Note
In this post, we discuss our research needs to reimplement the Topics API for the web and so, explain how the classification in Chrome is actually performed by presenting all pre- and post-processing steps used by Google. We also point at a formatting issue that we found along the way that impacts the intended accuracy of the API classification for some domains.
Note
We regularly crawl the Web for the presence of .well-known resources and files that were introduced by the following Privacy Sandbox mechanisms:
-
Related Website Sets
/.well-known/related-website-set.json
- HTTPS only
- on ETLD+1 only, where PSL is authoritative source for ETLD
- different JSON format depending if primary or other
- There is a canonical public list of sets, but some (like google.com/youtube.com) are missing from it for instance
- Generator
-
Attestation File
Note
Over the Summer and Fall of 2024, we collaborated with the HTTP Archive, as part of the Web Almanac project, and contributed by adding some instrumentation to their monthly crawl to detect the presence of the Privacy Sandbox APIs on visited websites. We also classified all the hostnames of the websites ever visited by the project with the latest version of the Topics API at the time.
The objective of this post is to list different datasets and software that could be useful to other researchers who are evaluating the privacy claims of different advertising and web proposals.
You may also want to check out similar content that is located elsewhere on this website under the “Datasets & Software” tag.
Datasets
-
HTTP Archive: This project regularly crawls top websites to record different information about the resources being fetched, APIs used, etc. The dataset and historical crawls can be accessed through BigQuery.