Skip to main content
Privacy Sandstorm
Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Back to homepage

Reimplementing the Topics API Classifier

Note

In this post, we discuss our research needs to reimplement the Topics API for the web and so, explain how the classification in Chrome is actually performed by presenting all pre- and post-processing steps used by Google. We also point at a formatting issue that we found along the way that impacts the intended accuracy of the API classification for some domains.

Read full post gdoc_arrow_right_alt

.well-known Crawler

Note

We regularly crawl the Web for the presence of .well-known resources and files that were introduced by the following Privacy Sandbox mechanisms:

  • Related Website Sets

    • /.well-known/related-website-set.json
    • HTTPS only
    • on ETLD+1 only, where PSL is authoritative source for ETLD
    • different JSON format depending if primary or other
    • There is a canonical public list of sets, but some (like google.com/youtube.com) are missing from it for instance
    • Generator
  • Attestation File

Read full post gdoc_arrow_right_alt

HTTP Archive Contributions

Note

Over the Summer and Fall of 2024, we collaborated with the HTTP Archive, as part of the Web Almanac project, and contributed by adding some instrumentation to their monthly crawl to detect the presence of the Privacy Sandbox APIs on visited websites. We also classified all the hostnames of the websites ever visited by the project with the latest version of the Topics API at the time.

Read full post gdoc_arrow_right_alt

Datasets & Software

The objective of this post is to list different datasets and software that could be useful to other researchers who are evaluating the privacy claims of different advertising and web proposals.

You may also want to check out similar content that is located elsewhere on this website under the “Datasets & Software” tag.

Datasets

  • HTTP Archive: This project regularly crawls top websites to record different information about the resources being fetched, APIs used, etc. The dataset and historical crawls can be accessed through BigQuery.

Read full post gdoc_arrow_right_alt