Reimplementing the Topics API Classifier

Note

Author: Yohan Beugin (post also published on my website)

GitHub repository: https://github.com/yohhaan/topics_classifier

In this post, we discuss our research needs to reimplement the Topics API for the web and so, explain how the classification in Chrome is actually performed by presenting all pre- and post-processing steps used by Google. We also point at a formatting issue that we found along the way that impacts the intended accuracy of the API classification for some domains.

Read full post

.well-known Crawler

Note

Author: Yohan Beugin

GitHub repository: Crawler & Analysis code

We regularly crawl the Web for the presence of .well-known resources and files that were introduced by the following Privacy Sandbox mechanisms:

Related Website Sets
- /.well-known/related-website-set.json
- HTTPS only
- on ETLD+1 only, where PSL is authoritative source for ETLD
- different JSON format depending if primary or other
- There is a canonical public list of sets, but some (like google.com/youtube.com) are missing from it for instance
- Generator
Attestation File

Read full post

HTTP Archive Contributions

Note

Authors: Yohan Beugin

Hostnames Classification: Documentation & GitHub repository

Web Almanac:

Chapters: Cookies 2024 & Privacy 2024 chapters

Instrumentation: see pull requests 129 and 131

Over the Summer and Fall of 2024, we collaborated with the HTTP Archive, as part of the Web Almanac project, and contributed by adding some instrumentation to their monthly crawl to detect the presence of the Privacy Sandbox APIs on visited websites. We also classified all the hostnames of the websites ever visited by the project with the latest version of the Topics API at the time.

Read full post

Datasets & Software

Our objective here is to list different resources that could be useful to researchers who are evaluating the privacy claims of different advertising and web proposals. Check out also similar content that is located elsewhere on our website under the “Datasets & Software” tag.

Datasets

HTTP Archive: This project regularly crawls top websites to record different information about the resources being fetched, APIs used, etc. The dataset and historical crawls can be accessed through BigQuery.

Read full post