.well-known crawler

Description

Crawl well-known Resources introduced by The Privacy Sandbox:

Related Website Sets
- /.well-known/related-website-set.json
- HTTPS only
- on ETLD+1 only, where PSL is authoritative source for ETLD
- different JSON format depending if primary or other
- There is a canonical public list of sets, but some (like google.com/youtube.com) are missing from it for instance
- Generator
Attestation File
- /.well-known/privacy-sandbox-attestations.json
- Submit a form, JSON file sent by Google
- No public list of who participates

Repositories

Crawler: https://github.com/privacysandstorm/well-known-crawler
Post-analysis: https://github.com/privacysandstorm/well-known-crawler-analysis

Datasets

Requestor pays buckets

Datasets for this crawler are stored on Amazon S3 in requestor pays buckets. This means that you must pay API call and data transfer rates associated with downloading the data. All datasets are stored in the us-east-2 region, so you can avoid data transfer fees by performing analysis within this region.

Buckets

Data is stored in one of two buckets depending on the nature of the data:

s3://well-known-crawler-data - raw crawl data are stored within this bucket. You may use any AWS IAM account to download from this bucket.
s3://well-known-crawler-analysis - analysis artifacts are stored within this bucket. You may use any AWS IAM account to download from this bucket.

Results

The crawl automatically runs twice per month as a minimum followed by the post analysis. You can always access the most up-to-date version of the results at the following URLs (in addition to in the buckets above):

List of origins with an /.well-known/related-website-set.json file: https://privacysandstorm-public-data.s3.us-east-2.amazonaws.com/well-known-crawler/rws_known_origins.json
List of origins with an /.well-known/privacy-sandbox-attestations.json file: https://privacysandstorm-public-data.s3.us-east-2.amazonaws.com/well-known-crawler/attestation_known_origins.json
Generated list of enrollment sites and corresponding APIs: https://privacysandstorm-public-data.s3.us-east-2.amazonaws.com/well-known-crawler/attestation_known_apis.tsv