Google 2023 Paper

Note

This paper appeared to have been submitted several times in somewhat slightly different versions to diverse venues in 2023 such as: SecWeb'23, SIGMOD'23, RegML at NeurIPS’, we decided to link below to the arXiv version.

Title: Measuring Re-identification Risk

Authors: Cj Carey (Google), Travis Dick (Google), Alessandro Epasto (Google), Adel Javanmard (USC and Google), Josh Karlin (Google), Shankar Kumar (Google), Andrés Muñoz Medina (Google), Vahab Seyed Mirrokni (Google), Gabriel H. Nunes (UFMG and Google), Sergei Vassilvitskii (Google), Peilin Zhong (Google)

Abstract/Summary: Compact user representations (such as embeddings) form the backbone of personalization services. In this work, we present a new theoretical framework to measure re-identification risk in such user representations. Our framework, based on hypothesis testing, formally bounds the probability that an attacker may be able to obtain the identity of a user from their representation. As an application, we show how our framework is general enough to model important real-world applications such as the Chrome’s Topics API for interest-based advertising. We complement our theoretical bounds by showing provably good attack algorithms for re-identification that we use to estimate the re-identification risk in the Topics API. We believe this work provides a rigorous and interpretable notion of re-identification risk and a framework to measure it that can be used to inform real-world applications.

Remarks

From our SecWeb'24 paper:

``Google released two privacy analyses of the fingerprinting risk of the Topics API; a white paper computing the aggregate information leakage per API call and for two consecutive calls and a second work using a theoretical framework to measure re-identification risk. However, the empirical measurements in both works have been performed on a private dataset, preventing the verification of the claims being made. Similarly, only aggregate and final results are reported, the lack of details across the distribution of users can hide the privacy risks of Topics for specific users as already pointed out by Thomson. Additionally, Google’s second analysis assumes and infers from an aggregate statistic that “for every user, samples of top sets [of topics] are independent across time”, while prior web measurement studies have found that users’ interests exhibit some stability over time.''