Practical examples on why k-anonymity are not enough
/images/2023-12-08-k_anonymity/ Banner created from a photo of Scott Webb on Pexels.com
why sharing data?
quasi indentifiers • What is a quasi-idenWfier? – CombinaWon of adributes (that an adversary may know) that uniquely idenWfy a large fracWon of the populaWon. – There can be many sets of quasi-idenWfiers. If Q = {B, Z, S} is a quasi-idenWfier, then Q + {N} is also a quasi-idenWfier.
We saw examples before …
- Massachuseds governor adack • AOL privacy breach • Neglix adack • Social Network adacks
K-anonymity
Each released record should be indistinguishable from at least (k-1) others on its QI attributes • Alternatively: cardinality of any query result on released data should be at least k • k-anonymity is (the first) one of many privacy definitions in this line of work – l-diversity, t-closeness, m-invariance, delta-presence…
 – Sometimes this maybe too restrictive. When some values are very common, the entropy of the entire table may be very low. This leads to the less conservative notion of ldiversity.
Recursive (c,l)-diversity – The most frequent value does not appear too frequently – r1 <c(rl +rl+1 +…+rm )
Limitations
l-diversity may be difficult and unnecessary to achieve. A single sensitive attribute Two values: HIV positive (1%) and HIV negative (99%) Very different degrees of sensitivity l-diversity is unnecessary to achieve 2-diversity is unnecessary for an equivalence class that contains only negative records l-diversity is difficult to achieve Suppose there are 10000 records in total To have distinct 2-diversity, there can be at most 10000*1%=100 equivalence classes
L-Diversity:
Guarding against unknown adversarial knowledge. Limit adversarial knowledge – Knows ≤ (L-2) negaWon statements of the form “Umeko does not have a Heart disease.” • Consider the worst case – Consider all possible conjuncWons of ≤ (L-2) statements
T-Closeness
k-anonymity prevents identity disclosure but not attribute disclosure • To solve that problem l-diversity requires that each eq. class has at least l values for each sensitive attribute • But l-diversity has some limitations • t-closeness requires that the distribution of a sensitive attribute in any eq. class is close to the distribution of a sensitive attribute in the overall table.
Privacy is measured by the information gain of an observer. – Information Gain = Posterior Belief – Prior Belief – Q = the distribution of the sensitive attribute in the whole table – P = the distribution of the sensitive attribute in eq. class
Principle: – An equivalence class is said to have t-closeness • if the distance between the distribution of a sensitive attribute in this class and the distribution of the attribute in the whole table is no more than a threshold t. – A table is said to have t-closeness • if all equivalence classes have t-closeness.
Conclusion • t-closeness protects against attribute disclosure but not identity disclosure • t-closeness requires that the distribution of a sensitive attribute in any eq. class is close to the distribution of a sensitive attribute in the overall table.

- Programming Differential Privacy - Website & Github repo
- K-anonymity, the parent of all privacy definitions by Damien Desfontaines