K–anonymity: An Introduction (2024)

Organizations today are entrusted with personal data that they use to serve customers and improve decision making, but a lot of the value in the data still goes untapped. This data could be invaluable to third party researchers and analysts in answering questions ranging from town planning to fighting cancer, so often organisations want to share this data, whilst protecting the privacy of individuals. However, it is also important to preserve the utility of the data to ensure accurate analytical outcomes.

Data owners want a way to transform a dataset containing highly sensitive information into a privacy-preserving, low-risk set of records that can be shared with anyone from researchers to corporate partners. Increasingly however, there have been cases of companies releasing datasets which they believed anonymised, only for a significant fraction of the records to be then re-identified. It is vital to understand how anonymisation techniques work, and to assess where they can be safely applied and their strengths and limitations.

This introduction looks atk-anonymity, a privacy model commonly applied to protect the data subjects’ privacy in data sharing scenarios, and the guarantees thatk-anonymity can provide when used to anonymise data. In many privacy-preserving systems, the end goal is anonymity for the data subjects. Anonymity when taken at face value just means to be nameless, but a closer look makes it clear very quickly that only removing names from a dataset is not sufficient to achieve anonymisation. Anonymised data can be re-identified by linking data with another dataset. The data may include pieces of information that are not themselves unique identifiers, but can become identifying when combined with other datasets, these are known as quasi-identifers.

For example,around 87 percent of the US population can be uniquely identified with just their 5-digit zip code, gender, and date of birth taken together. Even in cases where only a small fraction of individuals are uniquely identifiable, it can still lead to a severe privacy breach for the individuals affected. It is never possible to know the full set of what additional information is out there, and therefore, what could be identifying.

The Technique

K-anonymity is a key concept that was introduced to address the risk of re-identification of anonymised data through linkage to other datasets. Thek-anonymity privacy model was first proposed in 1998 by Latanya Sweeney in her paper ‘Protecting privacy when disclosing information:k-anonymity and its enforcement through generalization and supression‘. Fork-anonymity to be achieved, there need to beat leastkindividuals in the dataset who share the set of attributes that might become identifying for each individual.K-anonymity might be described as a ‘hiding in the crowd’ guarantee: if each individual is part of a larger group, then any of the records in this group could correspond to a single person.

Consider the example below:

Name, Postcode, Age, and Gender are attributes that could all be used to help narrow down the record to an individual; these are considered quasi-identifiers as they could be found in other data sources. Disease is the sensitive attribute that we wish to study and which we assume the individual has an interest in keeping private.

This second table shows the data anonymised to achieve k-anonymity ofk= 3, as you can see this was achieved by generalising some quasi-identifier attributes and redacting some others. In this small example the data has been distorted quite significantly, but the larger the dataset, the less distortion is required to reach the desired level ofk.

Whilek-anonymity can provide some useful guarantees, the technique comes with the following conditions:

  1. The sensitive columns of interest must not reveal information that was redacted in the generalised columns. For example, certain diseases are unique to men or women which could then reveal a redacted gender attribute.
  2. The values in the sensitive columns are not all the same for a particular group ofk. If the sensitive values are all the same for a set ofkrecords that share quasi-identifying attributes, then this dataset is still vulnerable to a so-called hom*ogeneity attack. In a hom*ogeneity attack, the attacker makes use of the fact that it is enough to find the group of records the individual belongs to if all of them have the same sensitive value. For example, all men over 60 in our dataset have cancer; I know Bob is over 60 and is in the dataset; therefore I now know Bob has cancer. Moreover, even if not all the values are the same for a group of k, if there is not enough diversity then there is still a high chance that I learn something more about Bob. If about 90 percent of the records in the group all have the same sensitive value, an attacker can at least infer with high certainty what is the individual’s sensitive attribute. Measures such as l-diversity and t-closeness can be used to specify that amongst any k matching records there must be a given amount of diversityamongst the sensitive values.
  3. The dimensionality of the data mustbe sufficiently low. If the data is of high dimensionality, such as time series data, it becomes quite hard to give the same privacy guarantee as with low dimensional data. For types of data such as transaction or location data, it can be possible to identify an individual uniquely by stringing together multiple data points. Also, as the dimensionality of data increases often the data points are very sparsely distributed. This makes it difficult to group records without heavily distorting the data to achievek-anonymity. By combining this approach with data minimisation and only releasing the columns people really need, the dimensionality can be reduced to manageable levels (at a cost of making different releases for different purposes).

K-anonymization is still a powerful tool when applied appropriately and with the right safeguards in place, such as access control and contractual safeguards. It forms an important part of the arsenal of privacy enhancing technologies, alongside alternative techniques such as differentially private algorithms. As big data becomes the norm rather than the exception, we see increasing dimensionality of data, as well as more and more public datasets that can be used to aid re-identification efforts.

K–anonymity: An Introduction (2024)
Top Articles
My Personal Review: How Naväge Stacks Up For Nasal Congestion Relief - 33rd Square
NAVÄGE - Nose Cleaner - Naväge - PDF Catalogs
Funny Roblox Id Codes 2023
Golden Abyss - Chapter 5 - Lunar_Angel
Www.paystubportal.com/7-11 Login
Joi Databas
DPhil Research - List of thesis titles
Shs Games 1V1 Lol
Evil Dead Rise Showtimes Near Massena Movieplex
Steamy Afternoon With Handsome Fernando
Which aspects are important in sales |#1 Prospection
Detroit Lions 50 50
18443168434
Newgate Honda
Zürich Stadion Letzigrund detailed interactive seating plan with seat & row numbers | Sitzplan Saalplan with Sitzplatz & Reihen Nummerierung
Grace Caroline Deepfake
978-0137606801
Nwi Arrests Lake County
Justified Official Series Trailer
London Ups Store
Committees Of Correspondence | Encyclopedia.com
Pizza Hut In Dinuba
Jinx Chapter 24: Release Date, Spoilers & Where To Read - OtakuKart
Obsidian Guard's Cutlass
Marvon McCray Update: Did He Pass Away Or Is He Still Alive?
Mccain Agportal
Amih Stocktwits
Fort Mccoy Fire Map
Uta Kinesiology Advising
Kcwi Tv Schedule
What Time Does Walmart Auto Center Open
Nesb Routing Number
Olivia Maeday
Random Bibleizer
10 Best Places to Go and Things to Know for a Trip to the Hickory M...
Black Lion Backpack And Glider Voucher
Gopher Carts Pensacola Beach
Duke University Transcript Request
Lincoln Financial Field, section 110, row 4, home of Philadelphia Eagles, Temple Owls, page 1
Jambus - Definition, Beispiele, Merkmale, Wirkung
Ark Unlock All Skins Command
Craigslist Red Wing Mn
D3 Boards
Jail View Sumter
Nancy Pazelt Obituary
Birmingham City Schools Clever Login
Thotsbook Com
Funkin' on the Heights
Vci Classified Paducah
Www Pig11 Net
Ty Glass Sentenced
Latest Posts
Article information

Author: Pres. Carey Rath

Last Updated:

Views: 6026

Rating: 4 / 5 (41 voted)

Reviews: 88% of readers found this page helpful

Author information

Name: Pres. Carey Rath

Birthday: 1997-03-06

Address: 14955 Ledner Trail, East Rodrickfort, NE 85127-8369

Phone: +18682428114917

Job: National Technology Representative

Hobby: Sand art, Drama, Web surfing, Cycling, Brazilian jiu-jitsu, Leather crafting, Creative writing

Introduction: My name is Pres. Carey Rath, I am a faithful, funny, vast, joyous, lively, brave, glamorous person who loves writing and wants to share my knowledge and understanding with you.