Data Anonymization: The What, Why, and How of Data Anonymization
Table of contents
Introduction to data anonymization
In the era of big data, personal privacy is a topic of increasing concern for consumers and businesses alike. Particularly as more of the services we rely on go online, and with the rise of personalization in marketing and product recommendations, companies are always looking for more intelligent ways to use data.
But these innovations cannot be at the expense of secure data storage and privacy protection. If companies are going to use customer data, they have a responsibility to protect it, and data anonymization is a critical part of good data security and privacy strategy.
Some big name companies have had steep penalties levied against them for violations of the European Union’s General Data Protection Regulation (GDPR). Enforcement of the GDPR shows no sign of slowing down. In the United States, regulations like the California Privacy Rights Act (CCPA) mandate creation of a state-level Privacy Protection Agency to handle alleged violations and enforcement.
No company wants to be an accidental or negligent offender. What’s more, ensuring data privacy compliance is increasingly a key way to build consumer trust in your brand. Just as a data breach can be a surefire way to lose a hard-won positive reputation.
We’ll take a look at how user data is anonymized, where this anonymization is required, and how companies can take steps to comply with privacy regulations.
Read on to learn more about:
- Anonymization 101
- How is data anonymized?
- Anonymization best practices
- Impacts of data anonymization
Data anonymization: key terms and definitions
In order to better understand data anonymization and its importance in compliance, it will be helpful to have an overview of some key terms. It’s a complicated topic with lots of different laws and regulations, so it’s important to get the basics down first.
We’ll start with what we mean by de-identified data. De-identification refers to the removal of personally identifying information (PII) from datasets in order to protect individuals’ privacy. In other words, data processors should be able to handle the information, such as for analytics and research, without having any recognizable link to, or being able to directly identify, the person it came from.
Pseudonymization is a form of this work in which personal identities are replaced with artificial identifiers, or pseudonyms. For example, stripping a real name and replacing it with “Jane Doe” is pseudonymization. Though in real life, it’s usually a random ID. The key thing to recognize is that de-identified data can be re-associated with the person it came from, so the information necessary to do this must be kept separate and secure to avoid privacy violations.
In contrast, anonymization is a more stringent standard of de-identification. It refers to the act of permanently stripping PII in such a way that the identification link can never be re-established. Put another way, de-identification and pseudonymization disconnect the person from the data but keep the linking information stored separately, while anonymization requires that there be zero risk of re-identification between the person and the information.
We’ll get into the technical aspects of how this is done in a bit, but let’s first look at what the law says about anonymization. In the United States in 2010, the Dodd-Frank Wall Street Reform and Consumer Protection Act, (for convenience we will refer to it as the CPA) was passed, providing a major overhaul of consumer rights to their data.
Among other things, the CPA grants individuals access to their own financial data and the ability to move it or share it with others. This was crucial for trends like the push for open banking and to enable new companies to compete with traditional institutions. These financial changes can also affect ecommerce.
Data privacy is still a work in progress, however. The US Consumer Financial Protection Bureau (CFPB) and international bodies are regularly re-evaluating existing privacy laws – and new ones continue to be drafted in countries around the world – in the face of new developments in technology and the new risks these advancements create.
In 2018, the GDPR became enforceable, holding companies processing the data of EU residents accountable to wide-spanning regulations. The GDPR has also been influential on other regulations since, like the CCPA.
Under the GDPR and similar regulations using an “opt-in” model, data controllers must inform users when data is being collected, and allow an individual the right to prevent that collection and processing at any time. Individuals are also granted explicit ownership of their data in the sense that they must be able to transfer personal data from one system to another.
Under an opt-out model, which has been more commonly adopted in the US to date, consumers’ consent must only be obtained before collected personal data is sold (or, in some cases, shared), rather than before it is collected.
The above distinction between de-identification and anonymization comes into play here, as data that has been fully anonymized is not subject to these consent requirements, but data that has only been de-identified is. Perhaps unsurprisingly, this poses some challenges for businesses that store personal identifiers and rely on users’ digital fingerprints to provide their services. For example, anything that monitors users’ online behavior.
How is data anonymized?
Today, most businesses collect some form of personal data, particularly in ecommerce. In order to provide a seamless checkout experience for users, for example, companies would need invoicing software that includes features like payment reminders and automatic billing for repeat customers. But in order to do this, a company must use browser cookies and store both personal information and payment information.
So how, then, can data be properly anonymized? Or, perhaps a better question would be, can data be truly anonymized at all? This is a big problem for privacy experts who are always looking for ways to make data storage more secure.
There are a number of ways that personally identifiable information like names, Special Security Numbers, physical or email addresses, etc. can be disassociated from their individual owners:
- Masking. Some common data masking techniques include word or character substitution and character shuffling. But as you can probably guess, this information can be re-identified, so it is not true anonymization.
- Generalization. This technique eliminates sensitive parts of data without changing the important information. For example, removing some parts of home addresses while still keeping the general geographic location intact.
- Swapping/shuffling/permutation. As the name suggests, this method rearranges data so the same data points are in the dataset, just not in the original order.
- Perturbation. This technique uses a proportional factor to add what data scientists call “random noise” to a dataset. This can be a complex process, but random noise can also be filtered out, so this method isn’t foolproof either.
- Synthetic data. This is the only technique that could be acceptable under the GDPR and similar regulations. It involves creating artificial datasets that look like (that is, maintain the relevant properties of) the original dataset. Though the GDPR doesn’t explicitly discuss synthetic data, it states that the regulations apply only to data that has a link to “an identifiable natural person”, which synthetic data does not, even if it mimics real user information.
Anonymization best practices
Perhaps the most common use of data de-identification has always been in the healthcare sector, where providers must store medical records in a way that does not put individuals at risk in the event of a breach. However, Big Data has paved the way for most businesses to have to think about privacy compliance, whether for ecommerce stores, social media marketing, etc.
For businesses online, privacy should be at the core of digital processes, including websites. A consent management platform can be a key tool in securing user consent and achieving data privacy compliance.
Impacts of data anonymization
There are some obvious benefits of data protection for online users. It’s not hard to see how it could be dangerous for information like health data, account credentials, or contact information to be made widely available. There have been breaches of privacy from anonymization errors that have proven the danger of violations and the need for enforcement of privacy regulations.
While an increasing number of consumers and internet users express concern over data privacy, they also show preferences for things like personalization in recommendations and advertising. This presents a challenge for marketers because, while high-level performance metrics like ROI and various on-page SEO KPIs can be tracked even with de-identification, de-identified data can not be used for direct marketing efforts or personalization.
That being said, data privacy can also be used to the advantage of marketers if businesses make privacy a part of their brand. In fact, improving trust in the business is considered one of the best ways to build brand value to increase revenue growth, and customers knowing that their data is protected is a crucial part of building that trust. By following these tips and investing in the right compliance monitoring tools, companies can better ensure the safety of customers’ data.
A rapidly increasing number of data breaches, combined with greater national and international attention to privacy regulations, means businesses need to focus on data privacy immediately, if they aren’t already doing so, or keep up with changing technologies and legislation, if they have taken steps to secure their operations and customers’ data. A website data privacy audit is a good start toward compliance with relevant privacy regulations. And as always, our experts are always happy to answer any questions.