Home United States USA — software What Should Software Engineers Know About GDPR?

What Should Software Engineers Know About GDPR?

December 10, 2017

199

EU General Data Protection Regulation (GDPR) is moving out of the transition period next summer to become enforceable GDPR strongly emphasizes risk-based thinking; you take every step to mitigate privacy risks until the risks become something you can tolerate. As a software developer, this will affect you. This is what you need to know.
Are you going to create new software solutions in 2018? If so, it might be a good idea to read this. EU General Data Protection Regulation (GDPR) is moving out of the transition period next summer to become enforceable. Violating its terms might lead you to face fines up to 20 million euros — much more for large organizations. In addition to sanctions listed in the regulation, jail time is even possible for individuals responsible for great neglect or data breaches.
Obviously, this sounds severe. I have seen two extreme approaches to GDPR: one, to pretend it does not apply to you and try to ignore it and another to declare that skies are falling and that no development can focus on personal data anymore. Both approaches are wrong, misinformed, and could lead to huge losses. GDPR does not create an end-to-all-personal-data scenario; instead, it sets rules for transparent and secure handling of personal data and threatens those who ignore them with very juicy sanctions.
GDPR strongly emphasises risk-based thinking; you take every step to mitigate privacy risks until the risks become something you can tolerate. I appreciate this regulation — there is enough software that has absolutely no security or privacy built in the design. This sort of software and its breaches lead users to mistrust how their personal data is being used. It’s time to change that.
This topic is huge so I am concentrating purely on the process of crafting new software solutions. There is lot to be said about organizational support and legacy systems, but they are highly dependent on the starting point. The GDPR does not allow many exceptions to the rule, so big and small businesses, non-profits, and government organizations all need to know the main points.
One key point of the new regulation is transparency for the data subjects. When you have a registry — for example, a database — that contains personally identifiable data, the GDPR holds that its use should be transparent to the data subjects. This means that people whose data you are collecting should be able to find out what you are collecting, your purpose, who has access to the data, and how long the data lives within the systems. To cope with this requirement, you naturally should know all these things and document them. Along with transparency, you need to provide better access to said data. Your data subjects should be able to verify, correct, export, move, and erase their data as easily as they gave it to you in the first place.
Another important topic is privacy by design/default. This should actually be integrated into every bit of architecture from now on. It should have been an automatic element of design before this regulation, but people often don’t want to pay for security or privacy until something happens. GDPR gives a powerful incentive to take care of this now — an incentive worth at up to 20 million euros. Privacy by default means a lot of things, but it essentially aims to protect personally identifiable data and its privacy, with suitable controls. This typically requires, for example, clear audit trails in the form of who did what when, including and especially read access of personally identifiable information. Additionally, you should pay attention to data when it’s being stored and in transit between different layers, and apply suitable encryption to avoid data leaking from your systems.
You should also have a valid basis for processing personal data, meaning what specifically gives you the right to collect and process the information. The basis, for example, could be a law that requires you to collect and store information on individuals for a period of time. The basis for processing personal data may be a contract, agreement, or transaction.
You can ask for consent to collect and process personal data but GDPR does not let you off easy here. It is not acceptable to have a checkbox already checked with a statement like “I accept that my information may be used for marketing purposes.” Consent must be clear, precise, and understandable — and cannot be pre-set. It should be as easy to cancel the consent as it is to consent in the first place. Software designers can decide none of this on their own but need to discuss it with whoever owns the software.
Here’s an interesting point. If the team members that build the software have access to actual personal data while building it, they become data processors and liable to the same sanctions and responsibilities. The same goes for the operations team. If they have access to databases and data, they are liable and responsible. You might want to think hard about that. It is possible to build and operate most systems without accessing actual customer data, after all.
GDPR is only interested in personally identifiable information (PII). GDPR does not apply to data that is not attached to a person, such as product or accounting information. You might still classify it as sensitive and might still want to protect it, but GDPR considers it non-PII data and ignores those situations.
GDPR identifies two classes of PII data. There is data that can be used to uniquely identify a person like social-security number, e-mail address, or anything directly connected to these identifiers such as purchase history. Then there is extra-sensitive data such as medical/health information, religion, sexual orientation, or any information on/collected from a minor.
Do note that according to GDPR, combinations of information that may not be unique in isolation can potentially identify an Individual. So PII also includes identities that may be deduced from values like postcode, travel, or multiple locations such as places of purchase. Tiny datasets and rare combinations of values make personal identification easier.
Since any information attached to or collected from a person is protected under privacy rules, most databases are going to contain PII, with some exceptions. I would estimate 70%-80% of typical systems data to be PII. It’s not only social-security numbers and credit-card numbers that you should protect.
There’s been a lot of discussion of access logs, audit logs, etc. that contain IP addresses or surrogate keys. Are these personal data? Are they registers? Do all personal rights extend to them? How strongly should they be protected? Experts seem to disagree about the answers. We have to wait and see how this evolves. I would advise, however, to avoid hysteria and to use common sense in grey areas. This sort of information could and should be protected to some extent, depending on how much harm a data breach would inflict. But I simply don’t see every web server in the world becoming a PII registry in the most demanding sense of the definition.
The cheapest way to have your software to comply with GDPR is to build the requirements right in. How comprehensively you want to do this depends on the risk level of the particular system in question:
If you have few users and the information that you collect is neither sensitive nor harmful, you might consider your system a low-risk environment and use more cost-effective controls to protect it. On the other hand, if your system contains sensitive data for many users, you would want to apply stronger protection.
A good audit trail is a minimal requirement. An audit trail not only shows that you have applied controls, it also helps you limit the damages in case of a data breach. After any data breach, whether by an internal or external party, the first thing you need to do is find forensics that can show which users are affected and which data were accessed. This is the information that you need to report to data-protection authorities. Additionally, these are the users you may need to notify about the data breach. If you have no forensics, you need to assume that a breach may have affected all users and all records.
A good audit trail also features non-repudiation — in other words, it cannot be altered/damaged even by system administrators. You might want to use audit trails to see what data a system administrator was violating, for example. This has happened before, and will happen again. Audit trails are also classified as PII: they have a unique identity and data directly connected to that.
After audit trails, the next task is to limit the exposure of data. The best way to do this is to limit what data you collect and how long it is stored. By introducing some kind of archival/erasure mechanisms in your software right from beginning, you can document this for your users. If a data breach happens, it can only affect data that was actually in the targeted system at that point. Many systems continue to collect all data but never clean it up, even when the data becomes obsolete. GDPR encourages you to clearly define data lifecycles and to document them. You should also restrict access to data to only what’s really necessary. This is especially true for sensitive data.
I already mentioned that you should have sufficient protection mechanisms for data that’s resting in a database or file system and that’s moving through a network, especially to other parties. Encryption is efficient but it has its weak spots. The most powerful encryption technology encrypts early, secures your keys, and decrypts late. Unfortunately, this is a complex and costly solution to implement. Cloud services, on the other end of the spectrum, often let you cheaply and simply encrypt an entire database with a checkbox or offer to manage keys and encryption for you. While easy, these mechanisms have weak spots. You just have to find what works for you, based on risks and sensitivity of data.
It’s worth mentioning that anonymization and pseudonymization mechanisms can help you with things like test data or analysis data. Anonymization basically removes all identifiable information by deleting or masking fields. Pseudonymization replaces identifiable information with pseudonyms, which typically keeps identities separate in the data. Both practices, however, are difficult to do right and may not offer perfect means to help your GDPR compatibility. Still, these are valuable tools in your toolkit.
You might want to revisit your logging standards and guidelines. It’s easiest if you can make sure that your logs do not contain PII — otherwise, they become PII registries as well with all the implications. Some logs are attached to individuals already: access logs and audit logs for example. But don’t pollute operational debug logs by writing user IDs, names, or similar values in them.