Big data: With great data comes great responsibility

June 23, 2021

Dr Florian Kerschbaum from the University of Waterloo, argues that with great data comes great responsibility in this big data focus

Big data helps us combat the pandemic and build cleaner tech-nology – amongst many other benefits. We collect exposure data to contain and trace the spread of the coronavirus. Genome databases can help build better medications and vaccines.

Personalised medicine may eventually allow defeating cancer. We collect household energy consumption to build a smart grid that helps save energy and supports renewable energy sources. Car and ride-sharing services in combination with electrical cars will eventually revolutionise transportation and lead to significantly lower carbon emissions.

Big data

All these services are based on the collection, processing and sharing of big data, personal data. They come with great societal benefits, but they also come with risks if not handled properly. Personal data stored within large repositories of companies are regularly exfiltrated in data breaches. Almost every individual in Western society has been subject to their data being exposed in almost always multiple data breaches. Nation-state actors have been exposed to conduct cyber-espionage on individuals and companies. Although it is 2021, George Orwell’s novel 1984 is a distinct possibility nowadays. So, we must ask ourselves: How do we combine Western civil liberties with the advance of ubiquitous data collection technologies?

This is a multi-faceted, interdisciplinary question spanning areas such as law, politics, and all aspects of technology. As a technologist, I feel the responsibility to educate about possibilities – good and bad – and advance the state-of-the-art and the state-of-practice in security and privacy of data. On the one hand, data science is the scientific discipline that deals with coming from sensor measurements to useful insights and predictions. It progresses in five steps: data collection, preparation, management, analysis, and use.

For each of those steps specific technologies have been developed, and each of those technologies and steps comes with specific security and privacy challenges. On the other hand, computer security, privacy, and cryptography deal with securing data at rest (e.g., stored on a permanent medium), in transit (e.g., transmitted over a network) or in use (e.g., while processing it in a computer). We have different technologies which are adept at providing protection during those steps. However, we do not have an integrated, end-to-end technology stack that deals with the entire data science process. I want to highlight some challenges that make this task difficult:

Human-in-the-loop: Data science is mostly driven and performed by humans, the data scientists. Many steps in the successful conversion from sensor readings to insights require human intervention. While we can control data in electronic form, it is different from controlling data exposed to a human being. Almost all recent criminal fiction now involves an investigator obtaining electronic records about a suspect – legally or deceitfully. We need to design our data protection mechanisms with the human in mind.

Unintended side effects: Raw data leaves traces throughout the entire data science process. For example, Homer’s attack showed that aggregate genomic studies can be used to infer information about the health status of participants of these studies and motivated the NIH to remove some genome studies and data from the public domain. Sweeney’s attack has recovered the health records of the Massachusetts governor by combining anonymised patient records and the voter registry. Privacy researchers are consistently detecting new such data leakages. We need to design principled protection mechanisms that respect the data science process.

Distinct application requirements: There exist different data protection technologies, but not all are equally fit for all applications. For example, a study on vaccine efficiency cannot deal with aggregate or perturbed data ruling out many protection mechanisms. However, the data is collected in a very controlled environment. Whereas mobile phone data is collected by many stakeholders (network providers, platform providers and app developers), but for many uses, aggregate data suffices.

Technological shortcomings: There are inherent limitations to the data protection mechanisms that require trade-offs in the design of a privacy-respecting data science process. For example, I argue that it is impossible to efficiently outsource a relational database to a single service provider without leaking information about the data being processed[1]. My team and I show that presumably provably secure protection mechanisms for machine learning models fail when the training data is biased[2]. We need to understand the limits of current data protection technologies and strive to improve them or develop better ones.

Driving data protection technology

These challenges highlight how difficult, yet important it is to drive the data protection technology from its inception and design to the ultimate adoption and deployment in the data science process. This requires a joint effort by academia, industry, and government. Each party needs to adapt to meet the challenge. Academia needs to understand and work towards industrially or societally relevant challenges. It needs to provide sound foundations striving towards scientific excellence.

Industry needs to understand the technology and embrace the motivation of society. While unconstrained data collection offers the highest immediate revenues, it is socially detrimental and no longer accepted. Many companies whose business model is solely based on data collection are implementing self-imposed restrictions and increasing their use of privacy-respecting technology. Government needs to guide and balance the process of converting public opinion and economic interests into societal norms and regulations.

Progress

Several important advances in technology and governance have been made, and we need to continue to keep up the pace. The Canadian National Cybersecurity Consortium and its privacy network aim to bring all these stakeholders in Canada together to build an improved innovation pipeline from invention to products and services. The consensus on the need for privacy in contact tracing technology[3] has contributed to the deployment of a privacy-conscious app in Canada.

Already in 2011, my student Marek Jawurek and I have outlined the cornerstones of a privacy-preserving smart grid [4]. I very much look forward to a world where the benefits of big data are enjoyed in a privacy-respecting manner consistent with Western civil liberties.

References

[1] https://www.youtube.com/watch?v=auCrSKH2oVI

[2] Thomas Humphries, Matthew Rafuse, Lindsey Tulloch, Simon Oya, Ian Goldberg, Florian Kerschbaum: Differentially Private Learning Does Not Bound Membership Inference. arXiv abs/2010.12112 (2020).

[3] https://uwaterloo.ca/cybersecurity-privacy-institute/news/coronavirus-statement

[4] Marek Jawurek, Martin Johns, Florian Kerschbaum: Plug-In Privacy for Smart Metering Billing. PETS 2011.

Please note: This is a commercial profile