Домой United States USA — software Protecting data, protecting truth

Protecting data, protecting truth

267
0
ПОДЕЛИТЬСЯ

A manifesto on data protection, data governance, and internationalism. Adapted from a speech delivered at the United Nations International School’s UNIS-UN conference, in the UN General Assembly Hall in New York.
On February 28th, 2018, I delivered a speech at the United Nations International School’s UNIS-UN Conference, held in the United Nations General Assembly hall in New York. This year’s conference was titled «Under CTRL: Technology, Innovation and the Future of Work.»
Though produced by students at UNIS in New York City, the conference was attended by approximately 750 students from international schools around the world. I myself graduated from UNIS in 1984; it’s where I first learned to program computers in BASIC, including a DEC PDP 8 and a Radio Shack TRS-80 Model 1, dating back to 1978.
What follows is an adaptation of the speech, presented in a format more suitable to publication as a post here on ZDNet. If you wish to view the speech in its entirety, you can do so here.
Data is nothing new, nor are databases, or even analytics systems, which themselves date back the better part of 50 years.
What is new, though, is how much data we collect, how much we keep, and what we can now do with it. That has changed a lot. We used to collect data at the level of a single transaction: a purchase, for example, or a single playground inspection.
Now we track every click leading up to the purchase, and the upsell ads that were served. And maybe the NYC Parks Department, where I created database systems in the mid-1980s, starting in my Freshman year of college, is tracking entrances and exits through the park gates, or the number of baskets through a given hoop, in a given court.
With today’s Internet of Things — or IoT — sensors, tracking all of that, in real time, is now quite feasible. And maybe there’s even a college freshman at the Parks Department building the database that handles it.
Moreover, we can keep so much of this data now. The economics allow it, whereas it was cost-prohibitive previously. The cloud provides for cheap storage…sometimes really cheap, if you’re willing to wait a few hours before it’s served up. And even in the on-premises world, new distributed file systems make massive and fault-tolerant data storage possible without needing to buy expensive, proprietary storage appliances.
As I said, what we can now do with the data is even more interesting. If we’re tracking clicks leading up to a purchase, we can start to predict whether someone is going to buy something, how much they’re going to spend and what they’re going to buy. In the case of tracking parks usage, we can predict when peak usage times will be, and thus when to deploy more maintenance workers, trash collectors and Urban Park Rangers. That can help with the budgeting process, too. Although I don’t think we’ve yet discovered a data technology that can create much efficiency in the New York City Council.
Data points
What I like to say, when I’m feeling corny, is that «data is life.» Every piece of data is a point-in-time recording of something that occurred, involving a person, an organization, a machine or groups of these things. The frequency at which we record these events now is much greater than it was. And so data has become more — shall we say — intimate.
Recording these point-in-time events means that data collection documents objective facts. And in an age when facts are disregarded, disparaged or — worst of all — falsified, this is a key facet of data and analytics that I think we must seize upon. In data lies truth. In analytics lies likelihood. It’s the ultimate weapon against distortion and disinformation. It’s a resource for doing good.
But data can also be used for more sinister, cynical purposes.
For example, it can be used to determine social media ad placement, targeting specific people with specific political tendencies, with content that isn’t data-driven or factual at all, but rather manipulative hyperbole, at best. It can be used for get-out-the-vote efforts and election day ground game management. But it can also provide tactical advice on voter suppression. Data isn’t just a resource for objective truth and good. Its predictive power can be a tool for spreading fear, uncertainty and doubt. So data can be, and has been, a tool for malfeasance.
If we look ahead at where predictive analytics may take us, it could be used to forecast opposition behavior. It could become not just a tool for small parts of a political campaign, but for planning every town-hall meeting, diner meet-and-greet, and full-on rally. It could even be used to determine messaging and policy, tailored for a specific locale. This would be policy optimized not for outcomes helpful to society, but simply to manipulate thinking and garner more votes. We might even imagine a time when predictive analytics could be used to automate and run a war. Data would become — almost literally — weaponized. And that’s extremely troubling.
Keeping AI honest
Even when used for purported good, though, we have to keep our eye on things. I’ve been talking about predictive analytics. That’s one name for it. Another, older, one is data mining. The newest name is machine learning and that, in turn, get used interchangeably with, Artificial Intelligence, or AI (even if it’s not the same thing.)
I actually studied AI in college, from 1986 through 1988. Here, again, the technology is not new. But AI never really caught on then…it couldn’t. Computers weren’t cheap enough and weren’t fast enough. So most predictive models had to be built on a sampling of the data, and even then it took forever to train the models.
Those limitations are mostly gone now. As I already said, we’re in a position with storage technology now to keep tons of data, and we’re in a place in computing power to build models using all of it.
We have much more powerful central processing units, or CPUs. And, more important, we now have incredibly powerful GPUs — or graphics processing units. Graphics may not sound relevant to AI but, as it turns out, technology that can do numerous complex calculations simultaneously (in parallel), which is what GPUs do, can turbo charge both graphics and AI.
In fact, NVIDIA, which started out as a graphics and gaming company, is now one of the most important companies in AI. Its GPUs are used on all the major cloud platforms, and its technology is becoming the de facto standard. As AI becomes more pervasive, the companies in leadership positions in the tech industry may well change. Keep an eye on it.
Let’s demystify this though: AI and machine learning work on a fairly simple premise: by looking at how some data impacts the values of other data, statistical models can be built that predict the latter from the former. That’s pretty straightforward — it’s not magic. Fitting numbers to a curve allows the generation of a mathematical model that takes a bunch of inputs and returns the predicted value as an output.
A model for disclosure
But how was that model actually built? Was the data that went into it completely valid? Was it based on IoT data from sensors that were placed according to some bias? In the Parks Department IoT example, were the sensors in parks in well-to-do neighborhoods deployed more carefully than those in poorer neighborhoods? Will resources be distributed unfairly because of it?
The reality is that we just don’t know. The process of building machine learning models is pretty closed. The models themselves are black boxes.
15 years ago, data mining systems were able to visualize their models, and disclose their content and structure. Today’s models are more complex and the need for visualizing them is more acute. Unfortunately, doing these kinds of visualizations seems to be a de-emphasized priority in the industry.
On the one hand, we might say «who cares?» These models are used by data scientists in corporate or scientific settings, so public accountability seems beside the point. And even if the details of the model were shared, how on earth would a lay person be able to interpret it? On the other hand, we have to be vigilant. We have to notice that machine learning models are being trusted more, relied upon with fewer restrictions and are becoming more ubiquitous.
Disclosure of the model’s contents, even if not interpretable by the vast majority of people, is something specialists working in the public interest could take advantage of and interpret. Transparency is a deterrent against abuse. If we’re lax about it now, then by the time machine learning is pervasive in our lives, we will have ceded our rights and our responsibilities in its management. Neither of these is good.
Data ethics
Not only do we want to know how the models were built, but we need to know what data was used to build them, and we need to know that none of it was illegitimately collected. Data ethics is a real — and urgent — concern. It may be corny to say «data is life,» but it does demonstrate how sensitive data can be, and how access to it needs to be restricted and protected.
You guys are probably sick of hearing people say how, by having a smartphone, you are walking around with a powerful computer in your pocket. But it’s true. And your phone is also a homing device, tracking where you’ve been the whole time you’ve had it on your person.
There are about a dozen different sensors in an iPhone, tracking things like your speed, rotation, face proximity to the device, and more. As long as you have your phone on you, you are essentially an IoT device.
There’s a lot of good that can be done with that data, and there’s a lot of unsavory stuff too. It’s good to be able to prove you weren’t somewhere that you shouldn’t have been. But you’re still entitled to your privacy. Do you want everyone to know what section of the library you were in, and when? When you voted? When you were in a drug store and whether you were at the pharmacy counter? All of that is innocent activity, but the intimacy of it likely isn’t something you’d want to share.

Continue reading...