When a massive earthquake and tsunami hit the eastern coast of Japan on March 11, 2011, the Fukushima Daiichi Nuclear Power Plant failed, leaking radioactive material into the atmosphere and water. People around the country as well as others with family and friends in Japan were, understandably, concerned about radiation levels—but there was no easy way for them to get that information. I was part of a small group of volunteers who came together to start a nonprofit organization, Safecast, to design, build, and deploy Geiger counters and a website that would eventually make more than 100 million measurements of radiation levels available to the public.
We started in Japan, of course, but eventually people around the world joined the movement, creating an open global data set. The key to success was the mobile, easy to operate, high-quality but lower-cost kit that the Safecast team developed, which people could buy and build to collect data that they might then share on the Safecast website.
While Chernobyl and Three Mile Island spawned monitoring systems and activist NGOs as well, this was the first time that a global community of experts formed to create a baseline of radiation measurements, so that everyone could monitor radiation levels around the world and measure fluctuations caused by any radiation event. (Different regions have very different baseline radiation levels, and people need to know what those are if they are to understand if anything has changed.)
More recently Safecast, which is a not-for-profit organization, has begun to apply this model to air quality in general. The 2017 and 2018 fires in California were the air quality equivalent of the Daiichi nuclear disaster, and Twitter was full of conversations about N95 masks and how they were interfering with Face ID. People excitedly shared posts about air quality; I even saw Apple Watches displaying air quality figures. My hope is that this surge of interest in air quality among Silicon Valley elites will help advance a field, namely the monitoring of air quality, that has been steadily developing but has not yet been as successful as Safecast was with radiation measurements. I believe this lag stems in part from the fact that Silicon Valley believes so much in entrepreneurs, people there try to solve every problem with a startup. But that’s not always the right approach.
Hopefully, interest in data about air quality and the difficulty in getting a comprehensive view will drive more people to consider an open data and approach over proprietary ones. Right now, big companies and governments are the largest users of data that we’ve handed to them—mostly for free—to lock up in their vaults. Pharmaceutical firms, for instance, use the data to develop drugs that save lives, but they could save more lives if their data were shared. We need to start using data for more than commercial exploitation, deploying it to understand the long-term effects of policy, and create transparency around those in power—not of private citizens. We need to flip the model from short-term commercial use to long-term societal benefit.
The first portable air sensors were the canaries that miners used to monitor for poison gases in coal mines. Portable air sensors that consumers could easily use were developed in the early 2000s, and since then the technology for measuring air quality has changed so rapidly that data collected just a few years ago is often now considered obsolete. Nor is “air quality” or the Air Quality Index standardized, so levels get defined differently by different groups and governments, with little coordination or transparency.
Yet right now, the majority of players are commercial entities that keep their data locked up, a business strategy reminiscent of software before we “discovered” the importance of making it free and open source. These companies are not coordinating or contributing data to the commons and are diverting important attention and financial resources away from nonprofit efforts to create standards and open data, so we can conduct research and give the public real baseline measurements. It’s as if everyone is building and buying thermometers that measure temperatures in Celsius, Fahrenheit, Delisle, Newton, Rankine, Réaumur, and Rømer, or even making up their own bespoke measurement systems without discussing or sharing conversion rates. While it is likely to benefit the businesses to standardize, companies that are competing have a difficult time coordinating on their own and try to use proprietary nonstandard improvements as a business advantage.
To attempt to standardize the measurement of small particulates in the air, a number of organizations have created the Air Sensor Workgroup. The ASW is working to build an Air Quality Data Commons to encourage sharing of data with standardized measurements, but there is little participation from the for-profit startups making the sensors that suddenly became much more popular in the aftermath of the fires in California.
Although various groups are making efforts to reach consensus on the science and process of measuring air quality, they are confounded by these startups that believe (or their investors believe) their business depends on big data that is owned and protected. Startups don’t naturally collaborate, share, or conduct open research, and I haven’t seen any air quality startups with a mechanism for making data collected available if the business is shut down.
Air quality startups may seem like a niche issue. But the issue of sharing pools of data applies to many very important industries. I see, for instance, a related challenge in data from clinical trials.
The lack of central repositories of data from past clinical trials has made it difficult, if not impossible, for researchers to look back at the science that has already been performed. The federal government spends billions of dollars on research, and while some projects like the Cancer Moonshot mandate data openness, most government funding doesn’t require it. Biopharmaceutical firms submit trial data evidence to the FDA—but not to researchers or the general public as a rule, in much the same way that most makers of air quality detection gadgets don’t share their data. Clinical trial data and medical research funded by government thus may sit hidden behind corporate doors at big companies. Preventing the use of such data impedes discovery of new drugs through novel techniques and makes it impossible for benefits and results to accrue to other trials.
Open data will be key to modernizing the clinical trial process and integrating AI and other advanced techniques used for analyses, which would greatly improve health care in general. I discuss some these considerations in my PhD thesis in more detail.
Some clinical trials have already begun requiring the sharing of individual patient data for clinical analyses within six months of a trial’s end. And there are several initiatives sharing data in a noncompetitive manner, which lets researchers create promising ecosystems and data “lakes” that could lead to new insights and better therapies.
Overwhelming public outcry can also help spur the embrace of open data. Before the 2011 earthquake in Japan, only the government there and large corporations held radiation measurements, and those were not granular. People only began caring about radiation measurements when the Fukushima Daiichi site started spewing radioactive material, and the organizations that held that data were reticent to release it because they wanted to avoid causing panic. However, the public demanded the data, and that drove the activism that fueled the success of Safecast. (Free and open source software also started with hobbyists and academics. Initially there was a great deal of fighting between advocacy groups and corporations, but eventually the business models clicked and free and open source software became mainstream.)
We have a choice about which sensors we buy. Before going out and buying a new fancy sensor or backing that viral Kickstarter campaign, make sure the organization behind it makes a credible case about the scholarship underpinning its technology; explains its data standards; and most importantly, pledges to share its data using a Creative Commons CC0 dedication. For privacy-sensitive data sets that can’t be fully open, like those at Ancestry.com and 23andme, advances in cryptography such as multiparty computation and zero knowledge proofs would allow researchers to learn from data sets without the release of sensitive details.
We have the opportunity and the imperative to reframe the debate on who should own and control our data. Big Data's narrative sells the idea that those owning the data control the market, and it is playing out in a tragedy of the commons, confounding the use of information for society and science.