Start United States USA — software Challenges When Aggregating Data Published Across Many Years Challenges When Aggregating Data...

Challenges When Aggregating Data Published Across Many Years Challenges When Aggregating Data Published Across Many Years

Von

admin

July 15, 2017

103

Learn about the challenges of large research data aggregation projects and why you need a strategy for standardizing how you store and publish your data online.
My partner in crime is working on a large data aggregation project regarding ed-tech funding. She is publishing data to Google Sheets, and I’ m helping her develop Jekyll templates she can fork and expand using GitHub when it comes to publishing and telling stories around this data across her network of sites. Like API Evangelist, Hack Education runs as a network of GitHub repositories, with a common template across them. We call the overlap between API Evangelist Contrafabulists.
One of the smaller projects she is working on as part of her ed-tech funding research involves pulling the grants made by the Gates Foundation since the 1990s. Similar to my story a couple weeks ago about my friend David Kernohan where he was wanting to pull data from multiple sources and aggregate into a single, workable project. Audrey is looking to pull data from a single source, but because the data spans almost 20 years. It ends up being a lot like aggregating data from across multiple sources.
A couple of the challenges she is facing trying to gather the data and aggregate as a common dataset are:
Data research takes time and is tedious mind-numbing work. I encounter many projects like hers where I have to make a decision between scraping or manually aggregating and normalizing data; each project will have its own pros and cons. I wish I could help, but it sounds like it will end up being a significant amount of manual labor to establish a coherent set of data in Google Sheets. Once she is done though, she has all the tools in place to publish as YAML to GitHub and get to work telling stories around the data across her work using Jekyll and Liquid. I’ m also helping her make sure she has a JSON representation of each of her data projects, allowing others to build on top of her hard work.
I wish all companies, organizations, institutions, and agencies would think about how they publish their data publicly. It’s easy to think that data stewards will have ill intentions when it comes to publishing data in a variety of formats like they do, but more likely than not, it is just a change of stewardship when it comes to managing and publishing the data. Different folks will have different visions of what sharing data on the web needs to look like and will have different tools available to them. And without a clear strategy, you’ ll end up with a mosaic of published data over the years. Which is why I’ m telling her story. I am hoping to possibly influence one or two data stewards or would-be data stewards when it comes to the importance of pausing for a moment and thinking through your strategy for standardizing how you store and publish your data online.