Where to find free and open data sets on the web

From government, health, and finance data to weather, baseball, and Star Trek, countless collections of free data are available to scratch your analytical itch

Where to find free and open data sets on the web
Thinkstock

Bosses love to hear the word “free.” Everyone wants to get something for nothing. The good news is that there’s a burgeoning collection of free data available for the taking. Some of it might even be useful for your project or your career.

What’s the catch? Sometimes there’s no catch at all. Many of the sources below come from government agencies. Once they’re done collecting the information, it often costs them very little to share it openly with everyone. Technically it’s not free because you’re paying for it on April 15th. But the good news is that your project budget won’t feel the pinch.

Other data collections are a subtle form of advertising. All of the major cloud companies host various collections of open data sets. You don’t need to use their cloud servers, but the performance will be that much better when the bits are stored in the same data center. The cloud companies could be purchasing 30-second spots on the Super Bowl, but this form of advertising is a better strategy for everyone.

The one danger with working with cost-free data is that the boss will assume that it’s also trouble-free. Many times the data will require a bit more work on your part. Perhaps the government agency that collected it liked to use its own peculiar format. Perhaps the data needs to be re-aggregated for your needs. There’s a good chance you’re going to need to write a bit of code to get it to work.

Some of the data projects function like open source software and work best when everyone contributes their own small part. I have a weather station in my backyard hooked up to the Personal Weather Station network that gathers data from close to a quarter million different citizen scientists. Participation is essential, but you’ll be able to leverage the work of everyone else at the same time. If your work is going to help build these projects, be prepared to pull your weight with project management.

The good news is that the barriers to entry are small. You don’t need to ask permission and you don’t need to beg forgiveness. Here are N different corners of the web to just start downloading and exploring.

Data.gov

The General Services Agency (GSA) maintains Data.gov, a big list of data sets the US government shares openly. As of this writing, there are 210,756 entries, many from the agencies that specialize in support of commerce (maritime, agriculture, energy). There are no secrets from classified agencies, though, and nothing from Area 51.

Kaggle 

Some of the data sources are not much more than a file repository. Kaggle is more of a cult. They've started with more than 50,000 different data sets and then added the basic tools (Jupyter notebooks) for making sense of them. There are already 400,000 different public notebooks that other data scientists have shared that analyze the data underneath. On top of that, Kaggle has added some online courses on using everything and mixed in some competitions with real cash prizes.

For instance, Cornell’s Laboratory of Ornithology is offering $25,000 to the best classifiers for birdsong, or what they call “bird vocalizations.” The Open Vaccine initiative will award $25,000 to the best models for predicting RNA degradation that will affect the COVID-19 vaccine. There is plenty of serious work to be found among the CSV or JSON files, but if you grow tired you can also have some fun. One data collection, for instance, is filled with lines scraped from all of the Star Trek episodes from the six major series. 

FiveThirtyEight

The FiveThirtyEight website is devoted to reporting stories with the support of a rich collection of data. When they can, they also share these data sets for you to do your own research. There are past records of their predictions for the major sports leagues, explorations about social attitudes like surveys of men asking what it means to be a man, and, of course, endless polls about upcoming political votes.

UNICEF

The UN agency responsible for helping raise healthy children around the world shares a wide variety of data sets that are useful to anyone with the same goals. The big picture can be found in marquee data sets like The State of the World’s Children 2019 Statistical Tables for those who want to track the change numerically. A more focused visualization can be discovered in tables that explore how iodized salt affects disease or the success of primary education.

Financial data

Ohio State’s library keeps a web page current with pointers to some of the biggest collections of economic and financial data. There are historical records of US data sets and also some data collected by the World Bank. Some require an academic account and some are free to the public.

Baseball

America’s sport is blessed by some fans who are adept enough with computers to develop extensive collections of data about the players and the results of their games. Sean Lahman’s database, for instance, contains complete batting and pitching statistics from 1871 through 2019. There are also tables of other details like fielding statistics, managerial changes, and World Series results that may not be complete, but might as well be for the modern era, which in major league baseball begins with the 20th century.

Project Retrosheet was started to assemble play-by-play summaries of all major league games whenever possible, and it is now complete through 1974. If you happen to have access to a scorecard from an earlier game, check the “most wanted” list to see if you can fill in a hole. Chadwick Baseball Bureau maintains a GitHub repo for the data if you prefer.

The Society for American Baseball Research maintains a list of other sources including offerings from commercial entities like FanGraphs, Baseball Reference, and Major League Baseball itself.

Google

If you’re just looking for a particular data set, Google Dataset Search lets you search the entire web for data sets using keywords. The results can be filtered by license, data format, and the time since the last update. Some of the most intriguing data sets are also included in Google’s public data directory, which not only lists the sources but offers some interactive dashboards. The World Bank, for instance, charts fertility versus life expectancy and you can track how this changes over the years with a slider.

Amazon Web Services

AWS users who want data stored in S3 buckets can turn to the Repository of Open Data on AWS, or RODA. There’s wide variety in the thousands of data sets but the highlights tend to be the data sets from sources with which AWS is openly collaborating like the Space Telescope Institute (stars), NOAA (NEXRAD weather radar imagery), and Common Crawl (more than 25 billion web pages). There are several good examples to help you get started analyzing the data using, of course, AWS services like Lambda or Comprehend.

Microsoft

Microsoft also has a number of data sets on Azure. City planners can look for insight in the records from the New York CIty taxi board, which tracks all fares. Economists and traders can look at price records for commodities for insight on inflation and economic changes. All are ready to be analyzed by Microsoft’s machine learning tools.

Facebook

Some of what we store on Facebook is private because we make it so. Some is shared with friends. Some content is completely open. Facebook supports research on the so-called “Facebook graph” with their Graph API. It’s not the same as downloading the entire data set, but it can be useful for some queries. Just remember that not everyone uses the same privacy settings, so you might not see every person or every post.

Yelp

The website known for reviews of restaurants, bars, and other public accommodations shares a great deal of the information in a public data set that you can study. There are more than eight million reviews of more than 200,000 establishments just waiting for you or your AI to parse them. They are a good source for training data for natural language processing and machine learning.

Open Data Kit

The bits distributed by the Open Data Kit community and its JavaScript-based cousin ODK-X aren’t data per se. They’re software designed to support scientists and researchers who are creating the data sets. The code lets you create a user interface that simplifies data collection by the front-line researchers and then begins the classification and cleaning workflow. The tools are used by a diverse group of organizations supporting field research including the World Mosquito Project and the Red Cross

Web scraping

Not all data reside in easily accessible databases with APIs. An enormous volume of information is embedded in web pages and the data needs to be pried out of them with some clever tools. This so-called web scraping is still a pretty good method, but it can have legal limitations. Some sites ban it in their terms of service and others watch for too many requests from one user and then either cut off the user or slow down the responses.

Tools like Puppeteer make it simpler to spin up one (or many!) headless versions of a web browser, download a web page, extract the right data, and do it again and again. There are now headless versions for most major browsers, thanks to the software testing community that needs to automate the testing process. Web scraping may not always be appropriate, but when it is it can be the fastest way to get the data you need. Nothing is more open than the open web.

Copyright © 2020 IDG Communications, Inc.