Real scientists make their own data

26 Jan 2013

Around budding social- and data scientists, a question you often hear is “where can I get data?” It happens so often that people like Hilary Mason, who I’m sure gets this question all the time, have posted pages with resources. Getting new data can be just what you need to practice a technique you are learning or complete a project that you can publish or add to your portfolio.

Here I argue that if you want to make a bigger impact as a scientist, you should make your own data instead of downloading it. Here are my points:

Historically science has been about observing (and sometimes manipulating) phenomena. Some of the most important contributions across all fields of science have occurred through actual hands-on data collection.
Making your own data means you are creating new facts about the world which gives you privileged access to scientific findings. Novel data is a source of competitive advantage that is sustainable, unlike being clever about your analysis.
When you invest in making your own data, you are forced to consider your research question long before diving into analyses. Having data that were costly to gather or create conveys that you are a thoughtful, careful researcher who knows a lot about her domain. And it probably means that you are addressing an important question.
If you are the creator of your data set, then you are likely to have a great understanding the data generating process. Blindly downloading someone’s CSV file means you are much more likely to make assumptions which do not hold in the data.
Making your own data means you can run randomized experiments, which enable you to make causal inferences. This is a big deal if you want to make any claims about policy decisions.

Think about a famous scientist who has inspired you. It’s likely he or she invested heavily in data collection. From Aristotle to famous natural philosophers–such as Galileo–practicing what evolved into modern science, there is rich tradition of making careful observations of interesting phenomena. Without modern methodology and theory, these proto-scientists were making important discoveries through careful record-keeping.

Gregor Mendel spent years breeding bees and pea plants in order discover the laws of modern genetics. Ronald Fisher, who we tend to regard as a statistician due to his methodological contributions, was actually coordinating large-scale experiments with crops at the Rothamsted Experimental Station. They had to use elbow grease and waited a long time to make important new discoveries.

Stanley Milgram is perhaps the best example of making your own data. The creativity in his work was apparent by how he approached difficult but important questions. He didn’t throw his hands up and resort to casual theorizing, he invented novel empirical strategies like sending chain letters to find answers.

This is my thesis:

Your best chance to make a serious contribution as a business or academic researcher is to find, make and combine novel data.

Almost everybody in your field will be as well-trained as you are. They will be able to run a Google search for data sets just as you can, and they will be able to apply methods like regression, clustering, visualization, etc on the data they find. If you want to compete, I suggest you allocate a substantial portion of your effort toward both 1) asking excellent questions and 2) constructing your own data which are suitable to directly answer them.

Many social/digital scientists are reluctant to invest in making data because it’s much more costly and risky than analyzing data you already have available. Sure it’s a gamble, but the payoffs can be substantial, both for science and for your reputation. As a thought experiment, I urge you to consider the last time you were really wowed by a finding that wasn’t produced using new data.

But how can you make your own data? Here are my suggestions:

If you already work as a data scientist for a company with a product or service, add your own instrumentation instead of relying on logs. Get permission to run randomized experiments which can tackle tough questions.
If you don’t work for a company, ask to partner with or intern for one. If you are answering a question they find interesting, and you are willing to help plan the data collection and analysis, you might just get their attention. Help out at DataKind and you’re almost guaranteed to get new data that no one has ever seen before.
Experiment on or survey your friends. With social networks and kind friends (and maybe some promises of free drinks), you can often get a large enough convenience sample to test a hypothesis.
Buy some new friends. Many excellent behavioral studies have been conducted on Mechanical Turk, including some in progress by famous data scientists.
Build your own website or application. This is the costliest route, but allows you the most control. I spent two years building Creds so I could study how beliefs about uncertainties are correlated in social networks. Some researchers at Boston University built a news reader application to study how humans allocate attention to news articles. This is a serious gamble that could pay off in a big way.

I’m sick of seeing the same old data sets recycled with a slightly new analysis done. Use old data to practice techniques, but if you want to get serious about being a scientist, make your own data.