Sean J. Taylor

JSM Talk on Big Data

My friend Sherri Rose asked me to be a discussant for a session at the Joint Statistical Meetings called Innovations in Statistics for Big Data from the Next GenerationIt was a real honor to hang out with and discuss statistics for big data with so many talented people.

You can find my slides here:

The point of my talk was that in my experience at Facebook, I have noticed four main ways in which having big data is often limited:

  • Measuring the wrong thing: you have a lot of data because you are substituting a cheaper, easier to measure quantity for what you really want to observe.
  • Bias: you have big data because you’re observing a specific population which is easier to get data for (e.g. the people who use your service).
  • Dependency: you have a lot of data because you make many repeated measurements of the same units.
  • Uncommon support: although you have a lot of data, it can be difficult to make comparisons you want to make (for instance for the purposes of causal inference) because the set of observations satisfying some set of complex criteria are small.

To summarize my argument:  If you have a lot of data, you should probably ask yourself “why do I have so much data?”  It’s often because the data were not constructed or collected to answer your specific question and this implies that you’ll likely have to cope with some of the above limitations.