Sean J. Taylor

Characterizing Online Public Discussions through Patterns of Participant Interactions

New Paper at CSCW 2018 with Justine Zhang, Cristian Danescu-Niculescu-Mizil, and Christy Sauper [link to paper]

Facebook has an incredible number of public (between non-friends) discussions that can grow quite large (# participants and # comments). A hard problem we face is understanding which discussions are going well for users. Imagine doing that for dozens of languages and use cases.

The topic of this paper is a natural extension of my work with George Berry where we study whether seeing higher quality comments causes people to write higher quality comments (it does!). We had a very limited definition of “quality” in that paper that allowed us to focus on the social influence component.

The key idea in this new paper is to encode a discussion with multiple participants as a hypergraph. This is a generalization of earlier reply-tree and sequence models of discussions. We actually ignore text entirely and focus on interactions like reactions/replies.

From this hypergraph, we can extract a number of features and create a high-dimensional vector of discussion characteristics. An important idea that we employ is to count subgraph motifs that stylized types of interactions people can have in discussions.

This high-dimensional representation of discussion structure can be easily embedded in low dimensions. What’s wild is that we can aggregate to the page level, finding certain types of pages are clearly associated with certain discussion structures. (No text/topics are used here!)

Our discussion embeddings allow us to do some neat tricks, like characterizing whether a conversation “style” emerges from the topic itself or its initiator (the latter!). The discussion vectors are also excellent input features for classification tasks, like predicting blocks.

The takeaway here is that the patterns of interaction in comment threads provides incredibly useful signal about what kind of discussion is happening, both in a descriptive and predictive sense. Some dimensions also have clear human interpretations, which we discuss in the paper.

Our hypergraph representation of online public discussions is flexible, extensible, complementary to text representations, and (we think) is more agnostic to language/culture. We also provide code that allows you to apply this method to Reddit data.

Abstract

Public discussions on social media platforms are an intrinsic part of online information consumption. Characterizing the diverse range of discussions that can arise is crucial for these platforms, as they may seek to organize and curate them. This paper introduces a computational framework to characterize public discussions, relying on a representation that captures a broad set of social patterns which emerge from the interactions between interlocutors, comments and audience reactions.We apply our framework to study public discussions on Facebook at two complementary scales. First, we use it to predict the eventual trajectory of individual discussions, anticipating future antisocial actions (such as participants blocking each other) and forecasting a discussion’s growth. Second, we systematically analyze the variation of discussions across thousands of Facebook sub-communities, revealing subtle differences (and unexpected similarities) in how people interact when discussing online content. We further show that this variation is driven more by participant tendencies than by the content triggering these discussions.

Randomized Experiments on Networks

My friend Dean Eckles and I wrote a book chapter on using randomized experiments to detect and estimate social influence in networks.  It will appear in the forthcoming “Spreading Dynamics in Social Systems.”  It’s really an all-star cast of chapter authors, so it’s quite humbling to be included.

You can download our chapter on Arxiv and find links to all of the other chapters on the book website.  The book will be published later this year.

It’s written mostly from a practitioner’s standpoint and is a good starting point to find many more advanced studies that might help you design and analyze field experiments in order to measure social influence.

We conceptualize an experiment as having four parts:

  1. A target population of units (i.e. individuals, subjects, vertices, nodes) who are connected by some interaction network.
  2. A treatment which can plausibly affect behaviors or interactions. 
  3. A randomization strategy mapping units to probabilities of treatments. 
  4. An outcome behavior or attitude of interest and measurement strategy for capturing it. 

Here’s the full abstract:

Estimation of social influence in networks can be substantially biased in observational studies due to homophily and network correlation in exposure to exogenous events. Randomized experiments, in which the researcher intervenes in the social system and uses randomization to determine how to do so, provide a methodology for credibly estimating of causal effects of social behaviors. In addition to addressing questions central to the social sciences, these estimates can form the basis for effective marketing and public policy. 

In this review, we discuss the design space of experiments to measure social influence through combinations of interventions and randomizations. We define an experiment as combination of (1) a target population of individuals connected by an observed interaction network, (2) a set of treatments whereby the researcher will intervene in the social system, (3) a randomization strategy which maps individuals or edges to treatments, and (4) a measurement of an outcome of interest after treatment has been assigned. We review experiments that demonstrate potential experimental designs and we evaluate their advantages and tradeoffs for answering different types of causal questions about social influence. We show how randomization also provides a basis for statistical inference when analyzing these experiments.

Prophet: a practical forecasting tool

My friend Ben Letham and I finally open sourced an in-house forecasting tool we developed at Facebook called Prophet.  It was incredibly exciting to release work done inside Facebook and make it available to the public.  We’ve already gotten a lot of great press and some excellent contributions.  Here are some important links:

We’d love it if you tried Prophet and found it useful.  Prophet has a very simple interface and is implemented in both R and Python so most data scientists should find it immediately useful.  Please tell us if you encounter issues or have success using it!

Discussion quality diffuses in the digital public square

I’m excited to announce that my paper with George Berry, “Discussion quality diffuses in the digital public square” was accepted for publication at WWW 2017.  You can download the pre-print here

We ran a field experiment to determine the effect of comment ranking on the comments that people see on Facebook, and more interestingly, the effect on the comments that they write.  We found that for even very high-quality comment authors, the ranking system caused them to write higher quality comments than they would have under a reverse chronological ordering.

Here is the abstract:

Studies of online social influence have demonstrated that friends have important effects on many types of behavior in a wide variety of settings. However, we know much less about how influence works among relative strangers in digital public squares, despite important conversations happening in such spaces. We present the results of a study on large public Facebook pages where we randomly used two different methods—most recent and social feedback—to order comments on posts. We find that the social feedback condition results in higher quality viewed comments and response comments. After measuring the average quality of comments written by users before the study, we find that social feedback has a positive effect on response quality for both low and high quality commenters. We draw on a theoretical framework of social norms to explain this empirical result. In order to examine the influence mechanism further, we measure the similarity between comments viewed and written during the study, finding that similarity increases for the highest quality contributors under the social feedback condition. This suggests that, in addition to norms, some individuals may respond with increased relevance to high-quality comments.

Identification of Peer Effects in Networked Panel Data

Excited to share that my paper with Daniel Rock and Sinan Aral, “Identification of Peer Effects in Networked Panel Data” was published in the Proceedings of the 2016 International Conference on Information Systems.

This work has a long history going back to early in graduate school for me.  Researchers are always looking for tricks to detect peer effects in observational data, and estimating time-dependent models with longitudinal data has always seemed like a promising approach.  We prove some results about a hybrid spatial-autoregressive model, extending the classic Bramoullé et al. (2009) results to the dynamic setting and estimating the model on some large-scale data.  Our results are somewhat pessimistic, in that it takes fairly strong (basically unrealistic) assumptions to identify a causal effect using panel data models. 

Here’s the abstract:

After product adoption, consumers make decisions about continued use. These choices can be influenced by peer decisions in networks, but identifying causal peer influence effects is challenging. Correlations in peer behavior may be driven by correlated effects, exogenous consumer and peer characteristics, or endogenous peer effects of behavior (Manski 1993). Extending the work of Bramoullé et al. (2009), we apply proofs of peer effect identification in networks under a set of exogeneity assumptions for the panel data case. With engagement data for Yahoo Go, a mobile application, we use the network topology of application users in an instrumental variables setup to estimate usage peer effects, comparing a variety of regression models. We find this type of analysis may be useful for ruling out endogenous peer effects as a driver of behavior. Omitted variables and violation of exogeneity assumptions can bias regression coefficients toward finding statistically significant peer effects.

AIS members can read the draft now, but we’ll be sharing a full draft with expanded findings soon.

Cartoon Character Space

I had a ton of fun crowdsourcing personality measurements of cartoon characters with Shiu Pei Luu. We were able to map the Big Five scale back to specific characters:

Big Five character scale

We also made a t-SNE embedding of the characters in 2D space:

t-SNE

Lots more in our writeup!

Facebook post on NFL fan friendships

A couple weeks ago I got to release my third analysis of NFL fans on Facebook:

2013 : NFL Fans on Facebook

2014The Emotional Highs and Lows of the NFL Season

2016NFL Fan Friendships on Facebook

This year I focused on the friendships between US NFL fans.  I found some fun patterns in homophily and also re-made the original NFL map using a new visualization.

image

A Measurement Error Model of Dichotomous Democracy Status

I’m not actually a political scientist, but I like to pretend I am sometimes.  Jay Ulfelder and I just submitted a draft of our paper to SSRN that you can check out.  Here’s the abstract:

We use a Bayesian measurement error model to derive a probabilistic measure of democracy from several existing dichotomous data sets. This approach accepts the premise that democracy may usefully be construed as a bivalent concept for certain theoretical and technical purposes, but it also makes more explicit our collective uncertainty about where some cases fall in that binary scheme. We believe the resulting data provide a firmer foundation than measures of countries’ degree of democracy for studies that require a dichotomous measure of regime type.

This allows us to do things like, measure the probabilistic concept like “an expert would say a country in a region is democratic,” borrowing information across time to produce plots like this one:

image

##

How we did it

This research is totally open and reproducible.  We used R and Python as well as some Stan to to implement our HMC estimation.  The code and the source data are available on github.

How this came about

I met Jay Ulfelder on Twitter a couple years ago and we periodically discuss research over email.  He emailed me last year to ask:

Is there a statistical model or process you know that solves the following problem?

Say you have 2 to 5 more or less independent, binary measures of the same thing in a time-series cross-sectional data set. Let’s call those measures “votes” and think of them as each sources’ best guess at the yes/no status of each record on some feature of interest. Is there a way—probably some kind of Bayesian model that didn’t even exist when I was in grad school—to use those votes to generate a unified probabilistic measure or z-score or something similar of each case’s status?

I had an idea about how to do this right away: item-response models provide a nice framework for thinking about this and all we had to do was model the underlying time series process. Bayesian inference was a perfect fit for this problem.

Learning About At-Risk Veterans Using Online Network Surveys

On December 12th last year, Carlos Diuk, Akos Lada, Alex Peysakhovich, and I hacked on Open Data and Innovations for Suicide Prevention, specifically working on addressing the issue for United States Veterans (of whom 22 commit suicide everyday). The hackathon was sponsored by the Whitehouse and the US Dept. of Veterans Affairs.

What We Did

We decided to treat at-risk veterans (those with depression, PTSD, or suicidal thoughts) as a “hidden” (i.e. hard to access and survey) population and designed a network-survey where we ask about friends who are veterans instead of the respondent, yielding information on network neighbors of the seed nodes we survey.  We’d like to thank Dennis Feehan, who’s an expert on this methodology, for his help in designing this strategy.

We first designed a survey, and then created a Facebook Page to create posts linking to the survey. We ran Facebook ads targeting those who were employees of the US Navy, Army, and Marine Corp, as well as those living near military bases and interested in those topics.

Within four hours, we had 400 brand new survey responses to analyze. We found patterns that seem to replicate existing understanding of incidences of risk factors in veterans. We also found some evidence for self-reporting bias in PTSD and suicidal thoughts, indicating that existing survey data may understate the problem. You can learn more from viewing our presentation slides:

What’s Cool About This

  1. Using Facebook ads to recruit survey respondents yielded excellent cost per respondent (about 40 cents).
  2. Survey instruments where you ask about the respondents friends instead of the respondent helps you find out about people who are hard to reach, such as people at-risk of mental health problems.
  3. Asking about friends has the added benefit of potentially yielding a less-biased signal when there may be response bias

Next Steps

We’re actively collaborating the San Francisco VA to get this method into production:

  • Iterate on survey design and recruitment procedures (optimizing ads targeting and copy).
  • Recruit for this survey regularly to detect longitudinal trends in the incidence of mental health risk factors for veterans.
  • Work on survey-weighting to adjust for biased survey population. Produce estimates of population levels instead of rates.
  • Potentially direct survey respondents to resources for themselves and the friends they identify as at-risk.

Upcoming Talks and Lectures (Fall 2015)

I’ll be speaking at the following events this fall:

no title

My friend and colleague Lada Adamic invited me to join on panel called “How Things Catch On” for the Sustainability@Scale 2015 workshop back on June 30th.  If you skip to about 18:00 in this video, you can listen to me talk about why most things don’t become popular and it’s hard to predict in advance which ones will.

Upcoming Talks

Lots of talks coming up for me.  Here’s where to catch me and hear about the research I’ve been working on!

Putting the Magic in Data Science

Here are slides from my talk at QCon last week:

Here’s my talk (which was not recorded) in a nutshell:

  • If we want to be valuable as data scientists, we should aspire to create as many “how did they do that?” moments as we can.  I call these “hoverboards.”  If we just count things, we are being terrible magicians. We probably don’t want to end up being the accountants of the 21st century.
  • Furthermore, these moments of magic should have impact – they should cause strategic or tactical decisions that people make to change.
  • I argue that magic in data science often comes from combining various “tricks” in novel ways. I describe four common tricks we use at Facebook, as well as a grab bag of others that I’ve found useful.
  • Tricks alone are not enough.  People have to use the technology you create.  That requires considering what I call Data Science’s last mile problem. How do we make the data inform/change people’s behaviors?
  • I describe four important (non-exhaustive) last mile considerations: reliability, latency/interactivity, simplicity, and interestingness. Many of these are achieved not through data science alone, but by combining data science with tricks from software engineering, design, and computer science.

What is Data Science?

There’s one little side-argument I made that I’d like to cover here. One thing we have trouble agreeing on is what data science actually is. My latest theory is we all work in different parts of technology pipeline (see below), which starts with academics publishing papers at conferences like KDD and NIPS and ends with non-experts getting utility from products and services we create. We’re all probably doing “data science” but it might look a lot different (pure math, problem discovery, design/visualization, engineering) depending on what stage you’re involved in.

image

World Cup Recap

I know it’s a couple months late, but I wanted to recap some of the fun stuff I worked on at Facebook for the World Cup.  All of this work was a collaboration with my friends Dustin Cable and Alan Clark.

Post by Sports on Facebook.

First, after the first week of games, we compiled all the of the first checkins in Brazil (actually within a certain radius of any of the 12 host cities) in month of June.  We used the first checkin as a crude measure of a person’s “arrival” in Brazil to attend.  Dustin then worked on animating the paths that people took across the globe from the current towns to the cities they first checked in at in Brazil.

image

Next, we took all these people and induced a social network on them.  We took the population of all people who checked in on Facebook in Brazil sometime during June and found any friendship that was created between two of these people after at least one of them arrived.  We grouped this by pairs of cities to show the new international friendships that formed during the World Cup.

There are many more details here in this Facebook Data Science blog post and Time magazine also did a writeup. I’m not a huge soccer fan, but the World Cup is such an huge, international event that it was fun to study from the Facebook lens.

JSM Talk on Big Data

My friend Sherri Rose asked me to be a discussant for a session at the Joint Statistical Meetings called Innovations in Statistics for Big Data from the Next GenerationIt was a real honor to hang out with and discuss statistics for big data with so many talented people.

You can find my slides here:

The point of my talk was that in my experience at Facebook, I have noticed four main ways in which having big data is often limited:

  • Measuring the wrong thing: you have a lot of data because you are substituting a cheaper, easier to measure quantity for what you really want to observe.
  • Bias: you have big data because you’re observing a specific population which is easier to get data for (e.g. the people who use your service).
  • Dependency: you have a lot of data because you make many repeated measurements of the same units.
  • Uncommon support: although you have a lot of data, it can be difficult to make comparisons you want to make (for instance for the purposes of causal inference) because the set of observations satisfying some set of complex criteria are small.

To summarize my argument:  If you have a lot of data, you should probably ask yourself “why do I have so much data?”  It’s often because the data were not constructed or collected to answer your specific question and this implies that you’ll likely have to cope with some of the above limitations.

Acknowledgments

The following is an excerpt from my dissertation. It felt sad burying all my gratitude in a gigantic research document that not many people will read. To all the amazing people in my life that aren’t mentioned here: this was written somewhat (very) hastily and I’m (probably) grateful to you as well :)

At the end, graduate school felt like Zeno’s dichotomy paradox. In order to get halfway to finish, I had to get halfway to that halfway point. In order to get there, I must have gone halfway once again. So there were thousands of tiny steps that were perhaps forgettable at the time, but comprised something much greater in aggregate. I took most of those infinitesimal steps with someone’s help, which makes it difficult to properly acknowledge all the people who made it possible for me to achieve the most ambitious and important thing I have ever undertaken. I will do my best here.

My family has been a constant source of motivation, and most importantly, got me to the starting line of this endeavor with all the tools I needed to succeed. I have the most incredible parents I could imagine and I owe my intellectual curiosity to their emphasis on education and my ambitiousness to their unwavering support. To my brother and sisters: I’m proud to be your older brother and I’ll always be more impressed with the things you accomplish than anything I do myself.

There’s a social support system built into graduate school – dozens of people who I experienced tribulations and triumphs with over the past few years. They’ve become friends for life. There are too many to mention here, but my biggest thanks go to my officemates Lauren Rhue, Beibei Li, and Jing Wang, as well as my study-break partners, Erin Smith and Jason Hong. I can’t imagine doing something like this without people as amazing as you all to share it with. To my friends who weren’t in grad school with me but were there by my side during it: thank you for dragging me out to drinks, on trips, and to all those weddings. Thanks also for picking up the check now and again when I had experienced a larger-than-expected expense shock.

The Information Systems group at NYU, characteristic of IS community in general, was an amazing resource and a perfect fit for my interests (although I perhaps did not know that in advance). I thank them for knowing that I would fit well in their program and for providing a stimulating academic environment in which I could thrive. In particular I would like to thank Arun Sundararajan,Natalia Levina, Vasant Dhar, Panos Ipeirotis, and Sonny Tambe for their open office doors and mentorship.

Many impressive people at Facebook – too many to list here – contributed to the quality of the work in this dissertation and made my life a lot more fun for a year during graduate school. Two in particular, Eytan Bakshy andDean Eckles, are like the academic brothers I was separated from at (academic) birth. They’ve been as supportive (and sometimes as hard on me) as brothers as well. I hope to continue collaborating with them long into my “stodgy old scientist” days.

My committee members, Anindya Ghose, Foster Provost, and Duncan Watts, went above and beyond what I could have expected. I thank them all for their support, kind criticism, and lenience with due dates. I hope that someday I am able to collaborate more directly with each one of them.

I remember meeting my advisor, Sinan Aral, for the first time and realizing he had actually read the research statement in my graduate school application. It was terrifying to be pushed on my ideas and realize the high standard of thinking for which I had signed up. But Sinan was then – and has always been – supportive of what I think and constructive in his feedback. Hundreds more discussions like that in his office comprise much of my learning in graduate school. The only pressure I ever felt from him was through the notion that good ideas and important questions deserve hard work. His attitude brought deep meaning to the research I was doing and made it easier to push through the setbacks. I can’t thank him enough for all this and too many other ways he’s gone beyond what I could have ever reasonably asked for from an advisor.

Learn Exploratory Data Analysis

My friends Moira, Dean, and Solomon - all members of Facebook’s Data Science team - worked with Udacity to create a fantastic course on exploratory data analysis. If you’re new to R/ggplot, or just want to hear about how experts think about visualizing and exploring data, this would be a great place to start. I really can’t recommend these instructors highly enough.

They asked several members of our team to talk about an EDA project they worked on and these interviews are included in the course. You can check out me talking about visualizing the sentiment from posts about NFL teams here.  I talk about using splines versus more flexible models for time series data and the bias-variance tradeoff.

Tutorial: Online experiments for computational social science

Eytan Bakshy and I are giving a tutorial this year at ICWSM (6/1 in Ann Arbor).  Sign up and learn some awesome stuff!

Registration for ICWSM isn’t open yet, but you can sign up for a reminder when it goes live. We’ll only email you one time.

Taught by two researchers on the Facebook Data Science team, this tutorial teaches attendees how to design, plan, implement, and analyze online experiments. First, we review basic concepts in causal inference and motivate the need for experiments. Then we will discuss basic statistical tools to help plan experiments: exploratory analysis, power calculations, and the use of simulation in R.  We then discuss statistical methods to estimate causal quantities of interest and construct appropriate confidence intervals. Particular attention will be given to scalable methods suitable for “big data”, including working with weighted data and clustered bootstrapping. We then discuss how to design and implement online experiments using PlanOut, an open-source toolkit for advanced online experimentation used at Facebook.  We will show how basic “A/B tests”, within-subjects designs, as well as more sophisticated experiments can be implemented.  We demonstrate how experimental designs from social computing literature can be implemented, and also review in detail two very large field experiments conducted at Facebook using PlanOut.  Finally, we will discuss issues with logging and common errors in the deployment and analysis of experiments. Attendees will be given code examples and participate in the planning, implementation, and analysis of a Web application using Python, PlanOut, and R.

On Computational Social Science

Computational Social Science is a new phrase that people throw around a lot–I even call myself that explicitly here on my website. I’d like to get my perspective on what it means to me written down.

One thing to keep in mind is that it’s clearly not the best possible name for the kind of science that I (and many others) do.  But those of us who self-identify as CSSs are mostly trained in social/computer sciences and care about academic audiences, so we needed a term other than data scientist. This one is catching on and I’m going with it.

In a way, the most unfortunate part of the term is that it puts the method (computational) before the social science (credit to Duncan Watts for this thought).  In this way, it emphasizes how you are doing science rather than what questions you are attempting to answer. But actually, the distinction about methods is often the salient one. Being a creator (or early adopter) of methods allows you to do science in a way that wasn’t possible before.  The innovation in CSS comes from what new knowledge is made possible through technology – we are often asking the same basic questions that social scientists have asked for a long time, but now we are just better equipped to answer them.

To illustrate this, I came up with this crude model of how new technology affects the production of scientific knowledge.  On the y-axis measurement technology, enabled by new instrumentation or statistical techniques that let us measure more constructs, more precisely, about more people.  On the x-axis is our level of experimental control.  Technology allows us to exert greater control over social systems (especially those which are digital), changing the structures of interactions in ways that allow us to gain knowledge of the underlying causal structure.

image

At the origin, we have casual empiricism: social science with no ability to measure anything well and no ability to effect a change in the environment.  Basically, what you and I do every day :)

As we move up and to the right on this plot, we gain the ability to more satisfyingly answer important questions. This is the essence of CSS.  As we innovate in measurement/control, we end up being able to produce higher quality, more complete, and more accurate social scientific knowledge, even if the underlying questions haven’t changed.

Where does theory fit in here?  I won’t go so far as to say that we don’t need theory anymore. There is nothing in this shifting production possibility frontier that conflicts with the Popperian conception of scientific progress.  However, I would posit that with better measurement/control we should shift our effort away from spending more time thinking deductively and theorizing toward thinking empirically and directly testing the specific, often highly contextual, questions we have as social scientists.

Update: Some insightful responses via Twitter.

Kevin Collins pointed out that my “measurement” dimension conflates a lot of different things. He offered these expanded dimensions:

@seanjtaylor I’d say there are four relevant dims: external validity, internal validity, “Richness of data,” “Scale of data,”

— Kevin Collins (@kwcollins) February 19, 2014

Michael Tofias took issue with my discussion of the diminished role of theory with better technolgy. I’m open minded that we could probably use a more theoretical grounding here but I’ve yet to see use it as a compelling complement to innovation in measurement/experiments.

@seanjtaylor @drewconway couldn’t agree less. Explicit theorizing needed now more than ever.

— Michael Tofias (@tofias) February 19, 2014

There’s obviously a nice middle ground here where we don’t let the increase in data lead to “theory gone wild.”

@tofias @SolomonMg @seanjtaylor @drewconway exciting time to test existing theory while being thoughtful in the creation of new theory

— Tristan Botelho (@TBotelho1) February 19, 2014

Kim Weeden asked good questions about what differentiates CSS from SS.

@seanjtaylor 1) How are correlations etc (UL quadrant) “new”? 2) What differentiates CSS from SS, some of which in UR, LR quads? Just size?

— Kim Weeden (@WeedenKim) February 19, 2014

My reply to the last two questions:

  1. if we observe a new correlation between two previously unmeasured variables, then it’s new knowledge about the world.
  2. For me it’s pushing upward-rightward by early adoption of new technologies, using those innovations to get better answers.

Tracking the Sentiment toward NFL Teams

Last week I published a post on the Facebook Data Science blog about the sentiment of NFL fans as revealed by their status updates.  It was a really fun project and I made a bunch of neat plots, for instance the mean sentiment paths over the course of games:

image

I spent a bunch of time making graphics which I feel tell the unqiue story of each team over their 2013 seasons.  Here is a version for just the Super Bowl teams, but I produced plots for every division in the NFL.

image

New Working Paper: Identity and Opinion

I recently submitted a new paper with Lev Muchnik and Sinan Aral entitled “Identity and Opinion: A Randomized Experiment.”  Here is the abstract:

Identity and content are inextricably linked in social media.  Content items are almost always displayed alongside the identity of the user who shares them. This social context enables social advertising but thwarts marketers’ efforts to separate causal engagement effects of content from the identity of the user who shares it. To identify the role of identity in driving engagement, we conducted a large-scale randomized experiment on a social news website. For any comment on the site, 5% of random viewers could not see the commenter’s identity, allowing us to measure how users interact with anonymous content. We conducted the experiment over two years, facilitating within-commenter measurements that characterize heterogeneity in identity effects. Our results establish three conclusions. First, identity cues affect rating behavior and discourse, and these effects depend on the specific identity of the content producer. Identity cues improve some users’ content ratings and responses, while reducing ratings and replies for others. Second, both selective turnout and opinion change drive the results, meaning identity cues actually change people’s opinions. Third, we find an association between users’ past scores and identity effects, implying that users bias ratings toward past identity-rating associations. This work improves our understanding of the persuasive impact of identity and helps marketers create more effective social advertising.

If you would like to read it and give feedback, please email me and let me know.

Frequently Asked Questions

I occasionally get emails from people asking me for advice or about my experiences before, during, and after graduate school. Here are some questions I’ve answered in the past. I’ll add more here as I receive and answer them.

Before Grad School

What was your experience like at the Fed Reserve Board and Matrix Group International, and how did they impact your decision to pursue a PhD in Information Systems at NYU?

I wanted to get a PhD in economics. At the Fed I did macroeconomic research and realized I liked research but didn’t like macro that much. At Matrix, I learned how to program professionally, which turned out to be a very valuable skill to have. I came to NYU because I wanted to study the social science of software development based on my experience at Matrix. I didn’t end up studying that, but IS is a very broad field where I eventually found a niche.

How did you choose to go to grad school? Or for that matter to go into the field of data science?

I wanted to be an economist in college, so my plan while working was always to go back to grad school for economics. It turns out I found a field that was a better way to merge my interests in economics, computer science and machine learning, so I ended up in Information Systems (which is a field offered by some business schools). For me the decision was mostly about wanting to know some subject incredibly well. It’s possible to do that on your own, but the structure/community that grad school provides makes you way likelier to succeed at becoming an expert. I didn’t choose “data science” but what I ended up learning was a really close fit to that nascent field.

If I had to start over, I’d get a PhD in Statistics instead of Information Systems. Though I love the social science research that I get to do, I think statisticians have the most impressive toolkits for approaching problems, and are definitely the most employable :)

About NYU Stern and Applying

What attracted you to Stern?

The professors in my department are leaders in our field. After meeting with them and hearing about their research, I knew that coming to Stern would give me the opportunity to work on something groundbreaking.

And now that you’re at Stern, what are some of the things you like best about being here?

Stern has a vibrant research community. Just about every week, there’s a student or professor giving a fascinating seminar. The faculty are eager to meet with students to discuss ideas and help out with research. Because we’re in New York City, we are able to attract some amazing visiting speakers from all over the world. Within the doctoral student community, I’ve made some great friends who are incredibly helpful and supportive. It’s nice to share your challenges with people your respect and admire.

How have you grown as a student/researcher since being at Stern? What allowed that to happen?

The doctoral program provides a steady stream of challenges, from introductory classes to qualifiers and dissertation work. It shapes the way you think, changing what you think is interesting and how you approach problems. There’s no single event I could point to, it’s a cumulative change. You learn that almost everything is more complex than you believe at first. When you accept that, you can research with an open mind and really make a contribution.

How would you describe the people or community (whether it be NYU, Stern, the doctoral program, or your dept)?

The professors in my department are friendly and supportive. They understand what you’re going through and are always forthcoming with advice. I’m obviously closest with my cohort in my department–I’ve shared an office and taken most of my classes with them for three years now. We’ve built a strong bond by going through the same challenges again and again. It’s a real privilege to learn and work with such talented people.

Does the stipend include a certain amount for rent, or does Stern have specific housing for PhDs? How has living in NY been like on the stipend, have you needed to supplement it?

No the stipend is just a payment you receive twice a month. You can spend it on housing anyway you want. I think for a limited number of students there is NYU housing available, but it is not super common for US students to receive it.

It is a bit difficult to live on the money, but you actually work very hard so there isn’t much time to spend. After you pass your qualifier you can take a small loan each year, so between that and the stipend, I have been fine. It is not possible to work during the program except during the summers as an intern.

Is there anything else you think it would help me to know about before applying to grad school?

Enjoy not being a graduate student before you start. It is a major life change. It’s not bad, but you’ll be busier than ever and your hobbies/interests/relationships will all have to evolve as a result.

Should i mention names of specific profs in NYU Stern in my SOP? If yes then whom should i mention about?

I wouldn’t mention any specific professors unless you can write cogently about some of their work which interests you. I would encourage you to read their websites and see if any of their research is something you’d consider building on.

How should I substantiate why I want to study in NYU Stern?

Make sure you know what our department does best (look at faculty websites) and make the case that you have similar interests to our faculty.

Any other admission tips would that you consider can help a prospective International Applicant?

Make sure your personal statement is thoughtful and shows some broad research interests. The faculty look for technical skills and also passion for a specific topic or area. You don’t have to know exactly what you’d like to study, but I would write about what questions you’d like to answer as a researcher.

What specifically does the admission committee look for in successful applicants for the Information Systems stream?

What are the first two years of grad school like?What are the first two years of grad school likeasdff?I haven’t been in the meetings where they admit students. However, the three keys seem to be 1) technical proficiency, 2) intellectual curiosity and 3) a good fit between your interests and the department’s.

Graduate School

What are the first two years of grad school like?

For the first two years, you generally spend the bulk of your time on coursework, much like a masters. Some courses are very hard and require a lot of time. Others are lighter and just meant to provide some familiarity with some topics, revolving around class discussion (surveys or seminars are like this). I’d say you can expect something like 10 hours in class and 40+ hours of work outside of class, mostly reading and then writing papers. This can be frustrating, but it’s pretty vital to be “caught up to the current conversation”–having the background knowledge required to have a cogent discussion of the interesting questions and issues in your field. A HUGE part of academia is just knowing how things are defined, what people already know, and what things they think are interesting. If you’re like most people, you’ll come out of this period realizing that what you previously thought was interesting has already been studied pretty thoroughly, so you’ll have settled on some new questions you’d like to address in your research. It’s possible and normal to end up somewhere pretty far from what you expected.

What kind of demands are there on your time?

In my program I had to work on research alongside my course work right away. I would characterize this as exactly the juggling act you described. Faculty members put pressure on you to finish projects with them. You have reading and assignments for course work (which can continue into your 3-5th years as you address subjects you feel weak in). You tend to have some papers or projects for courses or program requirements that also are very time consuming. You’re expected to attend 1-2 seminars a week with visiting speakers or members of your department. In addition, you’ll always have some papers and books you feel like reading because they fit your interests. Some people can really find some synergy between all these requirements, for instance writing their course papers about things they eventually want to research, so as not to duplicate reading. I’ve found this to be really difficult. The point is that there are many demands on your time and it is usually overwhelming. It certainly detracts from your ability to focus on ideas you’d like to work on.

That said, you would be hard pressed to find a career path that encourages independent and creative thinking as much as a PhD. The reward for being stretched in so many directions it that the important deliverables are always your own ideas, though they must be written and researched very carefully. I’d say most people have a good idea of the benefits of a career in academia, but they don’t realize a) how much of your time it takes just to be a functioning member of a doctoral program (the stuff I wrote about above) and b) how time-consuming and difficult producing quality research is. Really a doctoral program takes up way more of your time than you’ll ever imagine beforehand. Unlike in a job, you’re rarely completely done with anything and it’s always possible and desirable to put more time in. One of the hardest things is knowing when to stop and go back to normal life, or even feeling like you’re allowed to do so. Everyone I talk to about their programs admits to me that they have had to stop reading for fun, watching much tv, hanging out with friends, and pursuing the hobbies they had before.

How do you know if a PhD is a good fit?

I genuinely enjoy the challenge of the process and others do, too. Some people in my program seem to have made a mistake. I think the key difference is that for them, the work seems like an obligation instead of something they volunteered for. It’s easy to get caught up in meeting requirements and what it takes to finish, but really you are signing up for a lifetime of this stuff. If you don’t like the day-to-day stuff, which requires juggling a few tasks simultaneously, it’s will be tough to be happy in a program like mine. In fact, I think they go to great lengths to make you do a lot of things at once to get used to what it’s like being a professor.

Research

How long before you should start to publish articles?

Publishing is different by field, so I’m not sure how much advice I can offer there. I’m entering my 5th year and I only have conference (no journal) publications, but that’s not very unusual. My other general advice is all about thinking long term. There’s never as much pressure as you think there is, and it’s something to be enjoyed (or else why would you spend 5+ years doing it?). The real key is finding a balance where you can be happy and productive, while fulfilling what they expect of you to graduate. I think many people miss this point and see the goal as “getting the most papers” or “getting the best job,” when really what you are optimizing is your own life. Early on, the key is generating an idea of what you love to work on and find interesting enough to devote time to later. The happiest/best researchers are those who have some problem they love and are dying to tackle. They find this problem early in grad school (or even enter with the idea). So basically find out what’s interesting to you early on (and prize this above getting good grades, which barely matter). The least happy people are the ones who are trying to think of a research topic (or just pick one arbitrarily) so they can start/finish their dissertations. There’s also a host of advice to be given about choosing an advisor, but the gist is pick someone you personally like, does what you want to do, and isn’t overloaded.

What skills, both technical and otherwise, do you find most valuable for the research that you do?

Asking good questions is the most important thing. This is a skill you develop by talking to people in a field of research for a long time and realizing what are the big open things that people don’t understand very well. Think about the problem a journalist faces. You want people to read and talk about what you write. If you can’t make people interested, all the analysis and data collection will go to waste anyway.

Besides that, programming is incredibly important because you always have to manipulate data in some way. Statistics is also a huge part of what I do. You can never know too much of either of these things. I’ve recently gotten into data visualization, which is an underrated skill.

It seems to me that the largest obstacles to new data creation/collection are large fixed costs for an investment (research) with an unknown return. As a result, you typically need to have external funding. How do you think an individual can best overcome these obstacles for conducting an individual research project without external funding?

It’s a bit of a cop-out, but I would say creativity is the important part here. The best research I read, I often think “wow that was brilliant to get data in that way.” It’s not so much the analysis, it’s the idea to gather data from somewhere. As an example, when I was at the Fed in 2004, I had the idea to scrape Craigslist to get apartment prices in various cities. I never ended up working with that data, but I know that could have turned into a big research project if I had followed through.

One specific tip I will give is that often the best way to create novel data is to combine two sources people hadn’t thought of joining before.

As someone very interested in applied behavioral research (for a number of different topics ranging from the NFL to public policy), are you aware of any opportunities to become more involved in this type of work?

There are plenty of jobs out there if you are looking for formal employment. Your best shot is to work for a small company or a startup that will give you the leeway to learn on the job. Bigger places are looking for people with prior experience, so you’ll face a cold-start problem there. Plus they may not give you much room to shape your own job.

If you want to pursue it as a hobby, there are options like volunteering for a project at DataKind, or finding a hackathon which is data-oriented (I would look on meetup.com). You really need to be plugged into a community of researchers to do this kind of thing, because it’s hard to do any projects in isolation. In NYC, this revolves around several meetups which many of the data scientists in the area attend. There’s also academic communities out there and they are usually happy to let interested people show up to talks. Ask to get on mailing lists for visiting speakers if you are near a university and know a student there.

Facebook

How did you get involved with Facebook, and what motivated you to decide to do your research on NFL/likes?

I interned at Facebook last summer. I proposed some of my dissertation research to them, applied for an internship, and got the job. I did the NFL work because I like the NFL and I wanted to write a blog post to get some of my work in front of people. Academic research doesn’t have as broad a reach as small data projects like my NFL analysis.

What is like doing research at Facebook?

Facebook is a great place to work or do research. They are very supportive of asking and answering hard/interesting questions. You can observe a lot of what a large number of people are doing online. The downside is that you must study human behavior on the web, which kind of limits the questions you can ask (e.g. you can’t say much about consumer behavior past clicking on ads because you never observe purchases).

The Data Science team at Facebook does top notch research on par with most academic departments that study similar areas. The size of the data isn’t really the major advantage, it’s the richness. It’s very fine grained, so you can answer many questions that were not possible before.

What was a day in the life like as a data science intern?

It’s a pretty sweet job. You work on problems/projects that are anywhere from 3-months to a year. They tend to be deep dives into problems/questions/techniques where the company would be interested in the results. They are higher risk than most projects because you’re trying to go for a big impact and novel work. There are some shorter-term projects mixed in that involve answering more immediate questions, and you tend to consult with other teams on problems they have where they don’t have enough expertise. Basically, you’re paid to think really hard along with a bunch of other smart people and collaboratively come up with completely new stuff. So that’s obviously pretty great. You also get a lot of choice in what you’d like to work on because there are usually more than enough problems/projects to choose from.

Data Science

What is the career of a data scientist like? What happens after that first entry level job? This is pretty new ground, I think. I honestly don’t know what the field will look like in a few years. I see demand increasing, so there will be more entry level work or tasks that start to be done by MBAs or software engineers, but the more advanced stuff is not going away either. I would view the career as a series of new challenges. You want to work somewhere that gives you *hard* problems because those are the interesting ones and also where you can differentiate yourself. You get good at one kind of problem and it’s not hard anymore (or you solve it with technology), so you move onto new ones. I have friends that switch jobs just so they can work on something different. You could picture kind of hopping around from place to place and accumulating experience in different domains and trying to make the biggest impact – that’s how I see my future career path.

Finding Nate Silver

My (five minute) talk from Ignite Foo Camp on how we can build a reputation system to identify the people who are best at making predictions.

Science paper on Social Influence Bias

Exciting news! The paper I co-authored with Lev Muchnik and my advisor, Sinan Aral. Social Influence Bias: A Randomized Experiment was published in the August 9th edition of Science. Since many of you will read about the findings of the paper as interpreted by journalists, I thought it would be useful to give my own TL;DR version here. Plus a bonus plot that was not useful to include in the paper:

image

The first order result is that we have evidence that seeing prior ratings has a causal effect on rating behavior. When you rate things online, you are often exposed to others’ ratings (either aggregated or listed individually). It turns out that this does impact rating decisions and creates path dependence in ratings. The implication is that high or low ratings do not necessarily imply high or low quality because of social influence bias in rating systems. 

This finding requires a randomized experiment because items with current high ratings could simply be high quality, so if we see future high ratings it may not be because of any bias at all. We exogenously manipulate the initial ratings (“up-treatment” and “down-treatment”) in our design order to isolate a pure influence effect.

I believe our study innovates beyond earlier work in this field (such as the excellent music lab experiments by Matt Salganik, Peter Dodds, and Duncan Watts) in at least two key ways.

First, because we examine both up- and down-treatments, we can characterize an interesting asymmetry in social influence. Up-treatment works exactly as we expect, creating a 25% increase in final scores of comments. However, the effect of down-treatment is more nuanced. People seem to respond differently to negative ratings, either by correcting them or herding on them. Combining these two effects cancels out in aggregate treatment effects in long run ratings.

Second, repeated observation of the same users over time combined with the fact that users are able to either up-vote, down-vote, or abstain in response to treatments allows us to decompose treatment effects into selective turnout and opinion change. We find little evidence that our treatments inspire different types of people to rate (i.e. up-treatment causes only positive people to vote), but we do find evidence that our treatments change the proportion of up-votes used among subgroups in our sample. We conclude that while our manipulations do draw attention to comments and inspire more voting, they don’t do it any systematic way that we can identify and opinion change is at least one significant component of the effects we observe.

Distinguishing between these explanations was a sticking point of the reviewers and a major technical challenge in the paper. If you’re interested in more detail, I highly recommend reading the supplementary materials to see what I had been working on all Spring.

There are a bundle of other interesting and suggestive results in the paper and the supplementary materials, so I encourage you to read both (and pass any questions along to us!). Many thanks to my co-authors Lev and Sinan for being so awesome to collaborate with, and to the anonymous reviewers at Science who made the paper much better through their thoughtful criticism.

Foo Camp Report

It was an honor to be invited to my first Foo Camp this year. The weekend certainly lived up to its billing as a gathering of “alpha geeks” that imposes only loose structure on their interactions.

Coming from the academic world, the flexible format and mix of attendees were completely foreign to me. Academic conferences are attended by (mostly) homogenous groups who have read the same literature, have attended the same conferences before, and know each other personally or by reputation. This approach streamlines communication because everyone is steeped in jargon and shares a large set of common experience.

Foo Camp is the opposite. The diversity of skills and perspectives makes transmitting and receiving information inefficient and uncomfortable. But I firmly believe that these feelings are precisely what happens when you are learning and making new connections. The weekend fell solidly on the exploration side of the exploration-exploitation tradeoff in a way that was refreshing.

The other campers were actually homogenous in one important way: they were all people who get stuff done. Each person I met seemed to have been through at least a few cycles of taking on ambitious challenges and succeeding at them. (As my advisor would say, these people all had great red zone offenses.) This can be more inspiring than innovative ideas – it’s nice to hear stories from others who have worked hard to take their projects across the finish line. 

My contribution to the discussions stemmed from my two current interests. First, I tried to relate as many ideas as possible to my own experimentalist worldview. Ben Waber and I led a session on broadening the scope of experimentation in organizations to include self-experimentation. I think there is a false conceptual divide between the experimenter and the subject, leaving us uncomfortable applying scientific methods on ourselves. Yet, as the quantified self movement has shown for individuals, rigorously collecting data and applying carefully planned manipulations can help us learn about ourselves in addition to our environments.

Second, I shared my vision for systems which reward skill in prediction. We take for granted that smart people like Nate Silver will become popular, but I believe we should re-evaluate how reliably we come to pay attention to those who add some value to the conversation. My Ignite talk outlined the missing pieces needed to learn who is worth listening to when we care about what will happen in the future. The success of open source software projects and competitive platforms like Kaggle are inspiring early examples of how information technology and incentives can help reward the right people for their investments in skills and expertise.

I have to thank Tim O’Reilly and Sara Winge for curating such a memorable event. Bringing together so many amazing people and keeping them captive/cogitating for a whole weekend is a huge coordination problem. After years of focusing on my Ph.D. in a small field, I really enjoyed talking with people who were hackers like me and yet completely, wonderfully different.

Full Time

I’m excited to announce that today I accepted a full time job offer at Facebook’s Data Science team. It was a long road to this decision but I’m pretty pleased with where I ended up. I’ll be starting in the Fall on – as luck would have it – the same day as my friend John Myles White.

Leaving Academia

It was tough to leave the traditional academic route behind. I considered applying for tenure track positions up until just a couple of weeks ago. Ultimately I decided in favor of what I feel to be a more collaborative, cross-disciplinary environment than most departments I could have joined.  I couldn’t picture a better team of people to learn from while I work on hard problems. The access to data and experimentation is icing on the cake.

Choosing an Industry Job

I had another offer from a company that I love and it was incredibly difficult to choose between the two. I won’t go into details but it made the whole process that much more difficult. In the end, Facebook provided a unique mix of social science research combined with interesting business problems that would be hard to duplicate elsewhere. The company is very supportive of academic publications – a benefit I currently find hard to give up. I guess the other factor was that having done an internship there, I had great information on how talented the team is.

Finishing my PhD

I’ve got a lot of work left to do before I defend my dissertation in October! I look forward to posting updates on some chapters as they progress over the summer. I’m currently living and writing in San Francisco but will be back in New York for some time around my defense. The work focuses on the results of three experiments in online social interactions, two that I was involved in at Facebook and one on a social news discussion website similar to Reddit.

While I’ll be working pretty hard over the next few months, please feel free to get in touch if you’re in the Bay Area and would like to grab coffee/beer and talk about causal inference, experiments, etc. 

Talk on Ranking NFL Teams

I gave a talk last month at the New York Open Statistical Programming Meetup on ranking systems, specifically applied to the NFL. You can find slides, code, and an IPython notebook which contains most of the information. I encourage you to look at the slides, which I spent a lot of time on.  They contain two embedded interactive visualizations.  I did get my Super Bowl prediction wrong, though.

There were about 200 attendees, but unfortunately there is no video of the talk. Thanks to everyone who came; it was incredibly fun for me.

The talk was mostly a review/comparison of different methods:

  • Pythagorean wins
  • Eigenvector methods
  • the Bradley-Terry-Luce model
  • optimal rankings

The last one warrants more explanation.  I had previously reviewed the optimal descriptive ranking problem and my solution.  It’s a fascinating application of graph theory to a problem that most people wouldn’t consider to be graph-theoretic.  Once the ranking problem is posed as a topological sort of a graph which contains cycles, it’s easy to describe an exact (if non-unique) solution as well as find an algorithm to approximate it.  The results are quite stunning: a 10+% increase in the number of correctly described games from the other models.

NFL Fans on Facebook

NFL Fans on Facebook

I made a promise to myself when I interned at Facebook that I would write at least one blog post on the NFL.  Today I got to publish it!  Technically this was all quite trivial, mostly just aggregating users by teams and geography, then producing the maps using D3 (I had to rasterize them to post them on Facebook).

The NFL friendships thing involved a join on the social graph with the Like graph.  If I had had more time, I would have looked for rivalries.  My plan was to look for friendships that were unlikely holding relative geography fixed but changing the two teams.  The idea was to look for some pairs of teams that despite having pairs of fans in close proximity are unlikely to have fan friendships.

The finding that winning is correlated with Likes is also interesting.  To an economist, this would seem to be an excellent instrumental variable.  (This is not a new idea).  Conditional on the spread of the game, wins and losses are basically coin flips, and yet they seem to be correlated with big increases in team fan bases. This strategy could potentially be used to identify peer effects if one were willing to make a bundle of assumptions about dynamics.

Real scientists make their own data

Around budding social- and data scientists, a question you often hear is “where can I get data?”  It happens so often that people like Hilary Mason, who I’m sure gets this question all the time, have posted pages with resources. Getting new data can be just what you need to practice a technique you are learning or complete a project that you can publish or add to your portfolio.

Here I argue that if you want to make a bigger impact as a scientist, you should  make your own data instead of downloading it. Here are my points:

  1. Historically science has been about observing (and sometimes manipulating) phenomena. Some of the most important contributions across all fields of science have occurred through actual hands-on data collection.
  2. Making your own data means you are creating new facts about the world which gives you privileged access to scientific findings. Novel data is a source of competitive advantage that is sustainable, unlike being clever about your analysis.
  3. When you invest in making your own data, you are forced to consider your research question long before diving into analyses. Having data that were costly to gather or create conveys that you are a thoughtful, careful researcher who knows a lot about her domain. And it probably means that you are addressing an important question.
  4. If you are the creator of your data set, then you are likely to have a great understanding the data generating process. Blindly downloading someone’s CSV file means you are much more likely to make assumptions which do not hold in the data.
  5. Making your own data means you can run randomized experiments, which enable you to make causal inferences. This is a big deal if you want to make any claims about policy decisions.

Think about a famous scientist who has inspired you. It’s likely he or she invested heavily in data collection. From Aristotle to famous natural philosophers–such as Galileo–practicing what evolved into modern science, there is rich tradition of making careful observations of interesting phenomena. Without modern methodology and theory, these proto-scientists were making important discoveries through careful record-keeping.

Gregor Mendel spent years breeding bees and pea plants in order discover the laws of modern genetics. Ronald Fisher, who we tend to regard as a statistician due to his methodological contributions, was actually coordinating large-scale experiments with crops at the Rothamsted Experimental Station.  They had to use elbow grease and waited a long time to make important new discoveries.

Stanley Milgram is perhaps the best example of making your own data. The creativity in his work was apparent by how he approached difficult but  important questions. He didn’t throw his hands up and resort to casual theorizing, he invented novel empirical strategies like sending chain letters to find answers.

This is my thesis:

Your best chance to make a serious contribution as a business or academic researcher is to find, make and combine novel data.

Almost everybody in your field will be as well-trained as you are.  They will be able to run a Google search for data sets just as you can, and they will be able to apply methods like regression, clustering, visualization, etc on the data they find. If you want to compete, I suggest you allocate a substantial portion of your effort toward both 1) asking excellent questions and 2) constructing your own data which are suitable to directly answer them.

Many social/digital scientists are reluctant to invest in making data because it’s much more costly and risky than analyzing data you already have available. Sure it’s a gamble, but the payoffs can be substantial, both for science and for your reputation. As a thought experiment, I urge you to consider the last time you were really wowed by a finding that wasn’t produced using new data.

But how can you make your own data?  Here are my suggestions:

  1. If you already work as a data scientist for a company with a product or service, add your own instrumentation instead of relying on logs. Get permission to run randomized experiments which can tackle tough questions.
  2. If you don’t work for a company, ask to partner with or intern for one. If you are answering a question they find interesting, and you are willing to help plan the data collection and analysis, you might just get their attention. Help out at DataKind and you’re almost guaranteed to get new data that no one has ever seen before.
  3. Experiment on or survey your friends. With social networks and kind friends (and maybe some promises of free drinks), you can often get a large enough convenience sample to test a hypothesis.
  4. Buy some new friends. Many excellent behavioral studies have been conducted on Mechanical Turk, including some in progress by famous data scientists.
  5. Build your own website or application. This is the costliest route, but allows you the most control. I spent two years building Creds so I could study how beliefs about uncertainties are correlated in social networks. Some researchers at Boston University built a news reader application to study how humans allocate attention to news articles. This is a serious gamble that could pay off in a big way.

I’m sick of seeing the same old data sets recycled with a slightly new analysis done. Use old data to practice techniques, but if you want to get serious about being a scientist, make your own data.

Creds Blog: The Simon-Ehrlich Wager and Another Role for Markets

Creds Blog: The Simon-Ehrlich Wager and Another Role for Markets

Creds is a side-project I’ve been working on for awhile now. I encourage you to 1) check it out and 2) read this post to see my inspiration.

credsblog:

In 1980, an economist named Julian Simon and a biologist named Paul Ehrlich had a disagreement about the future scarcity of natural resources on Earth. Being gentlemen, they decided to stake some token money-as well as their reputations-on what they believed.  

The Statistics Software Signal

Last night on Twitter, I went on a bit of a rant about statistics packages (namely Stata and SPSS).  My point was not that these software packages are bad per se, but that I have found them to be correlated with bad quality science.  Here is my theory why.

  1. When you don’t have to code your own estimators, you probably won’t understand what you’re doing. I’m not saying that you definitely won’t, but push-button analyses make it easy to compute numbers that you are not equipped to interpret.
  2. When it’s extremely low cost to perform inference, you are likely to perform a lot of inferences.  When your first regression gives a non-result, you run a second one, and a third one, etc. This leads untrained researchers to run into multiple comparisons problems and increases the risk of Type I errors.
  3. When operating software doesn’t require a lot of training, users of that software are likely to be poorly trained.  This is an adverse selection issue. Researchers who care about statistics enough should have gravitated toward R at some point.  I also trust results produced using R, not because it is better software, but because it is difficult to learn.  The software is not causing you to be a better scientist, but better scientists will be using it.
  4. When you use proprietary software, you are sending the message that you don’t care about whether people can replicate your analyses or verify that the code was correct.  Most commercial software is closed source and expensive.  We can never know if the statisticians at Stata have a bug in their code unless we trust them to tell us.  Also consider researchers from schools or companies which can’t afford expensive commercial software.  Should they not be able to reproduce your results?

I do think these packages are valuable can be used for good. I have used Stata and it has saved me plenty of time.  My main point is that there are a number of mechanisms through which bad science can be correlated with using push-button statistics software, not that one is a direct consequence of the other.


What your statistical software says about you (to me):

  • R : You are willing to invest in learning something difficult.  You do not care about aesthetics, only availability of packages and getting results quickly. 
  • Python or JVM languages : You are a hacker who may have already been a programmer before you delved into statistics. You are probably willing to run alpha or beta-quality algorithms because the statistical package ecosystem is still evolving. You care about integrating your statistics code into a production codebase.
  • Julia : You are John Myles White.
  • Stata : You are an economist who doesn’t care to code your own estimators, probably because your comparative advantage lies elsewhere.  Possibly you are doing sophisticated work with panel data where Stata is the only game in town.  You don’t care that you can’t do proper programming because you’re not a programmer.
  • SPSS : You love using your mouse and discovering options using menus. You are nervous about writing code and probably manage your data in Microsoft Excel.
  • Matlab : You definitely know what you’re doing and you care about performance. You know Matlab is expensive but you aren’t the one paying for it. You live in a bubble where everyone you know uses Matlab.
  • Mathematica : You are an aesthete who believes everything Stephen Wolfram says.
  • SAS : You are an analyst for a large pharmaceutical company, and SAS is all you have ever known. You have a large library of custom SAS macros, so that (clearly) makes you a programmer. That anyone would want to hand-code statistical methods leaves you utterly baffled. If SAS does not ship with a particular statistical method, then it probably isn’t important. (h/t Chris Fonnesbeck)

Trouble in the Sandia Mountains

Last night I made a series of decisions that seriously jeopardized my life.

The Sandia mountains rise up to the East of Albuquerque and provide views of a beautiful Southwest sunset.  You can take a cable car up to a lodge at the crest of the mountains.  When we arrived at the departure point, I decided I would go for a quick trail run when we got to the top.  Having been trapped in the car all day on our way from Flagstaff, I was excited to move.  I had been running mostly in San Francisco the last few months and missed the rugged terrain I frequented while I was living in Palo Alto over the summer.

image

But more than that, I had hiked on South Crest Trail before, six years ago when visiting Albuquerque with a then-girlfriend.  I thought the trail was amazing and wanted to push further, but she was pragmatic and made me turn around earlier than I wanted. From the cable car you can track the crest of the trail and see dozens of spectacular Western-facing viewpoints along the way.  I thought this was a great opportunity to see some of the views that I had missed a few years ago.

We arrived with only 75 minutes of sunlight left, so I dressed quickly and left my phone in the car. It was 65 degrees, sunny, and calm at the base of the mountain, but in the waiting room for the cable car, there is a weather station showing that going up to 10,000 feet of elevation would drop the temperature to 36 degrees and increase the wind to 28mph.  I looked down at my shorts and running hoodie and decided it would be ok but that I was pushing it.

When we got the top I quaffed some water at the fountain in the tram station and took off to the South.  It was immediately familiar so I stopped to enjoy a few views very briefly but pushed on to see what was further.  The trail was rough: lots of rocks and quick elevation changes, surrounded by dense pine trees.  I figured I would run about 25 minutes out and then turn around in time to use dusk-light to find my way back.  The trail didn’t seem complicated enough to get lost with only a couple of obvious-seeming turns.

Sadly, the trail meandered to the left along the Eastern side of the ridge.  This meant two things: no views of the sun setting to the West and it was getting darker faster than I thought.  I figured if ran a bit further eventually it would come back up the crest and I could catch the beginning of the sunset.  I picked up my pace.

This actually worked-as the trail winded to the right almost 180 degrees and ended up giving me some of the views I was craving.  Perhaps it was altitude-induced optimism, but I thought that maybe it would provide me an alternative route home that wouldn’t require retracing all my steps.  I pushed on for longer than I should have before realizing I was on a completely separate ridge in the mountain.  I would need to run even further to get home.

I ran faster as I realized I was running out of daylight. The Eastern side of the mountain was already quite dark. As I ran out of light, I had to slow my pace to avoid turning my ankle on a rock. After 20 minutes of hurrying to get back I realized I was actually descending the mountain on the Eastern side.  I had made a wrong turn at some point and it wasn’t obvious to me where.  The difference between the trail and the woods was now barely visible.

I began yelling “Hello” and “Help” at regular intervals, hoping that I could find another hiker. I never heard a response. I turned around and headed back, straining my eyes to look for landmarks.  I knew that keeping calm was the most important thing but it was a struggle to do so.  It was only getting colder and darker.  The one upshot was that I knew I couldn’t be more than a couple miles from the cable car station and that my dad and sister would be wondering where I was.  At worst I would need to be found by rangers with flashlights.  The questions were whether it would get to that point and how long that would take.

Doubling back, I felt more lost than before because I was headed in the wrong direction.  Perhaps the scariest part was finding a turn in the right direction, following it for awhile and realizing I was looping back in the wrong direction. Then I experienced the old “getting lost” trope of seeing the same landmark twice, in this case a pair of white logs crossed in an X which were visible in the moonlight.

At this point it was so dark I was barely staying on the trail. I had to reach down and feel the ground to make sure it was soft and loose to make sure I was walking where I was supposed to. Despite this strategy, I made countless mistakes and tore up my legs walking through brush. Losing the trail like this was the scariest thing I experienced. At some points I would get on my hands and knees and crawl around until I found soft dirt to walk on.

I backtracked more and found a signpost I knew was on my original route. This was where I had made the wrong turn. Following the trail out from here was harder than I expected, but from here it was only a matter of time until I found my out.  I knew this and started deliriously belting out “99 bottles of beer on the wall” as I marched toward the station. This last bit was firmly on the ridge and dimly lit by Albuquerque city lights.  I knew I would be ok.

I arrived at the cable car station at around 7:15pm, two hours after I expected to be getting home. I was deliriously happy until I saw how upset my dad and sister were. They were on the phone with the state police and desperately trying to get them to send out a search party.  I had gotten there just in time to prevent that.

The lessons in this story can only be described as common sense. I went running too late, pushed too hard, and underestimated how dark it would get. It was pure trail-runner arrogance that got me into this situation.  I hadn’t brought a flashlight or a phone, either of which would have made this situation far less hopeless.

Despite the stupidity I demonstrated, I feel like it was a valuable experience.  Sometimes you need to make stupid mistakes and suffer the consequences to learn lessons. Escaping the danger I faced in the woods, in the dark, has made me more appreciative than ever at how wonderful my life is.

Optimal Descriptive NFL Rankings

Most NFL fans, like myself, obsess over who’s going to win games or which players to start in our fantasy football leagues. One of the fundamental tools we use to look at this are rankings. Rankings are the simplest possible model that can represent a total order, which you can think of as a function that allows you to compare all possible pairs in your set.

Update : Though I came up with this on my own, this idea is pretty old. See Ali, Cook and Kress (1986). Next time I’ll spend some more time on Google Scholar :)

Describing the NFL season

In the NFL regular season, each of the 32 teams plays 16 games. The result is 256 binary outcomes, which if we assume are the same as coin flips (independent trials of a fair coin), then the entirety of the season contains at most 256 bits of information. One way to think about this is that if you could send yourself a string of 256 1s or 0s from January back in time to Septmeber, with a simplistic coding scheme, your past self could correctly predict all 256 games.

I began to think: we all love rankings, but how well can you describe the season with a ranking of teams? There are 32! possible rankings, so a ranking of all 32 teams contains log2(32!) bits of information, or about 118 bits. This is the smallest amount of information you could use to describe what happens in all possible pairings of teams. How accurate is it to describe 256 bits with 118 bits? How many games would you get wrong?

This is also a good way to look at how random outcomes are. If rankings can’t describe the season well, then it’s hard to have a good intuition about which teams are the best in the league.

Ranking algorithms

Given my training in statistics and love of the NFL, I have tried ranking teams before. The model I used then is a variant of the BTL model. These models essentially posit that teams can be assigned a real number that represents their strength (they exist in a latent “quality” space), and that the probability of team A beating team B is proportional to the differences in their strengths. Essentially this is a logistic regression with dummy variables for teams.

As I looked into the mathematics of ranking, I found that many smart people have thought long and hard about this problem. The abstract problem is how to take a series of pairwise comparisons and generate a total order relation.

The tricky bit is that these problems have a recursive structure: first you rank teams naively, then you re-rank adjusting for strength of schedule from your naive ranking, then you re-rank using the new strengths, and so forth. It’s no surprise then that most sophisticated ranking procedures use something like the power method to compute eigenvectors of a square matrix that encodes the pairwise comparisons.

The BTL model is probabilistic, while the eigenvector model is derived using linear algebra. They are computed differently but there is a deep relationship between the two in that they optimize very similar objective functions.

Optimal ranking with graphs

The pairwise comparison matrix (the element at row i, column j is 1 if i beat j) is also a representation for a directed graph. Using this representation, the ranking problem can then be posed slightly differently.

In the BTL model and the eigenvector model, we wanted to project each team onto the real line. If we think of our pairwise comparisons as constituting a directed graph, then the ranking is actually a path through the graph. At first I thought this path might be related to the TSP with appropriate edge-weights, but this was a bit off.

The algorithm to find this path is called a topological sort, the directed graph equivalent of a minimum spanning tree for an undirected graph. The problem is, you can’t perform a topological sort when there are cycles in the graph. Cycles are a generalization of transitivity violations: situation where A beat B, B beat C, but C beat A. We can’t sort the graph if these exist. If they don’t, the sort can take place in linear time.

The set of edges (games) which introduce cycles into the graph is called the feedback arc set (FAS). You can think of this as the set of cases the ranking model is too simplistic to explain. It seems a bit backwards, but if we can get rid of all the upsets in advance, then we can rank the teams easily and quickly. Finding the minimum FAS is an NP-hard problem (of course!), but for a 256 edge graph, it’s not a problem to compute it with igraph among others packages. The minimal FAS set is not guaranteed to be unique (for a trival example, the minimal FAS set for a 3-cycle can be any of the three edges [1]), so unfortuntately the ranking may not be unique.

After removing the FAS, the topological sort produces the optimal ranking in the following sense: no other ranking will make fewer mistakes when applied to the 256 games. This is essentially minimizing 0-1 loss of the model, which is exactly our objective function if we want rankings that make the fewest mistakes. This ranking may not be unique, but you cannot describe the season better with any ranking model.

Results

I implemented a few ranking systems in Python. You can checkout the code on github. What I find amazing is how poorly BTL and eigenvector methods compare, at least descriptively, to optimal ranking. These rankings only are only about 70-75% accurate, while optimal ranking almost always breaks 80%. This are about 10-20 games which would be considered upsets under BTL and eigenvector methods that would not be considered upsets under the optimal ranking.

Visualizing

I put together a visualization of the win-matrix using d3.js. Squares are colored if the row team beat the column team. Our “loss” comes from wins that are inconsistent with our ranking model–any square beneath the diagonal. You can also look at a bigger and interactive version of the visualization.

Name Wins BTL Optimal Rank Loss:

A Challenge: minimum description length

I have shown that, at least for the task of description, a graph-based algorithm which minimizes the correct loss function is superior to probabilistic models which don’t actually minimize the right thing. But the ranking is still not a perfect representation of the NFL season, getting about 20% of the games wrong on average. My question to you is: describe a 100% accurate encoding of the NFL season that uses the fewest possible bits. I have thought hard about this and still don’t have a satisfying answer.