I occasionally get emails from people asking me for advice or about my experiences before, during, and after graduate school. Here are some questions I’ve answered in the past. I’ll add more here as I receive and answer them.
Exciting news! The paper I co-authored with Lev Muchnik and my advisor, Sinan Aral. Social Influence Bias: A Randomized Experiment was published in the August 9th edition of Science. Since many of you will read about the findings of the paper as interpreted by journalists, I thought it would be useful to give my own TL;DR version here. Plus a bonus plot that was not useful to include in the paper:
It was an honor to be invited to my first Foo Camp this year. The weekend certainly lived up to its billing as a gathering of “alpha geeks” that imposes only loose structure on their interactions.
Coming from the academic world, the flexible format and mix of attendees were completely foreign to me. Academic conferences are attended by (mostly) homogenous groups who have read the same literature, have attended the same conferences before, and know each other personally or by reputation. This approach streamlines communication because everyone is steeped in jargon and shares a large set of common experience.
Foo Camp is the opposite. The diversity of skills and perspectives makes transmitting and receiving information inefficient and uncomfortable. But I firmly believe that these feelings are precisely what happens when you are learning and making new connections. The weekend fell solidly on the exploration side of the exploration-exploitation tradeoff in a way that was refreshing.
The other campers were actually homogenous in one important way: they were all people who get stuff done. Each person I met seemed to have been through at least a few cycles of taking on ambitious challenges and succeeding at them. (As my advisor would say, these people all had great red zone offenses.) This can be more inspiring than innovative ideas — it’s nice to hear stories from others who have worked hard to take their projects across the finish line.
My contribution to the discussions stemmed from my two current interests. First, I tried to relate as many ideas as possible to my own experimentalist worldview. Ben Waber and I led a session on broadening the scope of experimentation in organizations to include self-experimentation. I think there is a false conceptual divide between the experimenter and the subject, leaving us uncomfortable applying scientific methods on ourselves. Yet, as the quantified self movement has shown for individuals, rigorously collecting data and applying carefully planned manipulations can help us learn about ourselves in addition to our environments.
Second, I shared my vision for systems which reward skill in prediction. We take for granted that smart people like Nate Silver will become popular, but I believe we should re-evaluate how reliably we come to pay attention to those who add some value to the conversation. My Ignite talk outlined the missing pieces needed to learn who is worth listening to when we care about what will happen in the future. The success of open source software projects and competitive platforms like Kaggle are inspiring early examples of how information technology and incentives can help reward the right people for their investments in skills and expertise.
I have to thank Tim O’Reilly and Sara Winge for curating such a memorable event. Bringing together so many amazing people and keeping them captive/cogitating for a whole weekend is a huge coordination problem. After years of focusing on my Ph.D. in a small field, I really enjoyed talking with people who were hackers like me and yet completely, wonderfully different.
I’m excited to announce that today I accepted a full time job offer at Facebook’s Data Science team. It was a long road to this decision but I’m pretty pleased with where I ended up. I’ll be starting in the Fall on — as luck would have it — the same day as my friend John Myles White.
It was tough to leave the traditional academic route behind. I considered applying for tenure track positions up until just a couple of weeks ago. Ultimately I decided in favor of what I feel to be a more collaborative, cross-disciplinary environment than most departments I could have joined. I couldn’t picture a better team of people to learn from while I work on hard problems. The access to data and experimentation is icing on the cake.
Choosing an Industry Job
I had another offer from a company that I love and it was incredibly difficult to choose between the two. I won’t go into details but it made the whole process that much more difficult. In the end, Facebook provided a unique mix of social science research combined with interesting business problems that would be hard to duplicate elsewhere. The company is very supportive of academic publications — a benefit I currently find hard to give up. I guess the other factor was that having done an internship there, I had great information on how talented the team is.
Finishing my PhD
I’ve got a lot of work left to do before I defend my dissertation in October! I look forward to posting updates on some chapters as they progress over the summer. I’m currently living and writing in San Francisco but will be back in New York for some time around my defense. The work focuses on the results of three experiments in online social interactions, two that I was involved in at Facebook and one on a social news discussion website similar to Reddit.
While I’ll be working pretty hard over the next few months, please feel free to get in touch if you’re in the Bay Area and would like to grab coffee/beer and talk about causal inference, experiments, etc.
There were about 200 attendees, but unfortunately there is no video of the talk. Thanks to everyone who came; it was incredibly fun for me.
The talk was mostly a review/comparison of different methods:
the Bradley-Terry-Luce model
The last one warrants more explanation. I had previously reviewed the optimal descriptive ranking problem and my solution. It’s a fascinating application of graph theory to a problem that most people wouldn’t consider to be graph-theoretic. Once the ranking problem is posed as a topological sort of a graph which contains cycles, it’s easy to describe an exact (if non-unique) solution as well as find an algorithm to approximate it. The results are quite stunning: a 10+% increase in the number of correctly described games from the other models.
I made a promise to myself when I interned at Facebook that I would write at least one blog post on the NFL. Today I got to publish it! Technically this was all quite trivial, mostly just aggregating users by teams and geography, then producing the maps using D3 (I had to rasterize them to post them on Facebook).
The NFL friendships thing involved a join on the social graph with the Like graph. If I had had more time, I would have looked for rivalries. My plan was to look for friendships that were unlikely holding relative geography fixed but changing the two teams. The idea was to look for some pairs of teams that despite having pairs of fans in close proximity are unlikely to have fan friendships.
The finding that winning is correlated with Likes is also interesting. To an economist, this would seem to be an excellent instrumental variable. (This is not a new idea). Conditional on the spread of the game, wins and losses are basically coin flips, and yet they seem to be correlated with big increases in team fan bases. This strategy could potentially be used to identify peer effects if one were willing to make a bundle of assumptions about dynamics.
Around budding social- and data scientists, a question you often hear is “where can I get data?” It happens so often that people like Hilary Mason, who I’m sure gets this question all the time, have posted pages with resources. Getting new data can be just what you need to practice a technique you are learning or complete a project that you can publish or add to your portfolio.
Here I argue that if you want to make a bigger impact as a scientist, you should make your own data instead of downloading it. Here are my points:
In 1980, an economist named Julian Simon and a biologist named Paul Ehrlich had a disagreement about the future scarcity of natural resources on Earth. Being gentlemen, they decided to stake some token money-as well as their reputations-on what they believed.
Last night on Twitter, I went on a bit of a rant about statistics packages (namely Stata and SPSS). My point was not that these software packages are bad per se, but that I have found them to be correlated with bad quality science. Here is my theory why.
Last night I made a series of decisions that seriously jeopardized my life.
The Sandia mountains rise up to the East of Albuquerque and provide views of a beautiful Southwest sunset. You can take a cable car up to a lodge at the crest of the mountains. When we arrived at the departure point, I decided I would go for a quick trail run when we got to the top. Having been trapped in the car all day on our way from Flagstaff, I was excited to move. I had been running mostly in San Francisco the last few months and missed the rugged terrain I frequented while I was living in Palo Alto over the summer.
Most NFL fans, like myself, obsess over who’s going to win games or which players to start in our fantasy football leagues. One of the fundamental tools we use to look at this are rankings. Rankings are the simplest possible model that can represent a total order, which you can think of as a function that allows you to compare all possible pairs in your set.
Describing the NFL season
In the NFL regular season, each of the 32 teams plays 16 games. The result is 256 binary outcomes, which if we assume are the same as coin flips (independent trials of a fair coin), then the entirety of the season contains at most 256 bits of information. One way to think about this is that if you could send yourself a string of 256 1s or 0s from January back in time to Septmeber, with a simplistic coding scheme, your past self could correctly predict all 256 games.
I began to think: we all love rankings, but how well can you describe the season with a ranking of teams? There are 32! possible rankings, so a ranking of all 32 teams contains log2(32!) bits of information, or about 118 bits. This is the smallest amount of information you could use to describe what happens in all possible pairings of teams. How accurate is it to describe 256 bits with 118 bits? How many games would you get wrong?
This is also a good way to look at how random outcomes are. If rankings can’t describe the season well, then it’s hard to have a good intuition about which teams are the best in the league.
Given my training in statistics and love of the NFL, I have tried ranking teams before. The model I used then is a variant of the BTL model. These models essentially posit that teams can be assigned a real number that represents their strength (they exist in a latent “quality” space), and that the probability of team A beating team B is proportional to the differences in their strengths. Essentially this is a logistic regression with dummy variables for teams.
As I looked into the mathematics of ranking, I found that many smart people have thought long and hard about this problem. The abstract problem is how to take a series of pairwise comparisons and generate a total order relation.
The tricky bit is that these problems have a recursive structure: first you rank teams naively, then you re-rank adjusting for strength of schedule from your naive ranking, then you re-rank using the new strengths, and so forth. It’s no surprise then that most sophisticated ranking procedures use something like the power method to compute eigenvectors of a square matrix that encodes the pairwise comparisons.
The BTL model is probabilistic, while the eigenvector model is derived using linear algebra. They are computed differently but there is a deep relationship between the two in that they optimize very similar objective functions.
Optimal ranking with graphs
The pairwise comparison matrix (the element at row i, column j is 1 if i beat j) is also a representation for a directed graph. Using this representation, the ranking problem can then be posed slightly differently.
In the BTL model and the eigenvector model, we wanted to project each team onto the real line. If we think of our pairwise comparisons as constituting a directed graph, then the ranking is actually a path through the graph. At first I thought this path might be related to the TSP with appropriate edge-weights, but this was a bit off.
The algorithm to find this path is called a topological sort, the directed graph equivalent of a minimum spanning tree for an undirected graph. The problem is, you can’t perform a topological sort when there are cycles in the graph. Cycles are a generalization of transitivity violations: situation where A beat B, B beat C, but C beat A. We can’t sort the graph if these exist. If they don’t, the sort can take place in linear time.
The set of edges (games) which introduce cycles into the graph is called the feedback arc set (FAS). You can think of this as the set of cases the ranking model is too simplistic to explain. It seems a bit backwards, but if we can get rid of all the upsets in advance, then we can rank the teams easily and quickly. Finding the minimum FAS is an NP-hard problem (of course!), but for a 256 edge graph, it’s not a problem to compute it with igraph among others packages. The minimal FAS set is not guaranteed to be unique (for a trival example, the minimal FAS set for a 3-cycle can be any of the three edges ), so unfortuntately the ranking may not be unique.
After removing the FAS, the topological sort produces the optimal ranking in the following sense: no other ranking will make fewer mistakes when applied to the 256 games. This is essentially minimizing 0-1 loss of the model, which is exactly our objective function if we want rankings that make the fewest mistakes. This ranking may not be unique, but you cannot describe the season better with any ranking model.
I implemented a few ranking systems in Python. You can checkout the code on github. What I find amazing is how poorly BTL and eigenvector methods compare, at least descriptively, to optimal ranking. These rankings only are only about 70-75% accurate, while optimal ranking almost always breaks 80%. This is about 10-20 games which would be considered upsets under BTL and eigenvector methods that would not be considered upsets under the optimal ranking.
I put together a visualization of the win-matrix using d3.js. Squares are colored if the row team beat the column team. Our “loss” comes from wins that are inconsistent with our ranking model—any square beneath the diagonal. You can also look at a bigger and interactive version of the visualization.
I have shown that, at least for the task of description, a graph-based algorithm which minimizes the correct loss function is superior to probabilistic models which don’t actually minimize the right thing. But the ranking is still not a perfect representation of the NFL season, getting about 20% of the games wrong on average. My question to you is: describe a 100% accurate encoding of the NFL season that uses the fewest possible bits. I have thought hard about this and still don’t have a satisfying answer.