Benn Stancil

Are Home Run Derby Hitters Different?

2014-08-06T11:09:04-07:00

Tonight’s Home Run Derby features ten of the world’s best home run hitters. But the Derby isn’t about just hitting home runs—it’s about seeing how far they fly.

The graphic below explores how this year’s participants’ homers compare to the 40,000 home runs that have been hit over the last seven-and-a-half MLB seasons. Do they hit the ball harder, higher, and farther than average?

Data for the graphic was provided by ESPN. The full dataset for all regular season home runs can be found and analyzed on Mode.

Five Public Datasets, and Lots of Ideas for Exploring Them

2014-07-23T00:40:15-07:00

The world is full of interesting datasets. But even though data is increasingly accessible, it’s sometimes hard think up an interesting problem to analyze. Maybe there are just too many possible questions, maybe it’s a pain to set up analytical tools, or maybe it’s just too easy to get distracted by animal GIFs.

Whatever the case, we want to make it easier to start working on interesting problems right away. Here are five datasets, already loaded into Mode’s public database, that you can query, analyze, and visualize right now.

For each dataset, I’ve provided a link to the table in Mode’s public data warehouse. If you’re feeling lazy and only want to work with a tiny amount of data (as in, one row), I found the best single row of data from each dataset. And if you’re feeling ambitious—and want to get popular on the internet or explain some things—I added some ideas for turning these datasets into maps.

FEC Campaign Finance Data

The Federal Election Commission requires candidates to make their campaign expenditures public. This dataset includes over 200,000 campaign expenditures from the 2012 U.S. presidential campaign, and is full of fascinating discoveries. Like Herman Cain’s $150,000 expense on Herman Cain. And the $5,000–the most of any candidate by far–Mitt Romney spent at liquor stores. And Ron Paul’s and Romney’s addiction to fast food (and Obama’s clear preference for Subway).

Herman Cain be like:

What kinds of questions can I ask? What do candidates spend the most money on? Do some candidates get a spending lead early, while others save for the end? How do spending patterns differ for the incumbent compared to challengers?
What’s the table called?cooldata.fec_2012_presidential_campaign_expenditures
What’s the best row? wut.
Can I use this data to make a map? Absolutely. You could see how money was spent in each state. Or you could see if different candidates focus on different regions. Or, if you were feeling particularly ambitious, you could map candidate expenses by day to sketch out how they traveled across the country during their campaigns.

Crunchbase

Crunchbase is quickly becoming the dataset of record for the startup and venture capital communities. It can provide information on anything from what industries are hot (biotech) to the potential effects of founder experience or age. The dataset includes funding, investment, and acquisition data on over 40,000 companies.

What kinds of questions can I ask? Are there characteristics of a company—industry, location, etc.—that differ by VC? Do some VCs typically invest together, while others rarely do so? Are companies raising more money earlier? ARE WE IN A BUBBLE??
What are the tables called? crunchbase.acquisitions; crunchbase.companies; crunchbase.investments; crunchbase.rounds.
What’s the best row? This one, which is approaching the theoretical limit of how good a row of data can be.
Can I use this data to make a map? Yes! Like this rather uninformative one, showing the number of startups by the county where they’re headquartered.

UFO Sightings

Quandl, which provides millions of free datasets on vast range of subjects, added data on UFO sightings to Mode. The data includes the number of reported sightings by month. Quandl gets the data from the National UFO Reporting Center (and in case you need to report a sighting, they have a hotline).

What kinds of questions can I ask? Are some months more popular for sightings? What correlates with UFO sightings?
What’s the table called? thomas.ufo_sightings
What’s the best row? The first one, on a sighting from June 1400. The first sighting of the Black Knight?
Can I use this data to make a map? No. But you can probably combine it with some Independence Day GIFs and make a killer listicle.

FiveThirtyEight

FiveThirtyEight, Nate Silver’s data journalism site, produces a lot of great analysis. For some articles, they publish the underlying data on GitHub. If you want to explore their data or expand on their analyses, we’ve uploaded most of their datasets. A few topics include classic rock radio plays, the ages of Congressional representatives, World Cup predictions, and surveys about defining U.S. geographic regions and international cuisine preferences.

What kinds of questions can I ask? From the cuisine survey, do people from different areas of the country prefer different foods? Can we predict what food someone would like based on their other preferences? From the data on Congress age, it might be interesting to see if people from different states tend to elect representatives of different ages—and are those ages related to the age of the constituents? And from the classic rock data, which classic rock songs should we be most sick of by now? Which radio stations have the laziest DJs?
What are the tables called? cooldata.fivethirtyeight_region_survey; cooldata.fivethirtyeight_congress_age; cooldata.fivethirtyeight_world_cup_predictions; cooldata.fivethrityeight_classic_rock_plays; cooldata.fivethirtyeight_classic_rock_songs; fivethirtyeight_food_world_cup.
What’s the best row? Too soon?
Can I use this data to make a map? Yes! I made the map below to explore how people from different states defined the South and Midwest. You could map food cuisine preferences by region or show how the age of states’ Congressional representatives have changed over time.

Holidays all over the world

This dataset includes a list of all the holidays in the world over the next year. While this data is useful for analysis, it could be even more valuable for figuring out which parts of the world—and which of your customers—are on vacation.

What kinds of questions can I ask? Which countries have the most holidays? Which months and days have the most holidays? Which countries share a lot of holidays, and which only share a few?
What’s the table called? reference_lookups.holidays_by_country
What’s the best row? The freedom row.
Can I use this data to make a map? Yes! You could show the average number of holidays during the year by country, or where there’s a holiday on any given day.

Ideas for More?

Inspired to do something fun with one of these datasets? Send us a link to your project on Twitter or Facebook, and we’ll share some of the best work! And if you want to make a map, we’ll soon be publishing a quick tutorial for how make one, but feel free to email us if you have any questions now.

Are Taxi Drivers Racist?

2014-06-23T15:57:14-07:00

Last week, Chris Whong published a massive dataset of every taxi trip taken in New York in 2013. The data, provided through a Freedom of Information Law request, includes an incredible amount of detail on where trips started, where they ended, when they occurred, how much they cost, and how many passengers there were.

A number of people have already done incredible things with this data, including making a remarkably detailed map of where cabs typically pick up and drop off passengers. A dataset of this detail opens the door for countless questions and angles of exploration.

One such question surrounds accusations that New York City cabs discriminate against potential passengers. A number of anecdotes claim that New York cabs are reluctant to stop for black passengers, especially after dark. Could this new dataset shed any light on this issue?

The dataset, which provides no personal details on passengers and drivers (sort of), can’t answer this question directly. However, by looking at where cabs pick up and drop off passengers—and by considering the racial makeup of those neighborhoods—we can start piecing together the evidence. It’s not conclusive, but it could be a start.

Where Cabs Go—And Where They Don’t #

On the surface, taxis appear to avoid picking up passengers in neighborhoods more heavily populated by minorities. The chart below show the number of taxi pickups per 1,000 people in areas bucketed by the demographic makeup of the residential population. As the chart shows—and the map above makes even more clear—as minorities make up more of the population, the fewer taxi trips originate from those areas.

A few details about the numbers above—and the numbers considered in the rest of this post—are important to note. First, the data above only includes trips from one week in June (chosen because it’s not affected by major holidays or severe weather). Even though this represents less than a 50th of the dataset, it still includes over 3 million trips. Second, because of the complex politics behind the New York City taxi system and and its services outside of Manhattan, this analysis is limited to trips that originated in Manhattan (officially, New York County).

In order to determine the demographic makeup of pickup and dropoff locations, I rounded the latitude and longitude of each pickup and dropoff point to the nearest thousandth, which approximates the location within about 100 meters. For every rounded location, I looked up the Census tract that it falls in via the FCC Census block API. The Census provides demographic and economic data by Census tract, allowing the mapping between GPS coordinates and neighborhood demographics.

Though the chart above raise suspicions about racial profiling among taxi drivers, the result is far from conclusive. After all, pickup rates are determined by passenger demand as well as the preferences of cab drivers. Neighborhoods with more minorities could simply have fewer prospective passengers.

Why this might be true is a question worth its own exploration, but for the sake of this analysis, I added a couple factors that could serve as proxies for taxi demand:

1. Residents’ incomes - Wealthier people likely take more cabs than the less-well-off. If neighborhoods with mostly white residents tend to be more affluent (which they are), the effect above could be caused by economics rather than discrimination.

2. Location - Central Manhattan is largely populated by whites. Though the graph above controls for population size, it doesn’t account for commercial or tourist activity, which is likely concentrated in central Manhattan. The apparent high rate of cab activity in white tracts could actually be because whites live in central areas with more commercial activity, while minorities live on Manhattan’s edges.

To attempt to account for these two factors, I made a simple model that estimates how many taxi pickups are expected in a Census tract given its population, median income, distance from Times Square (roughly the center of Manhattan), and non-white population.

A basic model with incomes, population, distance from Times Square, and the size of the white population suggests that these other factors—and primarily distance from Times Square—account for the apparent effect shown above. Nevertheless, even controlling for these variables, the correlation between the size of the white population and taxi pickups is strong enough to at least warrant further investigation.

# This model regresses the number of pickups by Census tract against the tract's income, population, white population ratio, and distance from Times Square. As expected, incomes are positively correlated with pickups, and distance from Times Square is negatively correlated. The size of white population is also positively correlated with pickups, though, as show by the p-value in the final column, the relationship isn't highly significant. 

Regression output:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)        1.204e+04  2.149e+03   5.600 5.69e-08 ***
median_income      6.346e-02  2.006e-02   3.163  0.00176 ** 
population        -7.785e-02  1.728e-01  -0.450  0.65282    
white_percent      4.748e+03  3.244e+03   1.464  0.14455    
distance_in_miles -2.840e+03  3.126e+02  -9.084  < 2e-16 ***

One claim worth exploring is the specific case highlighted in the anecdotes above: Cab drivers are reluctant to pick up black people (not necessarily all minorities) at night. If we limit our model to trips taken between 9:00 PM and 6:00 AM, the size of the black population appears to have a significantly negative effect on cab pickups.

# This model examines the number of pickups between 9:00 PM and 6:00 AM by Census tract. This restriction weakens the correlation between pickups and income, but the relationship between pickups and the size of the black population becomes very strong. 

Regression output:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)        7.188e+03  8.790e+02   8.177 1.53e-14 ***
median_income      1.510e-03  6.055e-03   0.249 0.803301    
population        -9.184e-02  6.859e-02  -1.339 0.181815    
black_percent     -3.937e+03  1.142e+03  -3.447 0.000667 ***
distance_in_miles -9.869e+02  1.194e+02  -8.266 8.56e-15 ***

Before concluding that this is damning evidence against cabs, it’s important to note that accounting for Manhattan’s geography in this model is a trickier statistical problem than just adding it as a variable to a regression. Because many of the non-white areas are far from Times Square, there’s a strong correlation between distance from Times Square and the racial makeup of the Census tract. This creates collinearity problems that undermine the results above.

Though more complex methods can correct this, I chose to make a simple approximation. There’s little correlation between race and distance between 3 and 4 miles from Times Square. Limiting on these set of tracts, we can apply the model as before. In this band, only income, and not racial makeup, appear to matter.

# This model considers the number of pickups between 9:00 PM and 6:00 AM by Census tract, but only includes Census tracts between 3 and 4 miles from Times Square. This limited dataset shows no correlation between race and pickups. The model only finds a significant relationship between pickups and tract income.

Regression output:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)        8.294e+02  1.821e+03   0.456    0.652    
median_income      2.756e-02  3.675e-03   7.501 8.73e-09 ***
population         5.658e-02  5.119e-02   1.105    0.277    
black_percent     -3.785e+02  6.655e+02  -0.569    0.573    
distance_in_miles -4.579e+02  5.124e+02  -0.894    0.378

Considering Dropoffs #

Dropoff patterns and their relationships to pickup locations could provide an additional angle of exploration. Dropoffs, as the chart below shows, are also heavily biased towards white areas.

Though this supports the argument that minority neighborhoods are underserved, it says little about racial motivations. On the one hand, if cab drivers were discriminating against minorities, we’d expect taxis to be taking fewer passengers to minority neighborhoods and therefore have fewer dropoffs in those areas. On the other hand, the low dropoff rate could be caused by the same reasons as the low pickup rate: There’s little demand to travel to and from these neighborhoods.

Dropoff data combined with subsequent pickup data could show something far more interesting. Because passengers can’t pre-arrange for a New York yellow cab to pick them up, most trips come from street hails. If cabs wanted to avoid minority fares, they would likely move away from areas where they would get hailed by minorities once they dropped off a passenger, especially if that fare was dropped off in a minority neighborhood.

Nevertheless, this approach suffers from the same flaw as the chart above—non-discriminatory cabs would exhibit this same behavior if demand were higher in white neighborhoods.

A different angle can provide some conclusions. Cabs do pick up passengers in minority neighborhoods. If these pickups were preceded exclusively by dropoffs in the same neighborhoods, that would suggest cabs only traveled to these areas when a passenger requested they do so (the reasons for their reluctance would still unclear). However, if pickups in minority neighborhhoods followed dropoffs in white neighborhoods—in other words, if drivers voluntarily traveled to non-white areas—that could provide a piece of evidence in support of non-discriminatory practices.

As the chart below shows, pickups from minority neighborhoods are typically preceded by dropoffs in white neighborhoods, suggesting pickups in minority neighborhoods aren’t dependent on dropoffs in minority neighborhoods.

Digging Deeper #

This ultimately provides one strong conclusion and one mixed one. First, by nearly every measure, cab drivers have a strong preference for white neighborhoods and non-white neighborhoods are severely underserved. This is undeniable.

Drivers’ motivations, however, aren’t clear. There isn’t a lot of evidence for geographically opportunistic discrimination: Cabs in white neighborhoods aren’t much more likely to stay there than cabs elsewhere. Still, the degree to which minority neighborhoods are underserved and the models above, despite their flaws, do raise questions. But the evidence supports several plausible explanations for these disparities in service.

Regardless of the results of this analysis, racism exists, and the experiences described by those above—and many others—are real and can’t be discounted. No matter what this smoothed, stylized, and aggregated perspective describes, we also have to acknowledge the view from the street.

Furthermore, these arguments are only a very cursory first step into the data. There are numerous problems with these conclusions, and many questions that could be explored further. Just to name a few:

Data quality is a concern. This is particularly true for trip sequencing because some trips appear to begin before the previous trip ended.
Because GPS coordinates were rounded to the nearest 100 meters, Census tract matching is imprecise.
Demographic figures for Census tracts represent residential populations. The demographics of commuters, tourists, and others traveling to and from different tracts is unidentified.
Income is an imperfect measure of taxi demand. After incomes reach a certain point, residents may begin to favor other forms of transportation, like their own cars or limo services.
Other unmeasured factors can affect demand as well. For example, older populations may be more likely to take cabs, while residents in tracts with a high concentration of restaurants and bars may prefer to stay in their neighborhoods and demand fewer cabs. Demand in these neighborhoods could represent that of outsiders as much as that of residents. Moreover, taxi demand could be endogenous to discrimination—minority populations may hail fewer cabs because they’re concerned about racism.
Census tracts are not homogenous, but by identifying them as largely white or non-white, we’re flattening each tract into a single category. It’s possible that many passengers from a non-white tract are white, or many passengers from a white district are non-white.
Similarly, non-white districts are homogenized into a single group. A more detailed inspection of demographics could yield different results. As noted above, there seems to be evidence that black passengers face worse treatment. This analysis could be greatly expanded, or applied to other groups.
The data only includes one form of travel. Different parts of the city may be better served by other methods of transportation (subways, buses, Green cabs); travel choices may be more affected by these options rather than the biases of cab drivers.
It’s possible that these conclusions change during different times of the day. The type of passenger that rides a cab at 9:00 AM on a Monday morning is likely different than the type of passenger looking for a cab at 11:00 PM on Friday night. These potential differences are largely unexplored.

All these questions and caveats could be investigated further. To that end, all of my analysis above can be easily accessed via in-text links or links below the graphs. Additional details ares provided below. The links to Mode provide access to both the raw data and the underlying analysis. For anyone interested in exploring these ideas further—or digging into an entirely new issue with the same data—I invite you to copy and extend my work however you see fit.

Data #

Taxi data was provided by Chris Whong and Andrés Monroy. June data is available as trip_data_6. I trimmed this dataset down to one week in June using Python. In the process, I also added a by-driver trip counter in order to identify trip order. Data on the New York Census tracts was provided by the Census. This data was matched to GPS coordinates using another Python script. Finally, simple regression models were constructed in R. The models presented above, as well as several intermediate steps, can be found in GitHub.

Benn Stancil is the chief analyst of Mode.

Where Americans Think They Live

2014-05-05T10:50:47-07:00

Last week, FiveThirtyEight’s Walt Hickey wrote a couple of interesting articles about which states are in the Midwest and South. Being from North Carolina—where people definitely consider themselves Southern—I was surprised to see that only two-thirds of all respondents to the FiveThirtyEight survey said North Carolina was in the South.

This made me wonder: How do people from different states define the South and Midwest? And specifically, how do the views of people from who live in a state differ from those of people who don’t live there?

In keeping with their commitment to release an article’s underlying data, FiveThirtyEight published the full survey results on GitHub. Opening this data enables further exploration, as in the interactive graphic below. I looked at how every state in the U.S. views the South and Midwest, and how local opinions compare to national views.

The data reveals many additional findings:

People tend to think the South and Midwest are close to them. Western states think the regions extend further west than states on the East Coast.
It’s not clear if this is because people perceive the regions this way, or if it’s because they forget about states that are farther away (though this seems unlikely, as it appears respondents were provided a list of states to choose from).
I’m not sure where people from Wyoming think Alabama is, or where people from Delaware think most U.S. states are (maybe they, um, I believe, and such as, don’t have maps.)
Michigan and Oklahoma are the most geographically contentious states.
My intuition about North Carolina was right—95% of North Carolinians see the state as being in the South.
Despite people across the country having a mixed view of Arkansas, Arkansans, very clearly want to be in the South.
By contrast, West Virginians do not agree with it’s Southern reputation.

Explore the data #

The graphic takes a moment to load.

Benn Stancil is the chief analyst of Mode.

Are the Playoffs Taking Forever?

2014-04-29T17:33:56-07:00

We’re over ten days into this year’s NBA playoffs, and several of the opening series have made it all the way to…Game 5. The NHL playoffs, which are entering their third week but only the second round, are also in no apparent hurry.

So far, this year’s exciting playoffs are keeping critics quiet. But if tight overtime games and buzzer beaters give way to blowouts and snoozers, the annual complaints about the length of the NBA and NHL playoffs will probably resurface.

So I decided to take a detailed look. Sure, the playoffs feel long, but are they really?

Measuring playoff length isn’t actually a straightforward question. Because each league structures its playoffs differently, direct comparisons aren’t always appropriate. However, by attempting to standardize the pace and length across leagues—and by collecting data on every regular season and playoff game in the MLB, NBA, NFL, and NHL since 1970—we can draw some conclusions. The result? The problem might not be with the playoffs, but with us.

How Long Do They Take?

One obvious way to measure the length of the playoffs is by figuring out how long they take from beginning to end. The chart below shows the length of each league’s playoff since 1970. The NBA and NHL playoffs now last close to 60 days, which is considerably longer than both the MLB and NFL playoffs, and the NBA and NHL playoffs of previous decades.

Sixty days feels like a long time, but is it really? When compared to the relative length of the regular season, the NBA playoffs are again the longest. In recent years, the NBA playoffs have stretched to over a third of the length of the regular season, which is slightly longer than the NHL playoffs and nearly twice that of the MLB. It also happens to be twice as long as March Madness, which is about 15% as long as college basketball’s 130-day regular season.

Put another way, the 60 days of the NBA playoffs take up more of the year than Mondays.

The Frequency of Play

The second way to measure the playoffs is by pace. As many have said, it’s not the length of the playoffs that are a problem—it’s the downtime. The NBA playoffs are riddled with off days, which means a seven-game series often drags on for over two weeks.

The chart below shows the average pace of playoff games for all the teams that went on to win a championship in the last 40 years. Somewhat surprisingly, the rate that teams play games hasn’t changed significantly in any sport.

A few notes about this graph: First, delays from the 1989 Loma Prieta earthquake caused the MLB spike. Second, the NFL occasionally eliminated the week off before the Super Bowl, creating the downward spikes in the NFL playoffs. Finally, as a minor technical detail, pace isn’t quite defined here as the number of days in the playoffs divided by the number of games. Instead, it’s the number of days minus one, divided by the number of games minus one. To understand why, imagine than an NFL team played two games on consecutive Sundays. The pace of those games is clearly one game every seven days, not two games in eight days, or one game every four days.

A more interesting measure of pace, however, is not how pace has changed from season to season—it’s how pace changes from the regular season to the postseason.

The chart below, inspired by XKCD’s popular “Frequency” comic, illustrates the relative pace of play. Each square flashes according to how often teams in that league play games. Hover over squares to see the associated value.

As the graphic shows, the pace of games for all sports slows down. However, despite the perception of a dramatic slowdown in the NBA, the NBA’s pace only slows down by 25%. That’s slightly less of a slowdown than that of the NFL and far less than the MLB’s, which slows its pace by nearly 50%. Even in absolute terms, the NBA and NHL playoffs actually slow less than the MLB and NFL, suggesting that pace may be more of a perceived problem than a real one.

The Number of Games

Even if two months isn’t “too long” and the gaps between games don’t grow as much as they seem to, the sheer volume of games might be another reason why the NBA and NHL playoffs feel like they take forever.

Since 2003, NBA and NHL teams have averaged about 85 playoff games a year. For both leagues, that’s more games than single team plays in a year. NFL teams play about three-fourths of a season in the playoffs, while MLB teams only play 20 percent.

Because professional sports’ playoffs are almost always on national TV and heavily covered by the sports media, it’s possible that raw number of games in the NHL and NBA playoffs wears on fans. This effect could be amplified by the relative lack of interest in the regular season compared to the playoffs. Unlike baseball or football, basketball and hockey draw far more attention during the playoffs than during the regular season, which likely means a higher percentage of playoff viewers are marginal fans who might grow impatient during the early rounds of the playoffs. Of course, it’s also possible the relationship goes the other way, and interest in the regular season is low because fans know they have an entire season of meaningful games to watch during the playoffs. But in either case, it seems plausible that complaints about the playoffs don’t reflect a change from the sport, but a change in fanbases.

The NBA and NHL playoffs could also feel particularly long because there are so many teams in these playoffs, and so much “noise” is required to find a champion. Over the last ten years, NBA and NHL champions have averaged playing 23 games a year, or 27% of total number of playoff games. Super Bowl champions participate in an average of 33% of NFL playoff games, while World Series winners play in 45% of baseball playoff games. For casual fans—which the NBA and NHL playoffs appear to attract—the early rounds may feel like a long and unnecessary preamble to the NBA and Stanley Cup Finals.

So in the end, it’s likely not the pace of play that makes the playoffs feel sluggish. Instead, it’s likely the volume of playoffs we have to get through that’s the problem—and the fact that many of us complaining may not be committed fans in the first place.

Data

MLB game data was collected from Retrosheet. NBA, NFL, and NHL data was collected from Sports Reference. Charts were created using D3. All the analysis, complete datasets, and visualization code can be found in Mode, and is free for anyone to use, copy, and modify.

Benn Stancil is the chief analyst of Mode.

Finding the Most Gerrymandered Districts

2014-04-16T11:35:44-07:00

Yesterday, I came across an interesting Vox.com article discussing Congressional gerrymandering. In one of the article’s cards, author Andrew Prokop highlighted several of the country’s most gerrymandered districts. Having recently crunched some numbers on geographic data, why not try to quantitatively define the most gerrymandered districts and states?

Defining Gerrymandering #

As Prokop noted, there’s not a great way to determine if a district is gerrymandered. Nevertheless, researchers have proposed a few ideas to approximate it. The proposals largely measure gerrymandering in one of two ways: By calculating how far various points on the district’s boundary are from the district’s geographic center, and by comparing the perimeter of the district to that of a similar-sized district with a regular shape (in this case, a circle). Both calculations are far from perfect—the first calculation doesn’t work for noncontinuous districts, while the second is affected by any irregular boundaries, including coastlines and state borders—but they give decent estimates.

Because the effect created by irregular borders is both smaller and a bit easier to adjust for, I chose to rank districts by the ratio of the perimeter of the district to the circumference of a circle of equal area. The larger that ratio, the more gerrymandered it is.

Importantly, this ignores any political definitions for gerrymandering and assumes the best drawn districts are all circles. This assumption clearly doesn’t work geometrically or politically. A more true measure of gerrymandering would compare actual lines to those that make sense given the area’s geography and population distribution, but that adds significantly more complexity to the problem.

Ranking the Districts #

After calculating the area and perimeter of each district as well as the circumference of a circle with the same area (discussed in more detail below), districts can be ranked by how severe they’re gerrymandered. The list below shows the top 10 most gerrymandered districts in the United States.

The winner, North Carolina’s 12th District, is hardly a surprise. (According to Wikipedia, “It is an example of gerrymandering.”) Maryland’s 3rd and Florida’s 5th are also examples of some “creative” geographic interpretations. One unexpected addition to the list is Hawaii’s 2nd. In this case, Hawaii’s geography appears to be confusing the ranking, which misinterprets the district’s several islands as severely gerrymandered boundaries. It’s worth noting, however, that this method still ranks three other contiguous districts as having been gerrymandered worse than a district broken into several pieces.

For those interested in the full ranking of all 435 districts, it is available here.

Ranking the States #

Four of the top ten most gerrymandered districts are in North Carolina. This certainly makes North Carolina the frontrunner for the most gerrymandered state, but is it actually the worst?

It turns out that’s not a straightforward question to answer. You can’t simply sum each district’s gerrymandering scores because states with more districts tend to have much higher scores. Furthermore, taking the average of all the districts in each state doesn’t work either, because states with irregular borders or coastlines (like Alaska, Hawaii, Louisiana, and North Carolina) will appear more gerrymandered than they actually are.

To attempt to correct for these issues, I adjusted each state’s average score by how gerrymandered the state would appear if it were a single district. The more irregular the state border, the less penalized that state is for having irregular districts.

As the chart below shows, North Carolina is in fact the worst offender. In addition to ranking states, the chart also colors each state according to who controls the process of drawing district lines. States with split executive and legislative branches or with split legislative houses are categorized as having split control. States that only have one Congressional district are grouped in with those with independent commissions.

While table above implicates both parties, it does suggest that independent commissions, unsurprisingly, help reduce gerrymandering. Half of the ten least gerrymandered states (excluding those with one district) employ independent commissions, while only two of the ten most gerrymandered districts do so. (Additionally, many of the less gerrymandered states have fewer districts. This makes sense because fewer districts provide fewer opportunities for gerrymandering.)

Importantly, this isn’t the only way to rank states—there are other (likely better) ways to measure district gerrymandering, other ways to aggregate it, and additional political and population data that might be useful. If anyone would like to try a different method of analysis or add additional data, I’ll be happy to provide full access to my data and work in Mode.

Data #

To rank Congressional districts, I needed to figure out two things: The area and perimeter of each district. Fortunately, D3, the JavaScript visualization library, can take do this pretty easily. Using several of D3’s geographic functions and GeoJSON data on Congressional districts, I was able to calculate each district’s (as well each state’s) area and perimeter. That raw data is available here, as is the script that generated it (it’s in an .html file because I’m a hack). I. Note that the area data and perimeter data are in different units. Ratios between the two can be compared, but you can’t convert one to the other. Data on who controls the redistricting process was provided by Justin Levitt.

Benn is the chief analyst of Mode.

Plotting the Rest of the Baseball Season

2014-04-11T09:19:51-07:00

We’re less than two weeks into the 2014 baseball season, and most people would say that it’s too early to make any forecasts about the rest of the year.

Still, as others have noted, though ten games only represents 6% of an MLB season, surely these early games provide some indication of how a team will finish. Does Milwaukee’s 7-2 start mean that they may not be the sub-.500 team they were predicted to be? Are the 4-8 Diamondbacks likely to even worse than expected?

The graphic below explores this question. It plots the full season for every team over the last 10 years, or 300 seasons in total. By filtering by record, you can see how teams with similar starts fared over the rest of the season, and how this compares to an average season. (Because the graphic is loading nearly 50,000 games, it takes moment to first display.)

Note that for records with fewer than 5 teams, the graphic expands the win filter to include at least 5 teams. For example, because only one team started 6-0, the graphic shows teams that started both 6-0 and 5-1.

As the graphic shows, there’s generally a lot of noise early in the season. A small bit of history, however, is on the Brewers’ side. Based on the 15 teams that also started 7-2, the Brewers have around an 80% chance to finish above .500. In Arizona, things don’t look too desperate yet: Though teams that start 4-8 average 6 fewer wins than the overall mean, the distribution is still quite wide and many teams finish above .500.

Of course, forecasting (in the loosest sense of the word) future records based on teams’ current records is a very simple way to approach this problem. There are a number of other factors, such as runs scored, runs allowed, run differentials, results against other strong or weak teams, and results in home and away games, that could be indicative of future success.

As an example, run differentials could provide more information than just win totals. The table below shows the relationship between run differentials in wins and season win totals. As it shows, teams that win by more runs in their first 30 games tend to have better seasons. While that’s unsurprising, it suggests that wins alone is probably a crude metric, especially early in seasons when the sample is so small.

For anyone interested in exploring this data further, Retrosheet data is available in Mode for every season since 1980. This data can be analyzed, visualized, and shared directly through Mode, and I can provide access to anyone who is interested. If you’d like to modify or double-check any of my analysis above, you can click through the embed links to access the work and data directly.

Benn Stancil is the chief analyst of Mode.

FiveThirtyEight vs. The Oddsmakers

2014-03-27T21:37:03-07:00

You come at the king, you best not miss. - Omar Little

At the start of this year’s NCAA tournament, FiveThirtyEight, the new website of reigning forecast champion Nate Silver, predicted each team’s chances of making it to different rounds of the tournament. In an update yesterday, FiveThirtyEight looked into how their forecasts were doing. Having made my own predictive bracket based on Las Vegas odds, I figured I’d do the same—and see who comes out on top.

How Did FiveThirtyEight Do? #

Rather than simply forecasting winners, FiveThirtyEight’s predictions—like mine—calculate each team’s probability of winning every game. To assess how well these forecasts performed, it’s not appropriate to see how many of their “favorites” won. Instead, it’s better to see if favorites win more or less often than expected. In other words, if FiveThirtyEight identified 100 games in which the favorite had a 60% chance of winning, the favorite should actually win 60 of them. If the results are substantially different from that, then it’s an indication that something’s wrong with the model.

Over the last several years, Silver’s predictions have performed well. The chart below—reproduced using data provided by FiveThirtyEight—compares game results to FiveThirtyEight’s forecasts. As it shows, if you group games by the predicted odds of the favorite winning, the actual results are close to that range.

As Silver noted in his post, though the results in each bucket don’t precisely match the forecast, they fall reasonably close and well within his confidence intervals. Silver’s model, it appears, works reasonably well.

Silver vs. Vegas #

While FiveThirtyEight’s bracket is based on team rankings and a few other factors, I based my bracket solely on Las Vegas odds. Though the predictions are different, our brackets’ favorites are all the same, except for the Championship game. FiveThirtyEight gave a slight edge to Louisville over Florida, while Vegas preferred Florida by a slim margin.

Unfortunately, though Silver’s predictions go back to 2011, I only made forecasts for the tournaments last year and this year. To make the comparison equal, I first trimmed Silver’s data to include only 2013 and 2014 results. The chart below shows the same calculations as above for FiveThirtyEight (the buckets were made larger to adjust for the smaller sample size).

Unsurprisingly, with a smaller sample (especially one that includes the chaos of last year’s tournament), Silver’s model looks tarnished (sorry). Still, the trend is generally in the right direction.

Compared to my predictions using Vegas odds, however, Silver regains his luster. Vegas (or my method of interpreting Vegas) performs worse than FiveThirtyEight. As the chart shows—which overlays my model’s results with Silver’s—the model does a particularly poor job of identifying solid but not overwhelming favorites: Favorites only won half the the games in which they were expected to win 70% to 80% of the time.

Why The Difference? #

Before conceding to Nate Silver’s sterling record and accepting that he is just better at this than me, it’s worth looking into why our predictions came out so differently. Fortunately, there’s a fairly clear explanation. Silver’s recalculates his forecasts as the tournament progresses, updating the predictions after each game. This has two effects. First, his calculations respond to positive and negative signals from previous games. For instance, Virginia’s blowout win against Memphis could improve their odds against Michigan State, or Iowa State’s loss of Georges Niang to injury could lower their chances against Connecticut. My calculations were based on Vegas odds at the beginning of the tournament, and not responsive to new results.

Second—and more importantly—for all the games beyond the first round (when matchups were unknown), I computed game odds by comparing each team’s odds of making the Final Four. This is an imperfect calculation, most notably because those odds are based on a team’s entire path to the Final Four. For teams that face very challenging first games, their odds of making the Final Four are quite low. However, my model doesn’t make any adjustments for teams that overcome this first game.

Florida Gulf Coast’s Cinderella run last year is a perfect example of both of these cases. Not only did Florida Gulf Coast demonstrate that they were a better team than many thought by beating Georgetown and San Diego State by a total of 20 points, but they also cleared two tough hurdles between them and the Final Four. In part because of both of these factors, Silver’s model gave Florida Gulf Coast a 5.8% chance against Florida. My model—which was still based on Florida Gulf Coast’s original 1% chance of making the Final Four—only gave them a 0.2% chance against Florida.

This problem, however, can be partially corrected by only looking at first round games. Because these match-ups are known, forecasts are based on actual game lines rather than derived matchups.

As suspected, the difference in projections are less apparent in the first roud. Charting the relative predictions for each game shows that, in the first round, the differences between models are more or less random, and clustered around zero. In later rounds—when I’m deriving probabilities—Vegas odds almost universally overestimate the favorite’s chances. This makes sense, given the Florida Gulf Coast example above.

As this suggests, looking at only games in the first round, the models fair similarly. The chart below shows the same buckets as before, but only includes first round games—and in this case, the model predictions are more closely aligned.

Based on this, when picking your bracket next year, it doesn’t really matter if you go with Vegas or Nate Silver in the first round. For later round games, Nate Silver has left me hiding behind as car as he whistles The Farmer in the Dell (WARNING: that link is a Wire spoiler). But this does raise an interesting question: If FiveThirtyEight predicted every matchup at the start of the tournament, how would their results look? And how do FiveThirtyEight’s forecasts compare to Vegas lines at the start of each game? In other words, if we level the playing field, should I bet on Nate Silver or the true kings of sports forecasting—the oddsmakers?

Data #

Data was collected from FiveThirtyEight and calculated using Vegas odds. All analysis, data, and visualization code can be found in Mode. The graphs are backed by Variance, an excellent new visualization library.

Benn Stancil is the chief analyst of Mode.

The Odds of Your NCAA Bracket

2014-03-18T10:55:04-07:00

This year, Warren Buffett promised a billion dollars to anyone who picks a perfect bracket. Unfortunately, the odds aren’t in your favor—the chance of picking a perfect bracket if you pick every game at random is one in 9 quintillion (or 9,000,000,000,000,000,000).

But that’s just a hypothetical bracket. What are the odds of your bracket? The interactive below lets you figure that out. Using the betting lines for each game and on each team’s chances of the making the Final Four and winning the NCAA Championship, the graphic calculates the odds of every possible NCAA matchup—and every possible NCAA bracket. The graphic also shows how each team affects your bracket’s odds, and which picks lower your chances of winning a billion dollars the most.

Click on the bracket to see the interactive

The odds for first round games are calculated using the betting lines for those games. In matchups between two teams after the first round, each game’s odds are calculated using the relative odds that the two teams make the Final Four (in the case of regional games) or the relative odds that the two teams win the championship (in the case of Final Four games).

To figure out how much a team contributes to lowering your bracket’s odds, I compared the odds of the selected bracket to the odds of a bracket in which that team has a 100% chance of winning every game you picked them to win. The more this adjustment increased the odds of the bracket (that is, made the bracket more likely it made it), the larger that team’s circle.

The data that powers the bracket, as well as the code for the visualization, are in this GitHub folder.

Benn Stancil is the chief analyst of Mode.

Engineering a Best Picture

2014-02-27T10:00:41-08:00

When Netflix wanted to create a hit TV show, it turned to data. By analyzing its viewers habits, Netflix uncovered that its customers particularly liked Kevin Spacey, director David Fincher, and political thrillers. In part because of these interests, Netflix brought the three together to create House of Cards—and thus far, the results have been tremendous.

Having binged our way through Season 2 of House of Cards, we in the entertainment world now turn our attention to the Oscars, and particularly, the race for Best Picture. In doing so, perhaps we could take a page from Netflix’s book. Perhaps, using data about movies and the relationships between them, we can identify a perfect cocktail of movie attributes—PG-13-rated biopics about celebrities, or heart-wrenching World War II stories directed by Steven Spielberg, or anything related to Michael Bay—that strikes every Best Picture nerve. Perhaps, just like a hit TV series, a Best Picture can be engineered.

Using data collected from Rovi, I explored attributes that define Best Pictures. In addition to finding characteristics that frequently appear in nominees for Best Picture, I also looked for what made candidates for Best Picture stand out from other Oscar nominees. For example, it’s well known that most Best Picture nominees are dramas, but hundreds of dramas are released every year. Maybe there are rare, niche genres that only produce a few films a year—but always get noticed by the Academy.

Furthermore, concluding “we should make a drama” isn’t very instructive. Why not sketch out the entire movie, complete with a plot, themes and tones, and a cast and crew?

The following does exactly that. I first define the rough outline and plot of the movie, and cast it and pick a crew. I then built a model to blend together movie titles and synopses from Oscar nominated-films. The result is the ultimate Frankensteinian Oscar-bait—and like Frankenstein, it could be a triumph or an abomination.

Sorting out the Basics #

When engineering a Best Picture, a few elements are essential. First, popular opinion about dramas is correct—dramas do much better than other genres. Fifty percent of movies nominated for any Oscar are dramas, compared to 75 percent for Best Picture nominees. However, there’s a stronger bias for crime dramas and biographical films, suggesting that some specialization could be beneficial.

On the other end of the spectrum, comedy, science fiction, action, and horror movies are all under-represented in Best Pictures. Out of the roughly 550 movies of these types nominated for Oscars, less than 30 were nominated for Best Picture.

Second, the film shouldn’t be a sequel. Though sequels do well at the box office (for example, Transformers 2 and 3; The Matrix 2 and 3; Spider-Man 2 and 3; Batman 2 and 3; Pirates of the Caribbean 2, 3, and 4; Twilight 2, 3, 4, and 5; and Harry Potter 2, 3, 4, 5, 6, 7, and 8), they rarely impress Oscar voters. Only three of over 80 sequels nominated for Oscars were nominated for Best Picture.

While the Academy shuns sequels for Best Pictures, it actually favors adapted writing over original writing. Over 50 percent of the nominees for Best Adapted Screenplay (or a past variant of the award) were also nominated for Best Picture, but less than 40 percent of Best Original Screenplays garnered Best Picture nominations.

Constructing a Plot #

This gives us the basic outline of a film—it should be well-written adapted biopic about crime. But other studies confirmed these discoveries. Surely we can be more specific. Fortunately, Rovi provides detailed classifications of films’ tones and moods, and identifies specific characteristics and keywords related to their plots. Using these attributes, I developed a more well-defined Best Picture.

The table shows the themes and plot elements most over-represented in Best Picture nominees (it excludes characteristics that only appear in very few films). Broadly, Best Picture nominees should take on sweeping themes and be bittersweet and compassionate. They should never be goofy, silly, or campy.

Though the tone of the films are best in a minor key, they should end triumphantly. Light, just-for-fun adrenaline rushes struggle (apologies to Michael Bay).

Regarding the specifics of the plot, movies about cross-cultural relations and forbidden loves are strong performers. The best cultural differences to explore are those that emerge from economic inequality—stories about class differences and servants and employees are among the Academy’s favorites. Social injustices, particularly those that address racism and mental illness are well-received, though movies that touch on injustices done to Native Americans may be ignored. Other important characteristics to avoid are kidnapping, self-referential movies about filmmaking, pregnancy, and evil aliens (apologizes to Michael Bay).

Though these plotlines are solid bets, they’re all also well-trodden—what if we want to try something a little edgier? Though they’re infrequently made (and excluded from the table above), films about wheelchairs and farm life have played well to Best Picture voters. A movie about the struggles of a wheelchair-ridden farmer, perhaps?

Finally, the movie should present these difficult and complex themes in depth and without censorship. Oscar-nominated films are an average of 114 minutes long, Best Picture nominees are an average of 130 minutes long, and Best Picture winners are 142 minutes long. Moreover, the film should be R-rated: 40 percent of Oscar-nominated films are R-rated, compared to 50 percent of Best Picture nominees.

Casting a Best Picture #

Bad acting can ruin a film. Can good acting make a movie a Best Picture?

Though it doesn’t appear critical, casting good actors certainly helps: About 60 percent of films nominated for best actor or best supporting actor were nominated for Best Picture. Sadly, the gendered “actor” isn’t incidental—only around 45 percent of films recognized for outstanding performances by actresses are nominated for Best Picture.

When filling the roles, we may be inclined to turn to the greats like Meryl Streep, Tom Hanks, and Jack Nicholson. These three—along with Harrison Ford, Dustin Hoffman, Robert De Niro and Leonardo DiCaprio—have been in a number of films nominated for Best Picture, but they’ve also been in many great films that weren’t nominated for Best Picture. When casting the film—especially if it’s on a budget—we want actors and actresses that collect Best Picture nominations as efficiently as possible.

Tragically, the undisputed champion of the Best Picture nod—John Cazale—died 35 years ago. Remarkably, Cazale appeared in five films in his career, and all five were nominated for Best Picture. Outside of Cazale, Daniel Day Lewis and Al Pacino are the best male leads. For female leads, Ellen Page and Jessica Chastain are excellent choices. Despite their commercial success, Sean Connery, Ewan McGregor, Eddie Murphy, Colin Farrell, and Susan Sarandon have all had little success with Best Picture nominations and are all stars to avoid.

To round out the supporting cast, Billy Boyd—a poor man’s John Cazale—is an obvious first choice. All four of the Oscar-nominated movies in which Boyd has had parts (all three Lord of the Rings films and Master and Commander) were nominated for Best Picture. After Boyd, Shane Rimmer and Peter Cellier are among the best men, while Miranda Otto and Talia Shire stand out among the women.

(As an aside, this film has run into a minor issue at this point. The Academy indirectly told us that we should focus on issues about race. However, the Oscars have historically favored white men. Maybe Oscar-nominee The Last Samurai found the solution to this dilemma—make the hero of the oppressed race…Tom Cruise.)

Finding a Worthy Crew #

A quality crew is perhaps even more important than the cast. Seventy-five percent of films nominated for Best Director were also nominated for Best Picture, which is the strongest overlap Best Picture nominees have with any other Oscar category.

To lead the crew, Martin Scorsese is a clear choice for the director. He has more Best Picture nominations than any other director, and has collected them with amazing efficiency. Beyond Scorsese, Norman Jewison, Ang Lee, and James Brooks would all be strong choices. On the other side of the coin, Tim Burton and Michael Bay have been very successful at having their films nominated for Oscars, but not as Best Pictures (apologizes to Michael Bay).

The “Moneyball” picks for the cast are editor Thelma Schoonmaker and cinematographer Robert Richardson. The two have been involved in a total of 27 Oscar nominated films, 15 of which were nominated for Best Picture. Moreover, films nominated for best editing and best cinematography were also nominated for Best Picture at rates above 50 percent, suggesting these two may provide the most bang-for-their buck out of any cast or crew member. Notably, Schoonmaker typically works with Scorsese, so she may be riding his coattails—or he may be riding hers.

When looking for other crew members, we should focus on sound over visual elements. Forty-two percent of films nominated for Best Score received Best Picture nominations. (Interestingly, the rate was only 14 percent for films nominated for Best Song.) By contrast, the overlaps between Best Picture nominees and Best Costumes, Best Makeup, and Best Visual Effects nominees are among the weakest of any Oscar category, excluding those dedicated to particular genres like documentaries or shorts.

Finally, the Weinstein brothers easily top the list as the best candidates to bankroll the movie. Like Scorsese, they’ve been remarkably successful in getting films they produce nominated for Best Picture, and done so with impressive efficiency.

Bullet to the Dark Side #

Based on this outline, 12 Years a Slave appears to be the clearest Oscar-bait among this year’s Best Picture nominees. It’s a dramatic biopic; crime is central to the plot (though one of the chief crimes is kidnapping); it was nominated for Best Directing and Best Editing; and the arc of the plot—a hopeless social injustice, accented with cross-cultural relationships, breaks the audience’s spirit before ending in poignant salvation—fits the mold perfectly.

But we can do better. The cast could be improved. At “only” 134 minutes, it could be longer. And imagine if Scorsese had directed it.

What film would be better? Using a model to blend titles and plot synopses of all the Oscar nominees over the last five decades, I generated several potential Best Picture-worthy titles and plot summaries. Among the randomly-generated plots and titles produced by the model, the two titles and six plots below best fit the themes and tones recommended from the analysis above (I paired the synopses with the titles they best matched). Some of the results could be clear winners; others, however, might need some of that Schoonmaker magic…

Suggested Title 1: Bullet to the Dark Side #

Possible Plot 1: A portrait of a newlywed couple who are reunited in the Afghan mountains.
Possible Plot 2: A ‘50s housewife and a disgraced cop team up to exact revenge upon her one-time lover.
Possible Plot 3: A crooked cop tries to obtain the ultimate Dalmatian coat.

Suggested Title 2: Hurt Me the Hidden World #

Possible Plot 1: A suicidal former Union soldier ends up joining a Sioux tribe. He then takes up arms to defend them when they become entangled with Russian mobsters in London.
Possible Plot 2: A farmer tries to woo a wealthy uncle, meets and falls for an agnostic Roman soldier during WWII.
Possible Plot 3: A rich playboy who escapes from prison to reunite their divorced dad poses as an eccentric teacher at an unconventional brothel.

So to all the aspiring writers and filmmakers in the world, you now know what to do. The path before you is clear. Six outstanding movies are practically written. The necessary themes and plot twists are known. All that’s left to do is assemble the right cast and crew, and collect the inevitable hardware.

Data #

I collected data via the Rovi Cloud Services API. While Rovi provides an impressive amount of data on each film, the dataset still has a few holes, most notably regarding the awards each film was nominated for (data on Best Picture nominees is complete). Additionally, Rovi provided no data on about 30 of the 2,900 films that were nominated for an Oscar over the last fifty years.

To determine the top attributes in a Best Picture, I found which attributes were most over-represented in Best Pictures relative to all Oscar nominees. Unless otherwise noted, when finding top themes and plot elements, I only considered those attributes that appeared in at least 10 of the nearly 3,000 Oscar nominated movies. The list of Oscar winners was collected from the Academy Awards Database.

While comparing Best Picture nominees to other Oscar nominees rather than all films introduces some bias (Oscar nominees aren’t a perfect sample of all movies), it has benefits as well. The dataset is restricted to movies that had some degree of critical or popular success, excludes made-for-TV movies, and largely focuses on American films (few foreign films are nominated for Best Picture).

The title and synopsis mash-ups were randomly generated using a Markov n-gram model trained on a dataset of all Oscar nominees. Because the set of Best Picture nominees is small, creating n-gram models using only Best Picture titles and synopses unfortunately isn’t possible.

To anyone who is interested in this work and would like to explore other methods for characterizing Best Picture nominees, I’m happy to share all my analysis. The conclusions presented here are a simple start to figuring out what makes a Best Picture; like so many other analyses, it could be greatly strengthened by others’ data and others’ ideas.

Benn Stancil is the chief analyst at Mode.