Episode 34: How to Harness the Power & Beauty of a Box Plot - Featured Data Visualization by Eric William Lin

Welcome to episode 34 of Data Viz Today. When's the last time you saw a box plot? How about the last time you created one?! It's been a long time for me, but this week's featured data visualization by Eric William Lin has convinced me to reconsider using this often clinical chart type as a beautiful and powerful way to tell a story. In this episode, we'll hear how Eric built his Kantar IIB Shortlisted viz, plus a few suggestions for how and when you could try a box plot!

Listen on Apple Podcasts, Google Play, Google PodcastsStitcher, SoundCloud & Spotify.

  • Welcome! I'm Alli Torban.

  • 00:25 - Today’s episode is all about the classic chart type - the box plot! Also known as the box and whisker plot. We’ll talk about the visualization that inspired me to reconsider the beauty and functionality of a box plot, how it was built, plus a few suggestions how and when YOU could try a box plot!

  • 01:02 - Today’s featured data viz is called Casting Shakespeare: How age, gender, and race affect casting by Eric William Lin

  • 01:10 - Eric is a musician-turned-software engineer based in New York City. He occasionally teaches classes in programming and has recently become obsessed with designing data visualization, which has led to this featured visualization showing up on the shortlist for the Kantar Information is Beautiful Awards! Public voting is open til the 19th so vote for this viz!

  • 01:40 - The spark that led to this shortlisted viz was actually from Shirley Wu - she was featured in episode 4 How to Find Answers in Survey Results. And last year, she gave a talk at a Javascript meetup in Brooklyn about her beautiful visualization of all the words in Hamilton the musical for The Pudding. Turns out that Eric was in the audience and having been a high school theater kid and music major, the thought of visualizing theater seemed like a really exciting way to combine his two loves - coding and theater.

  • 02:15 - He began brainstorming about which plays to focus on and what would be an interesting angle, which can be a big struggle like we talked about in the last episode.

  • 02:25 - But the first piece in the puzzle for Eric was that he remembered that the New York Philharmonic had open-sourced their performance history data, so while looking through the dataset, he decided to focus on Shakespeare plays, but with a twist - instead of focusing on the lines of text, he would focus on the characteristics of actors who have acted in those Shakespeare plays at over time. Like the age, gender and race of the actors.

  • 02:55 - So he began gathering data for that, and said this turned out to be the most difficult part of the project. All that information was scattered around on different sites, in different formats, or not available at all. He had to scrape a lot of data from production websites using python, and deduce some actors ages from an old article that referenced their age and compare it to the production date of the play.

  • 03:20 - But once he got everything that he needed, Eric was able to move onto the fun, creative part - visualizing the data. His first instinct was to create 2-dimensional scatter plot. The x-axis would the year of the production, and the y-axis would be the age of the actor at the time of production. Then do this same scatter plot for each character, and present it as a series of small multiples. But he quickly realized that showing the actors like this would make it hard to visually tell a story or a narrative about interesting patterns in the data…

  • 04:00 - His breakthrough moment was realizing that he could frame the story around the actor’s perspective. What if instead of looking at each character one-at-a-time and looking at how they were cast historically year-by-year, he could ask: As an actor, what roles are available to me at my current age? What roles should I audition for, and what characters would directors cast me in based on past data.

  • 04:40 - This led him to the box plot - he could show the distribution of ages for each character side-by-side, and another innovative benefit of using the boxplot - he could slowly reveal the boxplot to tell a story of an aging actor - like now you’re 30 years old, you’re probably not going to be cast as Romeo or Juliet because 75% of actors who played those roles were under 30.

  • 05:15 -Final visualization was built with JavaScript, D3.js, Aliza Aufrichtig’s Coordinator, and Susie Lu’s d3-annotations - you can hear more about that in episode 7 How to Annotate Like a Boss with Susie Lu!

  • 05:25 - Experience the viz here!

  • 07:40 - Eric showed us that box plots can be beautiful and aid in storytelling, but let’s get a quick refresher on what a box plot is, and then we’ll talk about some pros/cons, and some variations.

  • 07:53 - A boxplot is a standardized way of showing the distribution of data. It gives you a quick way to see how your data points are spread out. If someone told you the median of a dataset, you don’t know if most of the points are clumped around that value, or if they’re spread out.

  • 08:13 - In a box plot, there’s a specific mark to show the median, the lower and upper quartiles, the upper and lower fences, and any outliers. Listen for a more detailed description of how to build one. Check out Nathan Yau’s extremely helpful blog post about how to read a box plot:

  • 09:55 - Pros: you can garner a lot of information about the distribution by these couple of marks, and they don’t take up a lot of room, so if you try show distribution with a histogram or a density plot, then it’s harder to put them all side by side and compare. But it’s easy to stack up box plots into one chart and compare distributions among various groups.

  • 10:25 - Cons: The benefit of something like a histogram, is that you can see more detail. The box plot is using summary statistics, so you don’t have any control over the granularity, like you would with a histogram by varying the bin size. It also hides the sample size, so you might compare groups with separate box plots, but it could be a little misleading if your sample size for each group varies widely. You could annotate it, or I like what Eric did by actually showing the points with slightly transparent dots behind the box plot. The box plot is also less intuitive for some people, but you could mitigate that by doing what Eric did and show a How to Read chart beforehand.

  • 11:25 - Check out box plot variations from Data Viz Catalogue!

  • 12:04 - Box plots in the wild:

    • New York Times - they showed projected career earnings for college graduates, and they had a box plot for each major and you could see the median and the spread of the projected earnings in dollars for each one.

    • FiveThirtyEight showed the median and spread of yelp reviews for restaurants with different Michelin Stars. I liked that they have an inset box that explains what the box plot shows.

  • 12:50 - Tools that make box plots: Tableau, RAWGraphs, Excel...

  • 13:20 - My final takeaway is that next time you’re visualizing the distribution of points and also want to compare distributions across many groups, consider using a box plot. It’s a clean way to show distributions, and you can experiment with different variations to show more detail, and even use it as a storytelling tool like Eric did! Just make sure your audience understands how to read it because it could’ve been a minute since they learned about box plots in math class.

  • 13:30 - Listen for Eric’s amazing advice to designers just starting out!

  • 14:55 - You can follow him on Twitter, and check out his website.

  • 15:05 - I'm sharing my essential Adobe Illustrator tips in my new course! Check if it's right for you HERE!


Allison Torbanlin, boxplot