Welcome to episode 35 of Data Viz Today. What should you do when you plot your data points and realize they're all on top of each other?? I recently learned that this is called "overplotting" and in this episode, I'll offer 3 techniques to help you handle this problem so you can get back to analyzing & visualizing!
Welcome! I'm Alli Torban.
00:30 - Today’s episode is about how to deal with overplotting. Overplotting is when you have a lot of data that overlaps each other in your chart. It’s difficult to see how much data there is and where it’s the most concentrated, which really hinders your analysis and obviously conveying your message visually.
01:15 - When I finally figured out that this was called overplotting, I was able to find a lot of great resources, specifically this article by Stephen Few with lots of ideas.
01:40 - So let’s talk a little more about what overplotting looks like and 3 solutions that you can test out next time you run up against this in your practice.
01:46 - Overplotting is pretty common in scatter plots and line charts when you have a large dataset and/or many points are plotted on the same or similar values, or when you’re plotting the values of some points and your x-axis is plotting a discrete variable (like something where there’s a finite number of possible categories), so you’ll end up with a lot of points in the same place.
02:36 - There are a couple of solutions that you’d probably think of immediately. Make the points or lines slightly transparent or decrease them in size. Try these as well:
03:00 - First, you can try aggregating the data. Maybe you don’t need to see every point or line, so consider whether showing something like an average or median would work for your goal. Similarly, you can filter your data in certain ways and create a series of small multiples.
03:35 - Second, you can try to convey where the density of your data is by adding a distribution chart on the margin of your scatter plot. So the actual data in the scatter plot stays the same, but there’s a distribution line on the side of the chart to convey where the points are the most dense. Similarly, you can create a contour plot which draws these kind of concentric circles underneath your data points and the circle centers around the densest areas and radiates out as it becomes less dense.
04:22 - Third, you can add some jitter to your points. That’s when you slightly alter the value of points that are close together so they don’t overlap, or overlap less. The points end up kind of huddled together rather than obscuring each other. A similar solution that I found is called the gatherplot. I stumbled across a research paper by Niklas Elmqvist and others that introduced the gatherplot, and it’s kind of like adding jitter to your points in a scatter plot, but then ordering the points in a more meaningful way. Think of like you have all your gridlines on your scatterplot, and whichever points fall within one cell are then lined up in an orderly way rather than jittered all around or overlapping. So you get the benefit of jittering because the points aren’t overlapping, but it’s a little more organized so you can compare the size of the grouped points more easily. Plus if you’re coloring the points by some other variable, it makes it easier to compare the number of points of each color when they’re lined up and ordered within the cell, rather than jittered randomly.
05:45 - My final takeaway is that the next time you have an overplotting problem, where there’s a lot of overlapping points in your chart, you can try
playing with transparency,
decrease the size of the points,
aggregate the data,
create small multiples with filtered data,
use a contour plot,
try adding jitter, or
using a gather plot.
06:15 - And if you’ve been wanting to try creating data viz in Adobe Illustrator, they offer a 7 day free trial with no credit card required, and you can get going designing and editing charts quickly with my new course → Design Your First Visualization in Adobe Illustrator in Under 30 Minutes