Four Mistakes To Avoid If You’re Analyzing Data

Analyzing and graphing data helps us understand our work in science, business, and everyday life. We’ve written this post with a few principles we think about as a startup. We used Plotly’s free web app. Contact us if you’d like to use Plotly Enterprise to power your graphing and collaboration.

Image for post
Source: xkcd

1. Choose The Right Metrics

There are two types of data teams must be able to differentiate between: vanity metrics relate data that sounds appealing but is ultimately irrelevant while actionable metrics relate data that is relevant to a team.

For example, a startup studying only the first graph below might conclude that things are going well. The second graph reveals that despite the increased traffic, only 1% of visitors are actually signing up. Thus, we could make a new goal: increase not only the number of visitors who sign up, but the proportion thereof.

Image for post
See the interactive plot
Image for post
See the interactive plot

2. Correlation vs. Causation

As the comic at the beginning of this post notes, correlation does not imply causation. An increase in sales can’t directly be attributed to a new marketing startegy, just like cheese consumption can’t directly be attributed to doctorates awarded. A “correlation means causation” argument needs to pass further testing, analysis, and study.

Image for post
See the interactive plot

3. Ignoring the Tail

Looking at “top 10” for metrics is natural. It can be misleading if the “other” category largely exceeds the top categories for the metric. For example, consider the next two graphs. Most of the traffic is coming from smaller contributors in the “other” category, yet someone looking at the first graph might only focus on Facebook and Twitter. For more on distributions, see heavy-tailed distribution.

Image for post
See the interactive plot

4. Avoid Averages

Focusing too much on averages can be misleading as they do not accurately portray exactly how the data is dispersed. For example, say that our analytics say that “Average Time Spent” on the site is 1 minute and 33 seconds. Yet graphing out all the times spent on the site yields this:

Image for post
See the interactive plot

The two factors to analyze are (1) that many users are leaving the site in under 10 seconds, and (2) that a portion of them stay between 181 and 1800 seconds. In this case, the average does not explain how users are interacting with the site. Pro tip: look at a histogram or a boxplot to get a better feel for a distribution.

Analyzing data is not easy. We hope this post helps. Has your team made or avoided any of these mistakes? Do you have suggestions for a future post? Let us know; we’re @plotlygraphs, or email us at feedback at plot dot ly.

Written by

The leading front-end for ML & data science models in Python, R, and Julia.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store