A “Startlingly Neat & Simple” Rule & Five Graphs About Patterns That Might Surprise You

A “Startlingly Neat & Simple” Rule & Five Graphs About Patterns That Might Surprise You

To see the interactive versions of these charts, head over to our blog. If you’re inclined to share, here is the link: http://blog.plot.ly/post/111292824082/a-startlingly-neat-simple-rule-five-graphs

George Zipf popularized an idea — Zipf’s Law — that approximates populations of cities, distribution of money in counties, and how frequently words are used. Nobel Prize-winning columnist Paul Krugmans wrote of Zipf’s Law that

“the usual complaint about economic theory is that our models are oversimplified — that they offer excessively neat views of complex, messy reality. [In the case of Zipf’s law] the reverse is true: we have complex, messy models, yet reality is startlingly neat and simple.”

Read on to learn more. Let us know if you want to run Plotly Enterprise on-premise.

A Zipfian distribution is a type of power law. A power law occurs when one event varies as a power of another. One application of Zipf’s law states that in texts of natural language (e.g., books), each word is used twice as often as the next most commonly occuring word. The graph below applies the rule to word usage in 29 UK books below. “The” occurred 225,300 uses, and was the most commonly used word. Note that the graph is interactive; you can press the “play with this data” link to edit, embed, and share your own version.

Image for post
Image for post
See the interactive plot

We can test for a power law by plotting frequency (y-axis) against rank (x-axis) on a double log axis. Then check for a straight line. The graph below shows three attempts to fit a power law function to datasets. The plot on the left is a good fit. The plot in the middle is a decent fit. The plot on the right is not a good fit.

Image for post
Image for post
See the interactive plot

Another application of Zipf’s law is for populations. We’ve used ggplot2 to graph the population of cities (y-axis) and the rank of each city. In this dataset, New York has the highest population and is ranked first.

Image for post
Image for post
See the interactive plot

We are approaching a Zipfians distribution for country GDP vs rank.

Image for post
Image for post
See the interactive plot

Researchers use power laws to determine how much infrastructure a city needs, examine the number of gas stations required in a city, and much more.

Image for post
Image for post
See the interactive plot

If you liked what you read, please consider sharing. Find us at feedback@plot.ly and @plotlygraphs.

Written by

The leading front-end for ML & data science models in Python, R, and Julia.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store