What is the Outlier?

Malcolm Gladwell, author of “Outliers” book, answered on a question “What is an outlier?” this way: “”Outlier” is a scientific term to describe things or phenomena that lie outside normal experience”.

For many situations Scatter Plots used to visually detect suspected and extreme Outliers. For example on 1st Scatter below 2 data Points (labeled as E and G) are outliers and Data Point labeled as H is just a suspected Outlier.

On Next Scatter below 2 Data Points marked as an Outlier! and one Data Point marked as an Outlier? because it is a Suspected Outlier:

J. W. Tukey invented the Box-and-Whisker Plot to display group of data with specific goal to Visualize Outliers. To create a box-and-whisker plot, draw a box (the central rectangle spans the first quartile to the third quartile (the interquartile range or IQR)) with ends at the quartiles Q1 and Q3. Draw the statistical median as a horizontal line in the box.

Now extend the “whiskers” to the farthest points that are not outliers (i.e., that are within 3/2 times the IQR – interquartile range of and ). Then, for every point more than 3/2 times the interquartile range from the end of a box, draw a dot. If two dots have the same value, draw them side by side.

Not uncommonly real datasets will display surprisingly high maximums or surprisingly low minimums called outliers. John Tukey (see above) has provided a precise definition for two types of outliers: Outliers are either 3×IQR or more above the third quartile or 3×IQR or more below the first quartile.

And Suspected outliers are slightly more “central” versions of outliers: either 1.5×IQR or more above the third quartile or 1.5×IQR or more below the first quartile. If either type of outlier is present the whisker on the appropriate side is taken to 1.5×IQR from the quartile (the “inner fence”) rather than the max or min, and individual outlying data points are displayed as unfilled circles (for suspected outliers) or filled circles (for outliers). (The “outer fence” is 3×IQR from the quartile.)

You can read more about Tukey in many places, 1 of them here: http://blogs.sas.com/content/jmp/2013/09/02/celebrating-statisticians-john-w-tukey/

Outlier Detection

Below you can find 2 Charts, demonstrating Visualization of Outliers, 1st (upper) Chart is Radar Chart and 2nd (bottom) Chart is the Box-and-Whiskers Diagram.

The 1st chart below shows the number of visitors to website, depends on the time of the day; those quantities averaged over time, grouped by the Day of the Week and stacked on top of each other starting from beginning of the week (Monday) up to the end of the Week (Sunday in this case) so daily trends visualized too as well as average weekly picture! It is easy to see that around 11pm and 4am of the each day number of visits far exceeded any other time of the day and either require more attention from sales person(s) who are trying to convert visitors to customers or analyze those visits more details.

Such analysis giving a simple answer: those 2 spikes in web visits created by so called Web Crawlers and Spiders, whose simply schedule to visit this particular website at this time and therefore those visits should be completely ignored!

The next Chart below is a sample of how the Box-and-Whisker Diagram can be used to Visualize a Snowfall (shown is data points for Monthly Snowfall in Newport, VT during 1977-2007).

Permalink: https://apandre.wordpress.com/visible-data/outliers/


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s