{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Simple data exploration\n", "\n", "In this notebook we will explore a dataset from an article by a team at autodesk (which I link to below).\n", "We can think of this as the simple data exploration you might do when you first start working with a new dataset.\n", "\n", "First, we will load pandas and numpy, and read the comma-separated-value (CSV) file into a pandas dataframe." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from matplotlib import rcParams\n", "# figure size in inches\n", "rcParams['figure.figsize'] = 6,6" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | x | \n", "y | \n", "label | \n", "
---|---|---|---|
0 | \n", "32.331110 | \n", "61.411101 | \n", "away | \n", "
1 | \n", "53.421463 | \n", "26.186880 | \n", "away | \n", "
2 | \n", "63.920202 | \n", "30.832194 | \n", "away | \n", "
3 | \n", "70.289506 | \n", "82.533649 | \n", "away | \n", "
4 | \n", "34.118830 | \n", "45.734551 | \n", "away | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "
1841 | \n", "34.794594 | \n", "13.969683 | \n", "x_shape | \n", "
1842 | \n", "79.221764 | \n", "22.094591 | \n", "x_shape | \n", "
1843 | \n", "36.030880 | \n", "93.121733 | \n", "x_shape | \n", "
1844 | \n", "34.499558 | \n", "86.609985 | \n", "x_shape | \n", "
1845 | \n", "31.106867 | \n", "89.461635 | \n", "x_shape | \n", "
1846 rows × 3 columns
\n", "\n", " | x | \n", "y | \n", "
---|---|---|
x | \n", "281.069988 | \n", "-29.113933 | \n", "
y | \n", "-29.113933 | \n", "725.515961 | \n", "
\n", " | \n", " | x | \n", "y | \n", "
---|---|---|---|
label | \n", "\n", " | \n", " | \n", " |
away | \n", "x | \n", "281.227029 | \n", "-28.971572 | \n", "
y | \n", "-28.971572 | \n", "725.749775 | \n", "|
bullseye | \n", "x | \n", "281.207393 | \n", "-30.979902 | \n", "
y | \n", "-30.979902 | \n", "725.533372 | \n", "|
circle | \n", "x | \n", "280.898024 | \n", "-30.846620 | \n", "
y | \n", "-30.846620 | \n", "725.226844 | \n", "|
dino | \n", "x | \n", "281.069988 | \n", "-29.113933 | \n", "
y | \n", "-29.113933 | \n", "725.515961 | \n", "|
dots | \n", "x | \n", "281.156953 | \n", "-27.247681 | \n", "
y | \n", "-27.247681 | \n", "725.235215 | \n", "|
h_lines | \n", "x | \n", "281.095333 | \n", "-27.874816 | \n", "
y | \n", "-27.874816 | \n", "725.756931 | \n", "|
high_lines | \n", "x | \n", "281.122364 | \n", "-30.943012 | \n", "
y | \n", "-30.943012 | \n", "725.763490 | \n", "|
slant_down | \n", "x | \n", "281.124206 | \n", "-31.153399 | \n", "
y | \n", "-31.153399 | \n", "725.553749 | \n", "|
slant_up | \n", "x | \n", "281.194420 | \n", "-30.992806 | \n", "
y | \n", "-30.992806 | \n", "725.688605 | \n", "|
star | \n", "x | \n", "281.197993 | \n", "-28.432772 | \n", "
y | \n", "-28.432772 | \n", "725.239695 | \n", "|
v_lines | \n", "x | \n", "281.231512 | \n", "-31.371608 | \n", "
y | \n", "-31.371608 | \n", "725.638809 | \n", "|
wide_lines | \n", "x | \n", "281.232887 | \n", "-30.075267 | \n", "
y | \n", "-30.075267 | \n", "725.650560 | \n", "|
x_shape | \n", "x | \n", "281.231481 | \n", "-29.618418 | \n", "
y | \n", "-29.618418 | \n", "725.224991 | \n", "
\n", " | x | \n", "y | \n", "
---|---|---|
label | \n", "\n", " | \n", " |
away | \n", "54.266100 | \n", "47.834721 | \n", "
bullseye | \n", "54.268730 | \n", "47.830823 | \n", "
circle | \n", "54.267320 | \n", "47.837717 | \n", "
dino | \n", "54.263273 | \n", "47.832253 | \n", "
dots | \n", "54.260303 | \n", "47.839829 | \n", "
h_lines | \n", "54.261442 | \n", "47.830252 | \n", "
high_lines | \n", "54.268805 | \n", "47.835450 | \n", "
slant_down | \n", "54.267849 | \n", "47.835896 | \n", "
slant_up | \n", "54.265882 | \n", "47.831496 | \n", "
star | \n", "54.267341 | \n", "47.839545 | \n", "
v_lines | \n", "54.269927 | \n", "47.836988 | \n", "
wide_lines | \n", "54.266916 | \n", "47.831602 | \n", "
x_shape | \n", "54.260150 | \n", "47.839717 | \n", "
\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Final thoughts\n", "\n", "The data set above comes from [this post by Autodesk research](https://www.autodeskresearch.com/publications/samestats):\n", "\n", "``` Justin Matejka, George Fitzmaurice (2017)\n", " Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing\n", " CHI 2017 Conference proceedings:\n", " ACM SIGCHI Conference on Human Factors in Computing Systems\n", "```\n", "\n", "It was inspired by this tweet from Alberto Cairo:\n", "\n", "Be wary of boxplots! They might be obscuring important information.https://t.co/amnbAYvsq1 pic.twitter.com/7YxslPGp1n
— Justin Matejka (@JustinMatejka) August 9, 2017
\n", "\n", "A more well-known example is known as [Anscombe Quartet](https://en.wikipedia.org/wiki/Anscombe%27s_quartet)\n", "\n", "\n", "\n", "Don't trust summary statistics. Always visualize your data first https://t.co/63RxirsTuY pic.twitter.com/5j94Dw9UAf
— Alberto Cairo (@AlbertoCairo) August 15, 2016