02 - Data Handling
Getting data
Visualizing data
Characterizing data
Manipulating data
http://archive.ics.uci.edu/ml/index.php
https://aws.amazon.com/fr/datasets/
Metaportale:
http://dataportals.org/
https://www.opendatamonitor.eu
Given a file with 3 columns of values
10 0 cold
25 0 warm
15 5 cold
20 3 warm
18 7 cold
20 10 cold
22 5 warm
24 6 warm
Read data into a list
def csv_file_to_list(csv_file_name):
with open(csv_file_name, 'rb') as f:
reader = csv.reader(f)
data = list(reader)
return data
Read data into a dictionary
(all keys, value is last column)
def load_3row_data_to_dic(input_file):
f = open(input_file, 'r')
dic = {}
entries = (f.read()).splitlines()
for i in range(0, len(entries)):
values = entries[i].split(' ')
dic[int(values[0]), int(values[1])] = values[2]
return dic
Write dictionary back to file
def save_3row_data_from_dic(output_file, data):
f = open(output_file, 'w')
for key, value in data.items():
f.write(str(key[0]) + ' ' + str(key[1]) + ' ' + value + '\n')
def printf(format, *args):
sys.stdout.write(format % args)
printf(‘hello %s world‘, ‘good‘)
printf(‘pi is %f‘, 3.1415)
Plots
http://matplotlib.org
Bar charts
Pie charts
Line plots
Scatter plots
Histograms
…
Simple library
Diagram is set up internally step by step
Finally, show() displays the result
Many features -> see online manual
import matplotlib.pyplot as plt
years = [1950, 1960, 1970, 1980, 1990, 2000, 2010]
gdp = [300.2, 543.3, 1075.9, 2862.5, 5979.6, 10289.7, 14958.3]
# create a line chart, years on x-axis, gdp on y-axis
plt.plot(years, gdp, color='green', marker='o', linestyle='solid')
# add a title (GDP gross domestic product)
plt.title("Nominal GDP")
# add a label to the y-axis
plt.ylabel("Billions of $")
plt.show()
movies = ["Annie Hall", "Ben-Hur", "Casablanca", "Gandhi",
"West Side Story"]
num_oscars = [5, 11, 3, 8, 10]
# bars are by default width 0.8
xs = [i for i, _ in enumerate(movies)]
# plot bars with left x-coordinates [xs], heights [num_oscars]
plt.bar(xs, num_oscars)
plt.ylabel("# of Academy Awards")
plt.title("My Favorite Movies")
# label x-axis with movie names at bar centers
plt.xticks([i for i, _ in enumerate(movies)], movies)
plt.show()
from collections import Counter
grades = [83,95,91,87,70,0,85,82,100,67,73,77,0]
#round to ten
decile = lambda grade: grade / 10 * 10
histogram = Counter(decile(grade) for grade in grades)
# give each bar a width of 8
plt.bar([x for x in histogram.keys()], histogram.values(), 8)
# x-axis -5 .. 105, y-axis 0 .. 5 , labels 0 .. 100
plt.axis([-5, 105, 0, 5])
plt.xticks([10 * i for i in range(11)])
plt.xlabel("Decile")
plt.ylabel("# of Students")
plt.title("Distribution of Exam 1 Grades")
plt.show()
# y value series
variance = [1,2,4,8,16,32,64,128,256]
bias_squared = [256,128,64,32,16,8,4,2,1]
# zip() combines two data series to tuples
total_error = [x + y for x, y in zip(variance, bias_squared)]
# x values
xs = range(len(variance))
# we can make multiple calls to plt.plot
# to show multiple series on the same chart
# green solid line, red dot-dashed line, blue dotted line
plt.plot(xs, variance, 'g-', label='variance')
plt.plot(xs, bias_squared, 'r-.', label='bias^2')
plt.plot(xs, total_error, 'b:', label='total error')
# because we've assigned labels to each series
# we can get a legend for free
# loc=9 means "top center"
plt.legend(loc=9)
plt.xlabel("model complexity")
plt.title("The Bias-Variance Tradeoff")
plt.show()
friends = [ 70, 65, 72, 63, 71, 64, 60, 64, 67]
minutes = [175, 170, 205, 120, 220, 130, 105, 145, 190]
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']
plt.scatter(friends, minutes)
# label each point
for label, friend_count, minute_count in zip(labels, friends, minutes):
plt.annotate(label, xy=(friend_count, minute_count),
xytext=(5, -5), textcoords='offset points')
plt.title("Daily Minutes vs. Number of Friends")
plt.xlabel("# of friends")
plt.ylabel("daily minutes spent on the site")
plt.show()
plt.pie([0.95, 0.05],
labels=["Uses pie charts", "Knows better"])
# make sure pie is a circle and not an oval
plt.axis("equal")
plt.show()
Statistics
Small data sets can simply be represented by
giving the numbers
For larger data sets this is probably opaque
(imagine 1 million of numbers …)
-> we need statistics
from __future__ import division # do not round to int
from collections import Counter
from linear_algebra import sum_of_squares, dot
import math
import matplotlib.pyplot as plt
num_friends = [100,49,41,40,25,21,21,19,19,18,18,
16,15,15,15,15,14,14,13,13,13,13,12,12,11,
10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,
9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,
8,8,8,8,8,8,8,8,8,8,8,8,8,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,
4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,
3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
Vector calculations
Matrix calculations
def dot(v, w):
"""v_1 * w_1 + ... + v_n * w_n"""
return sum(v_i * w_i for v_i, w_i in zip(v, w))
def sum_of_squares(v):
"""v_1 * v_1 + ... + v_n * v_n"""
return dot(v, v)
friend_counts = Counter(num_friends)
xs = range(101)
ys = [friend_counts[x] for x in xs]
plt.bar(xs, ys)
plt.axis([0,101,0,25])
plt.title("Histogram of Friend Counts")
plt.xlabel("# of friends")
plt.ylabel("# of people")
plt.show()
num_points = len(num_friends) # 204
largest_value = max(num_friends) # 100
smallest_value = min(num_friends) # 1
sorted_values = sorted(num_friends)
smallest_value = sorted_values[0] # 1
second_smallest_value = sorted_values[1] # 1
second_largest_value = sorted_values[-2] # 49
Average (mean)
def mean(x):
return sum(x) / len(x)
mean(num_friends)
Average depends on every single value
◦ Runaway values can heavily influence the average
Median value
(middle-most value of the data set)
To find the middle value the data set must be
sorted
if (number of points is odd)
◦ take the middle one
else
◦ Take the mean value of the left & right value
def median(v):
"""finds the 'middle-most' value of v"""
n = len(v)
sorted_v = sorted(v)
midpoint = n // 2
if n % 2 == 1:
# if odd, return the middle value
return sorted_v[midpoint]
else:
# if even, return the average of the middle values
lo = midpoint - 1
hi = midpoint
return (sorted_v[lo] + sorted_v[hi]) / 2
Generalization of the median
The quantile is the value, which is the highest
of a certain percentile of the data set.
def quantile(x, p):
"""returns the pth-percentile value in x"""
p_index = int(p * len(x))
return sorted(x)[p_index]
quantile(num_friends, 0.10) # 1
quantile(num_friends, 0.25) # 3
quantile(num_friends, 0.75) # 9
quantile(num_friends, 0.90) # 13
Might be more than one value
def mode(x):
"""returns a list, might be more than one mode"""
counts = Counter(x)
max_count = max(counts.values())
return [x_i for x_i, count in counts.iteritems()
if count == max_count]
mode(num_friends) # [1, 6]
Measure how spread out the data is
◦ If near 0 -> hardly spread out
◦ If large number -> spread out
Range of data values
def data_range(x):
return max(x) - min(x)
def de_mean(x):
"""translate x by subtracting its mean
(so the result has mean 0)"""
x_bar = mean(x)
return [x_i - x_bar for x_i in x]
def variance(x):
"""assumes x has at least two elements"""
n = len(x)
deviations = de_mean(x)
return sum_of_squares(deviations) / (n - 1)
Variance is square unit of data
Therefore the standard deviation is
introduced as square root of the variance
def standard_deviation(x):
return math.sqrt(variance(x))
These metrics are once more dependent on
the number of items and sensitive to extreme
values (outliers)
Take the difference between 75% and 25%
percentile values
def interquartile_range(x):
return quantile(x, 0.75) - quantile(x, 0.25)
Correlation
Compare data sets to find commonalities
Sample:
◦ Amount of time people spend on our web site is related to the number of friends
◦ Start with a list of daily_minutes each user spends on the web site
daily_minutes = [1,68.77,51.25,52.08,38.36,44.54,57.13,51.4,41.42,31.22,34.76,
54.01,38.79,47.59,49.1,27.66,41.03,36.73,48.65,28.12,46.62,35.57,
32.98,35,26.07,23.77,39.73,40.57,31.65,31.21,36.32,20.45,21.93,
26.02, 27.34,23.49,46.94,30.5,33.8,24.23,21.4,27.94,32.24,40.57,
25.07,19.42,22.39,18.42,46.96,23.72,26.41,26.97,36.76,40.32,
35.02,29.47,30.2,31,38.11,38.18,36.31,21.03,30.86,36.07,28.66,
29.08,37.28,15.28,24.17,22.31,30.17,25.53,19.85,35.37,44.6,17.23,
13.47,26.33,35.02,32.09,24.81,19.33,28.77,24.26,31.98,25.73,
24.86,16.28,34.51,15.23,39.72,40.8,26.06,35.76,34.76,16.13,44.04,
18.03,19.65,32.62,35.59,39.43,14.18,35.24,40.13,41.82,35.45,36.07,
43.67,24.61,20.9,21.9,18.79,27.61,27.21,26.61,29.77,20.59,27.53,
13.82,33.2,25,33.1,36.65,18.63,14.87,22.2,36.81,25.53,24.62,
26.25,18.21,28.08,19.42,29.79,32.8,35.99,28.32,27.79,35.88,29.06,
36.28,14.1,36.63,37.49,26.9,18.58,38.48,24.48,18.95,33.55,14.24,
29.04,32.51,25.63,22.22,19,32.73,15.16,13.9,27.2,32.01,29.27,33,
13.74,20.42,27.32,18.23,35.35,28.48,9.08,24.62,20.12,35.26,19.92,
31.02,16.49,12.16,30.7,31.22,34.65,13.13,27.51,33.2,31.57,14.1,
33.42,17.44,10.12,24.42,9.82,23.39,30.93,15.03,21.67,31.09,33.29,
22.61,26.89,23.48,8.38,27.81,32.35,23.84]
Analog to the Variance
Dot product of the two pairs of data deviation
from mean value
def covariance(x, y):
n = len(x)
return dot(de_mean(x), de_mean(y)) / (n - 1)
May be hard to interpret
◦ Large positive covariance: if x is big, y is also big
◦ Large negative covariance: if x is big, y is inverse big
Divide out the standard deviation
def correlation(x, y):
stdev_x = standard_deviation(x)
stdev_y = standard_deviation(y)
if stdev_x > 0 and stdev_y > 0:
return covariance(x, y) / stdev_x / stdev_y
else:
return 0 # if no variation, correlation is zero
correlation(num_friends, daily_minutes) # 0,247
Single outstanding values can distort the plot
Always check your data for outliers
Maybe it is safe to ignore them?
outlier = num_friends.index(100) # index of outlier
num_friends_good = [x
for i, x in enumerate(num_friends)
if i != outlier]
daily_minutes_good = [x
for i, x in enumerate(daily_minutes)
if i != outlier]
Simpson‘s Paradox
Beware if boundary conditions are different
◦ „The only difference is the observation“
◦ „All else is equal“
Confounding variables can influence the
correlation
Always check AND UNDERSTAND your data
Correlation of zero means that there is no
linear relationship between the two variables
X = [-2, -1, 0, 1, 2]
Y = [ 2, 1, 0, 1, 2]
have zero correlation, but they have a
relationship (which is non-linear)
Correlation tells nothing about how large the
relationship is
X = [-2, -1, 0, 1, 2]
Y = [99.98, 99.99, 100, 100.01, 100.02]
Data is perfectly correlated, but are you
interested in this relationship?
Correlation is NOT causation
If data correlates, this might be because
◦ There is an underlying relationship
◦ There is something bigger causing this behavior
(external forces)
◦ The correlation is by coincidence and means
nothing
Conduct random experiments to foster the
results
Probability
Quantify the uncertainty associated with
certain events to occur
Given an event E we describe the probability
of the event happening as P(E)
Two events E and F are dependent if knowing
something about whether E happens gives us
information about whether F happens.
Otherwise, they are independent
If independent, then
P(E, F) = P(E) * P(F)
If events E, F are not necessarily independent
P(E|F) = P(E, F) / P(F)
P(E, F) = P(E|F) * P(F)
Probability that E happens if we know that F
happens
Assumptions:
◦ Each child can be either boy or girl equally like
◦ Gender of second child is independent of gender of
the first one
◦ P(B) = P(G) = 0.5
Start
P(B) P(G)
P(B|B) Boy P(G|B) P(B|G) Girl P(G|G)
Boy Girl Boy Girl
P(B,B) P(B,G) P(G,B) P(G,G)
Conditional Probability for both kids are girls
if the first one is a girl:
P(G|G) = P(G,G) / P(G)
P(G|G) = ¼ / ½ = ½
Conditional Probability for two girls if at least
one kid is a girl:
P(2girls |min1girl) = P(2girls, min1girl)/P(min1girl)
P(2girls |min1girl) = ¼ / ¾ = 1/3
So if you know that there is at least one girl in a familiy with
two kids, it is 2:1 that the other kid is a boy.
from __future__ import division
from collections import Counter
import math, random
def random_kid():
return random.choice(["boy", "girl"])
both_girls = 0
older_girl = 0
either_girl = 0
random.seed(0)
for _ in range(10000):
younger = random_kid()
older = random_kid()
if older == "girl":
older_girl += 1
if older == "girl" and younger == "girl":
both_girls += 1
if older == "girl" or younger == "girl":
either_girl += 1
print "P(both | older):", both_girls / older_girl # 0.514 ~ 1/2
print "P(both | either): ", both_girls / either_girl # 0.342 ~ 1/3
Provides a way to „reverse“ the conditional
probability:
P(E, F) = P(E) * P(F|E)
P(E|F) = P(E,F) / P(F)
Question:
what is conditional probability P(F|E)?
P(E|F) = P(E) * P(F|E) / P(F)
P(F) can be split in two parts:
P(F) = P(F,E) + P(F,¬E)
Probability that F happens if E happened plus
F happens if E did not happen
P(E|F) = P(F|E)P(E) / [P(F|E)P(E) + P(F|,¬E)P(¬E)]
Given disease affects 1 out of 10000
Test gives correct result in 99%
What is the probability that you are sick if the
test is positive?
P(Disease|TestPositive): P(D|T)
We know
◦ P(T|D) : 99% -> P(T|¬D) : 0,01
◦ P(D) : 1/10000 = 0,0001 -> P(¬D) : 0,9999
P(D|T) = P(T|D)P(D) / [P(T|D)P(D) + P(T|,¬D)P(¬D)]
P(D|T) = 0,99 * 0,0001 / (0,99*0,0001 + 0,01*0,9999)
P(D|T) = 0,98
Uniform Distribution
◦ Equal weight on all numbers between 0 and 1
◦ -> weight for single point = 0!
Better representation using the Probability
Density Function (pdf)
Cumulative Distribution Function (cdf)
y=0 for x <0
y=x for 0 <= x <= 1
y=1 for x>1
Bell-curve distribution
2
1 𝑥−µ
𝑓 𝑥 µ, 𝜎 = 𝑒𝑥𝑝 −
2𝜋𝜎 2 2𝜎 2
Cumulative Distribution Function
Sometimes it is necessary to find the x for a
given probability
There is no simple way for that, but we can
use binary search to find the value
http://en.wikipedia.org/wiki/Binary_search_al
gorithm
Filters & Conversions
Data values might be huge in difference
For some applications relation of data is more
interesting
Rescaling to percent of (MIN, MAX)
scaled = (value-min)/(max-min)
To get rid of extremes you can use the
quantile / percentile function
Find high/low extremes and cut them off
Then re-evaluate
Simulation of Galton Board
Write a program which simulates
a Galton board with 10 levels.
The probability of deviation of
each level can be parameterized,
when using (p = q = 0.5) a
Gauss deviation is expected
With the program do experiments for at least
3 different p/q settings (p + q = 1)
Use different numbers of marbels
(1E2, 1E4, 1E6, 1E8 and 1E10)
Document the results in histograms, graphs,
…