Descriptive Statistics with python

In contrast with inferential statistics, with descriptive statistics, we do not draw conclusions beyond the data we have neither do we reach any conclusion regarding hypotheses we may make. In other words, we do not try to infer the characteristics of the population of the data but present a quantitative description of data.

Measures and concepts with simple graphics make up the descriptive statistics toolbox and are applied to each quantitative analysis of data

Workflow

One of the first tasks when analyzing data is to collect and prepare the data in a format appropriate for analysis. The most common steps for data preparation are:

Collect Data: Information can be read directly from a file o fetch through and API or collected scraping the web.
Parsing the Data: The right parsing procedure depends on the format file, for instance, plain text file, or spreadsheet-like, etc.
Cleaning the Data: Always are going to be empty spaces in data, for that reason, it is necessary to think on a strategy to resolve those empty spaces, one option would remove the entire row, another would fill the space with the mean of variable if it is a quantitative one but at the end, it depends on the performer and the type of data analyzed.
Building data Structure: Once the data has been read, it is necessary to see if the data exceeds the capacity of the computer’s cache memory or if it fit it. If data exceed memory, usually a database is built to store data, which is an out-of-memory data structure.

Let’s take an example

Consider a database called “adult”, store in UCI’s Machine Learning Repository (but you can download it here). Contains approximately 32,000 observation with different financial parameters related with US population: age, sex, marriage status, country, income ( Boolean variable if the person earns more than $50,000 per year), education (The greatest grade achieved), occupation, etc.

Exploring the data making questions like Are men more likely to become high-income professionals than women receiving an income over $50,000? is the best way to do it (this example was made using Python 2).

# importing libraries we are going to need
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline
#----------------------------------------------------------------
# Importing Data
#----------------------------------------------------------------
file = open('../files/adult.data','r')
#----------------------------------------------------------------
# Process data Function
#----------------------------------------------------------------
def chr_int(a):
    if a.isdigit(): return int(a)
    else: return 0

data = []
for line in file:
    data1 = line.split(', ')
    if len(data1)==15:
        data.append([chr_int(data1[0]),data1[1],chr_int(data1[2]),data1[3],chr_int(data1[4]),data1[5],data1[6],\
            data1[7],data1[8],data1[9],chr_int(data1[10]),chr_int(data1[11]),chr_int(data1[12]),data1[13],\
            data1[14]])

We look at the output:

print data[1:2]

One of the easiest ways to manage data in Python is using Pandas.DataFrame structure contained into Pandas library. It is a two-dimensional, size-mutable structure, it seems like a spreadsheet but more powerful.

df = pd.DataFrame(data)
df.columns = [
    'age', 'type_employer', 'dflwgt', 'education',
    'education_num', 'marital', 'occupation', 'relationship',
    'race', 'sex', 'capital_gain', 'capital_loss', 'hr_per_week',
    'country', 'income'
]

We can know how many registers we have by the method shape(), it give us exactly the row count number (register number) and the column count number (variables number).

df.shape

One of the advantages of working with Data Frames is its flexibility, for example, to answer our question it is necessary to divide the sample by gender into two groups: men and women. We also created subgroups by income.

ml = df[(df.sex == "Male")]
ml1 = df[(df.sex == 'Male') & (df.income == '>50K\n')]
fm = df[(df.sex == 'Female')]
fm1 = df[(df.sex == 'Female') & (df.income == '>50K\n')]

Each particular measure represents a characteristic such as country of origin, education, etc. These measures and categories represent a sample distribution of the variable, which represents an approximation to the distribution of the population, one of the main objectives of the descriptive statistics is to visualize and summarize the sample distribution and thus allow us to make possible assumptions of the population distribution.

Our observable data represent only a finite group of an almost infinite number of sample possibilities. The characteristics of our random sample are interesting in the degree that it represents the characteristics of the data of the population from which it withdraws.

One of the first measurements that we use when summarizing the data and obtaining a sample statistic is the average that is basically defined as: Given a sample of n values, Xi, i = 1, … , n, the mean μ, is the sum of the values divided by the number of values, rather say:

\mu={\frac {1}{n}}\sum_{i=1}^{n}x_{i}

The average is the most basic and important statistic, it describes the central tendency of the sample.

print 'The average age of men is: ', ml['age' ]. mean ()
print 'The average age of women is: ', fm['age']. mean ()
print 'The average age of high - income men is: ', ml1['age']. mean ()
print 'The average age of high - income women is: ', fm1['age']. mean ()

There is a difference in the mean of the samples that can be considered as the first evidence that confirms our hypothesis.

Sample Variance

The mean is not usually a sufficient descriptor of the data, We can go further by knowing two numbers: mean and variance. Variance σ2 describes the spread of the data and is defined as follow:

\sigma^2 = \frac{\displaystyle\sum_{i=1}^{n}(x_i - \mu)^2} {n}

The square root of the variance is known as the standard deviation, and we always consider the standard deviation instead of variance because variance is hard to interpret.

Let us compute the age mean, variance and standard deviation.

ml_mu = ml['age' ]. mean ()
fm_mu = fm['age' ]. mean ()
ml_var = ml['age']. var ()
fm_var = fm['age']. var ()
ml_std = ml['age']. std ()
fm_std = fm['age']. std ()
print 'Statistics of age for men: mu:', ml_mu , 'var: ', ml_var , 'std: ', ml_std
print 'Statistics of age for women : mu:', fm_mu , 'var: ', fm_var , 'std:', fm_std

Skewness: Measuring Asymmetry

For univariate data, the skewness is a statistics that measure the asymmetry of the set of n data samples Xi

g=\frac{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{3}}{(n-1) s^{3}}

Negative deviation indicates that the distribution skews left (it extends farther to the left than to the right) and if is positive the distribution skews right. The normal distribution of the skewness is zero, and any symmetric data must have a skewness of zero.

#---------------------------------------------------------------
# Skewness Function
#---------------------------------------------------------------
def skewness(x):
    res = 0
    m = x.mean()
    s = x.std()
    for i in x:
        res += (i-m) * (i-m) * (i-m)
    res /= (len(x) * s * s * s)
    return res
# Here we applied skewness function to a data wthout outiers that we defined later
print " Skewness of the male population = ", skewness ( ml2_age )
print " Skewness of the female population is = ", skewness ( fm2_age )

That is it, the female population is more skewed than the male, probably since men could be most prone to retire later than women.

Data Distribution

Summarizing data by just looking at their mean and variance can be dangerous: very different data can be described by the same statistics, for that reason is a best practice to validate the data by inspecting them, looking at the data distribution, which describes how often each value appears. The most common representation of a distribution is a histogram.

Let us show the age of working men and women separately.

fm_age = fm['age']
ml_age = ml['age']
fm_age.hist( normed = 0, histtype = 'stepfilled' , alpha = .5, bins = 20)
ml_age.hist( normed = 0, histtype = 'stepfilled' , alpha = .5, color = sns.desaturate("indianred", .75) , bins = 10)
plt.xlabel('Age',fontsize=15)
plt.ylabel('Samples',fontsize=15)
plt.show();

Note that we are visualizing the absolute values of the numbers of people in our data set according to their age. As a side effect, we can see that there are many more men in these conditions than women.

We can normalize the frequencies of the histogram by dividing by n the number of samples. The normalized histogram is called Probability Mass Function (PMF).

fm_age.hist(normed=1, histtype='stepfilled', alpha=.5, bins=20)   # default number of bins = 10
ml_age.hist(normed=1, histtype='stepfilled', alpha=.5, color=sns.desaturate("indianred", .75), bins=10)
plt.xlabel('Age',fontsize=15)
plt.ylabel('PMF',fontsize=15)
plt.show()

Outlier Treatment

Outliers are data samples with a value that is far from the central tendency. Different rules can be defined to detect outliers, as follows:

Computing samples that are far from the median.
Computing samples whose values exceed the mean by 2 or 3 standard deviations

For example, in our case, we are interested in the age statistics of men versus women with high incomes and we can see that in our data set, the minimum age is 17 years and the maximum is 90 years. We can consider that some of these samples are due to errors or are not representable. Applying the domain knowledge, we focus on the median age (37, in our case) up to 72 and down to 22 years old, and we consider the rest as outliers.

df2 = df.drop(df.index[(df.income == '>50K\n') &
                       (df['age'] > df['age'].median() + 35) &
                       (df['age'] > df['age'].median() - 15)
                      ])
ml1_age = ml1['age']
fm1_age = fm1['age']
ml2_age = ml1_age.drop(ml1_age.index[
    (ml1_age > df['age'].median() + 35) &
    (ml1_age > df['age'].median() - 15)
])
fm2_age = fm1_age.drop(fm1_age.index[
    (fm1_age > df['age'].median() + 35) &
    (fm1_age > df['age'].median() - 15)
])

We can check how the mean and the median changed once the data were cleaned.

mu2ml = ml2_age . mean ()
std2ml = ml2_age . std ()
md2ml = ml2_age . median ()
mu2fm = fm2_age . mean ()
std2fm = fm2_age . std ()
md2fm = fm2_age . median ()
print " Men statistics :"
print " Mean :", mu2ml , " Std:", std2ml
print " Median :", md2ml
print " Min:", ml2_age .min () , " Max:", ml2_age . max ()
print " Women statistics :"
print " Mean :", mu2fm , " Std:", std2fm
print " Median :", md2fm
print " Min:", fm2_age . min () , " Max:", fm2_age .max ()

Let us visualize how many outliers are removed from the whole data by:

plt.figure(figsize = (13.4, 5))
df.age[(df.income == '>50K\n')].plot(alpha = .25, color = 'blue')
df2.age[(df2.income == '>50K\n')].plot(alpha = .45, color = 'red');

The chart above shows the outliers in blue and the rest data in red, next just rest to verify statistics again to see if we had outliers bias.

print 'The mean difference with outliers is: % 4.2f.' % ( ml_age . mean () - fm_age . mean ())
print 'The mean difference without outliers is: % 4.2f.' % ( ml2_age . mean () - fm2_age . mean ())

In our case, there were more outliers in men than women. If the difference in the mean values before removing the outliers is 2.5, after removing them it slightly decreased to 2.44.

Let us observe de difference between men and women income in the cleaned subset with some more details.

# Lets observe the difference of men and women incomes in cleaned subset
countx, divisionx = np.histogram(ml2_age, normed = True)
county, divisiony = np.histogram(fm2_age, normed = True)
val = [(divisionx[i] + divisionx[i+1])/2 for i in range(len(divisionx) - 1)]
plt.plot(val, countx - county, 'o-')
plt.title('Differences in promoting men vs. women')
plt.xlabel('Age',fontsize=15)
plt.ylabel('Differences',fontsize=15)
plt.show();

One can see that the differences between male and female values are slightly negative before age 42 and positive after it. Hence, women tend to be promoted (receive more than 50K) earlier than men.

Conclusion

After exploring the data, we obtained some apparent effects that seem to support our initial assumptions. For example, the mean age for men in our dataset is 39.4 years; while for women, it is 36.8 years. When analyzing the high-income salaries, the mean age for men increased to 44.6 years; while for women, it increased to 42.1 years. When the data were cleaned from outliers, we obtained mean age for high-income men: 44.3, and for women: 41.8. Moreover, histograms and other statistics show the skewness of the data and the fact that women used to be promoted a little bit earlier than men, in general.

Congratulation! You get the end of this tutorial, It was made based on the book “Introduction of Data Science” by Laura Igual and Santiago Seguí. I hope it helps you if you want the Jupyter notebook is here.