The aim of this IPython Notebook is to show how we can use Python to build predictive algorithms that solve data science problems in the arena of education.
This notebook is still heavily under construction
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
# Get the data: Algebra 2005-2006
train_filepath = 'data/algebra0506/algebra_2005_2006_train.txt'
test_filepath = 'data/algebra0506/algebra_2005_2006_test.txt'
traindata = pd.read_table(train_filepath)
Some more information the data format can be found on the challenge website
# Inspect some of the training data
traindata.head()
Let's begin asking some basic questions of the data
# Take the column of anonimized student IDs and count the number of unique entries
print 'Number of students: ', len(np.unique(traindata['Anon Student Id']))
csd = traindata['Correct Step Duration (sec)']
csd.describe()
So ignoring all the students that did not solve a problem step correctly, the average duraction for any problem step was about 18 seconds.
Let's histogram this data to see the distribution.
%matplotlib inline
hist = plt.hist(np.array(csd.dropna()),bins=100,normed=True,log=False,range=(0,100))
plt.xlabel('Time to correct answer (sec)')
plt.ylabel('Fraction')
plt.show()
counts, bins = hist[0], hist[1]
cdf = np.cumsum(counts)
plt.plot(bins[1::], cdf)
plt.xlabel('Time to correct answer (sec)')
plt.ylabel('Cumulative fraction')
plt.axis((0,100,0,1.0))
plt.show()
The histogram shows visually what mere statistics hints at. The distribution of students is heavily weighted towards those who are solving problems in under 20 seconds. The cumulative distribution function (CDF) shows that roughly 80% of successful students solve the problem within 20 seconds. After 40 seconds, 90% of successful students have finished the problem. Almost no students take longer than 80 seconds.
OK, let's ask a slightly harder question: how are students doing problem by problem? The answer will take several parts.
First, let's get the number of unique problems
# The unique identifier for each problem is the 'Problem Name'
problems = traindata['Problem Name']
# Get just the uniques
problems = np.unique(problems)
print 'Number of unique problems: ', len(problems)
pmedian_times = {}
for p in problems:
pmedian_times[p] = traindata[traindata['Problem Name'] == p]['Correct Step Duration (sec)'].median()
import operator
sorted_times = sorted(pmedian_times.iteritems(), key=operator.itemgetter(1), reverse=True)
traindata.columns
traindata['Step Name']
traindata.columns