Predicting Student Performance

A data science experiment using data from the KDD 2010 Educational Data Mining Challenge

The aim of this IPython Notebook is to show how we can use Python to build predictive algorithms that solve data science problems in the arena of education.

This notebook is still heavily under construction

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
In [2]:
# Get the data: Algebra 2005-2006
train_filepath = 'data/algebra0506/algebra_2005_2006_train.txt'
test_filepath  = 'data/algebra0506/algebra_2005_2006_test.txt'
traindata = pd.read_table(train_filepath)

Some more information the data format can be found on the challenge website

In [3]:
# Inspect some of the training data
traindata.head()
Out[3]:
Row Anon Student Id Problem Hierarchy Problem Name Problem View Step Name Step Start Time First Transaction Time Correct Transaction Time Step End Time Step Duration (sec) Correct Step Duration (sec) Error Step Duration (sec) Correct First Attempt Incorrects Hints Corrects KC(Default) Opportunity(Default)
0 1 0BrbPbwCMz Unit ES_04, Section ES_04-1 EG4-FIXED 1 3(x+2) = 15 2005-09-09 12:24:35.0 2005-09-09 12:24:49.0 2005-09-09 12:25:15.0 2005-09-09 12:25:15.0 40 NaN 40 0 2 3 1 [SkillRule: Eliminate Parens; {CLT nested; CLT... 1
1 2 0BrbPbwCMz Unit ES_04, Section ES_04-1 EG4-FIXED 1 x+2 = 5 2005-09-09 12:25:15.0 2005-09-09 12:25:31.0 2005-09-09 12:25:31.0 2005-09-09 12:25:31.0 16 16 NaN 1 0 0 1 [SkillRule: Remove constant; {ax+b=c, positive... 1~~1
2 3 0BrbPbwCMz Unit ES_04, Section ES_04-1 EG40 1 2-8y = -4 2005-09-09 12:25:36.0 2005-09-09 12:25:43.0 2005-09-09 12:26:12.0 2005-09-09 12:26:12.0 36 NaN 36 0 2 3 1 [SkillRule: Remove constant; {ax+b=c, positive... 2
3 4 0BrbPbwCMz Unit ES_04, Section ES_04-1 EG40 1 -8y = -6 2005-09-09 12:26:12.0 2005-09-09 12:26:34.0 2005-09-09 12:26:34.0 2005-09-09 12:26:34.0 22 22 NaN 1 0 0 1 [SkillRule: Remove coefficient; {ax+b=c, divid... 1~~1
4 5 0BrbPbwCMz Unit ES_04, Section ES_04-1 EG40 2 -7y-5 = -4 2005-09-09 12:26:38.0 2005-09-09 12:28:36.0 2005-09-09 12:28:36.0 2005-09-09 12:28:36.0 118 118 NaN 1 0 0 1 [SkillRule: Remove constant; {ax+b=c, positive... 3~~1

5 rows × 19 columns

Let's begin asking some basic questions of the data

How many students are interacting with the system?

In [4]:
# Take the column of anonimized student IDs and count the number of unique entries
print 'Number of students: ', len(np.unique(traindata['Anon Student Id']))
Number of students:  574

How long does it take a student to solve any problem step on average?

In [5]:
csd = traindata['Correct Step Duration (sec)']
csd.describe()
Out[5]:
count    620129.000000
mean         18.071478
std          34.796694
min           0.000000
25%           5.000000
50%           8.000000
75%          18.000000
max        1907.000000
Name: Correct Step Duration (sec), dtype: float64

So ignoring all the students that did not solve a problem step correctly, the average duraction for any problem step was about 18 seconds.

Let's histogram this data to see the distribution.

In [6]:
%matplotlib inline
hist = plt.hist(np.array(csd.dropna()),bins=100,normed=True,log=False,range=(0,100))
plt.xlabel('Time to correct answer (sec)')
plt.ylabel('Fraction')
plt.show()
In [7]:
counts, bins = hist[0], hist[1]
cdf = np.cumsum(counts)
plt.plot(bins[1::], cdf)
plt.xlabel('Time to correct answer (sec)')
plt.ylabel('Cumulative fraction')
plt.axis((0,100,0,1.0))
plt.show()

The histogram shows visually what mere statistics hints at. The distribution of students is heavily weighted towards those who are solving problems in under 20 seconds. The cumulative distribution function (CDF) shows that roughly 80% of successful students solve the problem within 20 seconds. After 40 seconds, 90% of successful students have finished the problem. Almost no students take longer than 80 seconds.

Completion time by problem

OK, let's ask a slightly harder question: how are students doing problem by problem? The answer will take several parts.

First, let's get the number of unique problems

In [8]:
# The unique identifier for each problem is the 'Problem Name'
problems = traindata['Problem Name']
In [9]:
# Get just the uniques
problems = np.unique(problems)
print 'Number of unique problems: ', len(problems)
Number of unique problems:  1084
In [10]:
pmedian_times = {}
for p in problems:
    pmedian_times[p] = traindata[traindata['Problem Name'] == p]['Correct Step Duration (sec)'].median()
In [11]:
import operator
In [12]:
sorted_times = sorted(pmedian_times.iteritems(), key=operator.itemgetter(1), reverse=True)
In [13]:
traindata.columns
Out[13]:
Index([u'Row', u'Anon Student Id', u'Problem Hierarchy', u'Problem Name', u'Problem View', u'Step Name', u'Step Start Time', u'First Transaction Time', u'Correct Transaction Time', u'Step End Time', u'Step Duration (sec)', u'Correct Step Duration (sec)', u'Error Step Duration (sec)', u'Correct First Attempt', u'Incorrects', u'Hints', u'Corrects', u'KC(Default)', u'Opportunity(Default)'], dtype='object')
In [14]:
traindata['Step Name']
Out[14]:
0        3(x+2) = 15
1            x+2 = 5
2          2-8y = -4
3           -8y = -6
4         -7y-5 = -4
5            -7y = 1
6           7y+4 = 7
7             7y = 3
8         -5+9y = -6
9            9y = -1
10        -7-3x = -2
11    -7-3x+7 = -2+7
12           -3x = 5
13     -3x/-3 = 5/-3
14         -9 = 8y+9
...
809679             -4x = 5
809680       -4x/-4 = 5/-4
809681            x = 5/-4
809682          0 = -1y-10
809683    0+10 = -1y-10+10
809684      10 = -1y-10+10
809685            10 = -1y
809686      10/-1 = -1y/-1
809687        -10 = -1y/-1
809688           -7+2x = 4
809689       -7+2x+7 = 4+7
809690            2x = 4+7
809691             2x = 11
809692         2x/2 = 11/2
809693           -2+5x = 8
Name: Step Name, Length: 809694, dtype: object
In [15]:
traindata.columns
Out[15]:
Index([u'Row', u'Anon Student Id', u'Problem Hierarchy', u'Problem Name', u'Problem View', u'Step Name', u'Step Start Time', u'First Transaction Time', u'Correct Transaction Time', u'Step End Time', u'Step Duration (sec)', u'Correct Step Duration (sec)', u'Error Step Duration (sec)', u'Correct First Attempt', u'Incorrects', u'Hints', u'Corrects', u'KC(Default)', u'Opportunity(Default)'], dtype='object')
In [17]: