Predicting Student Performance¶

A data science experiment using data from the KDD 2010 Educational Data Mining Challenge¶

The aim of this IPython Notebook is to show how we can use Python to build predictive algorithms that solve data science problems in the arena of education.

This notebook is still heavily under construction

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn

# Get the data: Algebra 2005-2006
train_filepath = 'data/algebra0506/algebra_2005_2006_train.txt'
test_filepath  = 'data/algebra0506/algebra_2005_2006_test.txt'
traindata = pd.read_table(train_filepath)

Some more information the data format can be found on the challenge website

# Inspect some of the training data
traindata.head()

Let's begin asking some basic questions of the data

How many students are interacting with the system?¶

# Take the column of anonimized student IDs and count the number of unique entries
print 'Number of students: ', len(np.unique(traindata['Anon Student Id']))

Number of students:  574

How long does it take a student to solve any problem step on average?¶

csd = traindata['Correct Step Duration (sec)']
csd.describe()

count    620129.000000
mean         18.071478
std          34.796694
min           0.000000
25%           5.000000
50%           8.000000
75%          18.000000
max        1907.000000
Name: Correct Step Duration (sec), dtype: float64

So ignoring all the students that did not solve a problem step correctly, the average duraction for any problem step was about 18 seconds.

Let's histogram this data to see the distribution.

%matplotlib inline
hist = plt.hist(np.array(csd.dropna()),bins=100,normed=True,log=False,range=(0,100))
plt.xlabel('Time to correct answer (sec)')
plt.ylabel('Fraction')
plt.show()

counts, bins = hist[0], hist[1]
cdf = np.cumsum(counts)
plt.plot(bins[1::], cdf)
plt.xlabel('Time to correct answer (sec)')
plt.ylabel('Cumulative fraction')
plt.axis((0,100,0,1.0))
plt.show()

The histogram shows visually what mere statistics hints at. The distribution of students is heavily weighted towards those who are solving problems in under 20 seconds. The cumulative distribution function (CDF) shows that roughly 80% of successful students solve the problem within 20 seconds. After 40 seconds, 90% of successful students have finished the problem. Almost no students take longer than 80 seconds.

Completion time by problem¶

OK, let's ask a slightly harder question: how are students doing problem by problem? The answer will take several parts.

First, let's get the number of unique problems

# The unique identifier for each problem is the 'Problem Name'
problems = traindata['Problem Name']

# Get just the uniques
problems = np.unique(problems)
print 'Number of unique problems: ', len(problems)

Number of unique problems:  1084

pmedian_times = {}
for p in problems:
    pmedian_times[p] = traindata[traindata['Problem Name'] == p]['Correct Step Duration (sec)'].median()

import operator

sorted_times = sorted(pmedian_times.iteritems(), key=operator.itemgetter(1), reverse=True)

traindata.columns

Index([u'Row', u'Anon Student Id', u'Problem Hierarchy', u'Problem Name', u'Problem View', u'Step Name', u'Step Start Time', u'First Transaction Time', u'Correct Transaction Time', u'Step End Time', u'Step Duration (sec)', u'Correct Step Duration (sec)', u'Error Step Duration (sec)', u'Correct First Attempt', u'Incorrects', u'Hints', u'Corrects', u'KC(Default)', u'Opportunity(Default)'], dtype='object')

traindata['Step Name']

0        3(x+2) = 15
1            x+2 = 5
2          2-8y = -4
3           -8y = -6
4         -7y-5 = -4
5            -7y = 1
6           7y+4 = 7
7             7y = 3
8         -5+9y = -6
9            9y = -1
10        -7-3x = -2
11    -7-3x+7 = -2+7
12           -3x = 5
13     -3x/-3 = 5/-3
14         -9 = 8y+9
...
809679             -4x = 5
809680       -4x/-4 = 5/-4
809681            x = 5/-4
809682          0 = -1y-10
809683    0+10 = -1y-10+10
809684      10 = -1y-10+10
809685            10 = -1y
809686      10/-1 = -1y/-1
809687        -10 = -1y/-1
809688           -7+2x = 4
809689       -7+2x+7 = 4+7
809690            2x = 4+7
809691             2x = 11
809692         2x/2 = 11/2
809693           -2+5x = 8
Name: Step Name, Length: 809694, dtype: object

traindata.columns

Index([u'Row', u'Anon Student Id', u'Problem Hierarchy', u'Problem Name', u'Problem View', u'Step Name', u'Step Start Time', u'First Transaction Time', u'Correct Transaction Time', u'Step End Time', u'Step Duration (sec)', u'Correct Step Duration (sec)', u'Error Step Duration (sec)', u'Correct First Attempt', u'Incorrects', u'Hints', u'Corrects', u'KC(Default)', u'Opportunity(Default)'], dtype='object')

	Row	Anon Student Id	Problem Hierarchy	Problem Name	Problem View	Step Name	Step Start Time	First Transaction Time	Correct Transaction Time	Step End Time	Step Duration (sec)	Correct Step Duration (sec)	Error Step Duration (sec)	Correct First Attempt	Incorrects	Hints	Corrects	KC(Default)	Opportunity(Default)
0	1	0BrbPbwCMz	Unit ES_04, Section ES_04-1	EG4-FIXED	1	3(x+2) = 15	2005-09-09 12:24:35.0	2005-09-09 12:24:49.0	2005-09-09 12:25:15.0	2005-09-09 12:25:15.0	40	NaN	40	0	2	3	1	[SkillRule: Eliminate Parens; {CLT nested; CLT...	1
1	2	0BrbPbwCMz	Unit ES_04, Section ES_04-1	EG4-FIXED	1	x+2 = 5	2005-09-09 12:25:15.0	2005-09-09 12:25:31.0	2005-09-09 12:25:31.0	2005-09-09 12:25:31.0	16	16	NaN	1	0	0	1	[SkillRule: Remove constant; {ax+b=c, positive...	1~~1
2	3	0BrbPbwCMz	Unit ES_04, Section ES_04-1	EG40	1	2-8y = -4	2005-09-09 12:25:36.0	2005-09-09 12:25:43.0	2005-09-09 12:26:12.0	2005-09-09 12:26:12.0	36	NaN	36	0	2	3	1	[SkillRule: Remove constant; {ax+b=c, positive...	2
3	4	0BrbPbwCMz	Unit ES_04, Section ES_04-1	EG40	1	-8y = -6	2005-09-09 12:26:12.0	2005-09-09 12:26:34.0	2005-09-09 12:26:34.0	2005-09-09 12:26:34.0	22	22	NaN	1	0	0	1	[SkillRule: Remove coefficient; {ax+b=c, divid...	1~~1
4	5	0BrbPbwCMz	Unit ES_04, Section ES_04-1	EG40	2	-7y-5 = -4	2005-09-09 12:26:38.0	2005-09-09 12:28:36.0	2005-09-09 12:28:36.0	2005-09-09 12:28:36.0	118	118	NaN	1	0	0	1	[SkillRule: Remove constant; {ax+b=c, positive...	3~~1