Monday, December 28, 2015

Kaggle Titanic tutorial on www.dataquest.io

Today I started to do Kaggle Titanic tutorial on Dataquest. As usual, I decide to play with code and change some parameters in library functions. In 10 step they use Kfold method from sklearn.cross_validation library.
They use it in the example with Shuffle = False and this gives the result 78%. When I change parameter to Shuffle = True I received 100% and when I increased parameter n_folds to 4 I received 22% 
I understood that something is going wrong and actually this is quite scary because I made it on the small dataset, received obviously wrong results and can do nothing about that. So I decided to solve this problem.

Here is code from this part:

import numpy as np

# The predictions are in three separate numpy arrays.  Concatenate them into one.  
# We concatenate them on axis 0, as they only have one axis.
predictions = np.concatenate(predictions, axis=0)

# Map predictions to outcomes (only possible outcomes are 1 and 0)
predictions[predictions > .5] = 1
predictions[predictions <=.5] = 0
accuracy = sum(predictions[predictions == titanic["Survived"]]) / len(predictions)

I kill on this about 5 hours (
The trick was in this part

accuracy = sum(predictions[predictions == titanic["Survived"]]) / len(predictions)

This code predictions == titanic["Survived"] return an array with True and False and when you pass this result to array predictions you receive values from 0 and 1 rows.

When you have unshuffled values it works well because in the first position you have Survived = 0 and on the second position you have Survived = 1. But when you made shuffling of values it sometimes leads to Survived = 1 on both positions and, as a result, you receive 100% successful predictions or you have inverted values and then you receive 22% successful predictions.   

Actually, we do not need this wrapper around predictions == titanic["Survived"] because it will be counted even if it boolean. So for correct work of this code we should write this last line of code as:
accuracy = sum(predictions == titanic["Survived"]) / len(predictions)