Monday, December 28, 2015

Kaggle Titanic tutorial on www.dataquest.io

Today I started to do Kaggle Titanic tutorial on Dataquest. As usual, I decide to play with code and change some parameters in library functions. In 10 step they use Kfold method from sklearn.cross_validation library.
They use it in the example with Shuffle = False and this gives the result 78%. When I change parameter to Shuffle = True I received 100% and when I increased parameter n_folds to 4 I received 22% 
I understood that something is going wrong and actually this is quite scary because I made it on the small dataset, received obviously wrong results and can do nothing about that. So I decided to solve this problem.

Here is code from this part:

import numpy as np

# The predictions are in three separate numpy arrays.  Concatenate them into one.  
# We concatenate them on axis 0, as they only have one axis.
predictions = np.concatenate(predictions, axis=0)

# Map predictions to outcomes (only possible outcomes are 1 and 0)
predictions[predictions > .5] = 1
predictions[predictions <=.5] = 0
accuracy = sum(predictions[predictions == titanic["Survived"]]) / len(predictions)

I kill on this about 5 hours (
The trick was in this part

accuracy = sum(predictions[predictions == titanic["Survived"]]) / len(predictions)

This code predictions == titanic["Survived"] return an array with True and False and when you pass this result to array predictions you receive values from 0 and 1 rows.

When you have unshuffled values it works well because in the first position you have Survived = 0 and on the second position you have Survived = 1. But when you made shuffling of values it sometimes leads to Survived = 1 on both positions and, as a result, you receive 100% successful predictions or you have inverted values and then you receive 22% successful predictions.   

Actually, we do not need this wrapper around predictions == titanic["Survived"] because it will be counted even if it boolean. So for correct work of this code we should write this last line of code as:
accuracy = sum(predictions == titanic["Survived"]) / len(predictions)

Tuesday, October 27, 2015

This is very cool course
but it has very detail description of data visualization and very sudden and quick plunge to machine learning. The usual thing for a lot of courses: a slow and very detail explanation of simple things and then a sudden switch to speed and quite superficial explanation of hard topics. 

Maybe it's just because of my lack of math knowledge. Today I gave up and start to listen to statistics lessons of Brandon Foltz and in the middle of his lecture start to seek information about logarithms. 
This is really scary when I saw all this mountain of information and tons of lectures on the Khan.  I know that I shouldn't look too far and compare myself with other people, but this is very difficult to do. There is always some bad thoughts about the reality of my desire to work in machine learning sphere: Am I not too old for this?  Does my lack of math ok for this domain? Is it possible to learn all high-school and undergraduate math in realistic terms?

I have tried to do "beginner" course of Machine Learning. Actually, I have tried it already two times. The first time was some years ago and I gave up after the first hour of lectures. This time was a little better. I did not give up till the fourth week, but after the third week, I realized that I made tasks but didn't understand completely that I did. It was painful after four weeks, but I decided to quit the course till the better understanding of basics of machine learning.
We live in the strange time. I meet some unknown topic and during some minutes I found an information and listen to explanations of professional in this topic. Too difficult: I can find slower explanations, too simple: look for more advanced lectures.

15 years ago if I wanted to learn something about logarithms there is no other way as to go to the shop and buy some book or go to the library. There was no guaranty that book will be cool or at least intelligible to me. Now in some seconds I found 10 different teachers that discuss this topic in different manners and I have an opportunity to choose the best teacher for me.  MIT grad, Duke professor give me the lectures for free and I even don't need to make notes because I can easily repeat these lectures at any time.

People who live now and tell that they have no money for education are just liars. There is only one possible excuse: when one can't decide what he want to learn.