Hello all , we will take a look at a very interesting dataset today and make predictions on the same. The data set is Titanic Dataset which is based on actual data of the people who were present on titanic when the Unsinkable Ship sank!
The data set has been uploaded at the following links . Make sure that both the files are downloaded :
Training file : https://www.4shared.com/file/ouSVAkjTei/titanic_train.html
Testing File : https://www.4shared.com/file/anr1PN6aca/titanic_test.html
Python should be installed on the system with all the latest python libraries . You can find information about python installation and libraries required on my previous blog posts.
So, lets begin ! After downloading the data sets open titanic_train.csv . You will see a lot of data with 8 to 10 columns . Also you will see that a lot of data is incomplete or partially complete . This is the problem which will be faced a lot of times while working with Machine Learning . So we need to fix the problem , but first lets identify the features and the labels of the data
The features of the dataset are those entities which play an important role to predict the answer or the label. The features of this data set include = {row_names , passenger class , age , gender}. As we know from the titanic movie that most of the rescue boats carried children and women first before men so the chances of survival of men were less . Thus they are included in the features and also the passenger class played an important role in survival as 1st class travellers had their rooms in the upper level where water reached at the end . So to sum it up we can say that children or women having their rooms in the first class had the highest chance of survival . But we don't know exactly how much was the percentage of survival . Thus Machine Learning helps us to tackle such problems.
Lets us begin with the python code :
1)Importing the necessary python libraries
import pandas as pd
pd.options.mode.chained_assignment = None
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
2) Place the titanic_train.csv and titanic_test.csv in the working folder of the python where actual python code will be present. Now get the titanic_train.csv file with the help of pandas as :
df = pd.read_csv("titanic_train.csv")
3)Now as we have got the whole csv file , we know that all the data cannot be our features so we extract out the necessary features required for our prediction.
features = df[['row_names','pclass','age','sex']]
4) As we know that Machine Learning algorithms cannot perform operations or calculations on string data or character data we need to convert whatever character data we have into numbers.
features.replace(['male'],[0],inplace=True)
features.replace(['female'],[1],inplace=True)
features['pclass'].replace('3rd',3,inplace=True)
features['pclass'].replace('2nd',2,inplace=True)
features['pclass'].replace('1st',1,inplace=True)
5) Age is one of our important features in the data set but if we take a look at the column of age we can see that a lot of data is missing inside it . As data is missing we cannot leave it as it is and nor we can skip that particular entry . So the solution to this problem is that we find out the average of whatever data we have and then fill the blank data with the value of the average.
mean =int(features[['age']].mean())
features['age'].fillna(mean,inplace=True)
6)Now finally we are completely ready with our clean and partially correct data ! The next step is to identify the label , which in our case is the 'survived' column in the data set.
label = df[['survived']]
7)Now as we have our features and labels ready with us we can split and shuffle the data so that our Machine Learning algorithm will perform better with shuffled data . We are splitting the data as 90 and 10 that is 90% for training and 10% for testing
x_train,x_test,y_train,y_test = train_test_split(features,label,test_size=0.1)
8) Finally using our classifier we can call the 'fit()' method and also the 'score()' method to get our accuracy which in our case is around 75-85% . The accuracy is not as high as other data sets as a lot of data is missing in this data set and so we had to put the mean of the age . The classifier we are using here is random forest classifier.
clf = RandomForestClassifier()
clf.fit(x_train,y_train)
Accuracy = clf.score(x_test,y_test)
print(Accuracy)
9) We can print the accuracy and get the accuracy of our data set . Now we have the second file that is titanic_test.csv . We need to clean the data in same way as we did with the first file . This titanic_test.csv will be used to make some predictions on unseen data as our classifier trained on different data and we are testing it on some other data.Follow the steps below . They are similar to the one performed on the first data set.
pred_feat.replace(['male'],[0],inplace=True)
pred_feat.replace(['female'],[1],inplace=True)
mean2 =int(pred_feat[['age']].mean())
pred_feat['age'].fillna(mean,inplace=True)
pred_feat['pclass'].replace('3rd',3,inplace=True)
pred_feat['pclass'].replace('2nd',2,inplace=True)
pred_feat['pclass'].replace('1st',1,inplace=True)
10)Finally we can call the 'predict()' method to get our prediction which is based on the data from second data set and we can obtain our answer and we can check them manually to see if our classifier predicted it properly or not
Answer = clf.predict(pred_feat.head(15))
print(Answer)
You can find the whole code on the following github link : https://github.com/codecenterorg/Machine_Learning/blob/master/titanic
Comments
Post a Comment