uncategorized

Practice Kaggle Data_Titanic

I. Practice Kaggle Data

  1. 구글 드라이브 연동
  2. Kaggle API 설치
  3. Kaggle Token 다운로드
  4. Titanic 데이터 불러오기

1. 구글 드라이브 연동

Google Colab을 시작하면 항상 드라이브 연동을 해야 한다.

1
2
3
4
5
6
7
8
9
10
from google.colab import drive # 패키지 불러오기 
from os.path import join

ROOT = "/content/drive" # 드라이브 기본 경로
print(ROOT) # print content of ROOT (Optional)
drive.mount(ROOT) # 드라이브 기본 경로 Mount

MY_GOOGLE_DRIVE_PATH = 'My Drive/Colab Notebooks/Python/python/practice' # 프로젝트 경로
PROJECT_PATH = join(ROOT, MY_GOOGLE_DRIVE_PATH) # 프로젝트 경로
print(PROJECT_PATH)
/content/drive
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/My Drive/Colab Notebooks/Python/python/practice

아래 코드 실행 시, 에러가 없다면 데이터를 불러오면 된다.

1
%cd "{PROJECT_PATH}"
/content/drive/My Drive/Colab Notebooks/Python/python/practice

2. Kaggle API 설치

Google Colab에서 Kaggle API를 불러오는 소스코드를 실행한다.

1
!pip install kaggle # Google Colab에서 설치할 때는 ! 필요
Requirement already satisfied: kaggle in /usr/local/lib/python3.6/dist-packages (1.5.9)
Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from kaggle) (2.23.0)
Requirement already satisfied: tqdm in /usr/local/lib/python3.6/dist-packages (from kaggle) (4.41.1)
Requirement already satisfied: python-slugify in /usr/local/lib/python3.6/dist-packages (from kaggle) (4.0.1)
Requirement already satisfied: urllib3 in /usr/local/lib/python3.6/dist-packages (from kaggle) (1.24.3)
Requirement already satisfied: certifi in /usr/local/lib/python3.6/dist-packages (from kaggle) (2020.6.20)
Requirement already satisfied: six>=1.10 in /usr/local/lib/python3.6/dist-packages (from kaggle) (1.15.0)
Requirement already satisfied: python-dateutil in /usr/local/lib/python3.6/dist-packages (from kaggle) (2.8.1)
Requirement already satisfied: slugify in /usr/local/lib/python3.6/dist-packages (from kaggle) (0.0.1)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->kaggle) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->kaggle) (2.10)
Requirement already satisfied: text-unidecode>=1.3 in /usr/local/lib/python3.6/dist-packages (from python-slugify->kaggle) (1.3)

3. Kaggle Token 다운로드

Kaggle에서 API Token을 다운로드한다.
[Kaggle] - [My Account] - [API] - [Create New API Token]을 누르면 kaggle.json 파일이 다운로드 된다.
파일을 바탕화면에 옮긴 뒤, 아래 코드를 실행한다.

1
2
3
4
5
6
7
8
from google.colab import files
uploaded = files.upload()
for fn in uploaded.keys():
print('uploaded file "{name}" with length {length} bytes'.format(
name=fn, length=len(uploaded[fn])))

# kaggle.json을 아래 폴더로 옮긴 뒤, file을 사용할 수 있도록 권한을 부여한다.
!mkdir -p ~/.kaggle/ && mv kaggle.json ~/.kaggle/ && chmod 600 ~/.kaggle/kaggle.json



Upload widget is only available when the cell has been executed in the
current browser session. Please rerun this cell to enable.

Saving kaggle.json to kaggle.json
uploaded file "kaggle.json" with length 62 bytes

아래 코드를 실행했을 때, 에러 메시지가 없으면 json 파일이 성공적으로 업로드 되었다는 뜻이다.

1
ls -1ha ~/.kaggle/kaggle.json
/root/.kaggle/kaggle.json

4. Kaggle 데이터 불러오기

Kaggle competition list를 불러온다.

1
!kaggle competitions list
Warning: Looks like you're using an outdated API Version, please consider updating (server 1.5.9 / client 1.5.4)
ref                                            deadline             category            reward  teamCount  userHasEntered  
---------------------------------------------  -------------------  ---------------  ---------  ---------  --------------  
contradictory-my-dear-watson                   2030-07-01 23:59:00  Getting Started     Prizes        134           False  
gan-getting-started                            2030-07-01 23:59:00  Getting Started     Prizes        185           False  
tpu-getting-started                            2030-06-03 23:59:00  Getting Started  Knowledge        315           False  
digit-recognizer                               2030-01-01 00:00:00  Getting Started  Knowledge       2356           False  
titanic                                        2030-01-01 00:00:00  Getting Started  Knowledge      18058            True  
house-prices-advanced-regression-techniques    2030-01-01 00:00:00  Getting Started  Knowledge       4536            True  
connectx                                       2030-01-01 00:00:00  Getting Started  Knowledge        390           False  
nlp-getting-started                            2030-01-01 00:00:00  Getting Started  Knowledge       1184           False  
rock-paper-scissors                            2021-02-01 23:59:00  Playground          Prizes        152           False  
riiid-test-answer-prediction                   2021-01-07 23:59:00  Featured          $100,000       1466           False  
nfl-big-data-bowl-2021                         2021-01-05 23:59:00  Analytics         $100,000          0           False  
competitive-data-science-predict-future-sales  2020-12-31 23:59:00  Playground           Kudos       9343           False  
halite-iv-playground-edition                   2020-12-31 23:59:00  Playground       Knowledge         43           False  
predict-volcanic-eruptions-ingv-oe             2020-12-28 23:59:00  Playground            Swag        193           False  
hashcode-drone-delivery                        2020-12-14 23:59:00  Playground       Knowledge         79           False  
cdp-unlocking-climate-solutions                2020-12-02 23:59:00  Analytics          $91,000          0           False  
lish-moa                                       2020-11-30 23:59:00  Research           $30,000       3395           False  
google-football                                2020-11-30 23:59:00  Featured            $6,000        916           False  
conways-reverse-game-of-life-2020              2020-11-30 23:59:00  Playground            Swag        131           False  
lyft-motion-prediction-autonomous-vehicles     2020-11-25 23:59:00  Featured           $30,000        778           False  

코드 실행 시 나오는 대회 목록에서 원하는 대회의 데이터셋을 불러온다.

1
2
# 실습: 타이타닉 데이터 불러오기
!kaggle competitions download -c titanic
Warning: Looks like you're using an outdated API Version, please consider updating (server 1.5.9 / client 1.5.4)
Downloading train.csv to /content/drive/My Drive/Colab Notebooks/Python/python/practice/data
  0% 0.00/59.8k [00:00<?, ?B/s]
100% 59.8k/59.8k [00:00<00:00, 8.18MB/s]
Downloading gender_submission.csv to /content/drive/My Drive/Colab Notebooks/Python/python/practice/data
  0% 0.00/3.18k [00:00<?, ?B/s]
100% 3.18k/3.18k [00:00<00:00, 443kB/s]
Downloading test.csv to /content/drive/My Drive/Colab Notebooks/Python/python/practice/data
  0% 0.00/28.0k [00:00<?, ?B/s]
100% 28.0k/28.0k [00:00<00:00, 3.97MB/s]
1
!ls # ls: 리눅스 명령어, 경로 내 모든 데이터 파일을 보여줌
gender_submission.csv  test.csv  train.csv

총 3개의 데이터가 다운로드 되었다.

  • gender_submission.csv
  • test.csv
  • train.csv

II. Kaggle Data 실습_Titanic

  1. 데이터 살펴보기
  2. Kaggle Code 필사

1. 데이터 살펴보기

아래 코드를 실행하여 EDA 필수 패키지를 설치한다.

1
2
3
4
5
6
7
8
9
10
import pandas as pd # 데이터 가공, 변환
import pandas_profiling # 보고서 기능
import numpy as np # 수치 연산&배열, 행렬
import matplotlib as mpl # 시각화
import matplotlib.pyplot as plt # 시각화
from matplotlib.pyplot import figure # 시각화
import seaborn as sns

from IPython.core.display import display, HTML
from pandas_profiling import ProfileReport
1
2
3
4
5
6
7
%matplotlib inline
import matplotlib.pylab as plt

plt.rcParams["figure.figsize"] = (14,4)
plt.rcParams['lines.linewidth'] = 2
plt.rcParams['lines.color'] = 'r'
plt.rcParams['axes.grid'] = True

(1) 데이터 수집

  • gender_submission.csv
  • test.csv
  • train.csv
1
2
3
4
gender = pd.read_csv('data/gender_submission.csv')
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')
print("data import is done")
data import is done

(2) 데이터 확인
Kaggle 데이터를 불러와서 가장 먼저 확인해야 할 것은 데이터셋의 크기다.

  • 변수의 개수
  • Numeric 변수 & Categorical 변수의 개수 등 파악
    cf) 보통 test 데이터의 변수 개수가 train 변수 개수보다 하나 적음
1
gender.shape, train.shape, test.shape
((418, 2), (891, 12), (418, 11))
1
2
# train 데이터의 상위 5개 데이터만 확인해보기
display(train.head())

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Numerical 변수와 Categorical 변수를 구분한다.

  • numeric_features 구분
1
2
3
4
5
6
7
numeric_features = train.select_dtypes(include=[np.number])
print(numeric_features.columns)
print("The total number of numeric features are: ", len(numeric_features.columns))

numeric_features = test.select_dtypes(include=[np.number])
print(numeric_features.columns)
print("The total number of numeric features are: ", len(numeric_features.columns))
Index(['PassengerId', 'Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare'], dtype='object')
The total number of numeric features are:  7
Index(['PassengerId', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare'], dtype='object')
The total number of numeric features are:  6

train의 numeric 데이터는 7개, test의 numeric 데이터는 6개이다.

  • numeric_features 제외한 나머지 변수 추출
1
2
3
4
5
6
7
categorical_features = train.select_dtypes(exclude=[np.number])
print(categorical_features.columns)
print("The total number of non numeric features are: ", len(categorical_features.columns))

categorical_features = test.select_dtypes(exclude=[np.number])
print(categorical_features.columns)
print("The total number of non numeric features are: ", len(categorical_features.columns))
Index(['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], dtype='object')
The total number of non numeric features are:  5
Index(['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], dtype='object')
The total number of non numeric features are:  5

train의 numeric 아닌 데이터는 5개, test의 numeric 아닌 데이터는 5개이다.

2. Kaggle Code 필사

- 필사 자료: EDA to Prediction(DieTanic)

part1: Exploratory Data Analysis(EDA)

  1. Analysis of the features
  2. Finding any relations or trends considering multiple features
1
2
3
4
5
6
7
8
9
# 사용 함수 import
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
1
train.head()

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
1
train.isnull().sum() # checking for total null values
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

The AGE, Cabin and Embarked have null values.

How many Survived?

1
2
3
4
5
6
7
f, ax = plt.subplots(1, 2, figsize = (18, 8))
train['Survived'].value_counts().plot.pie(explode = [0, 0.1], autopct = '%1.1f%%', ax = ax[0], shadow = True)
ax[0].set_title('Survived') # 왼쪽 그래프 제목
ax[0].set_ylabel('')
sns.countplot('Survived', data=train, ax=ax[1])
ax[1].set_title('Survived')
plt.show()

png

It is evident that not many passengers survived the accident.

Out of 891 passengers in traing set, only around 350 survived i.e Only 38.4% of the total training set survived the crash. We need to dig down more to get better insights from the data and see which categories of the passengers did survive and who didn’t.

We will try to chek the survival rate by using the different features of the dataset. Some of the features being Sex, Port Of Embarcation, Age, etc.

First let us understand the different types of features.

*** Types Of Features**

  • Categorical Features:
    A categorical variable is one that has two or more categories and each value in that feature can be categorised by them. For example, gender is a categorical variable having two categories(male and female). Now we cannot sort or give any ordering to such variables. They are also known as Nominal Variables(명목형 변수).

  • Categorical Features in the dataset: Sex, Embarked


  • Ordinal Features:
    An ordinal variable is similar to categorical values, but the difference between them is that we can have relative ordering or sorting between the values. For eg: if we have a feature like Height with values Tall, Medium, Short, then Height is a ordinal variable. Here we can have a relative sort in the variable.

  • Ordinal(순위) Features in the dataset: PClass


  • Continuous Feature:
    A feature is said to be continous if it can take values between any two points or between the minimun or maximum values in the features column.

  • Coutinuous Features in the dataset: Age

*** Analysing The Features**

Sex -> Catagorical Feature

1
train.groupby(['Sex', 'Survived'])['Survived'].count()
Sex     Survived
female  0            81
        1           233
male    0           468
        1           109
Name: Survived, dtype: int64
1
2
3
4
5
6
f, ax = plt.subplots(1, 2, figsize = (18, 8))
train[['Sex', 'Survived']].groupby(['Sex']).mean().plot.bar(ax = ax[0])
ax[0].set_title('Survived vs Sex')
sns.countplot('Sex', hue='Survived', data=train, ax = ax[1])
ax[1].set_title('Sex: Survived vs Dead')
plt.show()

png

This looks interesting. The number of men on the ship is lot more than the number of women. Still the number of women saved is almost twice the number of males saved. The survival rates for a women on the ship is around 75% while that for men in around 18-19%.

This looks to be a very important feature for modeling. But is it the best? Let’s check other features.

*** Pclass -> Ordinal Feature**

1
2
# 클래스별 사망자/생존자 수
pd.crosstab(train.Pclass, train.Survived, margins=True).style.background_gradient(cmap = 'summer_r')
            <tr>
                    <th id="T_8a5ee43e_1e4d_11eb_98db_0242ac1c0002level0_row0" class="row_heading level0 row0" >1</th>
                    <td id="T_8a5ee43e_1e4d_11eb_98db_0242ac1c0002row0_col0" class="data row0 col0" >80</td>
                    <td id="T_8a5ee43e_1e4d_11eb_98db_0242ac1c0002row0_col1" class="data row0 col1" >136</td>
                    <td id="T_8a5ee43e_1e4d_11eb_98db_0242ac1c0002row0_col2" class="data row0 col2" >216</td>
        </tr>
        <tr>
                    <th id="T_8a5ee43e_1e4d_11eb_98db_0242ac1c0002level0_row1" class="row_heading level0 row1" >2</th>
                    <td id="T_8a5ee43e_1e4d_11eb_98db_0242ac1c0002row1_col0" class="data row1 col0" >97</td>
                    <td id="T_8a5ee43e_1e4d_11eb_98db_0242ac1c0002row1_col1" class="data row1 col1" >87</td>
                    <td id="T_8a5ee43e_1e4d_11eb_98db_0242ac1c0002row1_col2" class="data row1 col2" >184</td>
        </tr>
        <tr>
                    <th id="T_8a5ee43e_1e4d_11eb_98db_0242ac1c0002level0_row2" class="row_heading level0 row2" >3</th>
                    <td id="T_8a5ee43e_1e4d_11eb_98db_0242ac1c0002row2_col0" class="data row2 col0" >372</td>
                    <td id="T_8a5ee43e_1e4d_11eb_98db_0242ac1c0002row2_col1" class="data row2 col1" >119</td>
                    <td id="T_8a5ee43e_1e4d_11eb_98db_0242ac1c0002row2_col2" class="data row2 col2" >491</td>
        </tr>
        <tr>
                    <th id="T_8a5ee43e_1e4d_11eb_98db_0242ac1c0002level0_row3" class="row_heading level0 row3" >All</th>
                    <td id="T_8a5ee43e_1e4d_11eb_98db_0242ac1c0002row3_col0" class="data row3 col0" >549</td>
                    <td id="T_8a5ee43e_1e4d_11eb_98db_0242ac1c0002row3_col1" class="data row3 col1" >342</td>
                    <td id="T_8a5ee43e_1e4d_11eb_98db_0242ac1c0002row3_col2" class="data row3 col2" >891</td>
        </tr>
</tbody></table>
Survived 0 1 All
Pclass
1
2
3
4
5
6
7
f, ax=plt.subplots(1, 2, figsize = (18, 8))
train['Pclass'].value_counts().plot.bar(color = ['#CD7F32', '#FFDF00', '#D3D3D3'], ax = ax[0])
ax[0].set_title('Number Of Passengers By Pclass')
ax[0].set_ylabel('Count')
sns.countplot('Pclass', hue='Survived', data=train, ax = ax[1])
ax[1].set_title('Pclass:Survived vs Dead')
plt.show()

png

People say Money Can’t Buy Everything. But we can clearly see that Passengers Of Pclass 1 were given a very high priority while rescue. Even thouugh the number of Passengers in Pclass 3 were a lot higher, still the number of survival from them is very low, somewhere around 25%.

For Pclass 1% survived is around 63% while for Pclass 2 is around 48%. So money and status matters. Such a materialistic world.

Let’s dive in little bit more and check for other interesting observations. Let’s check survival rate with Sex and Pclass Together.

1
2
# Sex, Pclss 기준으로 Survived 인원 보기
pd.crosstab([train.Sex, train.Survived], train.Pclass, margins=True).style.background_gradient(cmap='summer_r')
            <tr>
                    <th id="T_ed10372a_1e4f_11eb_98db_0242ac1c0002level0_row0" class="row_heading level0 row0" rowspan=2>female</th>
                    <th id="T_ed10372a_1e4f_11eb_98db_0242ac1c0002level1_row0" class="row_heading level1 row0" >0</th>
                    <td id="T_ed10372a_1e4f_11eb_98db_0242ac1c0002row0_col0" class="data row0 col0" >3</td>
                    <td id="T_ed10372a_1e4f_11eb_98db_0242ac1c0002row0_col1" class="data row0 col1" >6</td>
                    <td id="T_ed10372a_1e4f_11eb_98db_0242ac1c0002row0_col2" class="data row0 col2" >72</td>
                    <td id="T_ed10372a_1e4f_11eb_98db_0242ac1c0002row0_col3" class="data row0 col3" >81</td>
        </tr>
        <tr>
                            <th id="T_ed10372a_1e4f_11eb_98db_0242ac1c0002level1_row1" class="row_heading level1 row1" >1</th>
                    <td id="T_ed10372a_1e4f_11eb_98db_0242ac1c0002row1_col0" class="data row1 col0" >91</td>
                    <td id="T_ed10372a_1e4f_11eb_98db_0242ac1c0002row1_col1" class="data row1 col1" >70</td>
                    <td id="T_ed10372a_1e4f_11eb_98db_0242ac1c0002row1_col2" class="data row1 col2" >72</td>
                    <td id="T_ed10372a_1e4f_11eb_98db_0242ac1c0002row1_col3" class="data row1 col3" >233</td>
        </tr>
        <tr>
                    <th id="T_ed10372a_1e4f_11eb_98db_0242ac1c0002level0_row2" class="row_heading level0 row2" rowspan=2>male</th>
                    <th id="T_ed10372a_1e4f_11eb_98db_0242ac1c0002level1_row2" class="row_heading level1 row2" >0</th>
                    <td id="T_ed10372a_1e4f_11eb_98db_0242ac1c0002row2_col0" class="data row2 col0" >77</td>
                    <td id="T_ed10372a_1e4f_11eb_98db_0242ac1c0002row2_col1" class="data row2 col1" >91</td>
                    <td id="T_ed10372a_1e4f_11eb_98db_0242ac1c0002row2_col2" class="data row2 col2" >300</td>
                    <td id="T_ed10372a_1e4f_11eb_98db_0242ac1c0002row2_col3" class="data row2 col3" >468</td>
        </tr>
        <tr>
                            <th id="T_ed10372a_1e4f_11eb_98db_0242ac1c0002level1_row3" class="row_heading level1 row3" >1</th>
                    <td id="T_ed10372a_1e4f_11eb_98db_0242ac1c0002row3_col0" class="data row3 col0" >45</td>
                    <td id="T_ed10372a_1e4f_11eb_98db_0242ac1c0002row3_col1" class="data row3 col1" >17</td>
                    <td id="T_ed10372a_1e4f_11eb_98db_0242ac1c0002row3_col2" class="data row3 col2" >47</td>
                    <td id="T_ed10372a_1e4f_11eb_98db_0242ac1c0002row3_col3" class="data row3 col3" >109</td>
        </tr>
        <tr>
                    <th id="T_ed10372a_1e4f_11eb_98db_0242ac1c0002level0_row4" class="row_heading level0 row4" >All</th>
                    <th id="T_ed10372a_1e4f_11eb_98db_0242ac1c0002level1_row4" class="row_heading level1 row4" ></th>
                    <td id="T_ed10372a_1e4f_11eb_98db_0242ac1c0002row4_col0" class="data row4 col0" >216</td>
                    <td id="T_ed10372a_1e4f_11eb_98db_0242ac1c0002row4_col1" class="data row4 col1" >184</td>
                    <td id="T_ed10372a_1e4f_11eb_98db_0242ac1c0002row4_col2" class="data row4 col2" >491</td>
                    <td id="T_ed10372a_1e4f_11eb_98db_0242ac1c0002row4_col3" class="data row4 col3" >891</td>
        </tr>
</tbody></table>
Pclass 1 2 3 All
Sex Survived
1
2
sns.factorplot('Pclass', 'Survived', hue = 'Sex', data = train)
plt.show()

png

We use FactorPlot in this case, because they make the seperation of categorical values easy.

Looking at the CrossTab and the FactorPlot, we can easily infer that survival for Women from Pclass 1 is about 95-96%, as only 3 out of 94 Women from Pclass 1 died.

It is evident that irrespective of Pclass, Women were given first priority while rescue. Even Men from Pclass 1 have a very low survival rate.

Looks like Pclass is also an important feature. Let’s analyse other features.

*** Age -> Continous Feature**

1
2
3
print('Oldest Passenger was of:', train['Age'].max(), 'Years')
print('Youngest Passenger was of:', train['Age'].min(), 'Years')
print('Average Age on the ship:', train['Age'].mean(), 'Years')
Oldest Passenger was of: 80.0 Years
Youngest Passenger was of: 0.42 Years
Average Age on the ship: 29.69911764705882 Years
1
2
3
4
5
6
7
8
f, ax = plt.subplots(1, 2, figsize = (18, 8))
sns.violinplot("Pclass", "Age", hue="Survived", data=train, split=True, ax=ax[0])
ax[0].set_title('Pclass and Age vs Survived')
ax[0].set_yticks(range(0, 110, 10))
sns.violinplot("Sex", "Age", hue="Survived", data=train, split=True, ax=ax[1])
ax[1].set_title('Sex and Age vs Survived')
ax[1].set_yticks(range(0, 110, 10))
plt.show()

png

Observations:

  • The number of cildren increases with Pclass and the survival rate for passengers below Age 10(i.e children) looks to be good irrespective of the Pclass.

  • Survival chances for passengers aged 20-50 from Pclass 1 is high and is even better for women.

  • For males, the survival chances decreases with an increase in age.


As we had seen earlier, the Age feature has 177 null values. To replace these NaN values, we can assign them the mean age of the dataset.

But the problem is, there were many people with many different ages. We just can’t assign a 4 year kid with the mean age that is 29 years. Is there any way to find out what age-band does the passenger lie?


We can check the Name feature. Looking upon the feature, we cas see that the names have a salutaion like Mr or Mrs. Thus we can assign the mean values of Mr and Mrs to the respective groups.

“What’s in a name?” —-> Feature

1
2
3
train['Initial']=0
for i in train:
train['Initial']=train.Name.str.extract('([A-Za-z]+)\.') # extract the Salutations

Using the Regex: [A-Za-z]+).. So what it does is, it looks for strings which lie between A-Z or a-z and followed by a .(dot). So we successfully extract the Initials from the Name.

1
pd.crosstab(train.Initial, train.Sex).T.style.background_gradient(cmap='summer_r') # checking the Initials with the sex
            <tr>
                    <th id="T_dae012f4_1e5f_11eb_98db_0242ac1c0002level0_row0" class="row_heading level0 row0" >female</th>
                    <td id="T_dae012f4_1e5f_11eb_98db_0242ac1c0002row0_col0" class="data row0 col0" >0</td>
                    <td id="T_dae012f4_1e5f_11eb_98db_0242ac1c0002row0_col1" class="data row0 col1" >0</td>
                    <td id="T_dae012f4_1e5f_11eb_98db_0242ac1c0002row0_col2" class="data row0 col2" >1</td>
                    <td id="T_dae012f4_1e5f_11eb_98db_0242ac1c0002row0_col3" class="data row0 col3" >0</td>
                    <td id="T_dae012f4_1e5f_11eb_98db_0242ac1c0002row0_col4" class="data row0 col4" >1</td>
                    <td id="T_dae012f4_1e5f_11eb_98db_0242ac1c0002row0_col5" class="data row0 col5" >0</td>
                    <td id="T_dae012f4_1e5f_11eb_98db_0242ac1c0002row0_col6" class="data row0 col6" >1</td>
                    <td id="T_dae012f4_1e5f_11eb_98db_0242ac1c0002row0_col7" class="data row0 col7" >0</td>
                    <td id="T_dae012f4_1e5f_11eb_98db_0242ac1c0002row0_col8" class="data row0 col8" >0</td>
                    <td id="T_dae012f4_1e5f_11eb_98db_0242ac1c0002row0_col9" class="data row0 col9" >182</td>
                    <td id="T_dae012f4_1e5f_11eb_98db_0242ac1c0002row0_col10" class="data row0 col10" >2</td>
                    <td id="T_dae012f4_1e5f_11eb_98db_0242ac1c0002row0_col11" class="data row0 col11" >1</td>
                    <td id="T_dae012f4_1e5f_11eb_98db_0242ac1c0002row0_col12" class="data row0 col12" >0</td>
                    <td id="T_dae012f4_1e5f_11eb_98db_0242ac1c0002row0_col13" class="data row0 col13" >125</td>
                    <td id="T_dae012f4_1e5f_11eb_98db_0242ac1c0002row0_col14" class="data row0 col14" >1</td>
                    <td id="T_dae012f4_1e5f_11eb_98db_0242ac1c0002row0_col15" class="data row0 col15" >0</td>
                    <td id="T_dae012f4_1e5f_11eb_98db_0242ac1c0002row0_col16" class="data row0 col16" >0</td>
        </tr>
        <tr>
                    <th id="T_dae012f4_1e5f_11eb_98db_0242ac1c0002level0_row1" class="row_heading level0 row1" >male</th>
                    <td id="T_dae012f4_1e5f_11eb_98db_0242ac1c0002row1_col0" class="data row1 col0" >1</td>
                    <td id="T_dae012f4_1e5f_11eb_98db_0242ac1c0002row1_col1" class="data row1 col1" >2</td>
                    <td id="T_dae012f4_1e5f_11eb_98db_0242ac1c0002row1_col2" class="data row1 col2" >0</td>
                    <td id="T_dae012f4_1e5f_11eb_98db_0242ac1c0002row1_col3" class="data row1 col3" >1</td>
                    <td id="T_dae012f4_1e5f_11eb_98db_0242ac1c0002row1_col4" class="data row1 col4" >6</td>
                    <td id="T_dae012f4_1e5f_11eb_98db_0242ac1c0002row1_col5" class="data row1 col5" >1</td>
                    <td id="T_dae012f4_1e5f_11eb_98db_0242ac1c0002row1_col6" class="data row1 col6" >0</td>
                    <td id="T_dae012f4_1e5f_11eb_98db_0242ac1c0002row1_col7" class="data row1 col7" >2</td>
                    <td id="T_dae012f4_1e5f_11eb_98db_0242ac1c0002row1_col8" class="data row1 col8" >40</td>
                    <td id="T_dae012f4_1e5f_11eb_98db_0242ac1c0002row1_col9" class="data row1 col9" >0</td>
                    <td id="T_dae012f4_1e5f_11eb_98db_0242ac1c0002row1_col10" class="data row1 col10" >0</td>
                    <td id="T_dae012f4_1e5f_11eb_98db_0242ac1c0002row1_col11" class="data row1 col11" >0</td>
                    <td id="T_dae012f4_1e5f_11eb_98db_0242ac1c0002row1_col12" class="data row1 col12" >517</td>
                    <td id="T_dae012f4_1e5f_11eb_98db_0242ac1c0002row1_col13" class="data row1 col13" >0</td>
                    <td id="T_dae012f4_1e5f_11eb_98db_0242ac1c0002row1_col14" class="data row1 col14" >0</td>
                    <td id="T_dae012f4_1e5f_11eb_98db_0242ac1c0002row1_col15" class="data row1 col15" >6</td>
                    <td id="T_dae012f4_1e5f_11eb_98db_0242ac1c0002row1_col16" class="data row1 col16" >1</td>
        </tr>
</tbody></table>

There are some misspelled Initials like Mlle of Mme that stand for Miss. I will replace them with Miss and same thing for other values.

Initial Capt Col Countess Don Dr Jonkheer Lady Major Master Miss Mlle Mme Mr Mrs Ms Rev Sir
Sex
1
train['Initial'].replace(['Mlle', 'Mme', 'Ms', 'Dr', 'Major', 'Lady', 'Countess', 'Jonkheer', 'Col', 'Rev', 'Capt', 'Sir', 'Don'], ['Miss', 'Miss', 'Miss', 'Mr', 'Mr', 'Mrs', 'Mrs', 'Other', 'Other', 'Other', 'Mr', 'Mr', 'Mr'], inplace=True)
1
train.groupby('Initial')['Age'].mean() # check the average age by Initials
Initial
Master     4.574167
Miss      21.860000
Mr        32.739609
Mrs       35.981818
Other     45.888889
Name: Age, dtype: float64

Filling NaN Ages

1
2
3
4
5
6
# assigning the NaN Values with the Ceil values of the mean ages
train.loc[(train.Age.isnull())&(train.Initial=='Mr'),'Age']=33
train.loc[(train.Age.isnull())&(train.Initial=='Mrs'),'Age']=36
train.loc[(train.Age.isnull())&(train.Initial=='Master'),'Age']=5
train.loc[(train.Age.isnull())&(train.Initial=='Miss'),'Age']=22
train.loc[(train.Age.isnull())&(train.Initial=='Other'),'Age']=46
1
train.Age.isnull().any() # so no null values left finally
False
1
2
3
4
5
6
7
8
9
10
f, ax=plt.subplots(1, 2, figsize=(20, 10))
train[train['Survived']==0].Age.plot.hist(ax=ax[0], bins=20, edgecolor='black', color='red')
ax[0].set_title('Survived=0')
x1=list(range(0, 85, 5))
ax[0].set_xticks(x1)
train[train['Survived']==1].Age.plot.hist(ax=ax[1], color='green', bins=20, edgecolor='black')
ax[1].set_title('Survived=1')
x2=list(range(0, 85, 5))
ax[1].set_xticks(x2)
plt.show()

png

Observations:

  • The Toddlers(age < 5) were saved in large numbers(The Women and Child First Policy)
  • The oldest Passenger was saved(80 years)
  • Maximum number of deaths were in the age group of 30-40
1
2
sns.factorplot('Pclass', 'Survived', col='Initial', data=train)
plt.show()

png

The Women and Child first policy thus holds true irrespective of the class.

*** Embarked -> Categorical Value**

1
pd.crosstab([train.Embarked, train.Pclass], [train.Sex, train.Survived], margins=True).style.background_gradient(cmap='summer_r')
            <tr>
                    <th id="T_7944395a_1e62_11eb_98db_0242ac1c0002level0_row0" class="row_heading level0 row0" rowspan=3>C</th>
                    <th id="T_7944395a_1e62_11eb_98db_0242ac1c0002level1_row0" class="row_heading level1 row0" >1</th>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row0_col0" class="data row0 col0" >1</td>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row0_col1" class="data row0 col1" >42</td>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row0_col2" class="data row0 col2" >25</td>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row0_col3" class="data row0 col3" >17</td>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row0_col4" class="data row0 col4" >85</td>
        </tr>
        <tr>
                            <th id="T_7944395a_1e62_11eb_98db_0242ac1c0002level1_row1" class="row_heading level1 row1" >2</th>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row1_col0" class="data row1 col0" >0</td>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row1_col1" class="data row1 col1" >7</td>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row1_col2" class="data row1 col2" >8</td>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row1_col3" class="data row1 col3" >2</td>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row1_col4" class="data row1 col4" >17</td>
        </tr>
        <tr>
                            <th id="T_7944395a_1e62_11eb_98db_0242ac1c0002level1_row2" class="row_heading level1 row2" >3</th>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row2_col0" class="data row2 col0" >8</td>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row2_col1" class="data row2 col1" >15</td>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row2_col2" class="data row2 col2" >33</td>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row2_col3" class="data row2 col3" >10</td>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row2_col4" class="data row2 col4" >66</td>
        </tr>
        <tr>
                    <th id="T_7944395a_1e62_11eb_98db_0242ac1c0002level0_row3" class="row_heading level0 row3" rowspan=3>Q</th>
                    <th id="T_7944395a_1e62_11eb_98db_0242ac1c0002level1_row3" class="row_heading level1 row3" >1</th>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row3_col0" class="data row3 col0" >0</td>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row3_col1" class="data row3 col1" >1</td>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row3_col2" class="data row3 col2" >1</td>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row3_col3" class="data row3 col3" >0</td>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row3_col4" class="data row3 col4" >2</td>
        </tr>
        <tr>
                            <th id="T_7944395a_1e62_11eb_98db_0242ac1c0002level1_row4" class="row_heading level1 row4" >2</th>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row4_col0" class="data row4 col0" >0</td>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row4_col1" class="data row4 col1" >2</td>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row4_col2" class="data row4 col2" >1</td>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row4_col3" class="data row4 col3" >0</td>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row4_col4" class="data row4 col4" >3</td>
        </tr>
        <tr>
                            <th id="T_7944395a_1e62_11eb_98db_0242ac1c0002level1_row5" class="row_heading level1 row5" >3</th>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row5_col0" class="data row5 col0" >9</td>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row5_col1" class="data row5 col1" >24</td>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row5_col2" class="data row5 col2" >36</td>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row5_col3" class="data row5 col3" >3</td>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row5_col4" class="data row5 col4" >72</td>
        </tr>
        <tr>
                    <th id="T_7944395a_1e62_11eb_98db_0242ac1c0002level0_row6" class="row_heading level0 row6" rowspan=3>S</th>
                    <th id="T_7944395a_1e62_11eb_98db_0242ac1c0002level1_row6" class="row_heading level1 row6" >1</th>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row6_col0" class="data row6 col0" >2</td>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row6_col1" class="data row6 col1" >46</td>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row6_col2" class="data row6 col2" >51</td>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row6_col3" class="data row6 col3" >28</td>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row6_col4" class="data row6 col4" >127</td>
        </tr>
        <tr>
                            <th id="T_7944395a_1e62_11eb_98db_0242ac1c0002level1_row7" class="row_heading level1 row7" >2</th>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row7_col0" class="data row7 col0" >6</td>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row7_col1" class="data row7 col1" >61</td>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row7_col2" class="data row7 col2" >82</td>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row7_col3" class="data row7 col3" >15</td>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row7_col4" class="data row7 col4" >164</td>
        </tr>
        <tr>
                            <th id="T_7944395a_1e62_11eb_98db_0242ac1c0002level1_row8" class="row_heading level1 row8" >3</th>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row8_col0" class="data row8 col0" >55</td>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row8_col1" class="data row8 col1" >33</td>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row8_col2" class="data row8 col2" >231</td>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row8_col3" class="data row8 col3" >34</td>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row8_col4" class="data row8 col4" >353</td>
        </tr>
        <tr>
                    <th id="T_7944395a_1e62_11eb_98db_0242ac1c0002level0_row9" class="row_heading level0 row9" >All</th>
                    <th id="T_7944395a_1e62_11eb_98db_0242ac1c0002level1_row9" class="row_heading level1 row9" ></th>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row9_col0" class="data row9 col0" >81</td>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row9_col1" class="data row9 col1" >231</td>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row9_col2" class="data row9 col2" >468</td>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row9_col3" class="data row9 col3" >109</td>
                    <td id="T_7944395a_1e62_11eb_98db_0242ac1c0002row9_col4" class="data row9 col4" >889</td>
        </tr>
</tbody></table>

Chances for Survival by Port of Embarkation

Sex female male All
Survived 0 1 0 1
Embarked Pclass
1
2
3
4
sns.factorplot('Embarked', 'Survived', data=train)
fig=plt.gcf()
fig.set_size_inches(5, 3)
plt.show()

png

The chances for survival for Port C is highest around 0.55 while it is lowest for S.

1
2
3
4
5
6
7
8
9
10
11
f,ax=plt.subplots(2,2,figsize=(20,15))
sns.countplot('Embarked',data=train,ax=ax[0,0])
ax[0,0].set_title('No. Of Passengers Boarded')
sns.countplot('Embarked',hue='Sex',data=train,ax=ax[0,1])
ax[0,1].set_title('Male-Female Split for Embarked')
sns.countplot('Embarked',hue='Survived',data=train,ax=ax[1,0])
ax[1,0].set_title('Embarked vs Survived')
sns.countplot('Embarked',hue='Pclass',data=train,ax=ax[1,1])
ax[1,1].set_title('Embarked vs Pclass')
plt.subplots_adjust(wspace=0.2,hspace=0.5)
plt.show()

png

Observations:

  • Maximum passengers boarded from S. Majority of them being from Pclass3.
  • The passengers from C look to be lucky as a good proportion of them survived. The reason for this maybe the rescue of all the Pclass 1 and Pclass 2 passengers.
  • The Embark S looks to the port from where majority of the rich people boarded. Still the chances for surical is low here, that is vecause many passengers from Pclass 3 around 81% didn’t survive.
  • Port Q had almost 95% of the passengers were from Pclass 3.
1
2
sns.factorplot('Pclass', 'Survived', hue='Sex', col='Embarked', data=train)
plt.show()

png

0bservations:

  • The survival chances are almost 1 for women for Pclass 1 and Pclass 2 irrespective of the Pclass.
  • Port S looks to be very unlucky for Pclass 3 passengers as the survival rate for both men and women is very low.(Money Matters)
  • Port Q looks to be unlukiest for Men, as almost all were from Pclass 3.

Filling Embarked NaN
As we saw that maximum passengers boarded from Port S, we replace NaN with S.

1
train['Embarked'].fillna('S', inplace=True)
1
train.Embarked.isnull().any() # Finally No NaN values
False

*** SibSip -> Discrete Feature**
This feature represents whether a person is alone of with their family members.


Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife

Part2: Feature Engineering and Data Cleaning

  1. Adding any few features
  2. Removing redundant features
  3. Converting features into suitable from for modiling
1

1

Part3: Rredictive Modeling

  1. Running Basic Algorithms
  2. Cross Validation
  3. Ensembling
  4. Important Features Extraction
Share