The good ol’ Titanic Kaggle Competition Part 1

It’s been over two months since I finished the Data Science certificate program through the University of Washington. Since then I’ve been trying to figure out my next step. The annoying thing about the internet is that it probably gives you too many options. Every time I search “learning data science”, or “how to become a data scientist”, or “what data science tools should I learn”, I get completely inundated with different information. I can’t tell you how many times one article has led to several others and in the end I can’t even remember where I started. In all of this noise, I’ve realized one thing, you just HAVE TO START SOMEWHERE. I’ve done Kaggle in the past and I’m pretty familiar with R, so I figured I would go back to the Titanic problem and see what happens. I won’t rehash the entire problem but basically you are given a set of features about passengers on the Titanic which you have to use to create a model to predict whether they died or survived. I have to give a shoutout to Trevor Stevens and his blog for getting me started.

For my analysis, I started by doing some simple proportion tables to see what impact different categorical features had on survival. You can see my code on Github for all the details. Passenger Class and Sex were the most obvious features to test since they have 3 and 2 factors respectively and they seem like they can provide some insight on survival (unlike the Embarked feature). I found that 3rd class passengers and males were the most likely to die. I created a few submissions based on sex and class. My females only prediction is currently my best score at 0.76555.

After that I began playing around with logistic regression. So far, none of my attempts at logistic regression have improved my score but I have some ideas for tomorrow (already reached my submission limit for today). I do realize now that I need to have a plan with my logistic regression models, I need to determine which features have the best probability of providing signal instead of blindly plugging in different ones. Since the code for this portion is short, I included it below.

# Kaggle Titanic Problem

rm(list=ls())
train <- read.csv("~/Documents/RStudio/Titanic/train.csv")
test <- read.csv("~/Documents/RStudio/Titanic/test.csv")
str(train)
table(train$Survived)
prop.table(table((train$Survived)))
test$Survived <- rep(0,418)

# First submission, assume everybody dies
submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)
write.csv(submit, file = "submission.csv", row.names = FALSE)
###

prop.table(table(train$Survived, train$Pclass))
# More than %80 of 3rd class passengers died, most 1st class passengers lived
prop.table(table(train$Survived, train$Sex))
# Most males died
prop.table(table(train$Sex, train$Pclass))
test$Survived[test$Sex == "female" & test$Pclass == 1] <- 1

# Second submission, all 1st class females live
submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)
write.csv(submit, file = "submission.csv", row.names = FALSE)
###

test$Survived <- rep(0,418)
test$Survived[test$Sex == "female" & test$Pclass == 1] <- 1
test$Survived[test$Sex == "female" & test$Pclass == 2] <- 1

# Third submission, only 1st and 2nd class females live
submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)
write.csv(submit, file = "submission.csv", row.names = FALSE)
###

test$Survived <- rep(0,418)
test$Survived[test$Sex == "female"] <- 1

# Fourth submission, only females live
submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)
write.csv(submit, file = "submission.csv", row.names = FALSE)
###

ave_agetr <- mean(train$Age, na.rm = TRUE)
train$Age[is.na(train$Age)] <- ave_agetr
ave_agete <- mean(test$Age, na.rm = TRUE)
test$Age[is.na(test$Age)] <- ave_agete
ave_farete <- mean(test$Fare, na.rm = TRUE)
test$Fare[is.na(test$Fare)] <- ave_farete
logist <- glm(Survived ~ Sex + Fare + Pclass + Age, data = train, family = "binomial")
test$Survived <- predict(logist, newdata = test, type = "response")
test$Survived[test$Survived > 0.5] <- 1
test$Survived[test$Survived != 1] <- 0

# Fifth submission, Logistic regression using Sex, Fare, Pclass, and Age
submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)
write.csv(submit, file = "submission.csv", row.names = FALSE)
###

The good ol’ Titanic Kaggle Competition Part 1

Trending Articles

RAMAYAMPET Mandal Sarpanch | Upa-Sarpanch | Ward member Mobile Numbers Medak...

लड़कियां सेक्स के दौरान क्यों करती है उह! आह!लड़कियां सेक्स के दौरान क्यों करती...

Neem Baba Extra Questions Answer Class 6 English Poorvi

Throw Back: 4×4 — Sikilitele (Ft Castro) Prod by JQ

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Lowe faces four theft charges

Practice Sheet of Right form of verbs for HSC Students

Mafia, Murder & Mayhem In The Motor City: Detroit Mob Hit Timeline (1937-2007)

The 10 Tennessee Cities With The Largest Black Population For 2021

Materials Around Us Class 6 Worksheet Science Chapter 6

デスクトップヒープの枯渇

Best Suvichar in Hindi |बेस्ट सुविचार |शुभ विचार हिंदी में

Kanulanu Thaake Lyrics and translation | Manam (2014)

Korean Sex Porn Videos: XXX Videos & Free Porn Movies

Teen Shot In Miami Drive-By Dies From Injuries

Download: IQ Muzatasha feat Shy D & Pmj – Ulesi NiFertilizer Yamavuto

Mahakal Attitude Status

Property developer set up cannabis factory to help pay off debts...

♡

KB: How to troubleshoot issues when adding a Hyper-V host in System Center...