It’s been over two months since I finished the Data Science certificate program through the University of Washington. Since then I’ve been trying to figure out my next step. The annoying thing about the internet is that it probably gives you too many options. Every time I search “learning data science”, or “how to become a data scientist”, or “what data science tools should I learn”, I get completely inundated with different information. I can’t tell you how many times one article has led to several others and in the end I can’t even remember where I started. In all of this noise, I’ve realized one thing, you just HAVE TO START SOMEWHERE. I’ve done Kaggle in the past and I’m pretty familiar with R, so I figured I would go back to the Titanic problem and see what happens. I won’t rehash the entire problem but basically you are given a set of features about passengers on the Titanic which you have to use to create a model to predict whether they died or survived. I have to give a shoutout to Trevor Stevens and his blog for getting me started.
For my analysis, I started by doing some simple proportion tables to see what impact different categorical features had on survival. You can see my code on Github for all the details. Passenger Class and Sex were the most obvious features to test since they have 3 and 2 factors respectively and they seem like they can provide some insight on survival (unlike the Embarked feature). I found that 3rd class passengers and males were the most likely to die. I created a few submissions based on sex and class. My females only prediction is currently my best score at 0.76555.
After that I began playing around with logistic regression. So far, none of my attempts at logistic regression have improved my score but I have some ideas for tomorrow (already reached my submission limit for today). I do realize now that I need to have a plan with my logistic regression models, I need to determine which features have the best probability of providing signal instead of blindly plugging in different ones. Since the code for this portion is short, I included it below.
# Kaggle Titanic Problem rm(list=ls()) train <- read.csv("~/Documents/RStudio/Titanic/train.csv") test <- read.csv("~/Documents/RStudio/Titanic/test.csv") str(train) table(train$Survived) prop.table(table((train$Survived))) test$Survived <- rep(0,418) # First submission, assume everybody dies submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived) write.csv(submit, file = "submission.csv", row.names = FALSE) ### prop.table(table(train$Survived, train$Pclass)) # More than %80 of 3rd class passengers died, most 1st class passengers lived prop.table(table(train$Survived, train$Sex)) # Most males died prop.table(table(train$Sex, train$Pclass)) test$Survived[test$Sex == "female" & test$Pclass == 1] <- 1 # Second submission, all 1st class females live submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived) write.csv(submit, file = "submission.csv", row.names = FALSE) ### test$Survived <- rep(0,418) test$Survived[test$Sex == "female" & test$Pclass == 1] <- 1 test$Survived[test$Sex == "female" & test$Pclass == 2] <- 1 # Third submission, only 1st and 2nd class females live submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived) write.csv(submit, file = "submission.csv", row.names = FALSE) ### test$Survived <- rep(0,418) test$Survived[test$Sex == "female"] <- 1 # Fourth submission, only females live submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived) write.csv(submit, file = "submission.csv", row.names = FALSE) ### ave_agetr <- mean(train$Age, na.rm = TRUE) train$Age[is.na(train$Age)] <- ave_agetr ave_agete <- mean(test$Age, na.rm = TRUE) test$Age[is.na(test$Age)] <- ave_agete ave_farete <- mean(test$Fare, na.rm = TRUE) test$Fare[is.na(test$Fare)] <- ave_farete logist <- glm(Survived ~ Sex + Fare + Pclass + Age, data = train, family = "binomial") test$Survived <- predict(logist, newdata = test, type = "response") test$Survived[test$Survived > 0.5] <- 1 test$Survived[test$Survived != 1] <- 0 # Fifth submission, Logistic regression using Sex, Fare, Pclass, and Age submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived) write.csv(submit, file = "submission.csv", row.names = FALSE) ###