Linear Programming for Titanic Survivors

Daryl Felix
6 min readNov 8, 2021

Predict Titanic survivors with Linear Programming

Titanic competition on Kaggle is a famous topic to start Machine Learning (ML) competition and enjoy the world of data analysis and modeling. With a few skills on Python (or other programming languages) and some models usage you can start to participate.

I graduated from EPFL Extension School in a Machine Learning program. During the courses, I participated to the Titanic competition to train my skills and to be honest this classification problem is very interesting.

I submitted a few set all based on classification models as Decision Tree, Random Forest or Linear Classification. I also submitted some set predicted with a PyCaret model.

Recently, I was reading some post and found several articles on Linear Programming. Most popular is the Sudoku solving problem and I was thinking to implement LP to predict Titanic survivors.

What is Linear Programming ?

Linear Programming tends to optimise a problem by applying constraints. For those who spend time on Sudoku, it is the ability to fill a grid with numbers (1 to 9) and respect some rules such as:

  • only one number on the rows and on columns
  • fill a grid of 3x3 with numbers from 1 to 9

These 2 bullets points can bee seen as « constraints ». Optimisation can be time spent to solve the Sudoku or number of try-and-error.

For this article, I worked with « pulp » which is a very good and simple to use python library. You can find more information here. There is more other library such as PyOmo or even SkLearn proposes some methods.

Titanic Competition

The Titanic Kaggle Competition

The goal of the competition is to predict survivor (and died passenger) from an unseen dataset which contains 418 observations (passengers details). To do that, we have to train a model with a dataset of 891 observations for whom we know the target (survived=1 or died=0). Observations are a list of features for each passenger such as age, class, where the passenger boarded, if passenger are part or not of a family, sex and few other details.

Dataset description

Training set to teach the model contains 891 observations. 342 survived. We know, by reading Wikipedia, that 1316 passengers were on the boat and 37.8% survived. This means 498 passengers survived. The unseen dataset contains 498–342 (146) passengers who survived. We must find 146 passengers from this unseen dataset who survived.

Internet it’s a gold oil of information related to this tragedy. You can find ratio of passengers by class who survived, how the cannot have been boarded and even the name of passengers who survived.

From the training dataset we know that 136 passengers from 1st class survived, 87 from the 2nd class and 119 from the 3rd class.

Internet tells us that 61% (202) of the 1st class passengers survived, 42% (118) from the 2nd class and 24% (178) from the 3rd class. So we must find in the unseen dataset:

  • 202–136 passengers from 1st class = 66
  • 118–87 passengers from 2nd class = 31
  • 178–119 passengers from the 3rd class = 59

We have a 3 constraints to start the problem optimisation :-)

Linear Programming to solve problem

With a few lines of codes we can easily implement this problem and ask « pulp » to solve it !

Let’s start the Notebook by importing librairies.

import pyforest
from pulp import *

Then we can load the dataset that we can find on Kaggle website.

train = pd.read_csv('titanic/train.csv')
test = pd.read_csv('titanic/test.csv')
test.shape,train.shape

Prepare the dataset for LP

# create a copy of test set with Pclass
cols = ['PassengerId', 'Pclass','Sex']
df = test[cols].copy()
df = df.astype({"Pclass": 'str'})
# set passenger Id as index
df.set_index('PassengerId', inplace=True)
df['Passenger']=df.index
df = df.astype({"Passenger": 'str'})
df = pd.get_dummies(df)
#df.columns = ['Pclass1','Pclass2','Pclass3']
# add a column with constant 1 which will be used to be sure
# passenger is choose only once (or none)
df['Selected']=1
# add a column which contains a value to be used for optimisation
# problem;
# value will be computed to maximise the score later
df['Score']=1
df.head()

Define a function for the solver

def setup_problem(df,sex=False, sex_class=False):

# Linear Programming

prob = LpProblem("titanic_canot",LpMaximize)
passengers_items = list(df.index)
passengers_items
# Create a dictinary of unique for all passengers
unique = dict(zip(passengers_items,df['Score']))
class1 = dict(zip(passengers_items,df['Pclass_1']))
class2 = dict(zip(passengers_items,df['Pclass_2']))
class3 = dict(zip(passengers_items,df['Pclass_3']))
passengers_vars = LpVariable.dicts("PassengerId",passengers_items,lowBound=0,cat='Integer')
prob += lpSum([unique[i]*passengers_vars[i] for i in passengers_items])

Start to write constraints. As I want exact numbers of passenger by class I defined constraints by 2 values per class {>= x and <= y}. Both having the same value.

prob += lpSum([class1[f] * passengers_vars[f] for f in passengers_items]) >= 66
prob += lpSum([class1[f] * passengers_vars[f] for f in passengers_items]) <= 66
prob += lpSum([class2[f] * passengers_vars[f] for f in passengers_items]) >= 31
prob += lpSum([class2[f] * passengers_vars[f] for f in passengers_items]) <= 31
prob += lpSum([class3[f] * passengers_vars[f] for f in passengers_items]) >= 59
prob += lpSum([class3[f] * passengers_vars[f] for f in passengers_items]) <= 59

I added a special constraint to be sure to choose a passenger only once (or not).

peoples = []
#numbers=["{0:04}".format(i) for i in range(892,1310)]
for i in range(892,1310):
key = 'Passenger_' + str(i)
temp = dict(zip(passengers_items,df[key]))
peoples.append(temp)
for people in peoples:
prob += lpSum([people[f] * passengers_vars[f] for f in passengers_items]) >= 0
prob += lpSum([people[f] * passengers_vars[f] for f in passengers_items]) <= 1
return prob

Call the solver

prob = setup_problem(df,False)
prob.solve()
print("Status:", LpStatus[prob.status])
to_remove = []
for v in prob.variables():
if v.varValue>0:
#print(v.name.replace('PassengerId_',''), "=", v.varValue)
to_remove.append(int(v.name.replace('PassengerId_','')))

Create submission for Kaggle

# create a submission dataset
submission = test[['PassengerId']].copy()
submission = submission.set_index('PassengerId')
submission.loc[to_remove,'Survived']=1
submission.fillna(0, inplace=True)
submission = submission.astype({"Survived": 'int'})
submission.reset_index().to_csv('submission-lp-002.csv', index=False)
submission.reset_index().head()

Look the breakdown of our submission

test_df = submission.merge(test, left_on='PassengerId', right_on='PassengerId')
test_df.query('Survived ==1')[['PassengerId','Pclass','Sex']].groupby(['Pclass','Sex']).count()

Score and more constraints

This first submission returned a score of 58% which is not so bad with a few lines of code and only 3 constraints.

Make another try by adding a constraint on sex.

I updated the problem method by adding a constraint to have at least 85 women and a minimum of 30 men. This means 115 passing on the 156 which must be chosen in the unseen dataset. After running the problem and submitting the result to Kaggle, the score jumped to 68%.

if sex:
sex_female = dict(zip(passengers_items,df['Sex_female']))
sex_male = dict(zip(passengers_items,df['Sex_male']))
prob += lpSum([sex_female[f] * passengers_vars[f] for f in passengers_items]) >= 85
prob += lpSum([sex_male[f] * passengers_vars[f] for f in passengers_items]) >= 30

Conclusion

By using Linear Programming with constraints we are close to 70% just by setting constraints on the number of passenger by class to save and deciding a breakdown from women and men. We can add a lot of more constraints to optimise the score, for example:

  • age (adult or children)
  • family group (based on SibSp or Parch feature)
  • cabin (depending on canot location)

In a next post I will show how I tried to boarded the canot with constraint based on description we can find on internet on how canots were boarded.

Titanic was a real tragedy and unfortunately a nice topic for machine learning practitioner.

This story has been inspired by this article.

--

--

Daryl Felix

Passionate programmer with more than 30 years of experience, from the first Cobol programs to Python.