“Using ChatGPT and Machine Learning to Predict Titanic Survivors: A Case Study”

Daryl Felix
5 min readJan 8, 2023

Recently, I started to work with ChatPGT (from openai.com — https://openai.com/blog/chatgpt/).

I made some tries and today I was wondering if I can create a Python program to predict survivors on Titanic dataset (which is famous in Machine Learning) without doing one line of code ?

Photo by Kelly Sikkema on Unsplash

Here is the result of my experimentation

Prerequisite

  • No one line of code writes by me
  • Well knowledge of the Titanic dataset
  • Asking question to ChatGPT on what I want to do and copy-paste the Python code in a Notebook

Convention

  • My questions are in bold
  • Code generated by ChatGPT are copied without any changes
  • ChatGPT description are in quotes and plain text
  • My comments are in italic

Machine Learning project pattern

I will work with a standard machine learning pattern to train the model and predict values:

  • load the dataset
  • make some statistical description
  • exploratory data analysis
  • features engineering
  • model training and grid search + tuning parameters
  • model evaluation
  • predictions

Let’s start

Generate Python code to load a local file named train.csv in a Pandas Dataframe and count number of rows

import pandas as pd

# Load the train.csv file into a DataFrame
df = pd.read_csv('train.csv')

# Count the number of rows in the DataFrame
row_count = df.shape[0]
print(f'Number of rows: {row_count}')

Print description of the loaded dataset

import pandas as pd

# Load the train.csv file into a DataFrame
df = pd.read_csv('train.csv')

# Print the description of the dataset
print(df.describe())

To print the description of the loaded dataset, you can use the describe method of the Pandas DataFrame. This method generates descriptive statistics of the data, including the count, mean, standard deviation, minimum and maximum values, and the quartiles.

Count the missing value in the dataset

This will print the number of missing values in each column of the DataFrame. If there are no missing values, the corresponding value will be 0.

You can also use the isnull method to count the total number of missing values in the DataFrame:

import pandas as pd

# Load the train.csv file into a DataFrame
df = pd.read_csv('train.csv')

# Count the number of missing values in each column
missing = df.isnull().sum()
print(missing)

# Count the total number of missing values
total_missing = df.isnull().sum().sum()
print(f'Total number of missing values: {total_missing}')

Dealing with missing values

I asked ChatGPT to fill age missing value with the average age group by sex and class of travel. This means, if in 3rd class, mens are young, the missing values won’t be mixed with age by old men travelling in 1st class.

def fill_age(group):
group['age'] = group['age'].fillna(group['age'].mean())
return group

df = df.groupby(['sex', 'class']).apply(fill_age)

Attention: ChatGPT generates code with variable in lower case only, your files might have some name like “Age” or “Sex”.

Create a new feature by binning age column

Create a new column by binning Age columns in 5 bins

# Create the bins for the age column
bins = [0, 20, 40, 60, 80, 100]

# Create a new column with the binned values of the age column
df['age_bins'] = pd.cut(df['age'], bins)

# Print the first few rows of the DataFrame
print(df.head())

Encode the new columns age_bins in numerical value

The new column contains category field which must be converted in a number for training dataset.

# Encode the age_bins column as numerical values with a prefix
df = pd.get_dummies(df, columns=['age_bins'], prefix='age')

Keep numerical features only

Delete non-numerical columns from the dataset

In machine learning, training a model can only be done with numerical data. The Titanic dataset contains “object” -text- values. I wiuld be able to continue to ask ChatGPT to transform them but to simplify this story I choosed to delete non-numerical columns.

# Select only the numerical columns
df = df.select_dtypes(include=['int64', 'float64','uint8'])

Split dataset and train a DecisionTree model

Split the dataset into train and test dataset and train a DecisionTree model from Scikit-Learn using train set


from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split



# Select the features and target columns
X = df.drop(columns='Survived')
y = df['Survived']

# Split the data into a train set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Decision Tree model on the train set
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)

Evaluate the model

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model performance
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

Here is the result

Summary

This story is a simple work with ChatGPT which has a lot of possibility. At the beginning of this article I wrote that I will not change any lines of code but in fact I had to adapt some variables (lower case to upper case) and delete some lines because ChatGPT each time reimport the original dataset.

But in general, the bot proposed me valuable lines of code which I simply pasted in my Notebook.

Conclusion

I worked with ChatGPT by having good skills in Python and Machine learning. Did I save time, I am not sure, this Notebooks was made in 90 minutes which I am sure I would spend if I had to do it by myself. So first outcome:

  • not sure to save time !

Is it simple to create program like this ?

  • Why not, because it’s an alternative to have Stackoverflow website open and look for code samples

Is it fun ?

  • Totally, I recommend the challenge, I asked some other questions to ChatGPT and got valuable answers.

Print of my experimentation.

--

--

Daryl Felix

Passionate programmer with more than 30 years of experience, from the first Cobol programs to Python.