“Using ChatGPT and Machine Learning to Predict Titanic Survivors: A Case Study”
Recently, I started to work with ChatPGT (from openai.com — https://openai.com/blog/chatgpt/).
I made some tries and today I was wondering if I can create a Python program to predict survivors on Titanic dataset (which is famous in Machine Learning) without doing one line of code ?
Here is the result of my experimentation
Prerequisite
- No one line of code writes by me
- Well knowledge of the Titanic dataset
- Asking question to ChatGPT on what I want to do and copy-paste the Python code in a Notebook
Convention
- My questions are in bold
- Code generated by ChatGPT are copied without any changes
- ChatGPT description are in quotes and plain text
- My comments are in italic
Machine Learning project pattern
I will work with a standard machine learning pattern to train the model and predict values:
- load the dataset
- make some statistical description
- exploratory data analysis
- features engineering
- model training and grid search + tuning parameters
- model evaluation
- predictions
Let’s start
Generate Python code to load a local file named train.csv in a Pandas Dataframe and count number of rows
import pandas as pd
# Load the train.csv file into a DataFrame
df = pd.read_csv('train.csv')
# Count the number of rows in the DataFrame
row_count = df.shape[0]
print(f'Number of rows: {row_count}')
Print description of the loaded dataset
import pandas as pd
# Load the train.csv file into a DataFrame
df = pd.read_csv('train.csv')
# Print the description of the dataset
print(df.describe())
To print the description of the loaded dataset, you can use the
describe
method of the Pandas DataFrame. This method generates descriptive statistics of the data, including the count, mean, standard deviation, minimum and maximum values, and the quartiles.
Count the missing value in the dataset
This will print the number of missing values in each column of the DataFrame. If there are no missing values, the corresponding value will be 0.
You can also use the
isnull
method to count the total number of missing values in the DataFrame:
import pandas as pd
# Load the train.csv file into a DataFrame
df = pd.read_csv('train.csv')
# Count the number of missing values in each column
missing = df.isnull().sum()
print(missing)
# Count the total number of missing values
total_missing = df.isnull().sum().sum()
print(f'Total number of missing values: {total_missing}')
Dealing with missing values
I asked ChatGPT to fill age missing value with the average age group by sex and class of travel. This means, if in 3rd class, mens are young, the missing values won’t be mixed with age by old men travelling in 1st class.
def fill_age(group):
group['age'] = group['age'].fillna(group['age'].mean())
return group
df = df.groupby(['sex', 'class']).apply(fill_age)
Attention: ChatGPT generates code with variable in lower case only, your files might have some name like “Age” or “Sex”.
Create a new feature by binning age column
Create a new column by binning Age columns in 5 bins
# Create the bins for the age column
bins = [0, 20, 40, 60, 80, 100]
# Create a new column with the binned values of the age column
df['age_bins'] = pd.cut(df['age'], bins)
# Print the first few rows of the DataFrame
print(df.head())
Encode the new columns age_bins in numerical value
The new column contains category field which must be converted in a number for training dataset.
# Encode the age_bins column as numerical values with a prefix
df = pd.get_dummies(df, columns=['age_bins'], prefix='age')
Keep numerical features only
Delete non-numerical columns from the dataset
In machine learning, training a model can only be done with numerical data. The Titanic dataset contains “object” -text- values. I wiuld be able to continue to ask ChatGPT to transform them but to simplify this story I choosed to delete non-numerical columns.
# Select only the numerical columns
df = df.select_dtypes(include=['int64', 'float64','uint8'])
Split dataset and train a DecisionTree model
Split the dataset into train and test dataset and train a DecisionTree model from Scikit-Learn using train set
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
# Select the features and target columns
X = df.drop(columns='Survived')
y = df['Survived']
# Split the data into a train set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train a Decision Tree model on the train set
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)
Evaluate the model
# Make predictions on the test set
y_pred = model.predict(X_test)
# Evaluate the model performance
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')
Here is the result
Summary
This story is a simple work with ChatGPT which has a lot of possibility. At the beginning of this article I wrote that I will not change any lines of code but in fact I had to adapt some variables (lower case to upper case) and delete some lines because ChatGPT each time reimport the original dataset.
But in general, the bot proposed me valuable lines of code which I simply pasted in my Notebook.
Conclusion
I worked with ChatGPT by having good skills in Python and Machine learning. Did I save time, I am not sure, this Notebooks was made in 90 minutes which I am sure I would spend if I had to do it by myself. So first outcome:
- not sure to save time !
Is it simple to create program like this ?
- Why not, because it’s an alternative to have Stackoverflow website open and look for code samples
Is it fun ?
- Totally, I recommend the challenge, I asked some other questions to ChatGPT and got valuable answers.
Print of my experimentation.