Football WorldCup Predictive Model

6 min readJul 2, 2022

Football WorldCup is an amazing topic to play with machine learning skills and Python programming technics. I assume, there is a lot of article on the subject and mine is not a revolution but more a share of my approach on this project.

My approach of the problem

I am interested about the capacity of the betting platforms to propose odds for any game with the certainty of giving attractive odds, but ultimately earning money on all the bets that will be made.

I decide to train a model based on features computed with odds and have a multi output regressor as outcomes.

The project pattern is a traditional one with the following steps:

define the problem
collect the data
clean data and prepare training dataset
train a model
evaluate and fine tune the model
predict future worldcup

Define the problem

The main objective is to train a model with past odds and known results. The game score is the result and it is a multiple target labels.

Game: France — England — [list of features]: target 2–1

The list of features

For this game, features can be:

when the game has been played (year, season, ...)
where the game has been played (neutral location, home/away stadium)
odds for this game from betting platform with 1N2 pattern [1.35, 2.15, 2.85] which mean France to win with 1.35 chance or draw the game with 2.15 chance. If you bet 1$ on the France will win the game, your returns will 1.35$ which a net profit of 0.35 (35%).
past results for the same game
last x games played, i.e 3 last games France won all [w,w,w] and England won the last one but lost the 2 others [l,l,w]

Many other features can be added to the model but remember when the model will make a prediction it cannot receive an unknown feature.

For this project, I simplified the features list by working with:

odds for the game (collected on website)
FIFA ranking at the time of the game
historical FIFA ranking and changes (computed with a Custom Transformer)

Collect the data, clean & prepare training set

With internet it’s easy to find website or database with past odds and game result. I focused my research on www.betexplorer.com which gives historical international game with a huge quantity of odds (result: 1N2, scores, half-time, many numbers).

I simple scrapping Notebook helped me to create a historical database with 5K games from 2000 until now.

Example for a 2006 Worldcup’s game in Germany.

The dataset can be used for multiple training model such as a classification problem by using the “target” column as “y” [0=home team to win, 1=draw, 2=away team to win] or a binary classification with “best_odd_won” [True/False] or multiple output regression problem with [home_score & away_score] to predict.

Train & Evaluate a model

I choose a DecisionTreeRegressor from Scikit-Learn as I can work with for a multi output regressor which means the prediction will return 2 values [home score and away score].

I created a standard pipeline with transformer to prepare the data, fit the regressor and evaluate the score.

regressor = DecisionTreeRegressor()train_X = trainset.drop(['home_score','away_score'],axis=1)
train_y = trainset[['home_score', 'away_score']]test_X = testset.drop(['home_score','away_score'],axis=1)
test_y = testset[['home_score', 'away_score']]numeric_features = ['1N2_1', '1N2_N', '1N2_2', 'away_team_rank_FIFA', 'home_team_rank_FIFA','season']numeric_transformer = Pipeline(steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())])categorical_features = ['home_team','away_team']
categorical_transformer = OneHotEncoder(handle_unknown="ignore")preprocessor = ColumnTransformer(transformers=[("num", numeric_transformer, numeric_features),("cat", categorical_transformer, categorical_features),])clf = Pipeline(steps=[('experimental_trans', CustomTransformer('1N2_1')) ,("preprocessor", preprocessor),("classifier", regressor)])clf.fit(train_X, train_y)print("model score: %.3f" % clf.score(test_X, test_y))import joblib#save your model or results
joblib.dump(clf,'models/multi-regressor-score.pkl')

This part of code, just prepare the train and test set (X,y), then setup a pipeline with data transformation before fitting the regressor. Finally, the model is saved to be loaded in another Notebook.

Experimental Transformation

To have some uncertainty in the data injected into the model, the odds are multiplied by a random factor giving more points to a win than to a draw.Making Prediction

This is the job of the Custom Transformer!

o1 = r['1N2_1']*random.uniform(0.805, 1.85) #* diff_fifa
oN = r['1N2_N']*random.uniform(0.505, 0.135) #* diff_fifa
o2 = r['1N2_2']*random.uniform(0.705, 1.65) #* diff_fifa

Computing a Worldcup

This is another fun part of the project. A worldcup is played on 64 games. At the 1st stage, we know the group composition. 8 groups (from A to H) of 4 national teams [A1, A2, A3, A4, B1, B2, …, H4].

Computing the Group Stage

To compute the group stage I needed to call the model with 48 games, store the result of the game (home_score / away_score) and then compute the ranking of each group to have the 1st and 2nd team by group which go to the knockout stage of the competition.

Worldcup model is stored in a Json file. Look at the knockout 16 games allocation. In this table, the winner of Group A will play against the 2nd of Group B and so on.

{"stage":{"groups":{"group-a": ["Uruguay","Russia","Saudi Arabia","Egypt"],
"group-b": ["Spain","Portugal","Iran","Morocco"],
"group-c": ["France","Denemark","Peru","Australia"],
"group-d": ["Croatia","Argentina","Nigeria","Iceland"],
"group-e": ["Brazil","Switzerland","Costa Rica","Serbia"],
"group-f": ["Sweden","Mexico","South Korea","Germany"],
"group-g": ["Belgium","England","Panama","Tunisia"],
"group-h": ["Columbia","Japan","Senegal","Poland"]}},"knockouts":{"knockout-16":{
"kn16-01": [["1A","2B"],["1C","2D"]],
"kn16-02": [["1E","2F"],["1G","2H"]],
"kn16-03": [["1B","2A"],["1D","2C"]],
"kn16-04": [["1F","2E"],["1H","2G"]]}}}

I created a Class “PredictGame” which receive the game to predict then prepare the data, call the saved model and return the result. This class receives a special attributes (knockout: True/False) because for knockout stage’s game the model cannot predict a draw. A winner must be returned.


payout = wc.loc[m,features].values.reshape(1,-1)# call the predictGame class and predict methodfeat = ['1N2_1','1N2_N','1N2_2','home_team_rank_FIFA','away_team_rank_FIFA','season','home_team','away_team']play=pd.DataFrame(wc.iloc[m,:]).Tg = pg.PredictGame(play)
p = g.predict()
t=p[0][0][7]

The model is called by the “predict” method from PredictGame class and return a set of value. The 7th position is the array of score [2,10245, 0.54646] which can be transformed on a 2–0 game result.

With the Group stage result and the worldcup pattern, the program continues to predict game until the worldcup final. The best score is the winner of this prediction.

Testing the model with 2018 Worldcup

Data from the 2018 Worldcup have been protected from the training process and test & evaluation process. I call them “unseen data”.

I loaded the dataset which is the Group Stage Composition and the odds collected from the betting website. Then I played the Group Stage (48 games), computed the table and ran the knockout stage from 16th until final. As mentioned previously, the odds are altered by a uncertainty factor which allows me to repeat the Worldcup 2018 computation many times and do not have each time the same result.

500 iterations !

I computed 500 iterations of Worldcup and stored the winner. Finally, a group by table with a counting statistics gave me the following result.

On 20.2% of the 500 Worldcup computed France won.

Conclusion

In this post, I shared about football and method to collect odds on betting website and my approach to define a multi output regressor dataset to train a model.

I mention also the fact to make a program to be able to “run” (or compute) a FIFA Worldcup.

Next stage will be to collect the Group Stage of the 2022 WorldCup and odds published on website and ran the program before betting 1$ on my winner !

Have fun …