Statistical Analysis to Classify the Pass Outcomes
In this post, we will learn how to use statistical learning to build models for classifying pass outcomes. Classification is the operation of labeling a set of data into different classes, for example whether a pass from a player will be a successful or an unsuccessful pass depending on a set of particular features. The pass outcome is the dependent class variable and the features are the independent variables. Classification tasks can be either binary or multiclass.
We will again use the statsbomb open data to collect information about different kind of passes and use various classification algorithms from statistical learning literature to classify these pass outcomes. First, we will import the relevant packages:
from statsbombpy import sb # statsbomb api
import matplotlib.pyplot as plt # matplotlib for plotting
import seaborn as sns # seaborn for plotting useful statistical graphs
import numpy as np # numerical python package
import pandas as pd # pandas for manipulating and analysing data
Let us now look into the different competitions available:
comp = sb.competitions()
## credentials were not supplied. open data access only
print(comp.to_markdown())
## | | competition_id | season_id | country_name | competition_name | competition_gender | season_name | match_updated | match_available |
## |---:|-----------------:|------------:|:-------------------------|:------------------------|:---------------------|:--------------|:---------------------------|:---------------------------|
## | 0 | 16 | 4 | Europe | Champions League | male | 2018/2019 | 2021-04-19T17:36:05.724116 | 2021-04-19T17:36:05.724116 |
## | 1 | 16 | 1 | Europe | Champions League | male | 2017/2018 | 2021-01-23T21:55:30.425330 | 2021-01-23T21:55:30.425330 |
## | 2 | 16 | 2 | Europe | Champions League | male | 2016/2017 | 2020-08-26T12:33:15.869622 | 2020-07-29T05:00 |
## | 3 | 16 | 27 | Europe | Champions League | male | 2015/2016 | 2020-08-26T12:33:15.869622 | 2020-07-29T05:00 |
## | 4 | 16 | 26 | Europe | Champions League | male | 2014/2015 | 2020-08-26T12:33:15.869622 | 2020-07-29T05:00 |
## | 5 | 16 | 25 | Europe | Champions League | male | 2013/2014 | 2020-08-26T12:33:15.869622 | 2020-07-29T05:00 |
## | 6 | 16 | 24 | Europe | Champions League | male | 2012/2013 | 2020-08-26T12:33:15.869622 | 2020-07-29T05:00 |
## | 7 | 16 | 23 | Europe | Champions League | male | 2011/2012 | 2020-08-26T12:33:15.869622 | 2020-07-29T05:00 |
## | 8 | 16 | 22 | Europe | Champions League | male | 2010/2011 | 2020-07-29T05:00 | 2020-07-29T05:00 |
## | 9 | 16 | 21 | Europe | Champions League | male | 2009/2010 | 2020-07-29T05:00 | 2020-07-29T05:00 |
## | 10 | 16 | 41 | Europe | Champions League | male | 2008/2009 | 2020-08-30T10:18:39.435424 | 2020-08-30T10:18:39.435424 |
## | 11 | 16 | 39 | Europe | Champions League | male | 2006/2007 | 2021-03-31T04:18:30.437060 | 2021-03-31T04:18:30.437060 |
## | 12 | 16 | 37 | Europe | Champions League | male | 2004/2005 | 2021-04-01T06:18:57.459032 | 2021-04-01T06:18:57.459032 |
## | 13 | 16 | 44 | Europe | Champions League | male | 2003/2004 | 2021-04-01T00:34:59.472485 | 2021-04-01T00:34:59.472485 |
## | 14 | 16 | 76 | Europe | Champions League | male | 1999/2000 | 2020-07-29T05:00 | 2020-07-29T05:00 |
## | 15 | 37 | 42 | England | FA Women's Super League | female | 2019/2020 | 2021-04-28T19:48:01.172671 | 2021-04-28T19:48:01.172671 |
## | 16 | 37 | 4 | England | FA Women's Super League | female | 2018/2019 | 2021-04-28T19:48:01.166958 | 2021-04-28T19:48:01.166958 |
## | 17 | 43 | 3 | International | FIFA World Cup | male | 2018 | 2020-10-25T14:03:50.263266 | 2020-10-25T14:03:50.263266 |
## | 18 | 11 | 42 | Spain | La Liga | male | 2019/2020 | 2020-12-18T12:10:38.985394 | 2020-12-18T12:10:38.985394 |
## | 19 | 11 | 4 | Spain | La Liga | male | 2018/2019 | 2021-04-20T03:24:51.029365 | 2021-04-20T03:24:51.029365 |
## | 20 | 11 | 1 | Spain | La Liga | male | 2017/2018 | 2021-04-19T17:36:05.805404 | 2021-04-19T17:36:05.805404 |
## | 21 | 11 | 2 | Spain | La Liga | male | 2016/2017 | 2021-02-02T23:24:58.985975 | 2021-02-02T23:24:58.985975 |
## | 22 | 11 | 27 | Spain | La Liga | male | 2015/2016 | 2020-07-29T05:00 | 2020-07-29T05:00 |
## | 23 | 11 | 26 | Spain | La Liga | male | 2014/2015 | 2020-07-29T05:00 | 2020-07-29T05:00 |
## | 24 | 11 | 25 | Spain | La Liga | male | 2013/2014 | 2020-07-29T05:00 | 2020-07-29T05:00 |
## | 25 | 11 | 24 | Spain | La Liga | male | 2012/2013 | 2020-07-29T05:00 | 2020-07-29T05:00 |
## | 26 | 11 | 23 | Spain | La Liga | male | 2011/2012 | 2020-07-29T05:00 | 2020-07-29T05:00 |
## | 27 | 11 | 22 | Spain | La Liga | male | 2010/2011 | 2020-07-29T05:00 | 2020-07-29T05:00 |
## | 28 | 11 | 21 | Spain | La Liga | male | 2009/2010 | 2020-07-29T05:00 | 2020-07-29T05:00 |
## | 29 | 11 | 41 | Spain | La Liga | male | 2008/2009 | 2020-07-29T05:00 | 2020-07-29T05:00 |
## | 30 | 11 | 40 | Spain | La Liga | male | 2007/2008 | 2020-07-29T05:00 | 2020-07-29T05:00 |
## | 31 | 11 | 39 | Spain | La Liga | male | 2006/2007 | 2020-07-29T05:00 | 2020-07-29T05:00 |
## | 32 | 11 | 38 | Spain | La Liga | male | 2005/2006 | 2020-07-29T05:00 | 2020-07-29T05:00 |
## | 33 | 11 | 37 | Spain | La Liga | male | 2004/2005 | 2020-07-29T05:00 | 2020-07-29T05:00 |
## | 34 | 49 | 3 | United States of America | NWSL | female | 2018 | 2020-07-29T05:00 | 2020-07-29T05:00 |
## | 35 | 2 | 44 | England | Premier League | male | 2003/2004 | 2020-08-31T20:40:28.969635 | 2020-08-31T20:40:28.969635 |
## | 36 | 72 | 30 | International | Women's World Cup | female | 2019 | 2020-07-29T05:00 | 2020-07-29T05:00 |
Our aim is to get access to all of Barcelona
’s available pass event data in La Liga
stretching from 2004/05
season to 2019/20
season. We will filter comp
by setting competition_name
to La Liga
.
comp = comp[comp["competition_name"] == "La Liga"]
print(comp.to_markdown())
## | | competition_id | season_id | country_name | competition_name | competition_gender | season_name | match_updated | match_available |
## |---:|-----------------:|------------:|:---------------|:-------------------|:---------------------|:--------------|:---------------------------|:---------------------------|
## | 18 | 11 | 42 | Spain | La Liga | male | 2019/2020 | 2020-12-18T12:10:38.985394 | 2020-12-18T12:10:38.985394 |
## | 19 | 11 | 4 | Spain | La Liga | male | 2018/2019 | 2021-04-20T03:24:51.029365 | 2021-04-20T03:24:51.029365 |
## | 20 | 11 | 1 | Spain | La Liga | male | 2017/2018 | 2021-04-19T17:36:05.805404 | 2021-04-19T17:36:05.805404 |
## | 21 | 11 | 2 | Spain | La Liga | male | 2016/2017 | 2021-02-02T23:24:58.985975 | 2021-02-02T23:24:58.985975 |
## | 22 | 11 | 27 | Spain | La Liga | male | 2015/2016 | 2020-07-29T05:00 | 2020-07-29T05:00 |
## | 23 | 11 | 26 | Spain | La Liga | male | 2014/2015 | 2020-07-29T05:00 | 2020-07-29T05:00 |
## | 24 | 11 | 25 | Spain | La Liga | male | 2013/2014 | 2020-07-29T05:00 | 2020-07-29T05:00 |
## | 25 | 11 | 24 | Spain | La Liga | male | 2012/2013 | 2020-07-29T05:00 | 2020-07-29T05:00 |
## | 26 | 11 | 23 | Spain | La Liga | male | 2011/2012 | 2020-07-29T05:00 | 2020-07-29T05:00 |
## | 27 | 11 | 22 | Spain | La Liga | male | 2010/2011 | 2020-07-29T05:00 | 2020-07-29T05:00 |
## | 28 | 11 | 21 | Spain | La Liga | male | 2009/2010 | 2020-07-29T05:00 | 2020-07-29T05:00 |
## | 29 | 11 | 41 | Spain | La Liga | male | 2008/2009 | 2020-07-29T05:00 | 2020-07-29T05:00 |
## | 30 | 11 | 40 | Spain | La Liga | male | 2007/2008 | 2020-07-29T05:00 | 2020-07-29T05:00 |
## | 31 | 11 | 39 | Spain | La Liga | male | 2006/2007 | 2020-07-29T05:00 | 2020-07-29T05:00 |
## | 32 | 11 | 38 | Spain | La Liga | male | 2005/2006 | 2020-07-29T05:00 | 2020-07-29T05:00 |
## | 33 | 11 | 37 | Spain | La Liga | male | 2004/2005 | 2020-07-29T05:00 | 2020-07-29T05:00 |
We see that the competion_id
is 11
for all the rows. So, now, we need to collect all the values from season_id
. Let us get the values:
season_ids = comp.season_id.unique()
print(season_ids)
## [42 4 1 2 27 26 25 24 23 22 21 41 40 39 38 37]
Now that we have all the values of La Liga
’s season_id
and that we know the competition_id
, we can now get all the required pass event data. Let us get the event data from the last three seasons:
ev = {}
i = 0
for si in season_ids[:3]:
mat = sb.matches(competition_id = 11, season_id = si)
match_ids = mat.match_id.unique()
for mi in match_ids:
events = sb.events(match_id = mi)
ev[i] = events
i+=1
L = list(ev.values())
events = pd.concat(L)
Note that the above chunk will take quite an amount of time to complete the process of collecting the data. The reader should take some break in the meantime and go grab a cup of coffee/tea! It is recommended to store this dataset as a .csv file for later use. Now let us filter the dataset by discarding unnecessary columns and picking up those which are relevant to pass events:
E_pass = events[['type', 'pass_angle', 'pass_height', 'pass_length', 'pass_outcome', 'team']]
print(E_pass.head(10).to_markdown())
## | | Unnamed: 0 | type | pass_angle | pass_height | pass_length | pass_outcome | team |
## |---:|-------------:|:------------|-------------:|:--------------|--------------:|---------------:|:-----------------|
## | 0 | 0 | Starting XI | nan | nan | nan | nan | Deportivo Alavés |
## | 1 | 1 | Starting XI | nan | nan | nan | nan | Barcelona |
## | 2 | 2 | Half Start | nan | nan | nan | nan | Barcelona |
## | 3 | 3 | Half Start | nan | nan | nan | nan | Deportivo Alavés |
## | 4 | 4 | Half Start | nan | nan | nan | nan | Barcelona |
## | 5 | 5 | Half Start | nan | nan | nan | nan | Deportivo Alavés |
## | 6 | 6 | Pass | 3.09995 | Ground Pass | 16.8146 | nan | Barcelona |
## | 7 | 7 | Pass | -2.25894 | Ground Pass | 11.6516 | nan | Barcelona |
## | 8 | 8 | Pass | 1.71269 | Ground Pass | 7.77817 | nan | Barcelona |
## | 9 | 9 | Pass | -1.51327 | Ground Pass | 19.1316 | nan | Barcelona |
print(len(E_pass))
## 402627
Firstly we will only keep those rows where, type
is set to 'Pass'
and team
is set to 'Barcelona'
:
E_pass = E_pass[(E_pass['type'] == 'Pass') & (E_pass['team'] == 'Barcelona')]
print(E_pass.head(10).to_markdown())
## | | Unnamed: 0 | type | pass_angle | pass_height | pass_length | pass_outcome | team |
## |---:|-------------:|:-------|-------------:|:--------------|--------------:|---------------:|:----------|
## | 6 | 6 | Pass | 3.09995 | Ground Pass | 16.8146 | nan | Barcelona |
## | 7 | 7 | Pass | -2.25894 | Ground Pass | 11.6516 | nan | Barcelona |
## | 8 | 8 | Pass | 1.71269 | Ground Pass | 7.77817 | nan | Barcelona |
## | 9 | 9 | Pass | -1.51327 | Ground Pass | 19.1316 | nan | Barcelona |
## | 10 | 10 | Pass | 1.27468 | Ground Pass | 6.16847 | nan | Barcelona |
## | 11 | 11 | Pass | 2.50258 | Ground Pass | 22.3002 | nan | Barcelona |
## | 12 | 12 | Pass | 1.31242 | Ground Pass | 14.4807 | nan | Barcelona |
## | 13 | 13 | Pass | -2.30539 | Ground Pass | 20.8866 | nan | Barcelona |
## | 14 | 14 | Pass | -0.447427 | High Pass | 38.5996 | nan | Barcelona |
## | 15 | 15 | Pass | -2.16891 | Low Pass | 24.6854 | nan | Barcelona |
print(len(E_pass))
## 72069
Note that, we have reduced the size of the dataset from 402627
to 72069
. We see that pass_height
in E_pass
is a categorical column and we need to engineer this feature to give it numerical values. Let us check the the unique types in pass_height
:
print(E_pass.pass_height.unique())
## ['Ground Pass' 'High Pass' 'Low Pass']
Intuition tells us that 'Ground Pass'
leads to more successful passes, 'Low Pass'
leads to lesser successful passes and the 'High Pass'
leads to the least successful passes. We can use a look up table to assign them some numerical values. Let us create a dict
object to do so:
pass_height_types = {'Ground Pass': 3, 'High Pass': 2, 'Low Pass': 1}
We will replace the entries of the pass_height
column with the above numerical values from the dictionary:
E_pass_new = E_pass.replace({"pass_height": pass_height_types})
print(E_pass_new.head(10).to_markdown())
## | | Unnamed: 0 | type | pass_angle | pass_height | pass_length | pass_outcome | team |
## |---:|-------------:|:-------|-------------:|--------------:|--------------:|---------------:|:----------|
## | 6 | 6 | Pass | 3.09995 | 3 | 16.8146 | nan | Barcelona |
## | 7 | 7 | Pass | -2.25894 | 3 | 11.6516 | nan | Barcelona |
## | 8 | 8 | Pass | 1.71269 | 3 | 7.77817 | nan | Barcelona |
## | 9 | 9 | Pass | -1.51327 | 3 | 19.1316 | nan | Barcelona |
## | 10 | 10 | Pass | 1.27468 | 3 | 6.16847 | nan | Barcelona |
## | 11 | 11 | Pass | 2.50258 | 3 | 22.3002 | nan | Barcelona |
## | 12 | 12 | Pass | 1.31242 | 3 | 14.4807 | nan | Barcelona |
## | 13 | 13 | Pass | -2.30539 | 3 | 20.8866 | nan | Barcelona |
## | 14 | 14 | Pass | -0.447427 | 2 | 38.5996 | nan | Barcelona |
## | 15 | 15 | Pass | -2.16891 | 1 | 24.6854 | nan | Barcelona |
First, we will study logistic regression for building a binary classifier model. So our pass_outcome
column should be such that it gives two values: 0
for unsuccessful passes or 1
for successful passes. Let us now look into the unique entries of the pass_outcome
column:
print(E_pass_new.pass_outcome.unique())
## [nan 'Incomplete' 'Out' 'Unknown' 'Injury Clearance' 'Pass Offside']
We know that in statsbomb data a pass_outcome
having a nan
value actually means a successful pass. So we will replace the nan
values in this column with 1
and all other values with 0
.
pass_outcome_types = {'Incomplete':0, 'Out':0, 'Unknown':0, 'Injury Clearance':0, 'Pass Offside':0}
E_pass_new = E_pass_new.replace({"pass_outcome": pass_outcome_types})
E_pass_new = E_pass_new.fillna({'pass_outcome':1})
print(E_pass_new.head(10).to_markdown())
## | | Unnamed: 0 | type | pass_angle | pass_height | pass_length | pass_outcome | team |
## |---:|-------------:|:-------|-------------:|--------------:|--------------:|---------------:|:----------|
## | 6 | 6 | Pass | 3.09995 | 3 | 16.8146 | 1 | Barcelona |
## | 7 | 7 | Pass | -2.25894 | 3 | 11.6516 | 1 | Barcelona |
## | 8 | 8 | Pass | 1.71269 | 3 | 7.77817 | 1 | Barcelona |
## | 9 | 9 | Pass | -1.51327 | 3 | 19.1316 | 1 | Barcelona |
## | 10 | 10 | Pass | 1.27468 | 3 | 6.16847 | 1 | Barcelona |
## | 11 | 11 | Pass | 2.50258 | 3 | 22.3002 | 1 | Barcelona |
## | 12 | 12 | Pass | 1.31242 | 3 | 14.4807 | 1 | Barcelona |
## | 13 | 13 | Pass | -2.30539 | 3 | 20.8866 | 1 | Barcelona |
## | 14 | 14 | Pass | -0.447427 | 2 | 38.5996 | 1 | Barcelona |
## | 15 | 15 | Pass | -2.16891 | 1 | 24.6854 | 1 | Barcelona |
print(E_pass_new.pass_outcome.unique())
## [1. 0.]
Now that we have manipulated our data, it is time to start building the model. For doing that, we need to use the scikit-learn
package that is used for predictive statistical learning with Python. the user should pip
install the package to begin with. As we are going to use logistic regression to build our model, we need to call the LogisticRegression
class from scikit-learn
:
from sklearn.linear_model import LogisticRegression
In addition to this, we need to split our dataset into a training dataset that aids in training our classifier and a testing dataset that is used to test the accuracy of our model. So we need to call the train_test_split
class from scikit-learn
.
from sklearn.model_selection import train_test_split
Finally, we need to calculate the evaluation metrics of our model. So we need to import the metrics
class too:
from sklearn import metrics
As pass_outcome
is the column for dependent variable and the rest are the columns for the independent variables, we have to divide the dataset into dependent and independent variables in the following way:
x = E_pass_new[['pass_angle', 'pass_height', 'pass_length']]
y = E_pass_new['pass_outcome']
print(x.head(10).to_markdown())
## | | pass_angle | pass_height | pass_length |
## |---:|-------------:|--------------:|--------------:|
## | 6 | 3.09995 | 3 | 16.8146 |
## | 7 | -2.25894 | 3 | 11.6516 |
## | 8 | 1.71269 | 3 | 7.77817 |
## | 9 | -1.51327 | 3 | 19.1316 |
## | 10 | 1.27468 | 3 | 6.16847 |
## | 11 | 2.50258 | 3 | 22.3002 |
## | 12 | 1.31242 | 3 | 14.4807 |
## | 13 | -2.30539 | 3 | 20.8866 |
## | 14 | -0.447427 | 2 | 38.5996 |
## | 15 | -2.16891 | 1 | 24.6854 |
print(y.head(10).to_markdown())
## | | pass_outcome |
## |---:|---------------:|
## | 6 | 1 |
## | 7 | 1 |
## | 8 | 1 |
## | 9 | 1 |
## | 10 | 1 |
## | 11 | 1 |
## | 12 | 1 |
## | 13 | 1 |
## | 14 | 1 |
## | 15 | 1 |
Here y
is the outcome and x
is the set of columns representing the features. Now we split the whole dataset into training and test datasets:
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.4, random_state = 0)
Here, the argument train_size=0.4
states that 40% of our data will be used as training data, and the argument random_state=0
ensures that the randomly selected rows that make up the training datasets are always the same every time the function is called on our original dataset.
Let us look into the training and test datasets:
print(x_train.head(10).to_markdown())
## | | pass_angle | pass_height | pass_length |
## |-------:|-------------:|--------------:|--------------:|
## | 12504 | -2.05555 | 3 | 16.9532 |
## | 43466 | 0.57442 | 2 | 42.8823 |
## | 248698 | 2.9105 | 3 | 17.4642 |
## | 114318 | 1.91635 | 3 | 15.9424 |
## | 375561 | -2.85014 | 3 | 20.8806 |
## | 129796 | 0.785398 | 3 | 18.3848 |
## | 43127 | 1.36116 | 3 | 14.4156 |
## | 165840 | -2.67795 | 2 | 6.7082 |
## | 394965 | -0.504861 | 2 | 43.4166 |
## | 158924 | -1.10715 | 3 | 11.1803 |
print(x_test.head(10).to_markdown())
## | | pass_angle | pass_height | pass_length |
## |-------:|-------------:|--------------:|--------------:|
## | 268532 | 0.432408 | 3 | 14.3178 |
## | 76336 | -0.922866 | 3 | 18.0602 |
## | 177139 | -1.24905 | 2 | 3.16228 |
## | 134226 | -2.19105 | 3 | 8.60233 |
## | 83969 | 0.18948 | 3 | 14.8661 |
## | 197188 | 1.15839 | 3 | 17.4642 |
## | 283676 | 3.07917 | 3 | 16.0312 |
## | 91742 | 0.0658676 | 3 | 37.9824 |
## | 233291 | 0.266252 | 3 | 11.4018 |
## | 91553 | 2.67795 | 3 | 11.1803 |
print(y_train.head(10).to_markdown())
## | | pass_outcome |
## |-------:|---------------:|
## | 12504 | 1 |
## | 43466 | 0 |
## | 248698 | 1 |
## | 114318 | 1 |
## | 375561 | 1 |
## | 129796 | 1 |
## | 43127 | 1 |
## | 165840 | 0 |
## | 394965 | 0 |
## | 158924 | 1 |
print(x_test.head(10).to_markdown())
## | | pass_angle | pass_height | pass_length |
## |-------:|-------------:|--------------:|--------------:|
## | 268532 | 0.432408 | 3 | 14.3178 |
## | 76336 | -0.922866 | 3 | 18.0602 |
## | 177139 | -1.24905 | 2 | 3.16228 |
## | 134226 | -2.19105 | 3 | 8.60233 |
## | 83969 | 0.18948 | 3 | 14.8661 |
## | 197188 | 1.15839 | 3 | 17.4642 |
## | 283676 | 3.07917 | 3 | 16.0312 |
## | 91742 | 0.0658676 | 3 | 37.9824 |
## | 233291 | 0.266252 | 3 | 11.4018 |
## | 91553 | 2.67795 | 3 | 11.1803 |
Now, we will create an instance of the logistic regression model:
lr = LogisticRegression()
Next, we will train our model on the training dataset:
lr.fit(x_train, y_train)
## LogisticRegression()
Once the training is done, we will predict the outcomes of the passes on the test data:
y_predicted = lr.predict(x_test)
Yes, we have generated our predicted outcomes. To check the accuracy of our model, we use the metrics.accuracy_score()
function:
accuracy = metrics.accuracy_score(y_test, y_predicted)
accuracy
## 0.8747282734378613
We see that our classification model has around 87.47% accuracy. Not bad! Next, we need a way to compare y_test
and y_predicted
. This is usually done by visualizing the confusion matrix or the error matrix in the following way:
error_matrix = metrics.confusion_matrix(y_test, y_predicted,labels = [0, 1])
sns.heatmap(error_matrix, annot=True, cmap = 'Blues_r', linewidths = 3, linecolor = 'red')
Now, from this confusion matrix we can calculate the true negatives, false positives, false negatives and true positives:
TN, FP, FN, TP = error_matrix.ravel()
print(TN, FP, FN, TP)
## 20 5392 25 37805
So, there are 20 true negatives, 5392 false positives, 25 false negatives and 37805 true positives to be precise. We can also confirm this by plotting a histogram to show the difference between the predicted value and the true value:
sns.displot((y_test - y_predicted), bins = 50, color = 'red')
We can finally calculate the mean absolute error:
mae = metrics.mean_absolute_error(y_test, y_predicted)
mae
## 0.12527172656213867
So, our model prediction is off by the value given by mae
. Next, we will study how to perform multi-label classification on the same dataset by using another statistical learning algorithm called the Naive Bayes algorithm.
First let us clean E_pass
dataset a little more by discarding those pass_outcomes
which are either 'Unknown'
or 'Injury Clearance'
.
E_pass = E_pass[E_pass['pass_outcome'].isin(['Unknown', 'Injury Clearance']) == False]
print(E_pass.pass_outcome.unique())
## [nan 'Incomplete' 'Out' 'Pass Offside']
As we are going to work with multi-label classification, let us modify the pass_outcome_type
look up table:
pass_outcome_types = {'Incomplete':0, 'Out':-1, 'Pass Offside':-2}
We will now alter E_pass
by changing the pass_outcome
column based on the new look up table:
E_pass_new = E_pass.replace({"pass_height": pass_height_types})
E_pass_new = E_pass_new.replace({"pass_outcome": pass_outcome_types})
E_pass_new = E_pass_new.fillna({'pass_outcome':1})
print(E_pass_new.head(10).to_markdown())
## | | Unnamed: 0 | type | pass_angle | pass_height | pass_length | pass_outcome | team |
## |---:|-------------:|:-------|-------------:|--------------:|--------------:|---------------:|:----------|
## | 6 | 6 | Pass | 3.09995 | 3 | 16.8146 | 1 | Barcelona |
## | 7 | 7 | Pass | -2.25894 | 3 | 11.6516 | 1 | Barcelona |
## | 8 | 8 | Pass | 1.71269 | 3 | 7.77817 | 1 | Barcelona |
## | 9 | 9 | Pass | -1.51327 | 3 | 19.1316 | 1 | Barcelona |
## | 10 | 10 | Pass | 1.27468 | 3 | 6.16847 | 1 | Barcelona |
## | 11 | 11 | Pass | 2.50258 | 3 | 22.3002 | 1 | Barcelona |
## | 12 | 12 | Pass | 1.31242 | 3 | 14.4807 | 1 | Barcelona |
## | 13 | 13 | Pass | -2.30539 | 3 | 20.8866 | 1 | Barcelona |
## | 14 | 14 | Pass | -0.447427 | 2 | 38.5996 | 1 | Barcelona |
## | 15 | 15 | Pass | -2.16891 | 1 | 24.6854 | 1 | Barcelona |
We are going to apply Naive Bayes algorithm to build our multi-label classification model. Naive Bayes algorithms are a set of simple probabilistic algorithms built upon Bayes' theorem, assuming that all the features are independent of each other. As this assumption is naive, these methods are therefore called Naive Bayes methods. First we need to call the GaussianNB
class from scikit-learn
:
from sklearn.naive_bayes import GaussianNB
We will next divide our dataset into dependent and independent variables:
x = E_pass_new[['pass_angle', 'pass_height', 'pass_length']]
y = E_pass_new['pass_outcome']
print(x.head(10).to_markdown())
## | | pass_angle | pass_height | pass_length |
## |---:|-------------:|--------------:|--------------:|
## | 6 | 3.09995 | 3 | 16.8146 |
## | 7 | -2.25894 | 3 | 11.6516 |
## | 8 | 1.71269 | 3 | 7.77817 |
## | 9 | -1.51327 | 3 | 19.1316 |
## | 10 | 1.27468 | 3 | 6.16847 |
## | 11 | 2.50258 | 3 | 22.3002 |
## | 12 | 1.31242 | 3 | 14.4807 |
## | 13 | -2.30539 | 3 | 20.8866 |
## | 14 | -0.447427 | 2 | 38.5996 |
## | 15 | -2.16891 | 1 | 24.6854 |
print(y.head(10).to_markdown())
## | | pass_outcome |
## |---:|---------------:|
## | 6 | 1 |
## | 7 | 1 |
## | 8 | 1 |
## | 9 | 1 |
## | 10 | 1 |
## | 11 | 1 |
## | 12 | 1 |
## | 13 | 1 |
## | 14 | 1 |
## | 15 | 1 |
Now, we split the whole dataset into training and test datasets:
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.4, random_state = 0)
print(x_train.head(10).to_markdown())
## | | pass_angle | pass_height | pass_length |
## |-------:|-------------:|--------------:|--------------:|
## | 24742 | -0.214967 | 3 | 23.4395 |
## | 150862 | -0.432408 | 3 | 14.3178 |
## | 341342 | -0.124355 | 2 | 8.06226 |
## | 106600 | 2.14671 | 3 | 9.18096 |
## | 209720 | 1.71269 | 3 | 7.07107 |
## | 177450 | 1.15257 | 3 | 19.6977 |
## | 205081 | 0.291457 | 2 | 10.4403 |
## | 287988 | 1.3734 | 3 | 5.09902 |
## | 318663 | -1.87668 | 3 | 19.9249 |
## | 283670 | -1.3633 | 3 | 19.4165 |
print(x_test.head(10).to_markdown())
## | | pass_angle | pass_height | pass_length |
## |-------:|-------------:|--------------:|--------------:|
## | 84315 | -2.10613 | 3 | 16.8585 |
## | 237632 | -0.927295 | 1 | 5 |
## | 76289 | -0.268841 | 3 | 10.1651 |
## | 15998 | -2.80159 | 1 | 15.5926 |
## | 364198 | 2.14213 | 3 | 16.6433 |
## | 129666 | -3.04192 | 3 | 10.0499 |
## | 329616 | 2.70175 | 3 | 18.7883 |
## | 133871 | -1.73595 | 3 | 12.1655 |
## | 75295 | -0.54172 | 1 | 13.1883 |
## | 229777 | 2.35619 | 1 | 4.24264 |
print(y_train.head(10).to_markdown())
## | | pass_outcome |
## |-------:|---------------:|
## | 24742 | 1 |
## | 150862 | 1 |
## | 341342 | 1 |
## | 106600 | 1 |
## | 209720 | 1 |
## | 177450 | 1 |
## | 205081 | 0 |
## | 287988 | 0 |
## | 318663 | 1 |
## | 283670 | 1 |
print(y_test.head(10).to_markdown())
## | | pass_outcome |
## |-------:|---------------:|
## | 84315 | 1 |
## | 237632 | 1 |
## | 76289 | 1 |
## | 15998 | 1 |
## | 364198 | 1 |
## | 129666 | 1 |
## | 329616 | 1 |
## | 133871 | 1 |
## | 75295 | 1 |
## | 229777 | 1 |
Then, we will create an instance of the Naive Bayes model:
nb = GaussianNB()
Next, we will train our model on the training dataset:
nb.fit(x_train, y_train)
## GaussianNB()
Once the training is done, we will predict the outcomes of the passes on the test data:
y_predicted = nb.predict(x_test)
After we predict the outcomes, we test the accuracy of our model:
accuracy = metrics.accuracy_score(y_test, y_predicted)
accuracy
## 0.8430598453356866
Our model has an accuracy of about 84.3%. We then compute and visualize the error matrix and calculate the values of true negatives, false positives, false negatives and true positives:
error_matrix = metrics.confusion_matrix(y_test, y_predicted,labels = [-2, -1, 0, 1])
sns.heatmap(error_matrix, annot=True, cmap = 'Blues_r', linewidths = 3, linecolor = 'red')
## <AxesSubplot:>
plt.show()
error_matrix.ravel()
## array([ 0, 1, 37, 108, 0, 31, 144, 211, 0,
## 91, 1114, 3467, 0, 92, 2607, 35158], dtype=int64)
error_matrix = pd.crosstab(y_test, y_predicted, rownames=['Original'], colnames=['Predicted'])
sns.heatmap(error_matrix, annot=True, cmap = 'Blues_r', linewidths = 3, linecolor = 'red')
## <AxesSubplot:xlabel='Predicted', ylabel='Original'>
plt.show()
Finally let us visualize the difference histogram and compute the mean absolute error
sns.displot((y_test - y_predicted), bins = 50, color = 'blue')
mae = metrics.mean_absolute_error(y_test, y_predicted)
mae
## 0.16985207031885
This completes our post on classifying different pass outcomes using two statistical learning algorithms, one for binary classification and the other for multi-label classification.