A Data Science Perspective to playing FIFA 19
- 21 minsI’ve always loved football.
And FIFA was what got me into football, as a child.
I loved FIFA.
Still do, for that matter. The difference now though, is that I can harness the power of data visualization!
And that’s what this blog post is about.
A huge shout-out to the guys who posted this amazing dataset!
Find it here.
Without further ado, let’s dive in (Not the Ramos way though)
Let’s start with the pesky imports and get them out of the way.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style = 'dark')
And of course, the data itself.
fifa = pd.read_csv('data.csv',index_col=0)
THE (Real) BEST Player
We all can agree that The Best Player awards were… vague. Let’s rectify that.
The simple objective is to find the best player based on various attributes!
The Overall Rating
So, what’s the average distribution of overall rating?
plt.figure(figsize=(18,10))
sns.countplot(fifa['Overall'], palette='rocket')
plt.show()
Looks like we have quite a normal distribution here. Kudos to the FIFA team on that.
Not surprised though, I expected something similar. Most players are average, some are just disappointingly wasted, and some extend all the way to extreme levels of awesomeness.
Moving on…
The Eldest Players
fifa.sort_values(by = 'Age' , ascending = False)[['Name','Club','Nationality','Overall', 'Age' ]].head(5)
Name | Club | Nationality | Overall | Age | |
---|---|---|---|---|---|
4741 | O. Pérez | Pachuca | Mexico | 71 | 45 |
18183 | K. Pilkington | Cambridge United | England | 48 | 44 |
17726 | T. Warner | Accrington Stanley | Trinidad & Tobago | 53 | 44 |
10545 | S. Narazaki | Nagoya Grampus | Japan | 65 | 42 |
7225 | C. Muñoz | CD Universidad de Concepción | Argentina | 68 | 41 |
Not going to lie, I’ve never heard of these players. But a look at their overall might explain it!
The Youngest Players
fifa.sort_values(by = 'Age' , ascending = True)[['Name','Club','Nationality','Overall', 'Age','Potential' ]].head(5)
Name | Club | Nationality | Overall | Age | Potential | |
---|---|---|---|---|---|---|
18206 | G. Nugent | Tranmere Rovers | England | 46 | 16 | 66 |
17743 | J. Olstad | Sarpsborg 08 FF | Norway | 52 | 16 | 69 |
13293 | H. Massengo | AS Monaco | France | 62 | 16 | 75 |
16081 | J. Italiano | Perth Glory | Australia | 58 | 16 | 79 |
18166 | N. Ayéva | Örebro SK | Sweden | 48 | 16 | 72 |
Again, who are they even?
Sure does look like my boy, Massengo has a bright future.
The Best Freekick Takers
fifa.sort_values(by = 'FKAccuracy' , ascending = False)[['Name','Club','Nationality','Overall', 'Age','FKAccuracy']].head(5)
Name | Club | Nationality | Overall | Age | FKAccuracy | |
---|---|---|---|---|---|---|
0 | L. Messi | FC Barcelona | Argentina | 94 | 31 | 94.0 |
293 | S. Giovinco | Toronto FC | Italy | 82 | 31 | 93.0 |
72 | M. Pjanić | Juventus | Bosnia Herzegovina | 86 | 28 | 92.0 |
1113 | E. Bardhi | Levante UD | FYR Macedonia | 77 | 22 | 91.0 |
449 | H. Çalhanoğlu | Milan | Turkey | 80 | 24 | 90.0 |
Okay, this is clearly no surprise. Messi at the top makes sense.
But Ronaldo not in the top 5? What’s up EA?
The Best Penalty Kick Taker
fifa.sort_values(by = 'Penalties' , ascending = False)[['Name','Club','Nationality','Overall', 'Age','Penalties']].head(5)
Name | Club | Nationality | Overall | Age | Penalties | |
---|---|---|---|---|---|---|
206 | M. Balotelli | OGC Nice | Italy | 83 | 27 | 92.0 |
118 | Fabinho | Liverpool | Brazil | 84 | 24 | 91.0 |
16 | H. Kane | Tottenham Hotspur | England | 89 | 24 | 90.0 |
297 | M. Kruse | SV Werder Bremen | Germany | 82 | 30 | 90.0 |
945 | L. Baines | Everton | England | 77 | 33 | 90.0 |
Wow, I thought Ronaldo would definitely make at least this list. So much for calling him penaldo huh?
The One with the Ball Control
fifa.sort_values(by = 'BallControl' , ascending = False)[['Name','Club','Nationality','Overall', 'Age','BallControl']].head(5)
Name | Club | Nationality | Overall | Age | BallControl | |
---|---|---|---|---|---|---|
0 | L. Messi | FC Barcelona | Argentina | 94 | 31 | 96.0 |
2 | Neymar Jr | Paris Saint-Germain | Brazil | 92 | 26 | 95.0 |
30 | Isco | Real Madrid | Spain | 88 | 26 | 95.0 |
5 | E. Hazard | Chelsea | Belgium | 91 | 27 | 94.0 |
1 | Cristiano Ronaldo | Juventus | Portugal | 94 | 33 | 94.0 |
Ah, this gave us pretty standard values.
The Fastest
Okay. pretty sure Mbappe and Sane should be near the top, but let’s see where FIFA ranks them.
fifa.sort_values(by = 'SprintSpeed' , ascending = False)[['Name','Club','Nationality','Overall', 'Age','SprintSpeed']].head(5)
Name | Club | Nationality | Overall | Age | SprintSpeed | |
---|---|---|---|---|---|---|
1968 | Adama | Wolverhampton Wanderers | Spain | 75 | 22 | 96.0 |
55 | L. Sané | Manchester City | Germany | 86 | 22 | 96.0 |
25 | K. Mbappé | Paris Saint-Germain | France | 88 | 19 | 96.0 |
1489 | I. Bebou | Hannover 96 | Togo | 76 | 24 | 95.0 |
36 | G. Bale | Real Madrid | Wales | 88 | 28 | 95.0 |
Finally getting a few predictions right, huh?
Famous Clubs
Okay let’s start going through the data club wise
Age Distribution
Let’s start with the age distribution in these clubs. My bet : Barcelona will top this list given the number of old players they have.
clubs = ['Chelsea' , 'Arsenal', 'Juventus', 'Paris Sain-Germain' ,'FC Bayern München',
'Real Madrid' , 'FC Barcelona' , 'Borussia Dortmund' , 'Manchester United' ,
'FC Porto', 'Liverpool', 'Manchester City']
Questionable decision adding Liverpool, I know.
fifa_club_age = fifa.loc[fifa['Club'].isin(clubs) & fifa['Age']]
plt.figure(1 , figsize = (15 ,7))
sns.violinplot(x = 'Club' , y = 'Age' , data = fifa_club_age,palette='rocket')
plt.title('Age Distribution in famous clubs')
plt.xticks(rotation = 50)
plt.show()
Well, that’s something.
Looks like Real Madrid, Liverrpool and Porto have the most young talent. Good to know!
Overall Rating
fifa_club_rating = fifa.loc[fifa['Club'].isin(clubs) & fifa['Overall']]
plt.figure(1 , figsize = (15 ,7))
sns.violinplot(x = 'Club' , y = 'Overall' , data = fifa_club_rating, palette='rocket')
plt.title('Overall Rating Distribution in famous clubs')
plt.xticks(rotation = 50)
plt.show()
Woah, look at Juventus go. Pretty sure they’ll lead in Freekick Accuracy as well.
Really surprised by Barcelona and Real Madrid though. Ah, the things data can tell you
The Best Club?
Real Madrid supporter here, so you know what I would want.
best_dict = {}
for club in fifa['Club'].unique():
overall_rating = fifa['Overall'][fifa['Club'] == club].sum()
best_dict[club] = overall_rating
best_club = pd.DataFrame.from_dict(best_dict,orient='index', columns = ['overall'])
best_club['club'] = best_club.index
best_club = best_club.sort_values(by = 'overall' , ascending = False)
plt.figure(1 , figsize = (15 , 6))
sns.barplot(x = 'club' , y = 'overall' , data = best_club.head(5),palette='rocket')
plt.xticks(rotation = 70)
plt.xlabel("Club")
plt.ylabel('Sum of Overall Rating of players in club')
plt.title('Clubs with best Players (sum of overall ratings of players per club)')
plt.ylim(2450 , 2600)
plt.show()
Ha, at the top.
Even Manchester United. Sweet, got 2 of my favorite clubs in the top 5.
Popular Countries
Age Distribution
countries = ['England' , 'Brazil' , 'Portugal' ,'Argentina',
'Italy' , 'Spain' , 'Germany' ,'Netherlands','India']
India, you ask? Can’t not. My country afer all!
fifa_country_age = fifa.loc[fifa['Nationality'].isin(countries) & fifa['Age']]
plt.figure(1 , figsize = (15 ,7))
sns.violinplot(x = 'Nationality' , y = 'Age' , data = fifa_country_age, palette='rocket')
plt.title('Age Distribution in famous clubs')
plt.xticks(rotation = 50)
plt.show()
Looks like it might finally be good times for England maybe?
But hey, Spain and Germany aren’t too far behind either.
Guess the Overall Rating can tell us more.
Overall Rating
fifa_country_rating = fifa.loc[fifa['Nationality'].isin(countries) & fifa['Overall']]
plt.figure(1 , figsize = (15 ,7))
sns.violinplot(x = 'Nationality' , y = 'Overall' , data = fifa_country_age, palette='rocket')
plt.title('Overall Rating Distribution in famous clubs')
plt.xticks(rotation = 50)
plt.show()
Brazil really seems to be pretty dominant here. Must be the experienced players, as the previous plost have shown us.
The Best Country?
Germany might be there at the top, along with Spain maybe?
What do you think?
best_dict = {}
for country in fifa['Nationality'].unique():
overall_rating = fifa['Overall'][fifa['Nationality'] == country].sum()
best_dict[country] = overall_rating
best_country = pd.DataFrame.from_dict(best_dict,orient='index', columns = ['overall'])
best_country['club'] = best_country.index
best_country = best_country.sort_values(by = 'overall' , ascending = False)
plt.figure(1 , figsize = (15 , 6))
sns.barplot(x = 'club' , y = 'overall' , data = best_country.head(10),palette='rocket')
plt.xticks(rotation = 70)
plt.xlabel("Country")
plt.ylabel('Sum of Overall Rating of players in a country')
plt.title('Countries with best Players (sum of overall ratings of players per club)')
plt.show()
England, wow, that’s just… well y’know.
But let’s put things into perspective.
best_dict = {}
for country in countries:
count = fifa['Overall'][fifa['Nationality'] == country].count()
best_dict[country] = count
best_country = pd.DataFrame.from_dict(best_dict,orient='index', columns = ['count'])
best_country['club'] = best_country.index
sns.barplot(x = 'club' , y = 'count' , data = best_country, palette='rocket')
plt.xticks(rotation = 70)
plt.xlabel("Country")
plt.ylabel('Count of players in a country')
plt.show()
And this is where we realise that England just has a lot of players who’re probably just average.
But let’s visualize that as well.
best_dict = {}
for country in countries:
overall = fifa['Overall'][fifa['Nationality'] == country].sum()
count = fifa['Overall'][fifa['Nationality'] == country].count()
country_overall = overall / count
best_dict[country] = country_overall
best_country = pd.DataFrame.from_dict(best_dict,orient='index', columns = ['country_overall'])
best_country['club'] = best_country.index
sns.barplot(x = 'club' , y = 'country_overall' , data = best_country, palette='rocket')
plt.xticks(rotation = 70)
plt.xlabel("Country")
plt.ylabel('Count of players in a country')
plt.show()
Hey, look. India is pretty close to England now huh?
Now before you get triggered, for those of you that don’t understand this, take a better look at the dataset. It contains a list of all the players who are from England, not just those who play for the National Team.
Hence this is an accurate reprsentation of all the players from England, not their national team as such.
If you did want to visualise their national team, you’ll need to provide Player Names as well.
Positions of Players
plt.figure(1 , figsize = (15 , 6))
sns.countplot(x = 'Position' , data = fifa , palette = 'rocket' )
plt.title('Count Plot of Postions of player')
plt.show()
Pretty much what is expected.
But let’s look at the features that are most common to each position.
# Let's define the various player features
player_features = ['Crossing', 'Finishing', 'HeadingAccuracy',
'ShortPassing', 'Volleys', 'Dribbling', 'Curve', 'FKAccuracy',
'LongPassing', 'BallControl', 'Acceleration', 'SprintSpeed',
'Agility', 'Reactions', 'Balance', 'ShotPower', 'Jumping',
'Stamina', 'Strength', 'LongShots', 'Aggression', 'Interceptions',
'Positioning', 'Vision', 'Penalties', 'Composure', 'Marking',
'StandingTackle', 'SlidingTackle', 'GKDiving', 'GKHandling',
'GKKicking', 'GKPositioning', 'GKReflexes']
for i, val in fifa.groupby(fifa['Position'])[player_features].mean().iterrows():
print('Position {}: {}, {}, {}, {}, {}'.format(i, *tuple(val.nlargest(5).index)))
Position CAM: Balance, Agility, Acceleration, SprintSpeed, BallControl
Position CB: Strength, Jumping, StandingTackle, Aggression, HeadingAccuracy
Position CDM: Stamina, Aggression, Strength, ShortPassing, Jumping
Position CF: Agility, Balance, Acceleration, SprintSpeed, Dribbling
Position CM: Balance, ShortPassing, Agility, Stamina, Acceleration
Position GK: GKReflexes, GKDiving, GKPositioning, GKHandling, GKKicking
Position LAM: Agility, Balance, SprintSpeed, Acceleration, Dribbling
Position LB: SprintSpeed, Acceleration, Stamina, Balance, Agility
Position LCB: Strength, Jumping, StandingTackle, Aggression, HeadingAccuracy
Position LCM: Stamina, ShortPassing, Balance, Agility, BallControl
Position LDM: Stamina, ShortPassing, Strength, Aggression, BallControl
Position LF: Balance, Agility, Acceleration, Dribbling, BallControl
Position LM: Acceleration, SprintSpeed, Agility, Balance, Dribbling
Position LS: SprintSpeed, Strength, Acceleration, ShotPower, Positioning
Position LW: Acceleration, SprintSpeed, Agility, Balance, Dribbling
Position LWB: SprintSpeed, Acceleration, Stamina, Agility, Balance
Position RAM: Agility, Balance, Acceleration, SprintSpeed, Dribbling
Position RB: SprintSpeed, Stamina, Acceleration, Balance, Jumping
Position RCB: Strength, Jumping, Aggression, StandingTackle, HeadingAccuracy
Position RCM: Stamina, ShortPassing, Agility, Balance, BallControl
Position RDM: Stamina, ShortPassing, Aggression, Strength, Jumping
Position RF: Agility, Acceleration, Balance, BallControl, SprintSpeed
Position RM: Acceleration, SprintSpeed, Agility, Balance, Dribbling
Position RS: SprintSpeed, Strength, Acceleration, Agility, ShotPower
Position RW: Acceleration, SprintSpeed, Agility, Balance, Dribbling
Position RWB: SprintSpeed, Acceleration, Stamina, Agility, Balance
Position ST: SprintSpeed, Strength, Acceleration, Jumping, Finishing
The Top 10 Players
Ah, the thing you’ve all been waiting for.
fifa_best_players = pd.DataFrame.copy(fifa.sort_values(by = 'Overall' , ascending = False ).head(10))
plt.figure(1 , figsize = (15 , 5))
sns.barplot(x ='Name' , y = 'Overall' , data = fifa_best_players,palette='rocket')
plt.ylim(87 , 95)
plt.show()
No surprises there (from FIFA’s standpoint at least).
And lastly, The Highest Earner
If you take a look at the dataset you see that the wage column has various represnetations of the currency., i.e M and K.
One way to deal with this is to convert everything to one base metric.
def normalizing_wage(x):
if '€' in str(x) and 'M' in str(x):
c = str(x).replace('€' , '')
c = str(c).replace('M' , '')
c = float(c) * 1000000
else:
c = str(x).replace('€' , '')
c = str(c).replace('K' , '')
c = float(c) * 1000
return c
fifa['Normalized_Wage'] = fifa['Wage'].apply(lambda x : normalizing_wage(x))
fifa.sort_values(by = 'Normalized_Wage' , ascending = False)[['Name','Club','Nationality','Overall',
'Age','Normalized_Wage','Wage']].head(5)
Name | Club | Nationality | Overall | Age | Normalized_Wage | Wage | |
---|---|---|---|---|---|---|---|
0 | L. Messi | FC Barcelona | Argentina | 94 | 31 | 565000.0 | €565K |
7 | L. Suárez | FC Barcelona | Uruguay | 91 | 31 | 455000.0 | €455K |
6 | L. Modrić | Real Madrid | Croatia | 91 | 32 | 420000.0 | €420K |
1 | Cristiano Ronaldo | Juventus | Portugal | 94 | 33 | 405000.0 | €405K |
8 | Sergio Ramos | Real Madrid | Spain | 91 | 32 | 380000.0 | €380K |