The README of this repository is a resources guide of learning materials, data sources, libraries, papers, blogs, , etc., created by all those that have made contributions to the open source football analytics community. This GitHub repository and resources list is always a work in progress, with new resources added semi-regularly. If you feel there's any resource(s) that I have missed, please feel free to create a pull request or send me a message on the links above and I'll get back to you as quick as I can!
If you like the repo, please feel free to give it a (top right). Cheers!
For more information about this repository and the author, see the following:
The code in this repository is written in a mix of both Python and R. Before you begin, ensure that you have the following prerequisites installed:
General Python data science libraries:
NumPy
for multidimensional array computing;pandas
for data analysis and manipulation;matplotlib
and Seaborn
for data visualisation; andscitkit-learn
and SciPy
for Machine Learning.Football analytics Python libraries:
kloppy
- a package for standardising tracking and event data by Koen Vossen and Jan Van Haaren. See the YouTube tutorial [link]floodlight
by floodlight-sports - package for streamlined analysis of sports data. It is designed with a clear focus on scientific computing and built upon popular libraries such as numpy or pandas. See the following documentation [link]matplotsoccer
- a Python library for visualising soccer event data by Tom Decroosmplsoccer
- a Python library for plotting football pitches in matplotlib by Andrew RowlinsonPySport
including PySport Soccer
- collection of open-source sport packages including many of those mentioned in this section, by Koen VossenScraperFC
by Owen Seymour - a Python package to scrape data from FiveThirtyEight data, FBref, Understat, Club Elo, Capology and TransferMarkt. Previously scraped Opta event data through the WhoScored? match center (functionality now removed but see old versions and GitHub repos to find this code)statsbombapi
- a Python API wrapper and dataclasses for StatsBomb datastatsbombpy
- a Python library written by Francisco Goitia to access StatsBomb datasocceraction
- a Python library for valuing the individual actions performed by soccer players. Includes an Expected Threat (xT) implementation by Tom Decroos et. al.soccer_xg
by ML KU Leuven- a Python package for training and analyzing expected goals (xG) models in footballsoccerdata
- scrape soccer data from Club Elo, ESPN, FBref, FiveThirtyEight, Football-Data.co.uk, SoFIFA and WhoScored by Pieter Robberechtstyrone_mings
by FCrSTATS - a Python TransferMarkt webscraperGeneral R data science libraries:
Football analytics R libraries:
ggsoccer
by Ben Torvaney - a soccer visualisation library in RggshakeR
by Abhishek Mishra - an analysis and visualisation R package that works with publicly available soccer data. See the following documentation [link]StatsBombR
- an R package to easily stream StatsBomb data from the API using your log in credentials or from the Open Data GitHub repository cost free into Rsoccermatics
by Joe Gallagher - an R package for the visualisation and analysis of soccer tracking and event dataworldfootballR
by Jason Zivkovic - a R package for extracting world football (soccer) data from FBref, TransferMarkt, Understat and fotmob (see guide on how to use this package [link])? Return
The contents of this GitHub repository is organised as follows:
eddwebster/football_analytics/ ➡️ central repository of code and analysis by Edd Webster ?⚽
│
├── dashboards/ ➡️ store of Tableau dashboards used for analysis ?
│
├── data/ ➡️ a selection of raw and processed data extracts by various providers ??
│ ├── capology
│ ├── davies
│ ├── elo
│ ├── fbref
│ ├── fifa
│ ├── guardian
│ ├── metrica-sports
│ ├── opta
│ ├── reference
│ ├── sb
│ ├── shots
│ ├── stats-perform
│ ├── stratabet
│ ├── tm
│ ├── touchline-analytics
│ ├── twenty-first-group
│ ├── understat
│ └── wyscout
│
├── docs/ ➡️ store of documentation for different vendors ?
│ ├── centre-circle
│ ├── metrica-sports
│ ├── opta
│ ├── sb
│ ├── shots
│ ├── stratabet
│ └── wyscout
│
├── fonts/ ➡️ store of custom and externally acquired fonts used for data visualisation ✍️?
│
├── ? .gitignore ➡️ ignore unnecessary files for version control with Git ?
│
├── img/ ➡️ store of images used for analysis including club badges, vendor logos and official media images ??
│ ├── club_badges/ # badges for football clubs
│ ├── edd_webster/ # images related to Edd Werbster
│ ├── fig/ # generated figures derived from analysis and reports in this repository
│ ├── gif/ # GIF images
│ ├── memes/ # memes
│ ├── pitches/ # images of football pitches and goals used mostly for Tableau visualisation
│ ├── players/ # images of football players
│ ├── vendors/ # logos for data vendors e.g. StatsBomb
│ ├── vizpiration/ # high-quality visualisations and analysis from renowned members of the football analytics community
│ └── websites-blogs/ # logos for data analysis websites and blogs e.g. Club Elo
│
├── scripts/ ➡️ store of libraries and Python and open source code ??
│
├── notebooks/ ➡️ Jupyter notebooks for exploration and visualisation
│ ├── 1_data_scraping/ # notebooks with code to acquire data via webscraping
│ │ ├── Capology Player Salary Web Scraping.ipynb
│ │ ├── FBref Player Stats Web Scraping.ipynb
│ │ └── TransferMarkt Player Bio and Status Web Scraping.ipynb
│ │
│ ├── 2_data_parsing/ # notebooks with code to acquire data via APIs
│ │ ├── Elo Team Ratings Data Parsing.ipynb
│ │ ├── StatsBomb Data Parsing.ipynb
│ │ └── Wyscout Data Parsing.ipynb
│ │
│ ├── 3_data_engineering/ # notebooks with code to engineer raw, unprocessed data to processed data
│ │ ├── Capology Player Salary Data Engineering.ipynb
│ │ ├── Centre Circle Opta CPL Data Engineering.ipynb
│ │ ├── FBref Player Stats Data Engineering.ipynb
│ │ ├── Opta #mcfcanalytics PL 2011-2012.ipynb
│ │ ├── StatsBomb Data Engineering.ipynb
│ │ ├── The Guardian Player Recorded Transfer Fees Data Engineering.ipynb
│ │ ├── TransferMarkt Historical Market Value Data Engineering.ipynb
│ │ ├── TransferMarkt Player Bio and Status Data Engineering.ipynb
│ │ ├── TransferMarkt Player Recorded Transfer Fees Data Engineering.ipynb
│ │ ├── Understat Data Engineering.ipynb
│ │ └── Wyscout Data Engineering.ipynb
│ │
│ ├── 4_data_unification/ # notebooks with code to unify disperate datasets
│ │ └── Unification of Aggregated Seasonal Football Datasets.ipynb
│ │
│ └── 5_data_analysis_and_projects # notebooks with code for example projects and analysis
│ ├── player_similarity_and_clustering
│ │ └── PCA and K-Means Clustering of 'Piqué-like' Defenders.ipynb
│ │
│ ├── tracking_data
│ │ ├── metrica_sports
│ │ │ └── Metrica Tracking Data EDA.ipynb
│ │ └── signality
│ │ ├── Signality Tracking Data Engineering.ipynb
│ │ └── Signality Tracking Data EDA.ipynb
│ │
│ └── xg_modeling
│ ├── shots_dataset
│ │ ├── Logistic Regression Expected Goals Model.ipynb
│ │ └── XGBoost Expected Goals Model.ipynb
│ └── opta_dataset
│ └── raining of an Expected Goals Model Using Opta Event Data.ipynb
│
├── ? README.md ➡️ project description and setup guide for better structure and collaboration ?
│
├── research/ ➡️ central repository of research and publicly available resources in football analytics ?⚽
│ ├── documents/ # documents
│ ├── papers/ # published academic papers and literature
│ └── slides/ # PowerPoint slides for published research
│
└── video/ ➡️ store of videos used or generated for analysis ??
? Return
The code in this repository is mostly written in Jupyter notebooks or Python scripts, organised in the following workflow:
? Return
For Tableau dashboards produced using the data engineered in the notebooks in this repository, please see my Tableau Public profile: public.tableau.com/profile/edd.webster.
Example Tableau dashboards:
? Return
Credit to the following resources that were all used to plug gaps in this resources guide once it was published:
analytics-handbook
GitHub repo by Devin Pleuler- a GitHub repo for getting started in soccer analyticsawesome-football
by football.db (Gerald Bauer) - a collection of awesome football datasetsawesome-football-analytics
by Diego Pastorawesome-soccer-analytics
by Matias MasciotoguideR
by Dom Samangy - a Google spreadsheet with 200+ R resources, 100+ Python tutorials, 30+ packages, 25+ accounts to follow, 10 cheatsheets, and several free books & blogs. GitHub repo [link]soccer-analytics-resources
Github repo by Jan Van Haaren? Return
Good resources for those new for the use of data in football:
soccer-analytics-handbook
by Devin Pleulerawesome-football-analytics
by Diego Pastorawesome-soccer-analytics
by Matias Masciotosoccer-analytics-resources
by Jan Van Haaren? Return
Publicly available data sources and datasets relating to football, from Tracking data, Event data, aggregated player performance data, detailed match statistics, injury records and transfer values, and more.
Data sources that have been used in the code and analysis in this repository can be found in the data
subfolder of this repository or in Google Drive (due to GitHub's 100mb file limit) [link]. All code however in this repository should enable you to scrape, parse, and engineer the datasets as per the output used for analysis and visualisations featured.
To learn more about the different types of data available, such as Event and Tracking data, see the "Where can I get data?" section of Devin Pleuler's soccer_analytics_handbook
[link].
For a quick primer of the free football data resources available, see the following Twitter thread by James Nalton [link].
Event Data is labelled data for each on-the-ball event that takes place during a game. The data is manually collected from television footage. To learn more about the data collection, see the following video [link].
Each match of event data has around 2-3 thousand individual events (rows), depending on the provider.
The main providers of this data are StatsBomb, Stats Perform (formally Opta), and Wyscout.
Name | Comments | Source / method(s) to get the data |
---|---|---|
StatsBomb Open Data |
|
StatsBomb Open Data GitHub Repo |
StrataData by StrataBet | Chance shooting data provided | No longer made available (since 2018), however, it can be found in GitHub repos of old analysis (including this one) [link]. |
Soccer Video and Player Position Dataset | Dataset of elite soccer player movements and corresponding videos, made available by the University of Oslo. See the accompanying paper [link] | [Link] (appears to no longer be working) |
Opta | Event data for 20+ leagues including the 'Big 5' European leagues, some of which go back to the 09/10 season, | Data available through scraping WhoScored? Match Centre through the following methods:
|
Opta (11/12 sample dataset) | Match-by-match aggregated player performance data for the 11/12 season and F24 Event data for a 11/12 match of Manchester City vs. Bolton Wanders as part of the #mcfcanalytics initiative | No longer made available (since 2012), however, it can be found in GitHub repos of old analysis (including this one). |
Understat | Shooting and meta data including xG values for the 'Big 5' European leagues and Russian Premier League | This data can be accessed through the following:
|
Wyscout | Event data for the 17/18 season for the 'Big 5' European leagues, Euro 2016 Chanpionship, and 2018 World Cup made available by Luca Pappalardo, Alessio Rossi, and Paolo Cintia. See their paper A public data set of spatio-temporal match events in soccer competitions. | Figshare |
Tracking Data records the x and y coordinates of every player on the field, as well as the ball, a number of times per second (usually 10-25). For this reason, the dataset is quite large, much larger than event data at around 2-3 million rows per game.
The data is collected by cameras installed in a stadium and is therefore not widely available, with teams usually only having access to the data in their own league.
The main providers of this data are Second Spectrum, STATS Perform, Metrica Sports, and Signality.
Name | Comments | Source / method(s) to get the data |
---|---|---|
Last Row Tracking-like data by Ricardo Tavares | Tracking-like data collected by Ricardo Tavares. See the Liverpool Analytics Challenge for which this data was used (winners discussed on Friends of Tracking [link]). | GitHub repo |
Metrica Sports Sample Tracking and corresponding Event data | Three sample matches of synced event and tracking data. For code to work with this data including Pitch Control modellng, see the LaurieOnTracking GitHub repo by Laurie Shaw and the corresponding Friends of Tracking tutorials. |
GitHub repo |
Signality Tracking data | Three matches of tracking data from the Allsvenskan - Hammarby vs. IF Elfsborg (22/07/2019), Hammarby 5 vs. 1 Örebrö (30/09/2019), and Hammarby vs. Malmö FF (20/10/2019). | This data was made available as part of the 2020 Mathematical Modelling of Football course. The password to download the data is not publicly available, but can be found in the Uppsala Mathematical Modelling of Football Slack group [link]. For access, contact Novosom Salvador Twitter and [email protected], or feel free to contact myself. Note, that the 2nd half of the Hammarby-Örebro match is incomplete. |
Broadcast Tracking is collected from broadcast footage using computer vision techniques. Unlike in-stadium tracking data, the dataset is not complete and missing players out of shot of the broadcast footage. However, the great benefit is that the data collected is much cheaper and the coverage for what leagues are available is much greater which is extremely useful for tasks such as recruitment analysis.
The main providers of this data are SkillCorner and Sportlogiq.
Name | Comments | Source / method(s) to get the data |
---|---|---|
SkillCorner broadcast Tracking data | 9 matches of broadcast tracking data, including matches from 2019/2020 for the league champions and runners up in English Premier League, French L1, Spanish LaLiga, Italian Serie A and German Bundesliga. To find out more about broadcast tracking data and its use cases, see the following Medium article [link]. | GitHub repo |
Name | Comments | Source / method(s) to get the data |
---|---|---|
DAVIES modelling data | Estimated player evaluation data by Sam Goldberg and Mike Imburgio for American Soccer Analysis. To learn more about DAVIES, see the following blog post [link]. | Shiny App |
FBref season-on-season aggregated player performance data provided by StatsPerform. | Aggregated player performance data for the following competitions:
|
Note: there was a change in the data provider used by FBref for their statistics in October 2022, from StatsBomb to StatsPerform. Therefore, the following scraping code is split into current working solutions and archived solutions:
|
Stats Perform and Centre Circle Canadian Premiere League data | Aggregated player performance data | Google Drive |
Name | Comments | Source / method(s) to get the data |
---|---|---|
Elo club rankings | Elo ratings for club football based on past results to allow for estimation of each club's strength, allowing predictions for the future. | Data available through:
|
Euro Club Index | Ranking of the football teams in the highest division of all European countries, that shows their relative playing strengths at a given point in time, and the development of playing strengths in time. To see more about the methodology used to calculate these rankings, see the following page [link] | Link |
FiveThirtyEight Club Ranking | Global Club Soccer Rankings. How 637 international club teams compare by Soccer Power Index | Data available through:
|
Opta Power Rankings | Opta Power Rankings | Data available through:
|
UEFA Club Coefficients | UEFA club coefficient rankings based on the results of all European clubs in UEFA club competition. | Data available through:
|
World Football / Soccer Clubs Ranking | Club ranking website | Link |
Name | Comments | Source / method(s) to get the data |
---|---|---|
Bundesliga physical data | Bundesliga player stats, powered by AWS | Link (not scraped into a CSV) |
Name | Comments | Source / method(s) to get the data |
---|---|---|
2018 FIFA World Cup Rosters | Goals, caps, club, and date of birth for players on 2018 FIFA World Cup rosters. Source: data.world | Excel |
engsoccerdata | English and European soccer results 1871-2017 | GitHub repo |
FIFA World Cup Match Results | Matchups and results of FIFA World Cup matches from 1930 - 2014. Source: data.world | Excel |
FotMob | Dataset including team and play stats including xG and post-shot xG. | This data can be scraped using:
|
Football Lineups | A database of teams tactics and formations crowdsourced by the users. | Link |
international_results |
Repository of results of 44,353 results of international football matches starting from the very first official match in 1872 up to 2022. | GitHub repo |
smarterscout | Scouting and player rating information platform for evaluating the performance of football players around the world. The platform was developed by Dan Altman at North Yard Analytics to assess players' contributions to winning, their playing style, and their skill level. Note: this is a subscription service. | Link |
SofaScore | Live scores, lineups, standings, heatmaps, and basic teams, coaches and player data | Link |
Soccerway | Match sheet data | Link |
Name | Comments | Source / method(s) to get the data |
---|---|---|
Capology | Player salaries | See the Capology Player Salary Web Scraping notebook for Python code to scrape Capology data or access saved CSV files in data subfolder |
KPMG Football Benchmark | player valuation data | |
The Price of Football Master Spreadsheet | data from the finance/business aspect of football by Kieran Maguire | Link |
spotrac | Player contracts, salaries, and transfer information for the Premier League, MLS, and NWSL | |
TransferMarket | Player bio, contractual, and estimated value data | This data can be accessed through the following:
|
Guardian Player Transfer data | Collated by Tom Worville (see Tweet [link]) | GitHub |
Name | Comments | Source / method(s) to get the data |
---|---|---|
BetExplorer | odds data | Link |
FiveThirtyEight Soccer Predictions database | football prediction data | Link |
Football-Data.co.uk | free bets and football betting, historical football results and a betting odds archive, live scores, odds comparison, betting advice and betting articles | Link |
International football results from 1872 to 2020 | an up-to-date dataset of over 40,000 international football results by Mart Jürisoo | Link |
See Mark Wilkin's Twitter thread for more about how to plot your own event data [link]:
Name | Comments | Source / method(s) to get the data |
---|---|---|
xT grid | League-wide Expected Threat (xT) values from the 2017-18 Premier League season (12x8 grid) determined by Karun Singh. For more information about about xT, see Karun's blog post [link] | Link |
EPV grid | Grid of Expected Possession Values determined by Laurie Shaw. See the following lecture for more information [link] | Link |
Zones of a pitch | Breakdown of a pitch into zones, for use with visualisation.Created by Rob Carroll | Link |
Name | Comments | Source / method(s) to get the data |
---|---|---|
awesome-football by football.db (Gerald Bauer) |
A collection of awesome football (national teams, clubs, match schedules, players, stadiums, etc.) datasets | GitHub repo |
Data Hub Football data | Link | |
European Soccer Database | 25k+ matches, players & teams attributes for European Professional Football | Link |
FIFA 15-22 player rating data | Scraped from SoFIFA by Stefano Leone | Link |
FIFA 18 Player Ratings | 17k+ players, 70+ attributes extracted from FIFA 18, provided by sofifa | Link |
FootballData |
"A hodgepodge of JSON and CSV Football data" | GitHub |
footballcsv |
Historical soccer results in CSV format | Link |
football.db | A free and open public domain football database & schema for use in any (programming) language (e.g. uses plain datasets) | Link |
Football xG | Link | |
Guide to Football/Soccer data and APIs by Joe Kampschmid | Link | |
My Football Facts | Link | |
Physio Room | Link | |
PlusMinusData | play by play data from espn.com | Link |
Rec.Sport.Soccer Statistics Foundation | Historical league tables and football results | Link |
RoboCup Soccer Simulator | RoboCup Soccer Simulator Data | Link |
Squawka | Link | |
Stat Bunker | Link | |
Tableau data resources | including sports data | Link |
Transfer League | Link | |
Twelve Football | Link | |
wosostats | Women's soccer data from around the world | Link |
All documentation saved locally in the documentation subfolder, including:
? Return
soccer_analytics
by Kraus Clemens - a Python project that facilitates the starting point for analyticsFootball-Analytics-With-Python
by Anmol DurgapalCheck out the Tableau for Sports Discord server organised by Ninad Barbadikar, to interact with a community of Tableau developers
For a YouTube playlist of Tableau-football videos and tutorials that I have collated from various sources including the Tableau Football User Group, Rob Carroll, Tom Goodall, and Ninad Barbadikar, see the following [link].