In this repo, I collect and explore data related to tracks featured in the Spotify Charts. I build on the work I did in this repo where I already did quite a lot of stuff. Unfortunately, the data processing in this project was really messy, so I decided to start from scratch to create a cleaner dataset.
I have come up with CLI scripts for fetching data related to the Spotify Charts. They can be found in the cli_scripts
subfolder of this project. As the folder name suggests, the scripts can all be invoked directly from the command line. You can call each script with the -h
option to get information about the accepted arguments (example: python cli_scripts/get_all.py -h
).
First, I came up with scripts for assembling data for tracks of the Spotify Daily Top 200. Unfortunately, this data is not available via the API. Furthermore, one has to download chart CSV files for each region and track separately (by navigating to the Spotify Charts page and clicking the download button) which is very inconvenient.
However, I worked around this by creating two scripts (see the spotify_charts
subfolder):
download.py
: automates the process of downloading charts CSV files for several regions (either all or a subset specified via arguments) and a given date range (start + end date) using selenium
(requires Spotify account/credentials!)combine_charts.py
: combines downloaded Spotify chart CSV files located in the specified directory into a single .parquet
fileA lot of interesting information and metadata about music on Spotify can also be retrieved from Spotify's official API. All scripts using the Spotify API (via the spotipy
Python API wrapper) can be found in spotify_api
:
get_track_metadata.py
: fetches track metadata from the /tracks
API endpoint for unique track IDs mentioned in a provided .parquet
file. Outputs a folder of several metadata .parquet
filesget_album_metadata.py
: does the same thing as above, only for albums instead of tracks (using the /albums
API endpoint)get_artist_metadata.py
: fetches artist metadata for all unique artist IDs among several input files (each having an artists_id
column), also storing metadata in a folder like the other scripts aboveget_all.py
: combines all scripts, getting track metadata first, then album metadata for all albums associated with tracks and finally artist metadata for all track and album artists.Unfortunately, the information for track credits (specifically, songwriters and producers) is also not available via the public Spotify API. However, I came up with a way to work around that. One can extract the request headers that are used for specific requests made by the Spotify Web App, e.g. when opening the Show Credits
popup on a track page and reuse them to make other requests to the same (inofficial/internal) API endpoint.
This approach can be used to
# download CSVs; might take a loooong time, can be interrupted and restarted/resumed later
python cli_scripts/spotify_charts/download.py -s 2022-01-01 -e 2022-12-31 -o data/scraper_downloads
# combine downloaded CSVs into single parquet file
python cli_scripts/spotify_charts/combine.py -o data/top200_2022
python cli_scripts/spotify_api/get_all.py -i data/top200_2022/charts.parquet
TODO: add proper command once scripts are 'finished' (good enough)
To make everything work, you can follow these instructions (assuming you have a recent version of Python installed)
If you want, you can create a new environment, e.g. with conda
:
conda env create --name=spotify-charts-analysis
conda activate spotify-charts-analysis
To make all the scripts work out-of-the-box you can simply install the helpers
package by running
pip install -e .
Alternatively, you can of course also just install packages one-by-one as you are running into issues trying to execute things lol
Track lyrics on Spotify can be incorrect, even for fairly popular songs (for example this instrumental track that for some reason has lyrics).