The 2021 Rolling Stone Top 500 Songs of All Time List has Become a Personality Trait

In a previous post, I ranted about ranking systems and webscraped data on a fascinating list - Rolling Stone Magazine’s Top 500 Greatest Songs of All Time. In this post, I pull the data again using the Spotify API and perform more in depth analytics on it including supervised learning, dimensionality reduction, and unsupervised learning. I also use Streamlit to make a web application that allows you to explore the data and see the results in a more interactive way.

TL/DR because this article is long

For those who don’t want to read the article, I built an accompanying Streamlit application for further exploration of the Rolling Stone top 500 data. Try it out here. You can also download the dataset here for your own use.

Background

Released in September 2021, Rolling Stone attempted to rank what maybe shouldn’t be ranked. Across all genres over the last 100 years, they ordered from 1 to 500 what they considered to be the best songs of all time. It was the first time they made a list like this in 17 years. In constructing the list, they surveyed a bunch of artists and music critics about their 50 favorite songs and aggregated the results to create their ranking. Possibly because of who they surveyed and possibly because of how they scored the survey results, the emerging list makes the listener stop and think, “Really? Gasolina by Daddy Yankee is in the top 50?”. Yet for all of its spicy takes, it makes for an excellent ~34 hour playlist, and I’ve been obsessed with it for almost a year.

You can read more about how they came up with this list on their website, but I site it below for reference:

In 2004, Rolling Stone published its list of the 500 Greatest Songs of All Time. It’s one of the most widely read stories in our history, viewed hundreds of millions of times on this site. But a lot has changed since 2004; back then the iPod was relatively new, and Billie Eilish was three years old. So we’ve decided to give the list a total reboot. To create the new version of the RS 500 we convened a poll of more than 250 artists, musicians, and producers — from Angelique Kidjo to Zedd, Sam Smith to Megan Thee Stallion, M. Ward to Bill Ward — as well as figures from the music industry and leading critics and journalists. They each sent in a ranked list of their top 50 songs, and we tabulated the results.

Nearly 4,000 songs received votes. Where the 2004 version of the list was dominated by early rock and soul, the new edition contains more hip-hop, modern country, indie rock, Latin pop, reggae, and R&B. More than half the songs here — 254 in all — weren’t present on the old list, including a third of the Top 100. The result is a more expansive, inclusive vision of pop, music that keeps rewriting its history with every beat.

Why I scraped data from the Rolling Stone Top 500.. Twice

In my last post I described how and why I scraped data from the Rolling Stone top 500. In my own words, “I’ve been so into my playlist that I wanted to be more exact when I told people where certain songs ranked in the Rolling Stone top 500 list”. My methodology was to use the BeautifulSoup library to scrape the songs off of the website and then merge in other information like genre and track time using the now deprecated iTunes API. The result had a few misses in some information but overall was pretty thorough.

So I pulled this dataset and I sent it to my little brother to run some kind of analysis on. After waiting a few months, I saw that someone else made the same playlist on Spotify that I had made in Youtube Music. For me this was a game changer, because Spotify API is free to use and returns high quality results. Besides filling in some missing records I had from linking my webscraping to the iTunes API, it also provided a bunch of quantitative metrics for evaluating and comparing music, such as the danceability and liveness of a given song.

Below, I show just how easy this was to do. In a few lines I pull more data than I ever had before on all of the songs in the top 500 list as well as metadata about the corresponding artists and albums.

Pulling Data

Imports

import numpy as np
import pandas as pd
import spotipy
from spotipy import SpotifyClientCredentials
import tqdm
import yaml

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Authentication

with open('secrets.yaml') as file:
    secrets = yaml.safe_load(file)
    client_id = secrets.get('client_id')
    client_secret = secrets.get('client_secret')

client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
sp = spotipy.Spotify(client_credentials_manager = client_credentials_manager)

Pull track URIs and general info

playlist_link = "https://open.spotify.com/playlist/4EdmaJCUXvXsXJKXZGvj2r"
playlist_URI = playlist_link.split("/")[-1].split("?")[0]
playlist_data = []
for i in range(5):
    playlist_data = playlist_data + sp.playlist_tracks(playlist_URI, offset=i*100)['items']
track_uris = [x["track"]["uri"] for x in playlist_data]

# Print information from Stronger by Kanye West
playlist_data[0]
{'added_at': '2021-09-15T18:48:41Z',
 'added_by': {'external_urls': {'spotify': 'https://open.spotify.com/user/117710003'},
  'href': 'https://api.spotify.com/v1/users/117710003',
  'id': '117710003',
  'type': 'user',
  'uri': 'spotify:user:117710003'},
 'is_local': False,
 'primary_color': None,
 'track': {'album': {'album_type': 'album',
   'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/5K4W6rqBFWDnAN6FQUkS6x'},
     'href': 'https://api.spotify.com/v1/artists/5K4W6rqBFWDnAN6FQUkS6x',
     'id': '5K4W6rqBFWDnAN6FQUkS6x',
     'name': 'Kanye West',
     'type': 'artist',
     'uri': 'spotify🧑‍🎨5K4W6rqBFWDnAN6FQUkS6x'}],
   'available_markets': [],
   'external_urls': {'spotify': 'https://open.spotify.com/album/5fPglEDz9YEwRgbLRvhCZy'},
   'href': 'https://api.spotify.com/v1/albums/5fPglEDz9YEwRgbLRvhCZy',
   'id': '5fPglEDz9YEwRgbLRvhCZy',
   'images': [{'height': 640,
     'url': 'https://i.scdn.co/image/ab67616d0000b2739bbd79106e510d13a9a5ec33',
     'width': 640},
    {'height': 300,
     'url': 'https://i.scdn.co/image/ab67616d00001e029bbd79106e510d13a9a5ec33',
     'width': 300},
    {'height': 64,
     'url': 'https://i.scdn.co/image/ab67616d000048519bbd79106e510d13a9a5ec33',
     'width': 64}],
   'name': 'Graduation',
   'release_date': '2007-09-11',
   'release_date_precision': 'day',
   'total_tracks': 13,
   'type': 'album',
   'uri': 'spotify:album:5fPglEDz9YEwRgbLRvhCZy'},
  'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/5K4W6rqBFWDnAN6FQUkS6x'},
    'href': 'https://api.spotify.com/v1/artists/5K4W6rqBFWDnAN6FQUkS6x',
    'id': '5K4W6rqBFWDnAN6FQUkS6x',
    'name': 'Kanye West',
    'type': 'artist',
    'uri': 'spotify🧑‍🎨5K4W6rqBFWDnAN6FQUkS6x'}],
  'available_markets': [],
  'disc_number': 1,
  'duration_ms': 311866,
  'episode': False,
  'explicit': True,
  'external_ids': {'isrc': 'USUM70741299'},
  'external_urls': {'spotify': 'https://open.spotify.com/track/4fzsfWzRhPawzqhX8Qt9F3'},
  'href': 'https://api.spotify.com/v1/tracks/4fzsfWzRhPawzqhX8Qt9F3',
  'id': '4fzsfWzRhPawzqhX8Qt9F3',
  'is_local': False,
  'name': 'Stronger',
  'popularity': 31,
  'preview_url': None,
  'track': True,
  'track_number': 3,
  'type': 'track',
  'uri': 'spotify:track:4fzsfWzRhPawzqhX8Qt9F3'},
 'video_thumbnail': {'url': None}}

Put Into a Neat Dataframe

playlist_dict = {
    
    'rank': [],
    'artist_name':[],
    'track_name':[],
    
    # Track data
    'track_popularity':[],
    'track_duration_ms':[],
    'track_is_explicit':[],
    'track_number':[],
    'track_danceability': [],
    'track_energy': [],
    'track_key': [],
    'track_loudness': [],
    'track_mode': [],
    'track_speechiness': [],
    'track_acousticness': [],
    'track_instrumentalness': [],
    'track_liveness': [],
    'track_valence': [],
    'track_tempo': [],
    'track_time_signature': [],
    'track_analysis_url': [],

    # Album data
    'album_name':[],
    'album_release_date':[],
    'album_release_year':[],
    'album_image':[],
    
    # Artist data
    'artist_genre':[],
    'artist_popularity':[],
    
}
for count, i in tqdm.tqdm(enumerate(playlist_data)):
    
    # Query for More Song Data
    track_info = sp.audio_features(i['track']['id'])[0]

    playlist_dict['rank'].append(500 - count)
    playlist_dict['artist_name'].append(i['track']['artists'][0]['name'])
    playlist_dict['track_name'].append(i['track']['name'])
    

    # Song Stuff
    playlist_dict['track_popularity'].append(i['track']['popularity'])
    playlist_dict['track_duration_ms'].append(i['track']['duration_ms'])
    playlist_dict['track_is_explicit'].append(i['track']['explicit'])
    playlist_dict['track_number'].append(i['track']['track_number'])
    playlist_dict['track_danceability'].append(track_info['danceability'])
    playlist_dict['track_energy'].append(track_info['energy'])
    playlist_dict['track_key'].append(track_info['key'])
    playlist_dict['track_loudness'].append(track_info['loudness'])
    playlist_dict['track_mode'].append(track_info['mode'])
    playlist_dict['track_speechiness'].append(track_info['speechiness'])
    playlist_dict['track_acousticness'].append(track_info['acousticness'])
    playlist_dict['track_instrumentalness'].append(track_info['instrumentalness'])
    playlist_dict['track_liveness'].append(track_info['liveness'])
    playlist_dict['track_valence'].append(track_info['valence'])
    playlist_dict['track_tempo'].append(track_info['tempo'])
    playlist_dict['track_time_signature'].append(track_info['time_signature'])
    playlist_dict['track_analysis_url'].append(track_info['analysis_url'])
    
    
    # Album Stuff
    playlist_dict['album_name'].append(i['track']['album']['name'])
    playlist_dict['album_release_date'].append(i['track']['album']['release_date'])
    playlist_dict['album_release_year'].append(np.nan)
    playlist_dict['album_image'].append(i['track']['album']['images'][0]['url'])
    
    # Artist Stuff
    # Query More Artist Data
    artist_uri = i["track"]["artists"][0]["uri"]
    artist_info = sp.artist(artist_uri)
    artist_pop = artist_info["popularity"]
    artist_genres = artist_info["genres"]
    
    playlist_dict['artist_genre'].append(', '.join(artist_info['genres']))
    playlist_dict['artist_popularity'].append(artist_info['popularity'])


500it [02:10,  3.84it/s]

df = (
    pd.DataFrame(data = playlist_dict)
    .assign(album_release_date = lambda x: pd.to_datetime(x.album_release_date))
    .assign(album_release_year = lambda x: x.album_release_date.dt.year)
)
# Check for null values
df.isna().any().sum()
0
df.head()
df.tail()
df.to_csv('data/spotify_music_output.csv',index=False)

rank artist_name track_name track_popularity track_duration_ms track_is_explicit track_number track_danceability track_energy track_key ... track_valence track_tempo track_time_signature track_analysis_url album_name album_release_date album_release_year album_image artist_genre artist_popularity
0 500 Kanye West Stronger 31 311866 True 3 0.617 0.717 10 ... 0.490 103.992 4 https://api.spotify.com/v1/audio-analysis/4fzs... Graduation 2007-09-11 2007 https://i.scdn.co/image/ab67616d0000b2739bbd79... chicago rap, rap 90
1 499 The Supremes Baby Love 67 158040 False 3 0.595 0.643 5 ... 0.730 135.633 4 https://api.spotify.com/v1/audio-analysis/5uES... Where Did Our Love Go 1964-08-31 1964 https://i.scdn.co/image/ab67616d0000b273d5ea12... brill building pop, classic girl group, disco,... 61
2 498 Townes Van Zandt Poncho & Lefty 55 220573 False 8 0.636 0.277 1 ... 0.685 133.363 4 https://api.spotify.com/v1/audio-analysis/6QXt... The Late Great Townes Van Zandt 1972-01-01 1972 https://i.scdn.co/image/ab67616d0000b273cbf571... alternative country, cosmic american, country ... 53
3 497 Lizzo Truth Hurts 78 173325 True 13 0.715 0.624 4 ... 0.412 158.087 4 https://api.spotify.com/v1/audio-analysis/3HWz... Cuz I Love You (Super Deluxe) 2019-04-19 2019 https://i.scdn.co/image/ab67616d0000b2737bebcd... dance pop, escape room, minnesota hip hop, pop... 80
4 496 Harry Nilsson Without You 68 201000 False 6 0.381 0.186 4 ... 0.142 65.058 4 https://api.spotify.com/v1/audio-analysis/6MrI... Nilsson Schmilsson 1971-01-01 1971 https://i.scdn.co/image/ab67616d0000b2734df5b5... art rock, brill building pop, classic rock, cl... 59

5 rows × 26 columns

rank artist_name track_name track_popularity track_duration_ms track_is_explicit track_number track_danceability track_energy track_key ... track_valence track_tempo track_time_signature track_analysis_url album_name album_release_date album_release_year album_image artist_genre artist_popularity
495 5 Nirvana Smells Like Teen Spirit 14 301920 False 1 0.502 0.912 1 ... 0.720 116.761 4 https://api.spotify.com/v1/audio-analysis/1f3y... Nevermind (Deluxe Edition) 1991-09-26 1991 https://i.scdn.co/image/ab67616d0000b27328a90d... alternative rock, grunge, permanent wave, rock 80
496 4 Bob Dylan Like a Rolling Stone 71 369600 False 1 0.482 0.721 0 ... 0.557 95.263 4 https://api.spotify.com/v1/audio-analysis/3AhX... Highway 61 Revisited 1965-08-30 1965 https://i.scdn.co/image/ab67616d0000b27341720e... classic rock, country rock, folk, folk rock, r... 70
497 3 Sam Cooke A Change Is Gonna Come 66 191160 False 7 0.212 0.383 10 ... 0.452 173.790 3 https://api.spotify.com/v1/audio-analysis/0KOE... Ain't That Good News 1964-03-01 1964 https://i.scdn.co/image/ab67616d0000b2737329db... adult standards, brill building pop, classic s... 64
498 2 Public Enemy Fight The Power 57 282640 True 20 0.797 0.582 2 ... 0.415 105.974 4 https://api.spotify.com/v1/audio-analysis/1yo1... Fear Of A Black Planet 1990-04-10 1990 https://i.scdn.co/image/ab67616d0000b2732e3d1d... conscious hip hop, east coast hip hop, gangste... 56
499 1 Aretha Franklin Respect 73 147600 False 1 0.805 0.558 0 ... 0.965 114.950 4 https://api.spotify.com/v1/audio-analysis/7s25... I Never Loved a Man the Way I Love You 1967-03-10 1967 https://i.scdn.co/image/ab67616d0000b2736aa931... classic soul, jazz blues, memphis soul, soul, ... 68

5 rows × 26 columns

The Measures We Have

In less than two minutes, we have complete data on all 500 songs in the list! I’m almost upset with how easy this was compared with scraping the data from Rolling Stone Magazine. Great job Spotify. Still, we might wonder why I bothered to gather this info given I already did it before even if it is cleaner and more reliable. But as I said above, the other reason I went through the trouble to pull this was for the beautiful quantitative measures we can get for each song on the list. I’ll show why these are so useful for my purposes in a little bit. As an example of what we have, below I show some new measures we have for the song “Stronger” by Kanye West.

sp.audio_features(track_uris[0])[0]
{'danceability': 0.617,
 'energy': 0.717,
 'key': 10,
 'loudness': -7.858,
 'mode': 0,
 'speechiness': 0.153,
 'acousticness': 0.00564,
 'instrumentalness': 0,
 'liveness': 0.408,
 'valence': 0.49,
 'tempo': 103.992,
 'type': 'audio_features',
 'id': '4fzsfWzRhPawzqhX8Qt9F3',
 'uri': 'spotify:track:4fzsfWzRhPawzqhX8Qt9F3',
 'track_href': 'https://api.spotify.com/v1/tracks/4fzsfWzRhPawzqhX8Qt9F3',
 'analysis_url': 'https://api.spotify.com/v1/audio-analysis/4fzsfWzRhPawzqhX8Qt9F3',
 'duration_ms': 311867,
 'time_signature': 4}

Some explanation of what some of these metrics mean is worthwhile. I pulled these definitions straight from the spotify documentation and include them here for completeness.

  • acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.

  • danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.

  • energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.

  • instrumentalness: Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.

  • key: The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1. >= -1, <= 11

  • liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.

  • loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typically range between -60 and 0 db.

  • mode: Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.

  • speechiness: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.

  • tempo: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.

  • time_signature: An estimated time signature. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure). The time signature ranges from 3 to 7 indicating time signatures of “3/4”, to “7/4”. >= 3, <= 7

  • valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

Some Quick EDA

Alright now that I have my dataset, I’d like to answer some important questions. I’m currently writing this piece on July 9, 2022 and I’m wondering which tracks and which artists are most and least popular right now.

(
    df[['track_name', 'artist_name', 'track_popularity', 'artist_popularity', 'rank']]
    .sort_values(by='track_popularity', ascending=False)
)

track_name artist_name track_popularity artist_popularity rank
440 Running Up That Hill (A Deal With God) Kate Bush 97 82 60
181 Everybody Wants To Rule The World Tears For Fears 86 71 319
195 Every Breath You Take The Police 86 73 305
143 Blank Space Taylor Swift 85 92 357
122 Mr. Brightside The Killers 85 76 378
... ... ... ... ... ...
486 Waterloo Sunset The Kinks 3 65 14
259 The Humpty Dance Digital Underground 3 54 241
50 Powderfinger - 2016 Remaster Neil Young 2 60 450
105 Planet Rock Afrika Bambaataa 1 42 395
380 Oh Bondage! Up Yours! X-Ray Spex 0 36 120

500 rows × 5 columns

Man, Kate Bush’s Running up that hill as number 1. Given the recent release of season 4 of Stranger Things, that tracks.

As a Neil Young fan, I’m kind of sad to see Powder Finger so low although it’s not really one of my favorite songs by him anyway. The Humpty Dance is amusing. If you’ve never listened to it, I recommend checking it out.

Following on rank against popularity, I’m curious if more currently popular songs tend to be ranked higher on the list. Below I make a simple scatter plot showing the rank against the popularity of the top 500 songs.

import plotly.express as px

def plot_popularity_against_track_rank(df, outpath=None):
    fig = px.scatter(
        data_frame=df, 
        x='rank', 
        y='track_popularity', 
        hover_name='track_name', 
        title='Comparing Rank and Track Popularity',
        hover_data=['artist_name'],
        trendline="lowess"
    )
    fig = fig.update_coloraxes(showscale=False)
    fig = fig.update_xaxes(autorange="reversed")
    fig = fig.update_layout(
        plot_bgcolor="white", 
        margin=dict(t=50, l=10, b=10, r=10),
        xaxis_title='Song Ranking',
        yaxis_title="Current Song Popularity",
        width = 800, height = 400
    )
    if outpath:
        fig.write_image(outpath)
    return fig

plot_popularity_against_track_rank(df, outpath="images/track_pop.png")

track_pop

Dammnnn, Kate Busch so far up there. And we see that there is no real link between current popularity and ranking. I mean, fair. Now I’d like to do something similar with artists in general. Below I explore the rank against popularity at the artist level by grabbing the best ranked song for each artist and compare it against the artist level popularity.

artist_df = (
    df
    .groupby('artist_name')
    .agg({'rank':'min', 'artist_popularity':'max'})
    .reset_index()
    .sort_values(by='artist_popularity', ascending=False)
)
artist_df

artist_name rank artist_popularity
21 Bad Bunny 329 100
86 Drake 129 95
127 Harry Styles 428 93
279 Taylor Swift 69 92
19 BTS 346 92
... ... ... ...
354 Woody Guthrie 229 41
320 The Slits 381 39
258 Screamin' Jay Hawkins 299 38
356 X-Ray Spex 120 36
112 Funky 4 + 1 288 23

359 rows × 3 columns

Wow this definitely tracks. Props to Harry Styles and X-Ray Spex for lining up their artist popularity generally with their song popularity. Now I’ll make the same plot I built above at the artist level. It turns out Kanye West and Kendrick Lamar are the two most popular artists with songs in the top 50 with Runaway and Alright respectively.

def plot_popularity_against_artist_rank(df, outpath=None):

    fig = px.scatter(
        data_frame=artist_df, 
        x='rank', 
        y='artist_popularity', 
        hover_name='artist_name', 
        title='Best Track Rank and Artist Popularity',
        trendline= 'lowess'
    )
    fig = fig.update_coloraxes(showscale=False)
    fig = fig.update_xaxes(autorange="reversed")
    fig = fig.update_layout(
        plot_bgcolor="white", 
        margin=dict(t=50, l=10, b=10, r=10),
        xaxis_title='Best Song Ranking',
        yaxis_title="Current Artist Popularity",
        width = 800, height = 400
    )
    if outpath:
        fig.write_image(outpath)
    return fig

plot_popularity_against_artist_rank(artist_df, outpath="images/artist_pop.png")

artist_pop

Advanced Analytics Part 1 - Forecast the album release date

With all of these new quantitative measures, I started to wonder if I could actually use machine learning in some way on this list. For example, maybe in different years there was a trend for the top songs to have varying amounts of accousticenss or loudness. Or, if I wanted to, I could leverage the artist genres to see if there was a correlation between the top songs and their genres. Below, I implement a simple pipeline using scikit-learn on a random forest. First, I split my data and define my datatypes. Then I provide two helper functions for handling boolean variables and for running singular value decomposition on the combined results. Finally, I define my pipeline, fit it to the data, and predict release years.

## Packages
from sklearn import model_selection
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import TruncatedSVD
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import GridSearchCV, cross_val_score, KFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder, FunctionTransformer
SEED = 42

column_type_mapping = {
    "track_danceability": "numeric",
    "track_energy": "numeric",
    "track_loudness": "numeric",
    "track_speechiness": "numeric",
    "track_acousticness": "numeric",
    "track_instrumentalness": "numeric",
    "track_liveness": "numeric",
    "track_valence": "numeric",
    "track_tempo": "numeric",
    "track_duration_ms": "numeric",
    "track_popularity": "numeric",
    "track_is_explicit": "categorical",
    "artist_popularity": "numeric",
    "track_time_signature": "categorical",
    "track_key": "categorical",
    "album_release_year": "numeric",
    "artist_genre": "text",
}


target = 'album_release_year'
features = list(column_type_mapping.keys())

numeric_cols = [i for i in features if column_type_mapping[i] == "numeric"]
categorical_cols = [i for i in features if column_type_mapping[i] == "categorical"]
text_cols = [i for i in features if column_type_mapping[i] == "text"]

X, y = df[features], df[target]
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, y, test_size=.2, random_state=42)
# Modified TruncatedSVD that doesn't fail if n_components > ncols
class MyTruncatedSVD(TruncatedSVD):
    def fit_transform(self, X, y=None):
        if X.shape[1] <= self.n_components:
            self.n_components = X.shape[1] - 1
        return TruncatedSVD.fit_transform(self, X=X, y=y)

def to_string(x):
    """Handle values as string.  They may be treated as booleans or numerics otherwise"""
    return x.astype(str)

#Define the hyperparameter grid
param_grid = {
    'svd__n_components' : [5, 10, 30],
    'rf__max_depth': [5, 10, 20],
    'rf__n_estimators': [30, 100, 200]
    }

# Build Pipeline
numeric_pipeline = Pipeline(
        steps=[
            ("scaler", MinMaxScaler()),
        ]
    )

# Handle Categorical Variables
categorical_pipeline = Pipeline(
    steps=[
        ("convert_to_string", FunctionTransformer(to_string)),
        ("onehot", OneHotEncoder(categories="auto", handle_unknown="ignore")),
    ]
)


# Handle Text features (just artist genre)
text_pipeline = Pipeline(
    steps=[
        ("ngrams", CountVectorizer(ngram_range=(1, 2))),
    ]
)

# Handle each variable type as overall preprocessing
preprocessing_pipeline = ColumnTransformer(
    transformers=[
        ("num", numeric_pipeline, numeric_cols),
        ("cat", categorical_pipeline, categorical_cols),
        ("text", text_pipeline, text_cols[0])
        ]
)

# Run SVD on the results
steps = [
    ("preprocessing", preprocessing_pipeline),
    ("svd", MyTruncatedSVD(random_state=SEED),),
    ("rf", RandomForestRegressor())
    ]

pipe = Pipeline(
    steps=steps,
    verbose=False,
)

# Run grid search to tune hyperparameters
estimator = GridSearchCV(pipe, param_grid=param_grid, cv=5, scoring='neg_mean_absolute_error', n_jobs=-1,) 
x = estimator.fit(X_train, Y_train)

ml_pipeline

Alright, we’ve fit our model, and found our hyperparameters. I also printed out the average cross validation score from the best hyperparameter settings. We have an in sample MAE of about 10. In other words, when the model predicts a release year, it’s off by 10 years on average. Considering the space of songs is about 100 years in total, the performance is actually a little better than I expected it to be. There are two easy reasons to I can use to explain this:

  1. This estimate is overly optimistic as I explain in my article about nested cross validation. When I evaluate the model on the test set, performance will probably be a little lower.
  2. This estimate and the holdout set have some leakage because I randomly split my dataset instead of partioning it by the artist name. The problem with random partitioning in this case is that I have artist level features like popularity and genre that are essentially duplicated in training validation and holdout sets. For exmaple, the list of artist genres affiliated with each song of the Beatles' is [beatlesque, british invasion, classic rock, merseybeat, psychedelic rock, rock]. No other artist will have ‘beatlesque’ in their genre so when the model sees this on the holdout data, it will know the release year of the album will be much lower than with other songs. This leakage will persist in the holdout set. I should repartition my data more thoughtfully, but I’m just kind of having fun here. My article, my rules!
print(f"Best parameters found: {estimator.best_params_}")
print(f"In sample MAE: {round(abs(estimator.best_score_),2)}")
Best parameters found: {'rf__max_depth': 20, 'rf__n_estimators': 200, 'svd__n_components': 30}
In sample MAE: 10.05

Surprisingly, the MAE actually dropped by nearly two years and even the Root Mean Squared Error, a metric that penalizes far off predictions, is under 12 years. As I listed before, there is a good amount of leakage in this model from the artist level features. Still, I do feel pretty impressed with this low error on the model.

preds = estimator.predict(X_test)
r2 = round((1 - np.sum(np.abs(Y_test - preds))/np.sum(np.abs(Y_test - preds.mean()))) * 100,2)
print(f'MAE: {mean_absolute_error(preds, Y_test)}')
print(f'RMSE: {round(np.sqrt(mean_squared_error(preds, Y_test)),2)}')
print(f"Fraction of deviance explained: {r2}%")

MAE: 8.301839416971916
RMSE: 11.88
Fraction of deviance explained: 41.82%

Advanced Analytics Part 2 - Dimensionality Reduction and Clustering

Now I want to disperse with machine learning in the predictive sense. Instead, I think I’m going to try and find tracks that are similar to each other. For example, I just moved in with this couple in Lawrenceville. One of my new roommates told me, “I think all of those 80’s rock bands sound the same. Like I never know which song is AC/DC or Kiss, or the Eagles”.

Now first of all, the Eagles became famous in the mid 70’s. Kiss, surprisingly, also became famous in the later 70’s rather than the 80’s. Regardless of timelines, I would secondly debate the claim that the Eagles sound at all similar to AC/DC or Kiss. Luckily, I now have data on at least one of each of these bands most famous songs. So, I’m going to do three things here:

  1. Use UMAP (a popular dimensionality reduction algorithm) on some columns on the dataset to project it onto a two dimensional plane
  2. Run DBSCAN on the dataset to assign our songs to distinct groups or clusters.
  3. Plot the results and look at where we stand

Doing this might sound complicated, and from a mathematics perspective it is. Luckily, Python has many packages to help us and running UMAP and DBSCAN on a dataset like this will be pretty easy.

# Select the artists (Note that Don Henley is the lead singer of the eagles)
search_artists = ['kiss', 'ac/dc', 'eagles', 'don henley']

artist_indexes = df.loc[lambda x: x.artist_name.str.lower().isin(search_artists)].index

df.loc[lambda x: x.artist_name.str.lower().isin(search_artists)]

rank artist_name track_name track_popularity track_duration_ms track_is_explicit track_number track_danceability track_energy track_key ... track_valence track_tempo track_time_signature track_analysis_url album_name album_release_date album_release_year album_image artist_genre artist_popularity
96 404 KISS Rock And Roll All Nite 76 168840 False 10 0.654 0.929 1 ... 0.902 144.774 4 https://api.spotify.com/v1/audio-analysis/6KTv... Dressed To Kill 1975-03-19 1975 https://i.scdn.co/image/ab67616d0000b27365a7b0... album rock, glam rock, hard rock, metal, rock 72
189 311 Eagles Hotel California - 2013 Remaster 84 391376 False 1 0.579 0.508 2 ... 0.609 147.125 4 https://api.spotify.com/v1/audio-analysis/40ri... Hotel California (2013 Remaster) 1976-12-08 1976 https://i.scdn.co/image/ab67616d0000b273463734... album rock, classic rock, country rock, folk r... 74
213 287 AC/DC You Shook Me All Night Long 81 210173 False 7 0.532 0.767 7 ... 0.755 127.361 4 https://api.spotify.com/v1/audio-analysis/2SiX... Back In Black 1980-07-25 1980 https://i.scdn.co/image/ab67616d0000b2730b51f8... australian rock, hard rock, rock 79
291 209 Don Henley The Boys Of Summer 78 288733 False 1 0.516 0.549 6 ... 0.907 176.941 4 https://api.spotify.com/v1/audio-analysis/4gve... Building The Perfect Beast 1984-01-01 1984 https://i.scdn.co/image/ab67616d0000b273e59d7f... album rock, art rock, classic rock, country ro... 63

4 rows × 26 columns

Build Pipeline

The pipe I will use for this will be similar to the one I used for predicting the album release year. In fact, I can even reuse the preprocessing steps from the previous section and then just apply UMAP and DBSCAN on the results.

from sklearn.cluster import DBSCAN
import umap

preprocessing_pipeline = ColumnTransformer(
    transformers=[
        ("num", numeric_pipeline, numeric_cols),
        ("cat", categorical_pipeline, categorical_cols),
        ("text", text_pipeline, text_cols[0])
        ]
)

pipeline = Pipeline(
    steps = [
        ("preprocessing", preprocessing_pipeline),
        ("svd", MyTruncatedSVD(n_components=10, random_state=SEED)),
        ("umap", umap.UMAP(random_state=42)),
        ("dbscan", DBSCAN())
    ],
    verbose=False
)

new_cluster

Now I’ll use the first two steps to decompose the dataset into a two dimensional plane. This will let me plot the results in a 2D. Then, I let the full pipeline run so I can get cluster assignments.


umap_embedding = (
    pipeline[:-1]
    .set_params(umap__n_components = 2)
    .fit_transform(df)
)

fit_clustering_params = {
    "umap__n_components": 2, 
    "umap__min_dist":0.0,
    "umap__n_neighbors": 30,
    "dbscan__eps":.5,
    "dbscan__min_samples":5
}

clusters_2d = (pipeline
            .set_params(**fit_clustering_params)
            .fit_predict(df, y=None)
           )


[Pipeline] ..... (step 1 of 3) Processing preprocessing, total=   0.0s
[Pipeline] ............... (step 2 of 3) Processing svd, total=   0.0s
[Pipeline] .............. (step 3 of 3) Processing umap, total=   2.3s
[Pipeline] ..... (step 1 of 4) Processing preprocessing, total=   0.0s
[Pipeline] ............... (step 2 of 4) Processing svd, total=   0.0s
[Pipeline] .............. (step 3 of 4) Processing umap, total=   2.9s
[Pipeline] ............ (step 4 of 4) Processing dbscan, total=   0.0s

Plotting the results

Below I plot my two dimensional embedding and clustering results, marking the artists I was comparing with a star. In this case, Boys of Summer by Don Henley was placed right near Hotel California by the Eagles while Shook Me All Night Long by AC/DC was placed near Rock and Roll All Night by Kiss. While these were the results I expected to prove. I actually see a problem with them. Far to the left are all of the Beatles songs. Why are they all so close together? Do they really all sound the same? No, they certainly do not. The issue here is that I’m using the artist genre in my dimensionality reduction process. Because I’m running ngrams on that text field, it creates a ton of extra features. Each of these has the same weight as the numeric features. I’ll need to run this again and remove the text features. I’ll also throw out the categorical features for the same reason.

import plotly.graph_objects as go

def plot_embedding(df, umap_embedding, clusters, artist_indexes, outpath=None):

    text = [
            f"Artist Name: {artist_name} </br> Song Name: {track_name}"
            for artist_name, track_name in zip(
                df.artist_name, df.track_name
            )
        ]
    hovertemplate = "</br>%{text}<extra></extra>"

        # Add in the listed bands from above
    filter_df = df.iloc[artist_indexes]

    text_record = [
            f"Artist Name: {artist_name} </br> Song Name: {track_name}"
            for artist_name, track_name in zip(
                filter_df.artist_name, filter_df.track_name
            )
        ]
    fig = go.Figure()

    if umap_embedding.shape[1] == 2:
        fig = fig.add_trace(
            go.Scatter(
                x=umap_embedding[:, 0],
                y=umap_embedding[:, 1],
                hovertemplate=hovertemplate,
                text=text,
                showlegend=False,
                mode="markers",
                marker_color=clusters,
                marker_line_color="black",
                marker_line_width=1,
            )
        )



        fig = fig.add_trace(
            go.Scatter(
                x=umap_embedding[artist_indexes, 0],
                y=umap_embedding[artist_indexes, 1],
                text=text_record,
                showlegend=True,
                mode="markers",
                marker_symbol="star",
                marker_line_color="midnightblue",
                marker_color="lightskyblue",
                marker_line_width=1,
                marker=dict(size=15),
            )
        )
    else:
        fig = go.Figure(data=[go.Scatter3d(
        x=umap_embedding[:, 0],
        y=umap_embedding[:, 1],
        z=umap_embedding[:, 2],
        mode='markers',
        text=text,
        marker_line_color="black",
        marker_line_width=1,
        marker=dict(
            size=10,
            color=clusters,  
            opacity=0.75
        )
        )])


        fig = fig.add_trace(
            go.Scatter3d(
                    x=umap_embedding[artist_indexes, 0],
                    y=umap_embedding[artist_indexes, 1],
                    z=umap_embedding[artist_indexes, 2],
                    text=text_record,
                    showlegend=True,
                    mode="markers",
                    marker_symbol="diamond",
                    marker_line_color="midnightblue",
                    marker_color="lightskyblue",
                    marker_line_width=1,
                    marker=dict(size=12),
            )
        )

    fig = fig.update_layout(
        title_font_size=20,
        height=400,
        width=800,
        plot_bgcolor="#ffffff",
        hoverlabel=dict(
            bgcolor="white", font_size=16, font_family="Rockwell", namelength=-1
        ),
        margin=dict(l=20, r=20, t=30, b=20),
    )
    if outpath:
        fig.write_image(outpath)
    return fig

plot_embedding(df, umap_embedding, clusters_2d, artist_indexes, outpath="images/umap_2d_first_take.png")

2d_embed

Dimensionality Reduction and Clustering - Take 2

Below I run the same pipeline, this time only taking my numeric features. Since there will be far fewer features to run UMAP on, I also remove the Singular Value Decomposition step.

from sklearn.cluster import DBSCAN
import umap

numeric_preprocessing_pipeline = ColumnTransformer(
    transformers=[
        ("num", numeric_pipeline, numeric_cols),
        ]
)

pipeline = Pipeline(
    steps = [
        ("preprocessing", numeric_preprocessing_pipeline),
        ("umap", umap.UMAP(random_state=42)),
        ("dbscan", DBSCAN())
    ],
    verbose=False
)

new_cluster

Below I run the pipeline to get the UMAP embeddings and cluster assignments again. Because everyone loves 3D charts, I also extract the embeddings and cluster assignments in three dimensions. We’ll see how this looks in a sec.


umap_embedding = (
    pipeline[:-1]
    .set_params(umap__n_components = 2)
    .fit_transform(df)
)
umap_embedding_3d = (
    pipeline[:-1]
    .set_params(umap__n_components = 3)
    .fit_transform(df)
)

fit_clustering_params = {
    "umap__n_components": 2, 
    "umap__min_dist": 0.0,
    "umap__n_neighbors": 30,
    "dbscan__eps":.3,
    "dbscan__min_samples":4
}


clusters_2d = (pipeline
            .set_params(**fit_clustering_params)
            .fit_predict(df, y=None)
           )

fit_clustering_params['umap__n_components'] = 3
clusters_3d = (
    pipeline
    .set_params(**fit_clustering_params)
    .fit_predict(df, y=None)
)

[Pipeline] ..... (step 1 of 2) Processing preprocessing, total=   0.0s
[Pipeline] .............. (step 2 of 2) Processing umap, total=   2.8s
[Pipeline] ..... (step 1 of 2) Processing preprocessing, total=   0.0s
[Pipeline] .............. (step 2 of 2) Processing umap, total=   2.5s
[Pipeline] ..... (step 1 of 3) Processing preprocessing, total=   0.0s
[Pipeline] .............. (step 2 of 3) Processing umap, total=   3.2s
[Pipeline] ............ (step 3 of 3) Processing dbscan, total=   0.0s
[Pipeline] ..... (step 1 of 3) Processing preprocessing, total=   0.0s
[Pipeline] .............. (step 2 of 3) Processing umap, total=   2.8s
[Pipeline] ............ (step 3 of 3) Processing dbscan, total=   0.0s
plot_embedding(df, umap_embedding, clusters_2d, artist_indexes, outpath="images/umap_2d_second_take.png")

2d_embed

Interestingly, now all of the tracks are pretty far away from each other. Does this make this representation meaningless? I think the answer is no. Looking around the chart on the far left, I found ‘Crying’ by Roy Orbison next to ‘River’ by Joni Mitchell and ‘Yesterday’ by the Beatles. On the far right, I see a bunch of rap songs like ‘The Message’ by Grandmaster Flash and ‘Nuthin But a ‘G’ Thang’ by Dr Dre. These actually kind of track. What looks less good in this group are the actual cluster assignments, which don’t seem to contribute much knowledge at all. In a future implementation, I’d like to try this again using Kmeans or HDBSCAN. Then again, a friend of mine once told me that clustering is “Like reading chicken scratch”. It would also be nice to see what features have the most influence on the structure of this embedding. I know I could find this if I had used PCA instead of UMAP, but I’ll leave doing it to another post. In the meantime, let’s look at the 3D embedding.

plot_embedding(df, umap_embedding_3d, clusters_3d, artist_indexes, outpath="images/umap_3d_second_take.png")

2d_embed

Welp, it’s a mess but it’s my mess. There are a few conclusions I can draw from this exercise.

  • The features you include in dimensionality reduction are important, and without proper handling they will have outsized influence if they are categorical or especially text. One way we could potentially handle this is by running SVD earlier on in the pipeline, exclusively on the categorical and text variables.
  • Be careful with your clustering methodology. As we see from the chart above, the cluster assignments created from running DBSCAN on one set of parameters are not great.
  • Unsupervised learning can be a great way to find patterns in high dimensional data but it can also be like reading signs from chicken scratch. We can drastically impact cluster assignments and embeddings by varying features to include as well as parameters for UMAP and DBSCAN.

Of course, this wouldn’t be complete if I did not allow you as a reader to play around with this data yourself. As with many other projects I’ve undertaken, I built an accompanying Streamlit application for further exploration on UMAP and DBSCAN. Try it out here. You can also download the dataset here for your own use.

If you are interested in learning more or talking to me about this subject (especially about ranking music), feel free to look me up on Github or add me on Linkedin! The greatest reward I have from doing any of this is hearing from people who took something away from my post.

References

  • “2.3. Clustering.” Scikit-Learn, https://scikit-learn/stable/modules/clustering.html. Accessed 9 July 2022.
  • “6.1. Pipelines and Composite Estimators.” Scikit-Learn, https://scikit-learn/stable/modules/compose.html. Accessed 9 July 2022.
  • 500 Best Songs of All Time - Rolling Stone. https://www.rollingstone.com/music/music-lists/best-songs-of-all-time-1224767/. Accessed 9 July 2022.

Credits

None of this has anything to do with DataRobot, the company I work at. Nonetheless, I thank all of my coworkers who came to that one deep dive session on Streamlit I did a couple weeks ago because it inspired me to build my app and make this post.

Related