Home
  • Portfolio
  • Tableau
  1. Reports
  2. Understanding Spotify User Behavior
  • Reports
    • BI with SQL & Markdown
    • BI as Code
    • Docs as Code
    • Slides as Code
    • Duckdb & Evidence
    • Duckdb & Google Sheets
    • Judiciary Elections in Mexico
    • Social Security in Mexico
    • National Guard in Mexico
    • The Economist & Data Scientist Role in Public Security
    • Plotting with ggplot2 in Python
    • Understanding Spotify User Behavior
    • Gasoline Prices in Mexico
    • Farmacias Similares Locations
    • Cluster analysis with Python and Polars
  • Slideshows
    • Gasoline Prices in Mexico
    • Water Collection Systems in Mexico
    • Goodbye, Lake Zumpango
    • Why BI tools Fall short
    • Window Functions in SQL
    • Dashboards vs Web reports
  • Dashboards
    • Gasoline Prices in Mexico
    • Farmacias Similares Dashboard
    • Mexico City Crime Dashboard
    • Streamlit Financial Dashboard
    • Percepción de Seguridad en México ENSU 2015-2025
    • Car Sales Dashboard

Table of Contents

  • Overview
  • Environment settings
    • Database connection
  • Dataset
  • Unique Artists, Albums and Songs
  • Minutes Played by Month and Year
  • Top 05 Artists by Year
  • Top 05 Albums by Year
  • Top 05 Songs by Year
  • Conclusions
  • References
  • Appendix
  • Contact

Other Links

  • Medium
  • Linkedin
  • Twitter
  1. Reports
  2. Understanding Spotify User Behavior

Understanding Spotify User Behavior

  • Show All Code
  • Hide All Code

Manipulating and Visualizing User Patterns with Polars

Author

Jesus LM

Published

Jan, 2026

Abstract

This project explores a dataset of Spotify music listening habits, leveraging the power of the Polars DataFrame library for efficient and scalable analysis. By employing Polars’ optimized data manipulation capabilities, we efficiently process and analyze Spotify data, extracting meaningful insights into user behavior and the dynamics of the music ecosystem. Furthermore, this work highlights the potential of Polars as a powerful tool for large-scale data analysis in the music industry and beyond.

Figure 1: Spotify, a leading digital music, podcast, and video streaming service, has revolutionized the way individuals consume audio content. By offering a vast library of on-demand content through a freemium model, Spotify has significantly shifted music consumption from ownership to access. Spotify’s expansion into podcasting and other audio content reflects its ambition to become a comprehensive audio platform.

Overview

This study conducts a historical analysis of Spotify data, leveraging the efficiency of Polars, the analytical power of DuckDB, and the visualization capabilities of Plotly. By combining these tools, we examine trends in music consumption, artist popularity, and playlist dynamics over time.

Polars facilitates rapid data manipulation and cleaning of large Spotify datasets, while DuckDB enables complex SQL queries for in-depth analysis.

Plotly is then employed to create interactive visualizations, revealing patterns and insights that would be difficult to discern from raw data alone.

This analysis explores the evolution of musical genres, the impact of algorithmic recommendations, and the shifting landscape of artist distribution.

We demonstrate the efficacy of this toolchain for extracting meaningful narratives from Spotify’s extensive historical data, providing a comprehensive understanding of the platform’s influence on the music industry.

Environment settings

Show code
# import libraries
import numpy as np
import pandas as pd
import polars as pl
import duckdb as db
import os
from dotenv import load_dotenv
load_dotenv('.env')
import plotly.express as px
from great_tables import GT, md

Database connection

Duckdb is a powerful tool for data analysts and developers who need to perform fast and efficient analytical queries on large datasets, especially in environments where simplicity and portability are crucial.
Show code
# database connection
token = os.getenv('MD_TOKEN')
conn = db.connect(f"md:?motherduck_token={token}")
# create dataframe
df = conn.sql('select * from projects.spotify_history').pl()
conn.close()

Polars is a modern DataFrame library designed for speed and efficiency, offering a compelling alternative to traditional data manipulation tools.

Dataset

Show code
df.schema
Schema([('spotify_track_uri', String),
        ('ts', Datetime(time_unit='us', time_zone=None)),
        ('platform', String),
        ('ms_played', Int64),
        ('track_name', String),
        ('artist_name', String),
        ('album_name', String),
        ('reason_start', String),
        ('reason_end', String),
        ('shuffle', Boolean),
        ('skipped', Boolean)])
Show code
df = (
    df.select(pl.exclude('spotify_track_uri'))
        .with_columns(year=pl.col('ts').dt.year())
)
Show code
(
    GT(df.head())
    .tab_header(title=md("### Spotify Songs History Dataset"))
    .fmt_date(columns='ts', date_style='iso')
    .fmt_number(columns='ms_played', decimals=0)
    .cols_hide(columns='year')
    .cols_label(
        ts='Date',
        platform='Platform',
        ms_played='Mins played',
        track_name='Song',
        artist_name='Artist',
        album_name='Album',
        reason_start='Reason start',
        reason_end='Reason end',
        shuffle='Shuffle',
        skipped='Skipped',
    )
    .tab_options(table_font_size='90%')
    .tab_options(table_width='90%')
    .tab_source_note(source_note='Source: https://mavenanalytics.io')
)

Spotify Songs History Dataset

Date Platform Mins played Song Artist Album Reason start Reason end Shuffle Skipped
2013-07-08 web player 3,185 Say It, Just Say It The Mowgli's Waiting For The Dawn autoplay clickrow false false
2013-07-08 web player 61,865 Drinking from the Bottle (feat. Tinie Tempah) Calvin Harris 18 Months clickrow clickrow false false
2013-07-08 web player 285,386 Born To Die Lana Del Rey Born To Die - The Paradise Edition clickrow unknown false false
2013-07-08 web player 134,022 Off To The Races Lana Del Rey Born To Die - The Paradise Edition trackdone clickrow false false
2013-07-08 web player 0 Half Mast Empire Of The Sun Walking On A Dream clickrow nextbtn false false
Source: https://mavenanalytics.io
Table 1: Preview of the dataset, showing the columns and a sample of data used in the analysis.
Show code
print(f'The data set has {df.shape[0]:,.0f} rows.')
print(f'The dataset goes from {df['year'].min()} up to {df['year'].max()}.')
The data set has 149,860 rows.
The dataset goes from 2013 up to 2024.

For further information about the fields of the dataset see Table 7

Unique Artists, Albums and Songs

Show code
# unique values per year
tops = (
        df.select('year','artist_name','album_name','track_name')
            .group_by('year', maintain_order=True)
            .n_unique()
)
Show code
(
    GT(tops)
    .tab_header(title=md("### Unique Artists, Albums and Songs Played by Year"))
    .fmt_number(columns=['artist_name','album_name','track_name'], decimals=0)
    .cols_label(
        year='Year',
        artist_name='Artists',
        album_name='Albums',
        track_name='Songs',
    )
    .tab_options(table_font_size='90%')
    .tab_options(table_width='90%')
    .tab_source_note(source_note='Source: Self-elaboration by author')
)

Unique Artists, Albums and Songs Played by Year

Year Artists Albums Songs
2013 63 95 149
2014 21 22 23
2015 610 924 1,357
2016 458 940 2,535
2017 647 1,357 3,326
2018 439 947 2,917
2019 493 1,029 2,934
2020 820 1,569 3,929
2021 1,578 2,674 5,126
2022 1,220 2,284 4,477
2023 1,456 2,336 4,042
2024 1,072 1,828 3,587
Source: Self-elaboration by author
Table 2: Unique Artists, Albums and Songs by Year.

Minutes Played by Month and Year

Show code
years = (
    df.group_by('year', maintain_order=True)
        .agg((pl.col('ms_played').sum()/60_000).round(2).alias('total_mins'))
)
Show code
months = (
    df.group_by_dynamic('ts', every='1mo')
        .agg((pl.col('ms_played').sum()/60_000).round(2)
        .alias('total_mins'))
)
Show code
fig = px.line(years, x='year', y='total_mins',)

fig.update_layout(
    title=dict(text='Spotify Played Minutes by Year', font=dict(size=30), yref='paper'),
    template='ggplot2',
    autosize=False,
    width=830,
    height=500,
    showlegend=True,
    plot_bgcolor='#f8f8f8',
    yaxis=dict(title='Minutes'),
    xaxis=dict(title=''),
    margin=dict(
        l=20,
        r=20,
        b=20,
        t=50,
        pad=4
    )
)

fig.update_traces(line_color='#007f00')
fig.show()
Figure 2: Total of Minutes Played By Year.
Show code
fig = px.line(months, x='ts', y='total_mins',)

fig.update_layout(
    title=dict(text='Spotify Played Minutes by Month', font=dict(size=30), yref='paper'),
    template='ggplot2',
    autosize=False,
    width=830,
    height=500,
    showlegend=True,
    plot_bgcolor='#f8f8f8',
    yaxis=dict(title='Minutes'),
    xaxis=dict(title=''),
    margin=dict(
        l=20,
        r=20,
        b=20,
        t=50,
        pad=4
    )
)

fig.update_traces(line_color='#007f00')
fig.show()
Figure 3: Total of Minutes Played By Year and Month.
Show code
(
    GT(years)
    .fmt_number(columns='total_mins', decimals=0)
    .cols_label(
        year='Year',
        total_mins='Minutes Played'
    )
    .tab_options(table_font_size='90%')
    .tab_options(table_width='80%')
    .tab_source_note(source_note='Source: Self-elaboration')
)
Year Minutes Played
2013 470
2014 62
2015 3,555
2016 11,857
2017 40,239
2018 28,454
2019 28,750
2020 55,239
2021 53,506
2022 38,495
2023 30,904
2024 28,962
Source: Self-elaboration

Top 05 Artists by Year

Show code
top_artists = (
    df
        .group_by('year','artist_name')
        .agg(pl.len().alias('count'))
        .filter(
            pl.col('count').rank(descending=True, method='dense').over('year') <= 5
        )
        .group_by('year')
        .agg(pl.col('artist_name'))
        .sort('year')
)
Show code
(
    GT(top_artists)
    .tab_header(title=md("### Top 05 Artist by Year"))
     .cols_width(
        cases={
            'year': '10%',
            'artist_name': '100%',
        }
    )
    .cols_label(
        year='Year',
        artist_name='Artist'
    )
    .cols_align(align='left', columns=('artist_name'))
    .tab_options(table_font_size='90%')
    .tab_options(table_width='90%')
    .tab_source_note(source_note='Source: Self-elaboration by author')
)

Top 05 Artist by Year

Year Artist
2013 ['Lana Del Rey', 'Coldplay', 'John Mayer', 'The Kooks', 'Passion Pit', 'U2']
2014 ['James Gang', 'Young Wonder', 'The Shins', 'Freddie King', 'The Neighbourhood', 'Ted Nugent', 'Funeral Suits', 'Arctic Monkeys', 'The Strokes', 'Blur', 'Karmon', 'Parachute Youth', 'JR JR', 'Ra Ra Riot', 'Blondfire', 'The Beatles Recovered Band', 'Atlas Genius', 'Discovery', 'Stevie Ray Vaughan', 'Mystery Jets', 'Bloc Party']
2015 ['Frank Sinatra', 'We The Kings', 'The Rolling Stones', 'Justin Bieber', 'The Script', 'David Bisbal']
2016 ['Vampire Weekend', 'Johnny Cash', 'Elvis Presley', 'The Beatles', 'Led Zeppelin']
2017 ['Radiohead', 'The Beatles', 'The Killers', 'Bob Dylan', 'John Mayer']
2018 ['The Killers', 'The Beatles', 'Paul McCartney', 'John Mayer', 'Bob Dylan']
2019 ['John Mayer', 'Bob Dylan', 'The Beatles', 'The Killers', 'Paul McCartney']
2020 ['The Strokes', 'The Beatles', 'The Killers', 'Bob Dylan', 'John Mayer']
2021 ['Bob Dylan', 'Kings of Leon', 'The Beatles', 'John Mayer', 'The Killers']
2022 ['The Beatles', 'The Killers', 'Joaquín Sabina', 'John Mayer', 'Howard Shore']
2023 ['Bob Dylan', 'The Killers', 'John Mayer', 'The Beatles', 'Howard Shore']
2024 ['The Killers', 'The Beatles', 'ABBA', 'Paul McCartney', 'John Mayer']
Source: Self-elaboration by author
Table 3: Top 05 Artists by Year.

Top 05 Albums by Year

Show code
top_albums = (
    df
        .group_by('year','album_name')
        .agg(pl.len().alias('count'))
        .filter(
            pl.col('count').rank(descending=True, method='dense').over('year') <= 5
        )
        .group_by('year')
        .agg(pl.col('album_name'))
        .sort('year')
)
Show code
(
    GT(top_albums)
    .tab_header(title=md("### Top 05 Albums by Year"))
     .cols_width(
        cases={
            'year': '10%',
            'album_name': '100%',
        }
    )
    .cols_label(
        year='Year',
        album_name='Album'
    )
    .cols_align(align='left', columns=('album_name'))
    .tab_options(table_font_size='90%')
    .tab_options(table_width='90%')
    .tab_source_note(source_note='Source: Self-elaboration by author')
)

Top 05 Albums by Year

Year Album
2013 ['Born To Die - The Paradise Edition', 'Oracular Spectacular', 'The 20/20 Experience (Deluxe Version)', 'Where the Light Is: John Mayer Live In Los Angeles', 'Paper Doll', 'Inside In / Inside Out', 'Battle Studies', 'Gossamer']
2014 ['When The Sun Goes Down', 'I Love You.', "I'm Sorry...", 'LP', '20th Century Masters: The Millennium Collection: Best Of Joe Walsh', 'Wincing The Night Away', 'My Someday', 'Through the Glass EP', 'Feel It EP', 'Lily of the Valley', 'Bloc Party EP', 'Twenty One', 'The Best Of', 'Can’t Get Better Than This', 'The Essential Ted Nugent', 'The Best Of Freddie King: The Shelter Years', 'Ra Ra Riot', 'Young Wonder', 'Blur: The Best Of', 'Room On Fire', "It's a Corporate World", '30 Beatles Top Hits']
2015 ['Somewhere Somehow', 'Ultimate Sinatra', 'Tú Y Yo', 'Hozier', 'No Sound Without Silence', 'Save Rock And Roll']
2016 ['At Folsom Prison', 'Elvis At Sun', 'Ultimate Sinatra', 'New York', 'The Beatles']
2017 ['Help!', 'The Beatles', 'Abbey Road', 'Past Masters', 'The Wall']
2018 ['The Wall', 'Egypt Station', 'Past Masters', 'Beatles For Sale - Remastered', 'The Beatles']
2019 ['Prismism', 'Past Masters', 'The Beatles', 'Abbey Road', 'Egypt Station']
2020 ['The Beatles', 'Hot Fuss', 'Imploding The Mirage', 'The Wall', 'The New Abnormal']
2021 ['Abbey Road', 'When You See Yourself', 'Pressure Machine', 'The Beatles', 'Past Masters']
2022 ['The Wall', 'The Beatles', 'The Lord of the Rings: The Fellowship of the Ring - the Complete Recordings', 'Past Masters', 'Pressure Machine', 'Nos Sobran Los Motivos', 'Revolver']
2023 ['Past Masters', 'Hot Fuss', 'The Beatles', "Sam's Town", 'Abbey Road']
2024 ['Paradise Valley', 'Born and Raised', 'The Beatles', 'Past Masters', 'Arrival']
Source: Self-elaboration by author
Table 4: Top 05 Albums by Year.

Top 05 Songs by Year

Show code
top_songs = (
    df
        .group_by('year','track_name')
        .agg(pl.len().alias('count'))
        .filter(
            pl.col('count').rank(descending=True, method='dense').over('year') <= 5
        )
        .group_by('year')
        .agg(pl.col('track_name'))
        .sort('year')
)
Show code
(
    GT(top_songs)
    .tab_header(title=md("### Top 05 Songs by Year"))
     .cols_width(
        cases={
            'year': '10%',
            'track_name': '100%',
        }
    )
    .cols_label(
        year='Year',
        track_name='Song'
    )
    .cols_align(align='left', columns=('track_name'))
    .tab_options(table_font_size='90%')
    .tab_options(table_width='90%')
    .tab_source_note(source_note='Source: Self-elaboration by author')
)

Top 05 Songs by Year

Year Song
2013 ['Breath Of Life', 'In Your Atmosphere - Live at the Nokia Theatre, Los Angeles, CA - December 2007', 'Beautiful Day', 'Impossible', 'Do I Wanna Know?', 'Heartbreak Warfare', 'Clocks', "I Still Haven't Found What I'm Looking For", 'Ooh La', 'Smile', 'Paper Doll', 'Next To Me', 'Take a Walk', 'Born To Die', "Free Fallin' - Live at the Nokia Theatre, Los Angeles, CA - December 2007", 'Millionaires', 'Last Nite', 'I Want You', 'Young And Beautiful', 'Half of My Heart', 'Mirrors', 'Dreaming of You']
2014 ["She's Hearing Voices", 'Feel It', 'Half in Love with Elizabeth', 'When The Sun Goes Down', 'Midnight Man', 'Parklife', 'Can You Tell', 'Wires', 'Sleeping Lessons', 'Florida', 'Little Wing', 'Wang Dang Sweet Poontang', 'Symptoms - EP Version', 'Skeletons', 'Flesh', "I Can't Win", 'I Want to Hold Your Hand', 'Pretty Young Thing', 'Awake Now', 'End of a Century', 'Walking By Myself - Remastered', 'Staying Up', 'Swing Tree']
2015 ['Superheroes', 'Switzerland', 'Diez Mil Maneras', 'What Do You Mean?', 'Paraíso']
2016 ['The River', 'My Way', 'South Bound Saurez - Remaster', 'Teenagers', "I've Got You Under My Skin", 'Not Today', 'Hallelujah']
2017 ['Perfect', 'The Man', 'Wildfire', 'In the Blood', 'Gravity', 'Bones', 'Married with Children - 2014 Remaster']
2018 ['Universal Gleam', "I Don't Know", 'Who Cares', 'People Want Peace', 'Dominoes', 'Band On The Run - 2010 Remaster', 'Sé Que Te Duele', 'No Words - 2010 Remaster', 'Come On To Me', 'With God on Our Side']
2019 ['Once', 'Reminder', 'Band On The Run - 2010 Remaster', 'Mariposa Traicionera', "Maybe It's Time"]
2020 ['Ode To The Mets', 'Dying Breed', 'Caution', 'Imploding The Mirage', 'My Own Soul’s Warning']
2021 ['When You See Yourself, Are You Far Away', 'Crucify Your Mind', 'Not in Nottingham', '100,000 People', 'Did My Best', 'Common People - Full Length Version']
2022 ['Telefonía', 'Postdata', 'Yolanda', 'Se Me Olvidó Otra Vez - Unplugged; 2020 Remasterizado', '19 Dias y 500 Noches - En Directo', 'Por el Bulevar de los Sueños Rotos']
2023 ['Always Alright', 'Primera Cita', 'Malibu', 'Qué Bonito Es Querer', 'Billy 4', 'You Sexy Thing']
2024 ['Dancing Queen', 'Superdeli', 'Does Your Mother Know', 'Knowing Me, Knowing You', 'Take A Chance On Me', 'Waterloo', 'Why Did It Have To Be Me?']
Source: Self-elaboration by author
Table 5: Top 05 Songs by Year.
Show code
platform = (
    df.group_by('platform')
        .agg(pl.len())
        .sort('len', descending=True)
        .with_columns((pl.col("len") / pl.sum("len")).alias("percent"))
)
Show code
(
    GT(platform)
    .tab_header(title=md("### Percentage of Use by Platform"))
    .fmt_number(columns='len', decimals=0)
    .fmt_percent('percent', decimals=2)
    .cols_label(
        platform='Platform',
        len='Total',
        percent='Percentage'
    )
    .cols_align(align='left', columns=('platform'))
    .tab_options(table_font_size='90%')
    .tab_options(table_width='90%')
    .tab_source_note(source_note='Source: Self-elaboration by author')
)

Percentage of Use by Platform

Platform Total Percentage
android 139,821 93.30%
cast to device 3,898 2.60%
iOS 3,049 2.03%
windows 1,691 1.13%
mac 1,176 0.78%
web player 225 0.15%
Source: Self-elaboration by author
Table 6: Total of Records Played by Platform.

Conclusions

In this analysis, we have shown the following metrics:

  • Total mins played by year and month
  • Total unique artists played by year
  • Total unique albums played by year
  • Total unique songs played by year
  • Top 05 artists by year
  • Top 05 albums by year
  • Top 05 songs by year
  • Use by platform

References

  • Maven Analytics (2025). Maven Music Challenge Dataset. Retrieved from Maven Analytics Website.
  • Yonak, R (2019). How Spotify has changed the way we listen to music. Retrieved from Audioxide Website.

Appendix

Data dictionary

Field Description
spotify_track_uri Spotify URI that uniquely identifies each track
ts Timestamp indicating when the track stopped playing in UTC
platform Platform used when streaming the track
ms_played Number of milliseconds the stream was played
track_name Name of the track
artist_name Name of the artist
album_name Name of the album
reason_start Why the track started
reason_end Why the track ended
shuffle If shuffle mode was used when playing the track (boolean)
skipped If the user skipped to the next song (boolean)
Table 7: Data dictionary, detailing field names, and descriptions

Contact

Jesus LM
Economist & Data Scientist

Medium | Linkedin | Twitter

Plotting with ggplot2 in Python
Gasoline Prices in Mexico
 
  • License

  •   © 2026