Show code
# import libraries
import numpy as np
import pandas as pd
import polars as pl
import duckdb as db
import os
from dotenv import load_dotenv
load_dotenv('.env')
import plotly.express as px
from great_tables import GT, mdManipulating and Visualizing User Patterns with Polars
Jesus LM
Jan, 2026
This project explores a dataset of Spotify music listening habits, leveraging the power of the Polars DataFrame library for efficient and scalable analysis. By employing Polars’ optimized data manipulation capabilities, we efficiently process and analyze Spotify data, extracting meaningful insights into user behavior and the dynamics of the music ecosystem. Furthermore, this work highlights the potential of Polars as a powerful tool for large-scale data analysis in the music industry and beyond.
This study conducts a historical analysis of Spotify data, leveraging the efficiency of Polars, the analytical power of DuckDB, and the visualization capabilities of Plotly. By combining these tools, we examine trends in music consumption, artist popularity, and playlist dynamics over time.
Polars facilitates rapid data manipulation and cleaning of large Spotify datasets, while DuckDB enables complex SQL queries for in-depth analysis.
Plotly is then employed to create interactive visualizations, revealing patterns and insights that would be difficult to discern from raw data alone.
This analysis explores the evolution of musical genres, the impact of algorithmic recommendations, and the shifting landscape of artist distribution.
We demonstrate the efficacy of this toolchain for extracting meaningful narratives from Spotify’s extensive historical data, providing a comprehensive understanding of the platform’s influence on the music industry.
Duckdb is a powerful tool for data analysts and developers who need to perform fast and efficient analytical queries on large datasets, especially in environments where simplicity and portability are crucial.Polars is a modern DataFrame library designed for speed and efficiency, offering a compelling alternative to traditional data manipulation tools.Schema([('spotify_track_uri', String),
('ts', Datetime(time_unit='us', time_zone=None)),
('platform', String),
('ms_played', Int64),
('track_name', String),
('artist_name', String),
('album_name', String),
('reason_start', String),
('reason_end', String),
('shuffle', Boolean),
('skipped', Boolean)])
(
GT(df.head())
.tab_header(title=md("### Spotify Songs History Dataset"))
.fmt_date(columns='ts', date_style='iso')
.fmt_number(columns='ms_played', decimals=0)
.cols_hide(columns='year')
.cols_label(
ts='Date',
platform='Platform',
ms_played='Mins played',
track_name='Song',
artist_name='Artist',
album_name='Album',
reason_start='Reason start',
reason_end='Reason end',
shuffle='Shuffle',
skipped='Skipped',
)
.tab_options(table_font_size='90%')
.tab_options(table_width='90%')
.tab_source_note(source_note='Source: https://mavenanalytics.io')
)Spotify Songs History Dataset |
|||||||||
| Date | Platform | Mins played | Song | Artist | Album | Reason start | Reason end | Shuffle | Skipped |
|---|---|---|---|---|---|---|---|---|---|
| 2013-07-08 | web player | 3,185 | Say It, Just Say It | The Mowgli's | Waiting For The Dawn | autoplay | clickrow | false | false |
| 2013-07-08 | web player | 61,865 | Drinking from the Bottle (feat. Tinie Tempah) | Calvin Harris | 18 Months | clickrow | clickrow | false | false |
| 2013-07-08 | web player | 285,386 | Born To Die | Lana Del Rey | Born To Die - The Paradise Edition | clickrow | unknown | false | false |
| 2013-07-08 | web player | 134,022 | Off To The Races | Lana Del Rey | Born To Die - The Paradise Edition | trackdone | clickrow | false | false |
| 2013-07-08 | web player | 0 | Half Mast | Empire Of The Sun | Walking On A Dream | clickrow | nextbtn | false | false |
| Source: https://mavenanalytics.io | |||||||||
The data set has 149,860 rows.
The dataset goes from 2013 up to 2024.
For further information about the fields of the dataset see Table 7
(
GT(tops)
.tab_header(title=md("### Unique Artists, Albums and Songs Played by Year"))
.fmt_number(columns=['artist_name','album_name','track_name'], decimals=0)
.cols_label(
year='Year',
artist_name='Artists',
album_name='Albums',
track_name='Songs',
)
.tab_options(table_font_size='90%')
.tab_options(table_width='90%')
.tab_source_note(source_note='Source: Self-elaboration by author')
)Unique Artists, Albums and Songs Played by Year |
|||
| Year | Artists | Albums | Songs |
|---|---|---|---|
| 2013 | 63 | 95 | 149 |
| 2014 | 21 | 22 | 23 |
| 2015 | 610 | 924 | 1,357 |
| 2016 | 458 | 940 | 2,535 |
| 2017 | 647 | 1,357 | 3,326 |
| 2018 | 439 | 947 | 2,917 |
| 2019 | 493 | 1,029 | 2,934 |
| 2020 | 820 | 1,569 | 3,929 |
| 2021 | 1,578 | 2,674 | 5,126 |
| 2022 | 1,220 | 2,284 | 4,477 |
| 2023 | 1,456 | 2,336 | 4,042 |
| 2024 | 1,072 | 1,828 | 3,587 |
| Source: Self-elaboration by author | |||
fig = px.line(years, x='year', y='total_mins',)
fig.update_layout(
title=dict(text='Spotify Played Minutes by Year', font=dict(size=30), yref='paper'),
template='ggplot2',
autosize=False,
width=830,
height=500,
showlegend=True,
plot_bgcolor='#f8f8f8',
yaxis=dict(title='Minutes'),
xaxis=dict(title=''),
margin=dict(
l=20,
r=20,
b=20,
t=50,
pad=4
)
)
fig.update_traces(line_color='#007f00')
fig.show()fig = px.line(months, x='ts', y='total_mins',)
fig.update_layout(
title=dict(text='Spotify Played Minutes by Month', font=dict(size=30), yref='paper'),
template='ggplot2',
autosize=False,
width=830,
height=500,
showlegend=True,
plot_bgcolor='#f8f8f8',
yaxis=dict(title='Minutes'),
xaxis=dict(title=''),
margin=dict(
l=20,
r=20,
b=20,
t=50,
pad=4
)
)
fig.update_traces(line_color='#007f00')
fig.show()| Year | Minutes Played |
|---|---|
| 2013 | 470 |
| 2014 | 62 |
| 2015 | 3,555 |
| 2016 | 11,857 |
| 2017 | 40,239 |
| 2018 | 28,454 |
| 2019 | 28,750 |
| 2020 | 55,239 |
| 2021 | 53,506 |
| 2022 | 38,495 |
| 2023 | 30,904 |
| 2024 | 28,962 |
| Source: Self-elaboration | |
(
GT(top_artists)
.tab_header(title=md("### Top 05 Artist by Year"))
.cols_width(
cases={
'year': '10%',
'artist_name': '100%',
}
)
.cols_label(
year='Year',
artist_name='Artist'
)
.cols_align(align='left', columns=('artist_name'))
.tab_options(table_font_size='90%')
.tab_options(table_width='90%')
.tab_source_note(source_note='Source: Self-elaboration by author')
)Top 05 Artist by Year |
|
| Year | Artist |
|---|---|
| 2013 | ['Lana Del Rey', 'Coldplay', 'John Mayer', 'The Kooks', 'Passion Pit', 'U2'] |
| 2014 | ['James Gang', 'Young Wonder', 'The Shins', 'Freddie King', 'The Neighbourhood', 'Ted Nugent', 'Funeral Suits', 'Arctic Monkeys', 'The Strokes', 'Blur', 'Karmon', 'Parachute Youth', 'JR JR', 'Ra Ra Riot', 'Blondfire', 'The Beatles Recovered Band', 'Atlas Genius', 'Discovery', 'Stevie Ray Vaughan', 'Mystery Jets', 'Bloc Party'] |
| 2015 | ['Frank Sinatra', 'We The Kings', 'The Rolling Stones', 'Justin Bieber', 'The Script', 'David Bisbal'] |
| 2016 | ['Vampire Weekend', 'Johnny Cash', 'Elvis Presley', 'The Beatles', 'Led Zeppelin'] |
| 2017 | ['Radiohead', 'The Beatles', 'The Killers', 'Bob Dylan', 'John Mayer'] |
| 2018 | ['The Killers', 'The Beatles', 'Paul McCartney', 'John Mayer', 'Bob Dylan'] |
| 2019 | ['John Mayer', 'Bob Dylan', 'The Beatles', 'The Killers', 'Paul McCartney'] |
| 2020 | ['The Strokes', 'The Beatles', 'The Killers', 'Bob Dylan', 'John Mayer'] |
| 2021 | ['Bob Dylan', 'Kings of Leon', 'The Beatles', 'John Mayer', 'The Killers'] |
| 2022 | ['The Beatles', 'The Killers', 'Joaquín Sabina', 'John Mayer', 'Howard Shore'] |
| 2023 | ['Bob Dylan', 'The Killers', 'John Mayer', 'The Beatles', 'Howard Shore'] |
| 2024 | ['The Killers', 'The Beatles', 'ABBA', 'Paul McCartney', 'John Mayer'] |
| Source: Self-elaboration by author | |
(
GT(top_albums)
.tab_header(title=md("### Top 05 Albums by Year"))
.cols_width(
cases={
'year': '10%',
'album_name': '100%',
}
)
.cols_label(
year='Year',
album_name='Album'
)
.cols_align(align='left', columns=('album_name'))
.tab_options(table_font_size='90%')
.tab_options(table_width='90%')
.tab_source_note(source_note='Source: Self-elaboration by author')
)Top 05 Albums by Year |
|
| Year | Album |
|---|---|
| 2013 | ['Born To Die - The Paradise Edition', 'Oracular Spectacular', 'The 20/20 Experience (Deluxe Version)', 'Where the Light Is: John Mayer Live In Los Angeles', 'Paper Doll', 'Inside In / Inside Out', 'Battle Studies', 'Gossamer'] |
| 2014 | ['When The Sun Goes Down', 'I Love You.', "I'm Sorry...", 'LP', '20th Century Masters: The Millennium Collection: Best Of Joe Walsh', 'Wincing The Night Away', 'My Someday', 'Through the Glass EP', 'Feel It EP', 'Lily of the Valley', 'Bloc Party EP', 'Twenty One', 'The Best Of', 'Can’t Get Better Than This', 'The Essential Ted Nugent', 'The Best Of Freddie King: The Shelter Years', 'Ra Ra Riot', 'Young Wonder', 'Blur: The Best Of', 'Room On Fire', "It's a Corporate World", '30 Beatles Top Hits'] |
| 2015 | ['Somewhere Somehow', 'Ultimate Sinatra', 'Tú Y Yo', 'Hozier', 'No Sound Without Silence', 'Save Rock And Roll'] |
| 2016 | ['At Folsom Prison', 'Elvis At Sun', 'Ultimate Sinatra', 'New York', 'The Beatles'] |
| 2017 | ['Help!', 'The Beatles', 'Abbey Road', 'Past Masters', 'The Wall'] |
| 2018 | ['The Wall', 'Egypt Station', 'Past Masters', 'Beatles For Sale - Remastered', 'The Beatles'] |
| 2019 | ['Prismism', 'Past Masters', 'The Beatles', 'Abbey Road', 'Egypt Station'] |
| 2020 | ['The Beatles', 'Hot Fuss', 'Imploding The Mirage', 'The Wall', 'The New Abnormal'] |
| 2021 | ['Abbey Road', 'When You See Yourself', 'Pressure Machine', 'The Beatles', 'Past Masters'] |
| 2022 | ['The Wall', 'The Beatles', 'The Lord of the Rings: The Fellowship of the Ring - the Complete Recordings', 'Past Masters', 'Pressure Machine', 'Nos Sobran Los Motivos', 'Revolver'] |
| 2023 | ['Past Masters', 'Hot Fuss', 'The Beatles', "Sam's Town", 'Abbey Road'] |
| 2024 | ['Paradise Valley', 'Born and Raised', 'The Beatles', 'Past Masters', 'Arrival'] |
| Source: Self-elaboration by author | |
(
GT(top_songs)
.tab_header(title=md("### Top 05 Songs by Year"))
.cols_width(
cases={
'year': '10%',
'track_name': '100%',
}
)
.cols_label(
year='Year',
track_name='Song'
)
.cols_align(align='left', columns=('track_name'))
.tab_options(table_font_size='90%')
.tab_options(table_width='90%')
.tab_source_note(source_note='Source: Self-elaboration by author')
)Top 05 Songs by Year |
|
| Year | Song |
|---|---|
| 2013 | ['Breath Of Life', 'In Your Atmosphere - Live at the Nokia Theatre, Los Angeles, CA - December 2007', 'Beautiful Day', 'Impossible', 'Do I Wanna Know?', 'Heartbreak Warfare', 'Clocks', "I Still Haven't Found What I'm Looking For", 'Ooh La', 'Smile', 'Paper Doll', 'Next To Me', 'Take a Walk', 'Born To Die', "Free Fallin' - Live at the Nokia Theatre, Los Angeles, CA - December 2007", 'Millionaires', 'Last Nite', 'I Want You', 'Young And Beautiful', 'Half of My Heart', 'Mirrors', 'Dreaming of You'] |
| 2014 | ["She's Hearing Voices", 'Feel It', 'Half in Love with Elizabeth', 'When The Sun Goes Down', 'Midnight Man', 'Parklife', 'Can You Tell', 'Wires', 'Sleeping Lessons', 'Florida', 'Little Wing', 'Wang Dang Sweet Poontang', 'Symptoms - EP Version', 'Skeletons', 'Flesh', "I Can't Win", 'I Want to Hold Your Hand', 'Pretty Young Thing', 'Awake Now', 'End of a Century', 'Walking By Myself - Remastered', 'Staying Up', 'Swing Tree'] |
| 2015 | ['Superheroes', 'Switzerland', 'Diez Mil Maneras', 'What Do You Mean?', 'Paraíso'] |
| 2016 | ['The River', 'My Way', 'South Bound Saurez - Remaster', 'Teenagers', "I've Got You Under My Skin", 'Not Today', 'Hallelujah'] |
| 2017 | ['Perfect', 'The Man', 'Wildfire', 'In the Blood', 'Gravity', 'Bones', 'Married with Children - 2014 Remaster'] |
| 2018 | ['Universal Gleam', "I Don't Know", 'Who Cares', 'People Want Peace', 'Dominoes', 'Band On The Run - 2010 Remaster', 'Sé Que Te Duele', 'No Words - 2010 Remaster', 'Come On To Me', 'With God on Our Side'] |
| 2019 | ['Once', 'Reminder', 'Band On The Run - 2010 Remaster', 'Mariposa Traicionera', "Maybe It's Time"] |
| 2020 | ['Ode To The Mets', 'Dying Breed', 'Caution', 'Imploding The Mirage', 'My Own Soul’s Warning'] |
| 2021 | ['When You See Yourself, Are You Far Away', 'Crucify Your Mind', 'Not in Nottingham', '100,000 People', 'Did My Best', 'Common People - Full Length Version'] |
| 2022 | ['Telefonía', 'Postdata', 'Yolanda', 'Se Me Olvidó Otra Vez - Unplugged; 2020 Remasterizado', '19 Dias y 500 Noches - En Directo', 'Por el Bulevar de los Sueños Rotos'] |
| 2023 | ['Always Alright', 'Primera Cita', 'Malibu', 'Qué Bonito Es Querer', 'Billy 4', 'You Sexy Thing'] |
| 2024 | ['Dancing Queen', 'Superdeli', 'Does Your Mother Know', 'Knowing Me, Knowing You', 'Take A Chance On Me', 'Waterloo', 'Why Did It Have To Be Me?'] |
| Source: Self-elaboration by author | |
(
GT(platform)
.tab_header(title=md("### Percentage of Use by Platform"))
.fmt_number(columns='len', decimals=0)
.fmt_percent('percent', decimals=2)
.cols_label(
platform='Platform',
len='Total',
percent='Percentage'
)
.cols_align(align='left', columns=('platform'))
.tab_options(table_font_size='90%')
.tab_options(table_width='90%')
.tab_source_note(source_note='Source: Self-elaboration by author')
)Percentage of Use by Platform |
||
| Platform | Total | Percentage |
|---|---|---|
| android | 139,821 | 93.30% |
| cast to device | 3,898 | 2.60% |
| iOS | 3,049 | 2.03% |
| windows | 1,691 | 1.13% |
| mac | 1,176 | 0.78% |
| web player | 225 | 0.15% |
| Source: Self-elaboration by author | ||
In this analysis, we have shown the following metrics:
Data dictionary
| Field | Description |
|---|---|
| spotify_track_uri | Spotify URI that uniquely identifies each track |
| ts | Timestamp indicating when the track stopped playing in UTC |
| platform | Platform used when streaming the track |
| ms_played | Number of milliseconds the stream was played |
| track_name | Name of the track |
| artist_name | Name of the artist |
| album_name | Name of the album |
| reason_start | Why the track started |
| reason_end | Why the track ended |
| shuffle | If shuffle mode was used when playing the track (boolean) |
| skipped | If the user skipped to the next song (boolean) |
Jesus LM
Economist & Data Scientist