Data Alchemy, SQL Analysis within Python

A Case Study using Duckdb, Polars and Plotly

python
sql
Author

Jesus LM

Published

Jan, 2025

Abstract

This project focuses on leveraging the strengths of DuckDB, Polars, and Plotly for efficient data analysis and visualization. DuckDB is used for fast in-memory data processing, Polars provides a user-friendly and high-performance DataFrame library, and Plotly offers interactive and customizable visualizations.

Project phases

  • Data Loading

    • DuckDB: Load large datasets directly into DuckDB for efficient in-memory operations and query execution.
    • Polars: Utilize Polars’ fast data loading capabilities, especially for CSV and Parquet files, to quickly ingest data into its DataFrame structure.
  • Data Transformation

    • Polars: Perform data cleaning, filtering, aggregation, and other transformations within the Polars DataFrame efficiently.
    • DuckDB: Execute complex SQL queries directly on the in-memory data within DuckDB for advanced data manipulation.
  • Data Visualization

    • Polars: Use Polars’ built-in plotting capabilities for quick exploratory visualizations.
    • Plotly: Leverage Plotly’s extensive library of interactive and customizable plots for in-depth analysis and presentation.
  • Performance Optimization

    • Minimize Data Transfers: Avoid unnecessary data transfers between tools. For example, if possible, perform data transformations within DuckDB and then directly visualize results using Plotly.
    • Utilize Parallel Processing: Leverage the parallel processing capabilities of both DuckDB and Polars to speed up data processing tasks.
    • Optimize Queries: Write efficient SQL queries and use appropriate data types to maximize DuckDB’s performance.
    • Caching: Cache intermediate results to avoid redundant computations.

Environment settings

Code
import numpy as np
import polars as pl
import random
import duckdb as db
import plotly.express as px

Create dummy dataset

Code
def generate_dummy_data(num_rows=10_000):
  """
  Generates dummy data for restaurants with name, rating_count, cost, city, and cuisine.
  Args:
    num_rows: Number of rows to generate.
  Returns:
    A list of dictionaries, where each dictionary represents a restaurant with the specified fields.
  """

  data = []
  names = ["The Cozy Nook", "Spice & Bloom", "The Golden Wok", "Midnight Diner", "Ocean Breeze", 
           "Cafe Delight", "The Burger Joint", "Pizza Palace", "Taste of Italy", "French Delights",
           "The Curry House", "Sushi Corner", "Greek Gyros", "Taco Town", "The BBQ Shack"]
  cities = ["New York", "London", "Paris", "Tokyo", "Rome", 
            "Berlin", "Sydney", "Madrid", "Amsterdam", "Lisbon"]
  cuisines = ["Italian", "Mexican", "Indian", "Chinese", "Japanese", 
              "American", "French", "Thai", "Greek", "Spanish"]

  for _ in range(num_rows):
    restaurant = {
        'name': random.choice(names),
        'rating_count': random.randint(100, 5000),
        'cost': random.uniform(10.0, 100.0),
        'city': random.choice(cities),
        'cuisine': random.choice(cuisines),
        'rating': random.randint(0, 5)
    }
    data.append(restaurant)

  return data
Code
# Generate rows of dummy data
dummy_restaurants = generate_dummy_data()

Create dataframe

Code
# Create dataframe
restaurants = pl.DataFrame(dummy_restaurants)
Code
# Show dataframe
(
    restaurants
        .to_pandas()
        .head()
        .style
        .hide()    
        .format({'rating_count': '{:,.0f}', 'cost': '${:.2f}'})
)
name rating_count cost city cuisine rating
The Golden Wok 3,408 $37.75 Lisbon French 2
Greek Gyros 1,484 $93.44 Paris Italian 0
Ocean Breeze 2,792 $32.26 Berlin Mexican 5
Greek Gyros 2,038 $20.78 London Spanish 5
The Curry House 2,321 $94.79 Lisbon French 1
Table 1: Restaurants dataset

Connect to database

Code
# connect to database
conn = db.connect('my_database.db')
#conn = db.connect('restaurants.db')
Code
# retrieve data from table
res = conn.sql('select * from restaurants limit 5')
Code
# Show table
(
    res.df()
        .head()
        .style
        .hide()    
        .format({'rating_count': '{:,.0f}', 'cost': '${:.2f}'})
)
name rating_count cost city cuisine rating
The Golden Wok 1,477 $33.62 Berlin American 5
Greek Gyros 770 $68.39 New York French 1
Taste of Italy 4,420 $88.23 Amsterdam Chinese 0
Midnight Diner 2,155 $12.97 Lisbon Mexican 1
Taste of Italy 3,375 $52.79 Sydney Chinese 1
Table 2: Restaurants table from database

Queries

Which restaurant of London is visited by the least number of people?

Code
(
    conn.sql('''
    select * from restaurants
    where city = 'London' and rating_count = (select min(rating_count)
                                            from restaurants 
                                            where city = 'London')
    ''')
        .df()
        .head()
        .style
        .hide()    
        .format({'rating_count': '{:,.0f}', 'cost': '${:.2f}'})
)
name rating_count cost city cuisine rating
Greek Gyros 101 $13.82 London Indian 2
Table 3: Query 1 result table

Which restaurant has generated maximum revenue?

Code
(
    conn.sql('''
    select * from restaurants
    where cost*rating_count = (select max(cost*rating_count)
                                from restaurants)
    ''')
        .df()
        .head()
        .style
        .hide()    
        .format({'rating_count': '{:,.0f}', 'cost': '${:.2f}'})
)
name rating_count cost city cuisine rating
Ocean Breeze 4,969 $99.92 Madrid Chinese 4
Table 4: Query 2 result table

How many restaurants are having a rating more than the average rating?

Code
(
    conn.sql('''
    select * from restaurants
    where rating > (select avg(rating)
                    from restaurants)
    ''')
        .df()
        .head()
        .style
        .hide()    
        .format({'rating_count': '{:,.0f}', 'cost': '${:.2f}'})
)
name rating_count cost city cuisine rating
The Golden Wok 1,477 $33.62 Berlin American 5
The Burger Joint 3,631 $25.36 London American 4
The Curry House 3,033 $56.05 Amsterdam Italian 3
The Burger Joint 4,656 $19.38 Amsterdam American 3
Pizza Palace 1,862 $71.95 Paris Japanese 5
Table 5: Query 3 result table

Which restaurant of New York has generated the most revenue?

Code
(
    conn.sql('''
    select * from restaurants
    where city = 'New York' and cost*rating_count = (select max(cost*rating_count)
                                                    from restaurants 
                                                    where city = 'New York')
    ''')
        .df()
        .head()
        .style
        .hide()    
        .format({'rating_count': '{:,.0f}', 'cost': '${:.2f}'})
)
name rating_count cost city cuisine rating
The Curry House 4,955 $97.92 New York Japanese 5
Table 6: Query 4 result table

Which restaurant chain has the maximum number of restaurants?

Code
(
    conn.sql('''
    select name, count(name) as no_of_chains
    from restaurants
    group by name
    order by no_of_chains DESC
    limit 10
    ''')
        .df()
        .head(10)
        .style
        .hide()    
        .format({'rating_count': '{:,.0f}', 'cost': '${:.2f}'})
)
name no_of_chains
The Burger Joint 721
Pizza Palace 703
Greek Gyros 696
Cafe Delight 692
French Delights 681
The BBQ Shack 671
The Golden Wok 667
Ocean Breeze 665
Spice & Bloom 665
Midnight Diner 657
Table 7: Query 5 result table
Code
res_chains = conn.sql('''
    select name, count(name) as no_of_chains
    from restaurants
    group by name
    order by no_of_chains DESC
    limit 10
''').pl()

fig = px.bar(res_chains, 
             x='no_of_chains',
             y='name',
             orientation='h',
             hover_data=['no_of_chains', 'name',],
             height=600,
             width=940,
             text_auto=True,
             title='Chains by Restaurant',)
fig.update_traces(textfont_size=12, 
                  textangle=0,
                  textposition='outside',
                  marker_color='rgb(55, 83, 109)',
                  marker_line_color='#000000',
                  marker_line_width=1.5,
                  opacity=0.8,)
fig.update_layout(yaxis=dict(autorange='reversed'), xaxis_title='Chains', yaxis_title='Restaurant')
fig.show()
Figure 2: Restaurant Chains

Which restaurant chain has generated maximum revenue?

Code
(
    conn.sql('''
    select name, sum(rating_count * cost) as revenue
    from restaurants
    group by name
    order by revenue DESC
    limit 10
    ''')
        .df()
        .head(10)
        .style
        .hide()    
        .format({'revenue': '{:,.2f}'})
)
name revenue
The Burger Joint 108,820,424.64
Pizza Palace 100,382,853.92
Cafe Delight 98,037,063.93
Greek Gyros 96,403,445.25
The BBQ Shack 96,286,414.02
Ocean Breeze 95,664,612.77
The Golden Wok 94,860,125.75
Spice & Bloom 91,824,854.56
French Delights 91,236,701.38
Midnight Diner 91,170,162.30
Table 8: Query 6 result table
Code
rev = conn.sql('''
            select name, sum(rating_count * cost) as revenue
            from restaurants
            group by name
            order by revenue DESC
            limit 10
            ''').pl()

fig = px.bar(rev, 
             x='revenue',
             y='name',
             orientation='h',
             hover_data=['revenue', 'name',],
             height=600,
             width=940,
             text_auto=True,
             title='Revenue by Restaurant',)
fig.update_traces(textfont_size=12, 
                  textangle=0,
                  textposition='inside',
                  marker_color='#004700',
                  marker_line_color='#000000',
                  marker_line_width=1.5,
                  opacity=0.8,
                  hovertemplate='<br>'.join([
                      'Restaurant: %{y}',
                      'Revenue: %{x}',
                  ]),
                 )
fig.update_layout(yaxis=dict(autorange='reversed'), xaxis_title='Revenue', yaxis_title='Restaurant',
                 xaxis_tickprefix='$', xaxis_tickformat=',.2f')
fig.show()
Figure 3: Revenue by Restaurant

Which city has the maximum number of restaurants?

Code
(
    conn.sql('''
    select city, count(*) as no_of_restaurants
    from restaurants
    group by city
    order by no_of_restaurants DESC
    limit 10
    ''')
        .df()
        .head(10)
        .style
        .hide()    
        .format({'no_of_restaurants': '{:,.0f}'})
)
city no_of_restaurants
Tokyo 1,071
Amsterdam 1,038
Lisbon 1,005
Madrid 1,003
London 993
Paris 992
Rome 979
Berlin 977
New York 971
Sydney 971
Table 9: Query 7 result table

Which city has generated maximum revenue?

Code
(
    conn.sql('''
    select city, sum(rating_count * cost) as revenue
    from restaurants
    group by city
    order by revenue DESC
    limit 10
    ''')
        .df()
        .head(10)
        .style
        .hide()    
        .format({'revenue': '${:,.2f}'})
)
city revenue
Amsterdam $148,839,878.62
Tokyo $148,035,421.32
Madrid $141,487,618.97
Paris $141,219,374.56
London $140,876,613.54
Rome $139,622,129.63
New York $138,621,609.28
Lisbon $136,814,247.27
Berlin $136,434,163.25
Sydney $131,656,513.97
Table 10: Query 8 result table
Code
cities = conn.sql('''
                select city, sum(rating_count * cost) as revenue
                from restaurants
                group by city
                order by revenue DESC
                limit 10
                ''').pl()

fig = px.bar(cities, 
             x='revenue',
             y='city',
             orientation='h',
             hover_data=['revenue', 'city',],
             height=600,
             width=940,
             text_auto=True,
             title='Revenue by City',)
fig.update_traces(textfont_size=12, 
                  textangle=0,
                  textposition='inside',
                  marker_color='#880808',
                  marker_line_color='#000000',
                  marker_line_width=1.5,
                  opacity=0.8,
                  hovertemplate='<br>'.join([
                      'City: %{y}',
                      'Revenue: %{x}',
                  ]),
                 )
fig.update_layout(yaxis=dict(autorange='reversed'), xaxis_title='Revenue', yaxis_title='City',
                 xaxis_tickprefix='$', xaxis_tickformat=',.2f')
fig.show()
Figure 4: Revenue by City

List 10 least expensive cuisines

Code
(
    conn.sql('''
    select cuisine, avg(cost) as avg_cost
    from restaurants
    group by cuisine
    order by avg_cost asc
    limit 10
    ''')
        .df()
        .head(10)
        .style
        .hide()    
        .format({'avg_cost': '${:.2f}'})
)
cuisine avg_cost
Indian $53.92
American $53.96
Italian $54.61
Thai $54.64
French $54.65
Mexican $55.47
Spanish $55.54
Japanese $55.55
Greek $55.77
Chinese $56.26
Table 11: Query 9 result table

List 10 most expensive cuisines

Code
(
    conn.sql('''
    select cuisine, avg(cost) as avg_cost
    from restaurants
    group by cuisine
    order by avg_cost desc
    limit 10
    ''')
        .df()
        .head(10)
        .style
        .hide()    
        .format({'avg_cost': '${:.2f}'})
)
cuisine avg_cost
Chinese $56.26
Greek $55.77
Japanese $55.55
Spanish $55.54
Mexican $55.47
French $54.65
Thai $54.64
Italian $54.61
American $53.96
Indian $53.92
Table 12: Query 10 result table

Conclusions

This project highlights the value of combining the strengths of SQL and Python for data-intensive tasks, providing a robust and efficient solution for a wide range of data analysis challenges.

By carefully considering the interaction between DuckDB, Polars, and Plotly, you can create a powerful and efficient data analysis and visualization pipeline.

This approach can lead organizations to unlock valuable insights from their data more quickly and effectively, driving data-driven decision-making.

Contact

Jesus LM
Economist & Data Scientist

Medium | Linkedin | Twitter