Data Alchemy, SQL Analysis within Python

Abstract

This project focuses on leveraging the strengths of DuckDB, Polars, and Plotly for efficient data analysis and visualization. DuckDB is used for fast in-memory data processing, Polars provides a user-friendly and high-performance DataFrame library, and Plotly offers interactive and customizable visualizations.

Project phases

Data Loading
- DuckDB: Load large datasets directly into DuckDB for efficient in-memory operations and query execution.
- Polars: Utilize Polars’ fast data loading capabilities, especially for CSV and Parquet files, to quickly ingest data into its DataFrame structure.
Data Transformation
- Polars: Perform data cleaning, filtering, aggregation, and other transformations within the Polars DataFrame efficiently.
- DuckDB: Execute complex SQL queries directly on the in-memory data within DuckDB for advanced data manipulation.
Data Visualization
- Polars: Use Polars’ built-in plotting capabilities for quick exploratory visualizations.
- Plotly: Leverage Plotly’s extensive library of interactive and customizable plots for in-depth analysis and presentation.
Performance Optimization
- Minimize Data Transfers: Avoid unnecessary data transfers between tools. For example, if possible, perform data transformations within DuckDB and then directly visualize results using Plotly.
- Utilize Parallel Processing: Leverage the parallel processing capabilities of both DuckDB and Polars to speed up data processing tasks.
- Optimize Queries: Write efficient SQL queries and use appropriate data types to maximize DuckDB’s performance.
- Caching: Cache intermediate results to avoid redundant computations.

Environment settings

Code

import numpy as np
import polars as pl
import random
import duckdb as db
import plotly.express as px

Create dummy dataset

Code

def generate_dummy_data(num_rows=10_000):
  """
  Generates dummy data for restaurants with name, rating_count, cost, city, and cuisine.
  Args:
    num_rows: Number of rows to generate.
  Returns:
    A list of dictionaries, where each dictionary represents a restaurant with the specified fields.
  """

  data = []
  names = ["The Cozy Nook", "Spice & Bloom", "The Golden Wok", "Midnight Diner", "Ocean Breeze", 
           "Cafe Delight", "The Burger Joint", "Pizza Palace", "Taste of Italy", "French Delights",
           "The Curry House", "Sushi Corner", "Greek Gyros", "Taco Town", "The BBQ Shack"]
  cities = ["New York", "London", "Paris", "Tokyo", "Rome", 
            "Berlin", "Sydney", "Madrid", "Amsterdam", "Lisbon"]
  cuisines = ["Italian", "Mexican", "Indian", "Chinese", "Japanese", 
              "American", "French", "Thai", "Greek", "Spanish"]

  for _ in range(num_rows):
    restaurant = {
        'name': random.choice(names),
        'rating_count': random.randint(100, 5000),
        'cost': random.uniform(10.0, 100.0),
        'city': random.choice(cities),
        'cuisine': random.choice(cuisines),
        'rating': random.randint(0, 5)
    }
    data.append(restaurant)

  return data

Code

# Generate rows of dummy data
dummy_restaurants = generate_dummy_data()

Create dataframe

Code

# Create dataframe
restaurants = pl.DataFrame(dummy_restaurants)

Code

# Show dataframe
(
    restaurants
        .to_pandas()
        .head()
        .style
        .hide()    
        .format({'rating_count': '{:,.0f}', 'cost': '${:.2f}'})
)

name	rating_count	cost	city	cuisine	rating
The Golden Wok	3,408	$37.75	Lisbon	French	2
Greek Gyros	1,484	$93.44	Paris	Italian	0
Ocean Breeze	2,792	$32.26	Berlin	Mexican	5
Greek Gyros	2,038	$20.78	London	Spanish	5
The Curry House	2,321	$94.79	Lisbon	French	1

Table 1: Restaurants dataset

Connect to database

Code

# connect to database
conn = db.connect('my_database.db')
#conn = db.connect('restaurants.db')

Code

# retrieve data from table
res = conn.sql('select * from restaurants limit 5')

Code

# Show table
(
    res.df()
        .head()
        .style
        .hide()    
        .format({'rating_count': '{:,.0f}', 'cost': '${:.2f}'})
)

name	rating_count	cost	city	cuisine	rating
The Golden Wok	1,477	$33.62	Berlin	American	5
Greek Gyros	770	$68.39	New York	French	1
Taste of Italy	4,420	$88.23	Amsterdam	Chinese	0
Midnight Diner	2,155	$12.97	Lisbon	Mexican	1
Taste of Italy	3,375	$52.79	Sydney	Chinese	1

Table 2: Restaurants table from database

Queries

Which restaurant of London is visited by the least number of people?

Code

(
    conn.sql('''
    select * from restaurants
    where city = 'London' and rating_count = (select min(rating_count)
                                            from restaurants 
                                            where city = 'London')
    ''')
        .df()
        .head()
        .style
        .hide()    
        .format({'rating_count': '{:,.0f}', 'cost': '${:.2f}'})
)

name	rating_count	cost	city	cuisine	rating
Greek Gyros	101	$13.82	London	Indian	2

Table 3: Query 1 result table

Which restaurant has generated maximum revenue?

Code

(
    conn.sql('''
    select * from restaurants
    where cost*rating_count = (select max(cost*rating_count)
                                from restaurants)
    ''')
        .df()
        .head()
        .style
        .hide()    
        .format({'rating_count': '{:,.0f}', 'cost': '${:.2f}'})
)

name	rating_count	cost	city	cuisine	rating
Ocean Breeze	4,969	$99.92	Madrid	Chinese	4

Table 4: Query 2 result table

How many restaurants are having a rating more than the average rating?

Code

(
    conn.sql('''
    select * from restaurants
    where rating > (select avg(rating)
                    from restaurants)
    ''')
        .df()
        .head()
        .style
        .hide()    
        .format({'rating_count': '{:,.0f}', 'cost': '${:.2f}'})
)

name	rating_count	cost	city	cuisine	rating
The Golden Wok	1,477	$33.62	Berlin	American	5
The Burger Joint	3,631	$25.36	London	American	4
The Curry House	3,033	$56.05	Amsterdam	Italian	3
The Burger Joint	4,656	$19.38	Amsterdam	American	3
Pizza Palace	1,862	$71.95	Paris	Japanese	5

Table 5: Query 3 result table

Which restaurant of New York has generated the most revenue?

Code

(
    conn.sql('''
    select * from restaurants
    where city = 'New York' and cost*rating_count = (select max(cost*rating_count)
                                                    from restaurants 
                                                    where city = 'New York')
    ''')
        .df()
        .head()
        .style
        .hide()    
        .format({'rating_count': '{:,.0f}', 'cost': '${:.2f}'})
)

name	rating_count	cost	city	cuisine	rating
The Curry House	4,955	$97.92	New York	Japanese	5

Table 6: Query 4 result table

Which restaurant chain has the maximum number of restaurants?

Code

(
    conn.sql('''
    select name, count(name) as no_of_chains
    from restaurants
    group by name
    order by no_of_chains DESC
    limit 10
    ''')
        .df()
        .head(10)
        .style
        .hide()    
        .format({'rating_count': '{:,.0f}', 'cost': '${:.2f}'})
)

name	no_of_chains
The Burger Joint	721
Pizza Palace	703
Greek Gyros	696
Cafe Delight	692
French Delights	681
The BBQ Shack	671
The Golden Wok	667
Ocean Breeze	665
Spice & Bloom	665
Midnight Diner	657

Table 7: Query 5 result table

Code

res_chains = conn.sql('''
    select name, count(name) as no_of_chains
    from restaurants
    group by name
    order by no_of_chains DESC
    limit 10
''').pl()

fig = px.bar(res_chains, 
             x='no_of_chains',
             y='name',
             orientation='h',
             hover_data=['no_of_chains', 'name',],
             height=600,
             width=940,
             text_auto=True,
             title='Chains by Restaurant',)
fig.update_traces(textfont_size=12, 
                  textangle=0,
                  textposition='outside',
                  marker_color='rgb(55, 83, 109)',
                  marker_line_color='#000000',
                  marker_line_width=1.5,
                  opacity=0.8,)
fig.update_layout(yaxis=dict(autorange='reversed'), xaxis_title='Chains', yaxis_title='Restaurant')
fig.show()

Figure 2: Restaurant Chains

Which restaurant chain has generated maximum revenue?

Code

(
    conn.sql('''
    select name, sum(rating_count * cost) as revenue
    from restaurants
    group by name
    order by revenue DESC
    limit 10
    ''')
        .df()
        .head(10)
        .style
        .hide()    
        .format({'revenue': '{:,.2f}'})
)

name	revenue
The Burger Joint	108,820,424.64
Pizza Palace	100,382,853.92
Cafe Delight	98,037,063.93
Greek Gyros	96,403,445.25
The BBQ Shack	96,286,414.02
Ocean Breeze	95,664,612.77
The Golden Wok	94,860,125.75
Spice & Bloom	91,824,854.56
French Delights	91,236,701.38
Midnight Diner	91,170,162.30

Table 8: Query 6 result table

Code

rev = conn.sql('''
            select name, sum(rating_count * cost) as revenue
            from restaurants
            group by name
            order by revenue DESC
            limit 10
            ''').pl()

fig = px.bar(rev, 
             x='revenue',
             y='name',
             orientation='h',
             hover_data=['revenue', 'name',],
             height=600,
             width=940,
             text_auto=True,
             title='Revenue by Restaurant',)
fig.update_traces(textfont_size=12, 
                  textangle=0,
                  textposition='inside',
                  marker_color='#004700',
                  marker_line_color='#000000',
                  marker_line_width=1.5,
                  opacity=0.8,
                  hovertemplate='<br>'.join([
                      'Restaurant: %{y}',
                      'Revenue: %{x}',
                  ]),
                 )
fig.update_layout(yaxis=dict(autorange='reversed'), xaxis_title='Revenue', yaxis_title='Restaurant',
                 xaxis_tickprefix='$', xaxis_tickformat=',.2f')
fig.show()

Figure 3: Revenue by Restaurant

Which city has the maximum number of restaurants?

Code

(
    conn.sql('''
    select city, count(*) as no_of_restaurants
    from restaurants
    group by city
    order by no_of_restaurants DESC
    limit 10
    ''')
        .df()
        .head(10)
        .style
        .hide()    
        .format({'no_of_restaurants': '{:,.0f}'})
)

city	no_of_restaurants
Tokyo	1,071
Amsterdam	1,038
Lisbon	1,005
Madrid	1,003
London	993
Paris	992
Rome	979
Berlin	977
New York	971
Sydney	971

Table 9: Query 7 result table

Which city has generated maximum revenue?

Code

(
    conn.sql('''
    select city, sum(rating_count * cost) as revenue
    from restaurants
    group by city
    order by revenue DESC
    limit 10
    ''')
        .df()
        .head(10)
        .style
        .hide()    
        .format({'revenue': '${:,.2f}'})
)

city	revenue
Amsterdam	$148,839,878.62
Tokyo	$148,035,421.32
Madrid	$141,487,618.97
Paris	$141,219,374.56
London	$140,876,613.54
Rome	$139,622,129.63
New York	$138,621,609.28
Lisbon	$136,814,247.27
Berlin	$136,434,163.25
Sydney	$131,656,513.97

Table 10: Query 8 result table

Code

cities = conn.sql('''
                select city, sum(rating_count * cost) as revenue
                from restaurants
                group by city
                order by revenue DESC
                limit 10
                ''').pl()

fig = px.bar(cities, 
             x='revenue',
             y='city',
             orientation='h',
             hover_data=['revenue', 'city',],
             height=600,
             width=940,
             text_auto=True,
             title='Revenue by City',)
fig.update_traces(textfont_size=12, 
                  textangle=0,
                  textposition='inside',
                  marker_color='#880808',
                  marker_line_color='#000000',
                  marker_line_width=1.5,
                  opacity=0.8,
                  hovertemplate='<br>'.join([
                      'City: %{y}',
                      'Revenue: %{x}',
                  ]),
                 )
fig.update_layout(yaxis=dict(autorange='reversed'), xaxis_title='Revenue', yaxis_title='City',
                 xaxis_tickprefix='$', xaxis_tickformat=',.2f')
fig.show()

Figure 4: Revenue by City

List 10 least expensive cuisines

Code

(
    conn.sql('''
    select cuisine, avg(cost) as avg_cost
    from restaurants
    group by cuisine
    order by avg_cost asc
    limit 10
    ''')
        .df()
        .head(10)
        .style
        .hide()    
        .format({'avg_cost': '${:.2f}'})
)

cuisine	avg_cost
Indian	$53.92
American	$53.96
Italian	$54.61
Thai	$54.64
French	$54.65
Mexican	$55.47
Spanish	$55.54
Japanese	$55.55
Greek	$55.77
Chinese	$56.26

Table 11: Query 9 result table

List 10 most expensive cuisines

Code

(
    conn.sql('''
    select cuisine, avg(cost) as avg_cost
    from restaurants
    group by cuisine
    order by avg_cost desc
    limit 10
    ''')
        .df()
        .head(10)
        .style
        .hide()    
        .format({'avg_cost': '${:.2f}'})
)

cuisine	avg_cost
Chinese	$56.26
Greek	$55.77
Japanese	$55.55
Spanish	$55.54
Mexican	$55.47
French	$54.65
Thai	$54.64
Italian	$54.61
American	$53.96
Indian	$53.92

Table 12: Query 10 result table

What city has Mexican food as the most popular cuisine?

Code

(
    conn.sql('''
    select city, avg(cost) avg_cost, count(*) as restaurants
    from restaurants
    where cuisine = 'Mexican'
    group by city
    order by restaurants desc
    ''')
        .df()
        .head()
        .style
        .hide()    
        .format({'avg_cost': '${:.2f}'})
)

city	avg_cost	restaurants
London	$52.30	112
Tokyo	$48.83	111
Rome	$57.11	108
Amsterdam	$60.67	108
Lisbon	$56.35	101

Table 13: Query 11 result table

Code

# close connection
conn.close()

Conclusions

This project highlights the value of combining the strengths of SQL and Python for data-intensive tasks, providing a robust and efficient solution for a wide range of data analysis challenges.

By carefully considering the interaction between DuckDB, Polars, and Plotly, you can create a powerful and efficient data analysis and visualization pipeline.

This approach can lead organizations to unlock valuable insights from their data more quickly and effectively, driving data-driven decision-making.

Contact

Jesus LM
Economist & Data Scientist

Medium | Linkedin | Twitter