This project focuses on leveraging the strengths of DuckDB, Polars, and Plotly for efficient data analysis and visualization. DuckDB is used for fast in-memory data processing, Polars provides a user-friendly and high-performance DataFrame library, and Plotly offers interactive and customizable visualizations.
DuckDB: Load large datasets directly into DuckDB for efficient in-memory operations and query execution.
Polars: Utilize Polars’ fast data loading capabilities, especially for CSV and Parquet files, to quickly ingest data into its DataFrame structure.
Data Transformation
Polars: Perform data cleaning, filtering, aggregation, and other transformations within the Polars DataFrame efficiently.
DuckDB: Execute complex SQL queries directly on the in-memory data within DuckDB for advanced data manipulation.
Data Visualization
Polars: Use Polars’ built-in plotting capabilities for quick exploratory visualizations.
Plotly: Leverage Plotly’s extensive library of interactive and customizable plots for in-depth analysis and presentation.
Performance Optimization
Minimize Data Transfers: Avoid unnecessary data transfers between tools. For example, if possible, perform data transformations within DuckDB and then directly visualize results using Plotly.
Utilize Parallel Processing: Leverage the parallel processing capabilities of both DuckDB and Polars to speed up data processing tasks.
Optimize Queries: Write efficient SQL queries and use appropriate data types to maximize DuckDB’s performance.
Caching: Cache intermediate results to avoid redundant computations.
Environment settings
Code
import numpy as npimport polars as plimport randomimport duckdb as dbimport plotly.express as px
Create dummy dataset
Code
def generate_dummy_data(num_rows=10_000):""" Generates dummy data for restaurants with name, rating_count, cost, city, and cuisine. Args: num_rows: Number of rows to generate. Returns: A list of dictionaries, where each dictionary represents a restaurant with the specified fields. """ data = [] names = ["The Cozy Nook", "Spice & Bloom", "The Golden Wok", "Midnight Diner", "Ocean Breeze", "Cafe Delight", "The Burger Joint", "Pizza Palace", "Taste of Italy", "French Delights","The Curry House", "Sushi Corner", "Greek Gyros", "Taco Town", "The BBQ Shack"] cities = ["New York", "London", "Paris", "Tokyo", "Rome", "Berlin", "Sydney", "Madrid", "Amsterdam", "Lisbon"] cuisines = ["Italian", "Mexican", "Indian", "Chinese", "Japanese", "American", "French", "Thai", "Greek", "Spanish"]for _ inrange(num_rows): restaurant = {'name': random.choice(names),'rating_count': random.randint(100, 5000),'cost': random.uniform(10.0, 100.0),'city': random.choice(cities),'cuisine': random.choice(cuisines),'rating': random.randint(0, 5) } data.append(restaurant)return data
Code
# Generate rows of dummy datadummy_restaurants = generate_dummy_data()
Which restaurant of London is visited by the least number of people?
Code
( conn.sql(''' select * from restaurants where city = 'London' and rating_count = (select min(rating_count) from restaurants where city = 'London') ''') .df() .head() .style .hide() .format({'rating_count': '{:,.0f}', 'cost': '${:.2f}'}))
name
rating_count
cost
city
cuisine
rating
Greek Gyros
101
$13.82
London
Indian
2
Table 3: Query 1 result table
Which restaurant has generated maximum revenue?
Code
( conn.sql(''' select * from restaurants where cost*rating_count = (select max(cost*rating_count) from restaurants) ''') .df() .head() .style .hide() .format({'rating_count': '{:,.0f}', 'cost': '${:.2f}'}))
name
rating_count
cost
city
cuisine
rating
Ocean Breeze
4,969
$99.92
Madrid
Chinese
4
Table 4: Query 2 result table
How many restaurants are having a rating more than the average rating?
Code
( conn.sql(''' select * from restaurants where rating > (select avg(rating) from restaurants) ''') .df() .head() .style .hide() .format({'rating_count': '{:,.0f}', 'cost': '${:.2f}'}))
name
rating_count
cost
city
cuisine
rating
The Golden Wok
1,477
$33.62
Berlin
American
5
The Burger Joint
3,631
$25.36
London
American
4
The Curry House
3,033
$56.05
Amsterdam
Italian
3
The Burger Joint
4,656
$19.38
Amsterdam
American
3
Pizza Palace
1,862
$71.95
Paris
Japanese
5
Table 5: Query 3 result table
Which restaurant of New York has generated the most revenue?
Code
( conn.sql(''' select * from restaurants where city = 'New York' and cost*rating_count = (select max(cost*rating_count) from restaurants where city = 'New York') ''') .df() .head() .style .hide() .format({'rating_count': '{:,.0f}', 'cost': '${:.2f}'}))
name
rating_count
cost
city
cuisine
rating
The Curry House
4,955
$97.92
New York
Japanese
5
Table 6: Query 4 result table
Which restaurant chain has the maximum number of restaurants?
Code
( conn.sql(''' select name, count(name) as no_of_chains from restaurants group by name order by no_of_chains DESC limit 10 ''') .df() .head(10) .style .hide() .format({'rating_count': '{:,.0f}', 'cost': '${:.2f}'}))
name
no_of_chains
The Burger Joint
721
Pizza Palace
703
Greek Gyros
696
Cafe Delight
692
French Delights
681
The BBQ Shack
671
The Golden Wok
667
Ocean Breeze
665
Spice & Bloom
665
Midnight Diner
657
Table 7: Query 5 result table
Code
res_chains = conn.sql(''' select name, count(name) as no_of_chains from restaurants group by name order by no_of_chains DESC limit 10''').pl()fig = px.bar(res_chains, x='no_of_chains', y='name', orientation='h', hover_data=['no_of_chains', 'name',], height=600, width=940, text_auto=True, title='Chains by Restaurant',)fig.update_traces(textfont_size=12, textangle=0, textposition='outside', marker_color='rgb(55, 83, 109)', marker_line_color='#000000', marker_line_width=1.5, opacity=0.8,)fig.update_layout(yaxis=dict(autorange='reversed'), xaxis_title='Chains', yaxis_title='Restaurant')fig.show()
Figure 2: Restaurant Chains
Which restaurant chain has generated maximum revenue?
Code
( conn.sql(''' select name, sum(rating_count * cost) as revenue from restaurants group by name order by revenue DESC limit 10 ''') .df() .head(10) .style .hide() .format({'revenue': '{:,.2f}'}))
name
revenue
The Burger Joint
108,820,424.64
Pizza Palace
100,382,853.92
Cafe Delight
98,037,063.93
Greek Gyros
96,403,445.25
The BBQ Shack
96,286,414.02
Ocean Breeze
95,664,612.77
The Golden Wok
94,860,125.75
Spice & Bloom
91,824,854.56
French Delights
91,236,701.38
Midnight Diner
91,170,162.30
Table 8: Query 6 result table
Code
rev = conn.sql(''' select name, sum(rating_count * cost) as revenue from restaurants group by name order by revenue DESC limit 10 ''').pl()fig = px.bar(rev, x='revenue', y='name', orientation='h', hover_data=['revenue', 'name',], height=600, width=940, text_auto=True, title='Revenue by Restaurant',)fig.update_traces(textfont_size=12, textangle=0, textposition='inside', marker_color='#004700', marker_line_color='#000000', marker_line_width=1.5, opacity=0.8, hovertemplate='<br>'.join(['Restaurant: %{y}','Revenue: %{x}', ]), )fig.update_layout(yaxis=dict(autorange='reversed'), xaxis_title='Revenue', yaxis_title='Restaurant', xaxis_tickprefix='$', xaxis_tickformat=',.2f')fig.show()
Figure 3: Revenue by Restaurant
Which city has the maximum number of restaurants?
Code
( conn.sql(''' select city, count(*) as no_of_restaurants from restaurants group by city order by no_of_restaurants DESC limit 10 ''') .df() .head(10) .style .hide() .format({'no_of_restaurants': '{:,.0f}'}))
city
no_of_restaurants
Tokyo
1,071
Amsterdam
1,038
Lisbon
1,005
Madrid
1,003
London
993
Paris
992
Rome
979
Berlin
977
New York
971
Sydney
971
Table 9: Query 7 result table
Which city has generated maximum revenue?
Code
( conn.sql(''' select city, sum(rating_count * cost) as revenue from restaurants group by city order by revenue DESC limit 10 ''') .df() .head(10) .style .hide() .format({'revenue': '${:,.2f}'}))
city
revenue
Amsterdam
$148,839,878.62
Tokyo
$148,035,421.32
Madrid
$141,487,618.97
Paris
$141,219,374.56
London
$140,876,613.54
Rome
$139,622,129.63
New York
$138,621,609.28
Lisbon
$136,814,247.27
Berlin
$136,434,163.25
Sydney
$131,656,513.97
Table 10: Query 8 result table
Code
cities = conn.sql(''' select city, sum(rating_count * cost) as revenue from restaurants group by city order by revenue DESC limit 10 ''').pl()fig = px.bar(cities, x='revenue', y='city', orientation='h', hover_data=['revenue', 'city',], height=600, width=940, text_auto=True, title='Revenue by City',)fig.update_traces(textfont_size=12, textangle=0, textposition='inside', marker_color='#880808', marker_line_color='#000000', marker_line_width=1.5, opacity=0.8, hovertemplate='<br>'.join(['City: %{y}','Revenue: %{x}', ]), )fig.update_layout(yaxis=dict(autorange='reversed'), xaxis_title='Revenue', yaxis_title='City', xaxis_tickprefix='$', xaxis_tickformat=',.2f')fig.show()
Figure 4: Revenue by City
List 10 least expensive cuisines
Code
( conn.sql(''' select cuisine, avg(cost) as avg_cost from restaurants group by cuisine order by avg_cost asc limit 10 ''') .df() .head(10) .style .hide() .format({'avg_cost': '${:.2f}'}))
cuisine
avg_cost
Indian
$53.92
American
$53.96
Italian
$54.61
Thai
$54.64
French
$54.65
Mexican
$55.47
Spanish
$55.54
Japanese
$55.55
Greek
$55.77
Chinese
$56.26
Table 11: Query 9 result table
List 10 most expensive cuisines
Code
( conn.sql(''' select cuisine, avg(cost) as avg_cost from restaurants group by cuisine order by avg_cost desc limit 10 ''') .df() .head(10) .style .hide() .format({'avg_cost': '${:.2f}'}))
cuisine
avg_cost
Chinese
$56.26
Greek
$55.77
Japanese
$55.55
Spanish
$55.54
Mexican
$55.47
French
$54.65
Thai
$54.64
Italian
$54.61
American
$53.96
Indian
$53.92
Table 12: Query 10 result table
What city has Mexican food as the most popular cuisine?
Code
( conn.sql(''' select city, avg(cost) avg_cost, count(*) as restaurants from restaurants where cuisine = 'Mexican' group by city order by restaurants desc ''') .df() .head() .style .hide() .format({'avg_cost': '${:.2f}'}))
city
avg_cost
restaurants
London
$52.30
112
Tokyo
$48.83
111
Rome
$57.11
108
Amsterdam
$60.67
108
Lisbon
$56.35
101
Table 13: Query 11 result table
Code
# close connectionconn.close()
Conclusions
This project highlights the value of combining the strengths of SQL and Python for data-intensive tasks, providing a robust and efficient solution for a wide range of data analysis challenges.
By carefully considering the interaction between DuckDB, Polars, and Plotly, you can create a powerful and efficient data analysis and visualization pipeline.
This approach can lead organizations to unlock valuable insights from their data more quickly and effectively, driving data-driven decision-making.