Implementing a Database System with DuckDB for Local Processing and MotherDuck for Scalable Cloud Storage

An Overview

python
sql
Author

Jesus LM

Published

Jan, 2025

Abstract

This project explores how to leverage the strengths of DuckDB and MotherDuck to build a robust data processing and storage solution. DuckDB excels at fast in-memory analytics, while MotherDuck provides a scalable and cost-effective cloud data warehouse. By combining these technologies, you can achieve optimal performance for both local and cloud-based data operations.

Figure 1: Data Warehouse by Niklas Lang

Overview

MotherDuck is serverless, easy-to-use, cloud-based platform, based on the DuckDB query engine.
MotherDuck makes it easy to connect to your data, explore it, and ask questions about it.
It is a great option for businesses of all sizes who need to do basic data analysis.

  • MotherDuck based on DuckDB

Known for its speed and ease of use, and provides a managed service in the cloud. This means you don’t have to worry about setting up and maintaining your own DuckDB instance.

  • Data Warehousing and Analytics

MotherDuck can function as a data warehouse, allowing you to store and analyze large datasets. It can also be used as a query engine for data lakes, which are central repositories for storing various data formats.

  • Collaboration Features

One of MotherDuck’s strengths is collaboration. You can share snapshots of your databases with colleagues, enabling everyone to work from the same data view. This is useful for data exploration and building machine learning models.

  • Hybrid Data Execution

MotherDuck can handle data residing in the cloud (on MotherDuck itself) and data stored locally on your device. It intelligently figures out the most efficient way to run your queries, considering where the data resides.

  • Ease of Use

Just like DuckDB, MotherDuck is known for being user-friendly. This makes it accessible to data analysts who may not be experts in complex database systems.

Environment settings

Code
import polars as pl
import duckdb as db
import glob

Extraction from data sources

csv files

Code
csv_files = glob.glob('./datasets/*.csv')
Code
list(enumerate(csv_files))
[(0, './datasets/watercollection.csv'),
 (1, './datasets/ContainsNull.csv'),
 (2, './datasets/sales_info.csv'),
 (3, './datasets/cdmx-subway.csv'),
 (4, './datasets/airports.csv'),
 (5, './datasets/colors.csv'),
 (6, './datasets/sets.csv'),
 (7, './datasets/appl_stock.csv'),
 (8, './datasets/sales.csv')]

json files

Code
json_files = glob.glob('./datasets/*.json')
Code
list(enumerate(json_files))
[(0, './datasets/prevalencia.json'), (1, './datasets/people.json')]

database tables

Code
db_files = glob.glob('datasets/*.db')
Code
list(enumerate(db_files))
[(0, 'datasets/retail_db.db'), (1, 'datasets/restaurants.db')]

Data warehouse creation

Code
conn = db.connect('my_database.db')

Data warehouse load

Code
conn.sql(f"create or replace table water_collection as select * from '{csv_files[0]}' ")
Code
conn.sql(f"create or replace table contains_null as select * from '{csv_files[1]}' ")
Code
conn.sql(f"create or replace table sales_info as select * from '{csv_files[2]}' ")
Code
conn.sql(f"create or replace table cdmx_subway as select * from '{csv_files[3]}' ")
Code
conn.sql(f"create or replace table airports as select * from '{csv_files[4]}' ")
Code
conn.sql(f"create or replace table colors as select * from '{csv_files[5]}' ")
Code
conn.sql(f"create or replace table sets as select * from '{csv_files[6]}' ")
Code
conn.sql(f"create or replace table appl_stock as select * from '{csv_files[7]}' ")
Code
conn.sql(f"create or replace table sales as select * from '{csv_files[8]}' ")
Code
conn.sql(f"create or replace table prevalencia as select * from '{json_files[0]}' ")
Code
conn.sql(f"create or replace table people as select * from '{json_files[1]}' ")
Code
retail = db.connect('./datasets/retail_db.db')
retail.sql('show tables')
┌──────────────┐
│     name     │
│   varchar    │
├──────────────┤
│ retail_sales │
└──────────────┘
Code
retail_sales_pl = retail.sql('select * from retail_sales').pl()
Code
conn.execute("create or replace table retail_sales as from retail_sales_pl");
Code
restaurants = db.connect('./datasets/restaurants.db')
restaurants.sql('show tables')
┌─────────────┐
│    name     │
│   varchar   │
├─────────────┤
│ restaurants │
└─────────────┘
Code
restaurants_pl = restaurants.sql('select * from restaurants').pl()
Code
conn.execute("create or replace table restaurants as from restaurants_pl");

Data retrieval

Code
conn.sql('show databases')
┌───────────────┐
│ database_name │
│    varchar    │
├───────────────┤
│ my_database   │
└───────────────┘
Code
conn.sql('show tables')
┌──────────────────┐
│       name       │
│     varchar      │
├──────────────────┤
│ airports         │
│ appl_stock       │
│ cdmx_subway      │
│ colors           │
│ contains_null    │
│ people           │
│ prevalencia      │
│ restaurants      │
│ retail_sales     │
│ sales            │
│ sales_info       │
│ sets             │
│ water_collection │
├──────────────────┤
│     13 rows      │
└──────────────────┘
Code
conn.sql('select * from restaurants limit 5').pl()
shape: (5, 6)
name rating_count cost city cuisine rating
str i64 f64 str str i64
"The Golden Wok" 1477 33.620488 "Berlin" "American" 5
"Greek Gyros" 770 68.388874 "New York" "French" 1
"Taste of Italy" 4420 88.23168 "Amsterdam" "Chinese" 0
"Midnight Diner" 2155 12.965985 "Lisbon" "Mexican" 1
"Taste of Italy" 3375 52.785226 "Sydney" "Chinese" 1

Cloud Data Warehouse with Motherduck

Code
dw = db.connect('md')
Code
dw.sql('select current_database()').show()
┌────────────────────┐
│ current_database() │
│      varchar       │
├────────────────────┤
│ my_portfolio       │
└────────────────────┘

Convert queries from local database to polars

Code
airports_pl = conn.sql('select * from airports').pl()
appl_stock_pl = conn.sql('select * from appl_stock').pl()
cdmx_subway_pl = conn.sql('select * from cdmx_subway').pl()
colors_pl = conn.sql('select * from colors').pl()
contains_null_pl = conn.sql('select * from contains_null').pl()
people_pl = conn.sql('select * from people').pl()
prevalencia_pl = conn.sql('select * from prevalencia').pl()
restaurants_pl = conn.sql('select * from restaurants').pl()
retail_sales_pl = conn.sql('select * from retail_sales').pl()
sales_pl = conn.sql('select * from sales').pl()
sales_info_pl = conn.sql('select * from sales_info').pl()
sets_pl = conn.sql('select * from sets').pl()
water_collection_pl = conn.sql('select * from water_collection').pl()

Upload dataframes to MotherDuck

Code
dw.sql(f"create or replace table airports as select * from airports_pl");
Code
dw.sql(f"create or replace table appl_stock as select * from appl_stock_pl");
Code
dw.sql(f"create or replace table cdmx_subway as select * from cdmx_subway_pl");
Code
dw.sql(f"create or replace table colors as select * from colors_pl");
Code
dw.sql(f"create or replace table contains_null as select * from contains_null_pl");
Code
dw.sql(f"create or replace table people as select * from people_pl");
Code
dw.sql(f"create or replace table prevalencia as select * from prevalencia_pl");
Code
dw.sql(f"create or replace table restaurants as select * from restaurants_pl");
Code
dw.sql(f"create or replace table retail_sales as select * from retail_sales_pl");
Code
dw.sql(f"create or replace table sales as select * from sales_pl");
Code
dw.sql(f"create or replace table sales_info as select * from sales_info_pl");
Code
dw.sql(f"create or replace table sets as select * from sets_pl");
Code
dw.sql(f"create or replace table water_collection as select * from water_collection_pl");

Check uploaded tables

Code
dw.sql('show tables')
┌──────────────────┐
│       name       │
│     varchar      │
├──────────────────┤
│ airports         │
│ appl_stock       │
│ cdmx_subway      │
│ colors           │
│ contains_null    │
│ people           │
│ prevalencia      │
│ restaurants      │
│ retail_sales     │
│ sales            │
│ sales_info       │
│ sets             │
│ water_collection │
├──────────────────┤
│     13 rows      │
└──────────────────┘

Close all database connections

Code
# close db connections
conn.close()
retail.close()
restaurants.close()
dw.close()

Conclusions

By combining DuckDB’s in-memory processing capabilities with MotherDuck’s cloud-based data warehousing, you can create a powerful and flexible data processing and storage solution. This approach allows you to efficiently handle both local and cloud-based data operations, optimize performance, and scale your data infrastructure as needed.

MotherDuck is a cloud service that makes working with DuckDB, a fast and easy relational database, even easier.
MotherDuck provides a managed DuckDB experience, so you don’t have to worry about setup or maintenance.
You can use it for data warehousing, analytics, and data exploration in collaboration with others.
Even if your data is spread between the cloud and your device, MotherDuck can handle it.

Contact

Jesus LM
Economist & Data Scientist

Medium | Linkedin | Twitter