Geospatial Analysis in Python

A Case Study using Duckdb, Polars and Folium

python
Author

Jesus LM

Published

Jan, 2025

Abstract

Geospatial analysis involves the application of spatial concepts and techniques to data that has geographic coordinates. With the rise of big data and the increasing availability of geospatial information, the demand for effective geospatial analysis tools has grown significantly. Python, with its rich ecosystem of libraries, has emerged as a powerful and popular choice for geospatial data scientists.

Figure 1: Image by Henrikki Tenkanen, Vuokko Heikinheimo, David Whipp

Environment setting

Code
# import libraries
import numpy as np
import pandas as pd
import polars as pl
import duckdb as db
import folium
from great_tables import GT, md
from warnings import filterwarnings
filterwarnings('ignore')

Data collection

Code
conn = db.connect('datasets/geospatial.db')
Code
conn.sql('show tables')
┌─────────┐
│  name   │
│ varchar │
├─────────┤
│ zomato  │
└─────────┘
Code
data = conn.sql('select * from zomato').pl()
Code
data.columns
['url',
 'address',
 'name',
 'online_order',
 'book_table',
 'rate',
 'votes',
 'phone',
 'location',
 'rest_type',
 'dish_liked',
 'cuisines',
 'approx_cost(for two people)',
 'reviews_list',
 'menu_item',
 'listed_in(type)',
 'listed_in(city)']
Code
data.shape
(51717, 17)

Data preprocessing

Code
data.is_duplicated().sum()
0
Code
data.select(pl.all().is_null().sum()).to_dicts()
[{'url': 0,
  'address': 0,
  'name': 0,
  'online_order': 0,
  'book_table': 0,
  'rate': 7775,
  'votes': 0,
  'phone': 1208,
  'location': 21,
  'rest_type': 227,
  'dish_liked': 28078,
  'cuisines': 45,
  'approx_cost(for two people)': 346,
  'reviews_list': 0,
  'menu_item': 0,
  'listed_in(type)': 0,
  'listed_in(city)': 0}]
Code
# As we have few missing values in location feature ,then we can drop the null
data = data.drop_nulls(subset=pl.col('location'))
Code
(
    GT(data.select('address','name','rate','votes','location','rest_type','dish_liked','cuisines').head(3))
    .tab_header(
        title=md('Zomato Restaurants')
    )
    .cols_width(
        cases={'rate':'50px',
              }
               )
    .tab_source_note(source_note=md('<br> *Source: Shan Singh*'))
)
Zomato Restaurants
address name rate votes location rest_type dish_liked cuisines
942, 21st Main Road, 2nd Stage, Banashankari, Bangalore Jalsa 4.1/5 775 Banashankari Casual Dining Pasta, Lunch Buffet, Masala Papad, Paneer Lajawab, Tomato Shorba, Dum Biryani, Sweet Corn Soup North Indian, Mughlai, Chinese
2nd Floor, 80 Feet Road, Near Big Bazaar, 6th Block, Kathriguppe, 3rd Stage, Banashankari, Bangalore Spice Elephant 4.1/5 787 Banashankari Casual Dining Momos, Lunch Buffet, Chocolate Nirvana, Thai Green Curry, Paneer Tikka, Dum Biryani, Chicken Biryani Chinese, North Indian, Thai
1112, Next to KIMS Medical College, 17th Cross, 2nd Stage, Banashankari, Bangalore San Churro Cafe 3.8/5 918 Banashankari Cafe, Casual Dining Churros, Cannelloni, Minestrone Soup, Hot Chocolate, Pink Sauce Pasta, Salsa, Veg Supreme Pizza Cafe, Mexican, Italian

Source: Shan Singh
Table 1: Zomato Restaurants from Singh, S (2024) Geospatial Data Science in Python
Code
# create a copy
df = data.clone()

Lets make every place more readible so that u will get more more accurate geographical co-ordinates..

Code
df = df.with_columns(
    location=(pl.col('location') + ', Bangalore, Karnataka, India')
)
Code
df.select('location').sample(5).to_dicts()
[{'location': 'HSR, Bangalore, Karnataka, India'},
 {'location': 'BTM, Bangalore, Karnataka, India'},
 {'location': 'BTM, Bangalore, Karnataka, India'},
 {'location': 'BTM, Bangalore, Karnataka, India'},
 {'location': 'BTM, Bangalore, Karnataka, India'}]
Code
df.schema
Schema([('url', String),
        ('address', String),
        ('name', String),
        ('online_order', Boolean),
        ('book_table', Boolean),
        ('rate', String),
        ('votes', Int64),
        ('phone', String),
        ('location', String),
        ('rest_type', String),
        ('dish_liked', String),
        ('cuisines', String),
        ('approx_cost(for two people)', String),
        ('reviews_list', String),
        ('menu_item', String),
        ('listed_in(type)', String),
        ('listed_in(city)', String)])

Extract coordinates from data

first we will learn how to extract Latitudes & longitudes using ‘location’ feature

Code
rest_loc = pl.DataFrame()
Code
rest_loc = pl.DataFrame({'name': df.select('location').unique()})
Code
rest_loc.sample(5).to_dicts()
[{'name': 'Jakkur, Bangalore, Karnataka, India'},
 {'name': 'Kalyan Nagar, Bangalore, Karnataka, India'},
 {'name': 'RT Nagar, Bangalore, Karnataka, India'},
 {'name': 'Koramangala 7th Block, Bangalore, Karnataka, India'},
 {'name': 'Kaggadasapura, Bangalore, Karnataka, India'}]
Code
# Nominatim is a tool to search OpenStreetMap data by address or location
from geopy.geocoders import Nominatim
Code
geolocator = Nominatim(user_agent='app', timeout=None)
Code
lat = [] # define lat list to store all the latitudes
lon = [] # define lon list to store all the longitudes

for name in pl.Series(rest_loc.select('name')):
    location = geolocator.geocode(name)
    
    if location is None:
        lat.append(np.nan)
        lon.append(np.nan)
        
    else:
        lat.append(location.latitude)
        lon.append(location.longitude)
Code
lat[:10]
[13.0621474,
 12.9846713,
 12.981015523680384,
 12.985098650000001,
 12.9096941,
 nan,
 12.9067683,
 12.938455602031697,
 12.9176571,
 12.9489339]
Code
rest_loc = rest_loc.with_columns(
    lat=pl.Series(lat), # For python lists, construct a Series
    lon=pl.Series(lon),
)
Code
(
    GT(rest_loc.head(5), auto_align=True)
    .tab_header(
        title=md('Zomato Restaurants Coordinates')
    )
    .fmt_number(columns=['lat','lon'], decimals=4, use_seps=False)
    .cols_width(
        cases={'name':'200%',
               'lat':'90%',
               'lon':'90%',
              }
               )
    .tab_source_note(source_note=md('<br>*Source: Shan Singh*'))
)
Zomato Restaurants Coordinates
name lat lon
Sahakara Nagar, Bangalore, Karnataka, India 13.0621 77.5801
Kaggadasapura, Bangalore, Karnataka, India 12.9847 77.6791
Infantry Road, Bangalore, Karnataka, India 12.9810 77.6021
CV Raman Nagar, Bangalore, Karnataka, India 12.9851 77.6631
JP Nagar, Bangalore, Karnataka, India 12.9097 77.5866

Source: Shan Singh
Table 2: Zomato restaurants coordinates from Singh, S (2024) Geospatial Data Science in Python

We have found out latitude and longitude of each location listed in the dataset using geopy This is used to plot maps.

Code
pl.Series(rest_loc.select('lat')).is_null().sum()
0
Code
pl.Series(rest_loc.select('lat')).is_nan().sum()
2
Code
rest_loc.filter(pl.col('lat').is_nan())
shape: (2, 3)
name lat lon
str f64 f64
"Sadashiv Nagar, Bangalore, Kar… NaN NaN
"Rammurthy Nagar, Bangalore, Ka… NaN NaN
Code
rest_loc = rest_loc.drop_nans()

Where are most number of restaurants located in Bengalore?

Code
rest_locations = pl.Series(df.select('location')).value_counts(sort=True, name='total')
Code
rest_locations = rest_locations.rename({'location':'name', 'total':'count'})
Code
(
    GT(rest_locations.head(), auto_align=True)
    .tab_header(
        title=md('Zomato Restaurants Count')
    )
    .cols_width(cases={'name': '200%',})
    .tab_source_note(source_note=md('<br>*Source: Shan Singh*'))
)
Zomato Restaurants Count
name count
BTM, Bangalore, Karnataka, India 5124
HSR, Bangalore, Karnataka, India 2523
Koramangala 5th Block, Bangalore, Karnataka, India 2504
JP Nagar, Bangalore, Karnataka, India 2235
Whitefield, Bangalore, Karnataka, India 2144

Source: Shan Singh
Table 3: Zomato restaurants count from Singh, S (2024) Geospatial Data Science in Python

Now we can say that these are locations where most of restaurants are located.

Lets create Heatmap of this results so that it becomes more user-friendly.

Now, in order to perform spatial analysis, we need latitudes & longitudes of every location, so lets merge both dataframes in order to get geographical co-ordinates.

Code
beng_rest_locations = rest_locations.join(rest_loc, on='name')
Code
(
    GT(beng_rest_locations.head(), auto_align=True)
    .tab_header(
        title=md('Zomato Restaurants Count & coordinates')
    )
    .cols_width(cases={'name': '200%',})
    .tab_source_note(source_note=md('<br>*Source: Shan Singh*'))
)
Zomato Restaurants Count & coordinates
name count lat lon
BTM, Bangalore, Karnataka, India 5124 12.9163603 77.604733
HSR, Bangalore, Karnataka, India 2523 12.90056335 77.64947470503677
Koramangala 5th Block, Bangalore, Karnataka, India 2504 12.9348429 77.6189768
JP Nagar, Bangalore, Karnataka, India 2235 12.9096941 77.5866067
Whitefield, Bangalore, Karnataka, India 2144 12.9696365 77.7497448

Source: Shan Singh
Table 4: Zomato restaurants count and coordinates from Singh, S (2024) Geospatial Data Science in Python

now in order to show-case it via Map(Heatmap) ,first we need to create BaseMap so that I can map our Heatmap on top of BaseMap !

Code
def Generate_basemap():
    basemap = folium.Map(location=[12.97 , 77.59], zoom_start=11)
    return basemap
Code
# Geographic heat maps are used to identify where something occurs, and demonstrate areas of high and low density...
from folium.plugins import HeatMap
Code
basemap = Generate_basemap()
Code
beng_rest_locations = beng_rest_locations.to_pandas()
Code
HeatMap(beng_rest_locations[['lat', 'lon' , 'count']]).add_to(basemap)
<folium.plugins.heat_map.HeatMap at 0x3058e0da0>
Code
basemap
Make this Notebook Trusted to load map: File -> Trust Notebook
Figure 2: Zomato Restaurants Heatmap
Note

You can interact with the above map by zooming in or out.

Majority of the Restaurants are avaiable in the city centre area.

Performing Marker Cluster Analysis

Code
from folium.plugins import FastMarkerCluster
Code
basemap = Generate_basemap()
Code
FastMarkerCluster(beng_rest_locations[['lat', 'lon' , 'count']]).add_to(basemap)
<folium.plugins.fast_marker_cluster.FastMarkerCluster at 0x30a2cd280>
Code
basemap
Make this Notebook Trusted to load map: File -> Trust Notebook
Figure 3: Zomato Marker Cluster Map
Note

You can interact with the above map by zooming in or out.

Mapping all the markers of places of Bangalore

Plotting Markers on the Map :

Folium gives a folium.Marker() class for plotting markers on a map

Just pass the latitude and longitude of the location, mention the popup and tooltip and add it to the map.

Plotting markers is a two-step process.

  1. you need to create a base map on which your markers will be placed
  2. and then add your markers to it:
Code
m = Generate_basemap()
Code
# Add points to the map
for index, row in beng_rest_locations.iterrows():
    folium.Marker(location=[row['lat'], row['lon']], popup=row['count']).add_to(m)
Code
m
Make this Notebook Trusted to load map: File -> Trust Notebook
Figure 4: Zomato Restaurants Marker Map
Note

You can interact with the above map by zooming in or out.

Rate field cleaning

In order to Analyse where are the restaurants situated with high average rate, first we need to clean ‘rate’ feature

Code
(
    df.filter(
        pl.col('rate').str.contains('^([^0-9]*)$')
    )
    .select('rate')
    .unique()
    .to_dicts()
)
[{'rate': '-'}, {'rate': 'NEW'}]
Code
pl.Series(df.select('rate')).is_null().sum()
7754
Code
# approximately 15% of your rating belongs to missing values
pl.Series(df.select('rate')).is_null().sum()/pl.Series(df.select('rate')).len()*100
14.999226245744351
Code
df = (
    df.drop_nulls(subset='rate')
        .with_columns(
            pl.col('rate').replace(['NEW', '-',], ['0', '0'])
        )
        .with_columns(
            rating=pl.col('rate').str.replace('/5', '')
        )
        .with_columns(
            pl.col('rating').str.strip_chars()
        )
        .cast({'rating': pl.Float32})
)
Code
df.select('rating').unique().to_dicts()
[{'rating': 2.4000000953674316},
 {'rating': 2.299999952316284},
 {'rating': 0.0},
 {'rating': 3.5},
 {'rating': 4.099999904632568},
 {'rating': 4.400000095367432},
 {'rating': 4.699999809265137},
 {'rating': 4.900000095367432},
 {'rating': 2.700000047683716},
 {'rating': 3.9000000953674316},
 {'rating': 3.799999952316284},
 {'rating': 3.4000000953674316},
 {'rating': 3.0},
 {'rating': 2.5999999046325684},
 {'rating': 3.299999952316284},
 {'rating': 4.199999809265137},
 {'rating': 2.200000047683716},
 {'rating': 4.0},
 {'rating': 4.5},
 {'rating': 2.5},
 {'rating': 3.5999999046325684},
 {'rating': 3.700000047683716},
 {'rating': 2.0999999046325684},
 {'rating': 4.800000190734863},
 {'rating': 3.200000047683716},
 {'rating': 2.799999952316284},
 {'rating': 4.300000190734863},
 {'rating': 2.9000000953674316},
 {'rating': 2.0},
 {'rating': 4.599999904632568},
 {'rating': 3.0999999046325684},
 {'rating': 1.7999999523162842}]

Most highest rated restaurants

Code
df.select('name','rate','votes','location','dish_liked','rating').sort('rating', descending=True).head()
shape: (5, 6)
name rate votes location dish_liked rating
str str i64 str str f32
"Byg Brewski Brewing Company" "4.9/5" 16345 "Sarjapur Road, Bangalore, Karn… "Cocktails, Dahi Kebab, Rajma C… 4.9
"Byg Brewski Brewing Company" "4.9/5" 16345 "Sarjapur Road, Bangalore, Karn… "Cocktails, Dahi Kebab, Rajma C… 4.9
"Byg Brewski Brewing Company" "4.9/5" 16345 "Sarjapur Road, Bangalore, Karn… "Cocktails, Dahi Kebab, Rajma C… 4.9
"Belgian Waffle Factory" "4.9/5" 1746 "Brigade Road, Bangalore, Karna… "Coffee, Berryblast, Nachos, Ch… 4.9
"Belgian Waffle Factory" "4.9/5" 1746 "Brigade Road, Bangalore, Karna… "Coffee, Berryblast, Nachos, Ch… 4.9
Code
grp_df = (
    df.group_by('location').agg(pl.col('rating').mean(), pl.col('name').count())
        .rename({'location':'name', 'rating':'avg_rating', 'name':'count'})
)
Code
grp_df
shape: (92, 3)
name avg_rating count
str f32 u32
"Brookefield, Bangalore, Karnat… 3.374697 581
"Thippasandra, Bangalore, Karna… 3.095396 152
"Electronic City, Bangalore, Ka… 3.04191 964
"Koramangala 1st Block, Bangalo… 3.263946 965
"Koramangala 3rd Block, Bangalo… 3.978755 193
"RT Nagar, Bangalore, Karnataka… 3.278125 64
"Jalahalli, Bangalore, Karnatak… 3.486956 23
"Commercial Street, Bangalore, … 3.109709 309
"Banaswadi, Bangalore, Karnatak… 3.362927 499
"Koramangala 5th Block, Bangalo… 3.901511 2381

lets consider only those restaurants who have send atleast 400 orders

Code
temp_df = grp_df.filter(pl.col('count')>400)
Code
temp_df.shape
(35, 3)
Code
temp_df
shape: (35, 3)
name avg_rating count
str f32 u32
"Brookefield, Bangalore, Karnat… 3.374697 581
"Electronic City, Bangalore, Ka… 3.04191 964
"Koramangala 1st Block, Bangalo… 3.263946 965
"Bannerghatta Road, Bangalore, … 3.271675 1324
"HSR, Bangalore, Karnataka, Ind… 3.484063 2128
"Richmond Road, Bangalore, Karn… 3.688013 634
"Koramangala 7th Block, Bangalo… 3.747846 1089
"Frazer Town, Bangalore, Karnat… 3.56488 578
"Banaswadi, Bangalore, Karnatak… 3.362927 499
"Koramangala 5th Block, Bangalo… 3.901511 2381
Code
rest_loc
shape: (91, 3)
name lat lon
str f64 f64
"Sahakara Nagar, Bangalore, Kar… 13.062147 77.580061
"Kaggadasapura, Bangalore, Karn… 12.984671 77.679091
"Infantry Road, Bangalore, Karn… 12.981016 77.602133
"CV Raman Nagar, Bangalore, Kar… 12.985099 77.663117
"JP Nagar, Bangalore, Karnataka… 12.909694 77.586607
"Seshadripuram, Bangalore, Karn… 12.993188 77.575342
"Jakkur, Bangalore, Karnataka, … 13.078474 77.606894
"Bommanahalli, Bangalore, Karna… 12.908945 77.623904
"Kammanahalli, Bangalore, Karna… 13.009346 77.637709
"Nagawara, Bangalore, Karnataka… 13.042279 77.624858

lets merge both the dataframe so that we can get coordinates as well

Code
ratings_locations = temp_df.join(rest_loc, on='name')
Code
ratings_locations
shape: (35, 5)
name avg_rating count lat lon
str f32 u32 f64 f64
"JP Nagar, Bangalore, Karnataka… 3.412929 1849 12.909694 77.586607
"Koramangala 4th Block, Bangalo… 3.814351 864 12.932778 77.629405
"Whitefield, Bangalore, Karnata… 3.384171 1693 12.969637 77.749745
"Bannerghatta Road, Bangalore, … 3.271675 1324 12.951856 77.604011
"Jayanagar, Bangalore, Karnatak… 3.61525 1718 12.939904 77.582638
"Ulsoor, Bangalore, Karnataka, … 3.541396 901 12.977879 77.62467
"Frazer Town, Bangalore, Karnat… 3.56488 578 12.998683 77.615525
"Indiranagar, Bangalore, Karnat… 3.652168 1936 12.996298 77.545278
"Koramangala 6th Block, Bangalo… 3.662465 1111 12.939025 77.623848
"Kammanahalli, Bangalore, Karna… 3.499809 525 13.009346 77.637709
Code
basemap = Generate_basemap()
Code
ratings_locations = ratings_locations.to_pandas()
Code
HeatMap(ratings_locations[['lat', 'lon' , 'avg_rating']]).add_to(basemap)
<folium.plugins.heat_map.HeatMap at 0x30a39bcb0>
Code
basemap
Make this Notebook Trusted to load map: File -> Trust Notebook
Figure 5: Highest-rated Zomato Restaurants Heatmap
Note

You can interact with the above map by zooming in or out.

Conclusions

Python, with its powerful libraries and ease of use, has become an indispensable tool for geospatial analysis. By leveraging the capabilities of libraries like GeoPandas, Shapely, and folium, data scientists can effectively explore and analyze geospatial data, gain valuable insights, and make informed decisions.

In this article, we have shown a brief overview of geospatial analysis in Python.

References

Contact

Jesus LM
Economist & Data Scientist

Medium | Linkedin | Twitter