Manipulating JSON Files with Python

Abstract

This article presents a straightforward and efficient method for reading JSON files using Python Pandas library. We will show how to import JSON data into DataFrames, enabling comprehensive analysis and exploration. This approach significantly simplifies the process of working with JSON data, making it accessible for data analysis.

Overview

Diving into the world of data science and machine learning, one of the fundamental skills you’ll encounter is the art of reading data.

If you have already some experience with it, you’re probably familiar with JSON (JavaScript Object Notation) files.

Think of how NoSQL databases like MongoDB love to store data in JSON, or how REST APIs often respond in the same format.

However, JSON, while perfect for storage and exchange, isn’t quite ready for in-depth analysis in its raw form.

This is where we transform it into something more analytically friendly – a tabular format.

Environment settings

Show code

# Create a json file

simple_json = {
    'name': 'David',
    'city': 'London',
    'income': 80000,
}

simple_json_2 = {
    'name': 'Taylor',
    'city': 'Chicago',
    'income': 120000,
}

simple_json_list = [
    simple_json, 
    simple_json_2

]

Show code

import pandas as pd

Manipulating json files

Show code

pd.json_normalize(simple_json)

	name	city	income
0	David	London	80000

Show code

pd.json_normalize(simple_json_2)

	name	city	income
0	Taylor	Chicago	120000

Show code

pd.json_normalize(simple_json_list)

	name	city	income
0	David	London	80000
1	Taylor	Chicago	120000

In case we just want to transform some specific fields into a tabular pandas DataFrame, the json_normalize() command does not allow us to choose what fields to transform.

Therefore, a small preprocessing of the JSON should be performed where we filter just those columns of interest.

Show code

# Fields to include
fields = ['name', 'city']

# Filter the JSON data
filtered_json_list = [{key: value for key, value in item.items() if key in fields}
                      for item in simple_json_list]

pd.json_normalize(filtered_json_list)

	name	city
0	David	London
1	Taylor	Chicago

When dealing with multiple-leveled JSONs we find ourselves with nested JSONs within different levels.

The procedure is the same as before, but in this case, we can choose how many levels we want to transform.

By default, the command will always expand all levels and generate new columns containing the concatenated name of all the nested levels.

Show code

# Create a nested json file

multiple_levels_json = {
    'name': 'David',
    'city': 'London',
    'income': 80000,
    'skills': {
        'python': 'advanced',
        'SQL': 'advanced',
        'GCP': 'mid'
    },
    'roles': {
        "project manager":False,
        "data engineer":False,
        "data scientist":True,
        "data analyst":False,
        }
    }

multiple_levels_json_2 = {
    'name': 'Taylor',
    'city': 'Chicago',
    'income': 120000,
    'skills': {
        'python': 'mid',
        'SQL': 'advanced',
        'GCP': 'beginner'
    },
    'roles': {
        "project manager":False,
        "data engineer":False,
        "data scientist":False,
        "data analyst":True
        }
    }

multiple_level_json_list = [
    multiple_levels_json, 
    multiple_levels_json_2

]

We would get the following table with 3 columns under the field skills:

skills.python
skills.SQL
skills.GCP

and 4 columns under the field roles

roles.project manager
roles.data engineer
roles.data scientist
roles.data analyst

However, imagine we just want to transform our top level.

We can do so by specifically defining the parameter max_level to 0

Show code

pd.json_normalize(multiple_level_json_list, max_level = 0)

	name	city	income	skills	roles
0	David	London	80000	{'python': 'advanced', 'SQL': 'advanced', 'GCP...	{'project manager': False, 'data engineer': Fa...
1	Taylor	Chicago	120000	{'python': 'mid', 'SQL': 'advanced', 'GCP': 'b...	{'project manager': False, 'data engineer': Fa...

Show code

nested_json = {
    'name': 'David',
    'city': 'London',
    'income': 80000,
    'skills': ["python", "SQL","GCP"],
    'roles': {
        "project manager":False,
        "data engineer":False,
        "data scientist":True,
        "data analyst":False,
        }
    }

nested_json_2 = {
    'name': 'Taylor',
    'city': 'Chicago',
    'income': 120000,
    'skills': ["python", "SQL","PowerBI","Looker"],
    'roles': {
        "project manager":False,
        "data engineer":False,
        "data scientist":False,
        "data analyst":True
        }
    }

nested_json_list = [
    nested_json, 
    nested_json_2

]

Show code

pd.json_normalize(nested_json_list, record_path=['skills'], meta=['name','city'])

	0	name	city
0	python	David	London
1	SQL	David	London
2	GCP	David	London
3	python	Taylor	Chicago
4	SQL	Taylor	Chicago
5	PowerBI	Taylor	Chicago
6	Looker	Taylor	Chicago

Conclusion

In summary, the transformation of JSON data into CSV files using Python’s Pandas library is easy and effective.

JSON is still the most common format in modern data storage and exchange, notably in NoSQL databases and REST APIs. However, it presents some important analytic challenges when dealing with data in its raw format.

References

Ferrer, J. (2024) Converting JSONs to Pandas DataFrames: Parsing Them the Right Way in Data Science

Contact

Jesus LM
Economist & Data Scientist

Medium | Linkedin | Twitter