
Overview
Diving into the world of data science and machine learning, one of the fundamental skills you’ll encounter is the art of reading data.
If you have already some experience with it, you’re probably familiar with JSON (JavaScript Object Notation) files.
Think of how NoSQL databases like MongoDB love to store data in JSON, or how REST APIs often respond in the same format.
However, JSON, while perfect for storage and exchange, isn’t quite ready for in-depth analysis in its raw form.
This is where we transform it into something more analytically friendly – a tabular format.
Environment settings
Show code
# Create a json file
simple_json = {
'name': 'David',
'city': 'London',
'income': 80000,
}
simple_json_2 = {
'name': 'Taylor',
'city': 'Chicago',
'income': 120000,
}
simple_json_list = [
simple_json,
simple_json_2
]
Manipulating json files
Show code
pd.json_normalize(simple_json)
Show code
pd.json_normalize(simple_json_2)
Show code
pd.json_normalize(simple_json_list)
| 0 |
David |
London |
80000 |
| 1 |
Taylor |
Chicago |
120000 |
In case we just want to transform some specific fields into a tabular pandas DataFrame, the json_normalize() command does not allow us to choose what fields to transform.
Therefore, a small preprocessing of the JSON should be performed where we filter just those columns of interest.
Show code
# Fields to include
fields = ['name', 'city']
# Filter the JSON data
filtered_json_list = [{key: value for key, value in item.items() if key in fields}
for item in simple_json_list]
pd.json_normalize(filtered_json_list)
| 0 |
David |
London |
| 1 |
Taylor |
Chicago |
When dealing with multiple-leveled JSONs we find ourselves with nested JSONs within different levels.
The procedure is the same as before, but in this case, we can choose how many levels we want to transform.
By default, the command will always expand all levels and generate new columns containing the concatenated name of all the nested levels.
Show code
# Create a nested json file
multiple_levels_json = {
'name': 'David',
'city': 'London',
'income': 80000,
'skills': {
'python': 'advanced',
'SQL': 'advanced',
'GCP': 'mid'
},
'roles': {
"project manager":False,
"data engineer":False,
"data scientist":True,
"data analyst":False,
}
}
multiple_levels_json_2 = {
'name': 'Taylor',
'city': 'Chicago',
'income': 120000,
'skills': {
'python': 'mid',
'SQL': 'advanced',
'GCP': 'beginner'
},
'roles': {
"project manager":False,
"data engineer":False,
"data scientist":False,
"data analyst":True
}
}
multiple_level_json_list = [
multiple_levels_json,
multiple_levels_json_2
]
We would get the following table with 3 columns under the field skills:
- skills.python
- skills.SQL
- skills.GCP
and 4 columns under the field roles
- roles.project manager
- roles.data engineer
- roles.data scientist
- roles.data analyst
However, imagine we just want to transform our top level.
We can do so by specifically defining the parameter max_level to 0
Show code
pd.json_normalize(multiple_level_json_list, max_level = 0)
| 0 |
David |
London |
80000 |
{'python': 'advanced', 'SQL': 'advanced', 'GCP... |
{'project manager': False, 'data engineer': Fa... |
| 1 |
Taylor |
Chicago |
120000 |
{'python': 'mid', 'SQL': 'advanced', 'GCP': 'b... |
{'project manager': False, 'data engineer': Fa... |
Show code
nested_json = {
'name': 'David',
'city': 'London',
'income': 80000,
'skills': ["python", "SQL","GCP"],
'roles': {
"project manager":False,
"data engineer":False,
"data scientist":True,
"data analyst":False,
}
}
nested_json_2 = {
'name': 'Taylor',
'city': 'Chicago',
'income': 120000,
'skills': ["python", "SQL","PowerBI","Looker"],
'roles': {
"project manager":False,
"data engineer":False,
"data scientist":False,
"data analyst":True
}
}
nested_json_list = [
nested_json,
nested_json_2
]
Show code
pd.json_normalize(nested_json_list, record_path=['skills'], meta=['name','city'])
| 0 |
python |
David |
London |
| 1 |
SQL |
David |
London |
| 2 |
GCP |
David |
London |
| 3 |
python |
Taylor |
Chicago |
| 4 |
SQL |
Taylor |
Chicago |
| 5 |
PowerBI |
Taylor |
Chicago |
| 6 |
Looker |
Taylor |
Chicago |
Conclusion
In summary, the transformation of JSON data into CSV files using Python’s Pandas library is easy and effective.
JSON is still the most common format in modern data storage and exchange, notably in NoSQL databases and REST APIs. However, it presents some important analytic challenges when dealing with data in its raw format.