loading...

Converting Nested JSON to CSV

vinay20045 profile image Vinay NP Originally published at askvinay.com ・4 min read

The first time I came across JSON, I was really happy. It is a very light and fluffy object representation in plain text. The beauty was that there were no new or extra specs; existing concepts of lists, objects, strings, numbers etc. were taken and put together in it's own clean way. I use JSON widely when I am working with APIs. In my playbook it is second only to single word/number replies.

Apart from becoming a standard for data interchange between clients and server calls, JSON has also become popular for intermediate representation and storage of data. This is pretty evident when working with reporting tools and frameworks. Why I say intermediate is because, the origin might be some XML or data from some DB and the actual consumption format might be some chart, graph or CSV data.

JSON allows expression of hierarchical, well formed and structured data by nesting objects and arrays within one another up to multiple nesting levels. Often, this sort of representation is not very easy or almost impossible to express in row/cols based data structures like CSV. However, when you talk about reporting and making dashboards it is almost always required to convert such structured data to more flatter key/value data for interpretation and exporting. This interpretation allows data analysts to import the data into tools like excel and work with it.

Here I am going to discuss about converting multiple nested JSON which might or might not contain similar elements to CSV for usage with tools like excel or open office calc. The script is written in Python2.7.

Let's take a valid multi-level JSON and start off...

{
 "fruit":[
  {
   "name":"Apple",
   "binomial name":"Malus domestica",
   "major_producers":[
    "China", 
    "United States", 
    "Turkey"
   ],
   "nutrition":{
    "carbohydrates":"13.81g",
    "fat":"0.17g",
    "protein":"0.26g"
   }
  },
  {
   "name":"Orange",
   "binomial name":"Citrus x sinensis",
   "major_producers":[
    "Brazil", 
    "United States", 
    "India"
   ],
   "nutrition":{
    "carbohydrates":"11.75g",
    "fat":"0.12g",
    "protein":"0.94g"
   }
  },
  {
   "name":"Mango",
   "binomial name":"Mangifera indica",
   "major_producers":[
    “India", 
    "China", 
    "Thailand"
   ],
   "nutrition":{
    "carbohydrates":"15g",
    "fat":"0.38g",
    "protein":"0.82g"
   }
  }
 ]
}

This is a valid JSON of 3 Fruit objects. Each fruit object has multiple valid data structures nested within.

Each object has to be parsed individually and reduced to a more flatter structure before putting it back together and writing it as a CSV. This is also essential for determining the headers. A few key things for doing this are...

Deducing key/value pairs for each object
I've visualized this structure as a tree and then the key and value is decided as follows...
Value: This is the leaf node.
Key: This is the concatenation of all parent node names with an '_' as separator. In case of Arrays, the immediate parent of the value is the index.

Taking the Apple object as example the tree can be visualized as...
JSON object as a tree

Therefore, the key/value pairs become...
Fruit_Name -> Apple
Fruit_Major Producers_0 -> China
and so on. The same thing is done for all Fruit objects. Here the array indices are ignored. Otherwise it would be Fruit_0_Name -> Apple as against Fruit_Name -> Apple

After this exercise we will be left with a much flatter set of the required key/value pairs derived from the original JSON object. Same thing has to be done for all the objects in the original data to obtain reduced data.

Determining Headers
The headers are decided by adding all keys of all objects together and removing duplicates among them. This will ensure that all the keys across objects are covered. Further, headers are sorted alphabetically so that all the similar keys stay next to each other making it easier to read the CSV file. However, this also means that the original ordering of the JSON is not maintained. This is a current limitation, I am working on this and will update when I fix this.

Some key/value pair(s) existing in one object may not be present in another object. In this situation, the value will be empty wherever the key does not exist.

I am also working on writing multi-row, pretty headers without the '_'. If done, this can make reading and using this CSV even more easier. This will also enable usage of Merge and Center functionality of excel and allow for easier filter/formula application.

Writing the CSV
Finally, after reducing all original objects to flat objects as discussed above, the headers are written followed by every object as a row. the csv.DictWriter is used to write the CSV which will ensure that the rows are written as per the header and in the same order.

All code is available at Github. I am yet to setup an online demo where you can upload a JSON file and get a CSV for download. Will update once this is live.

Please drop me a mail if you have any suggestions or come across any bugs in the code.

This post was first published on my blog

Discussion

pic
Editor guide
Collapse
kylefisherr profile image
Kyle Fisher

Great article, Vinay... I like the way you laid this all out. Reminds me of a script I wrote last year. Your explanation is much better than mine, admittedly haha.

Have any of you guys (or gals) on here heard of 4DIQ's Flow?

The framework has this whole new concept of 'generic data' where the whole idea of conversion is approached totally differently. Seems to be the quickest / most interesting way to convert files that I've come across as of recent.

Recommend checking it out! Below is a link to the blog post I found about about it from. It has to do with JSON to CSV, as well, but the more I play with the program the more it seems conversions are just a small piece of what it does.

Check it: flow-analytics.com/blog/how-to-nor...

Collapse
ubethke profile image
uli bethke

You can fully automate the JSON to CSV conversion process with Flexter our free JSON converter. Flexter is an ETL tool for JSON and XML. It automates the conversion of JSON to a database, text, or Hadoop. We have written up a blog post (including a video) that shows how easy it is to automatically convert your JSON files to CSV sonra.io/2018/03/16/converting-fhi.... No manual coding needed. In this blog post we convert FHIR JSON, which is based on an industry data standard in healtcare.

When you convert your JSON files it also provides a diagram of the target model and the data lineage. It can handle JSON of any complexity.

Try for yourself: jsonconverter.sonra.io/

Collapse
avneeshjain108 profile image
avneeshjain108

Hi Vinay,

Can i get the code for

{
MANGO:{ key1 : fruit
key2 : yellow
key3 : Juice
key4 : Summer
}
KIWI :{
key1 : fruit
key2 : Green
key3 : [Juice, eat]
key4 : Summer
}
}

Collapse
maxotek profile image
Partho Sarathi

The idea to sort the headers is great. I'll put it as an option in my JSON to CSV Converter

Collapse
cyr1l profile image
cyr1l

Hi, another online tool to do this: extendsclass.com/json-to-csv.html

Collapse
moboudra profile image
Mohamed Boudra

I like the script! Built a similar tool wrapped around a nice web interface, you can find it here: json to csv.

Collapse
beingnvn profile image
beingnvn

can you please email me back @ nvnsngh1008@hotmail.com. I am having some trouble converting my json file to csv
or are you available at whatsapp so that i can contact you

Collapse
jamesmalvi profile image
Jaimie Malvi

Nice article I use these tools to do the same.

jsonformatter.org
codebeautify.org/json-to-csv

Collapse
etyety profile image
etyety

{
"account_action_id": "20ee523e-32fe-456d-b891-5f29735b99cc",
"type": "scan",
"scans": [{
"name": "analyzer_results/862394810259/i-094db43f8f1ad67bc/20ee523e-32fe-456d-b891-5f29735b99cc_862394810259_i-094db43f8f1ad67bc.json",
"result": {
"metadata": {
"time": "2019-05-13T09:55:00.683231",
"version": "public"
},
"status": "success",
"main": {
"issues": {
"passwords_login": [{
"username": "myuser1",
"password": "qw**",
"source": "common",
"is_cache": true
}, {
"username": "myuser2",
"password": "12
",
"source": "common",
"is_cache": true
}, {
"username": "myuser10",
"password": "mi
",
"source": "leaked",
"is_cache": true
}
],
"passwords_auth_user": [{
"username": "myuser10",
"password": "mi
",
"source": "leaked",
"is_cache": true,
"files": ["/etc/httpd/.htpasswd"]
}, {
"username": "myuser23",
"password": "he
",
"source": "common",
"is_cache": true,
"files": ["/etc/apache2/.htpasswd"]
}, {
"username": "myuser22",
"password": "ab
",
"source": "common",
"is_cache": true,
"files": ["/etc/nginx/.htpasswd"]
}
],
"passwords_shell_history": [{
"cmd": "swaks",
"username": "you@example.com",
"password": "ab
",
"source": "common",
"cmdline": "swaks --to someone@gmail.com --from \"you@example.com\" --server mail.example.com --auth LOGIN --auth-user \"you@example.com\" --auth-password \"ab
\" -tls"
}, {
"cmd": "curl",
"username": "username",
"password": "pa
",
"source": "common",
"cmdline": "/bin/curl -u username:pa
** example.com"
}, {
"cmd": "mailx",
"username": "user1@gmail.com",
"password": "1q**",
"source": "common",
"cmdline": "mailx -v -s \"This is the subject\" -S smtp=\"mail.example.com:587\" -S smtp-auth=login -S smtp-auth-user=\"user1@gmail.com\" -S smtp-auth-password=\"1q
\" -S ssl-verify=ignore ismail@gmail.com"
}, {
"cmd": "wget",
"username": "doctorstrange",
"password": "la
",
"source": "leaked",
"cmdline": "wget --user doctorstrange --password la
** example.com/"
}
],
"additional_data": {
"last_brute_force": [{
"log_time": "May 11 04:44:15",
"username": "packer",
"source_ipv4": "52.174.51.107"
}, {
"log_time": "May 11 04:44:28",
"username": "postgres",
"source_ipv4": "52.174.51.107"
}, {
"log_time": "May 11 04:44:42",
"username": "deploy",
"source_ipv4": "52.174.51.107"
}, {
"log_time": "May 11 04:44:58",
"username": "vagrant",
"source_ipv4": "52.174.51.107"
}, {
"log_time": "May 11 04:45:13",
"username": "bot",
"source_ipv4": "52.174.51.107"
}, {
"log_time": "May 11 04:45:29",
"username": "hdfs",
"source_ipv4": "52.174.51.107"
}, {
"log_time": "May 11 04:45:45",
"username": "ts",
"source_ipv4": "52.174.51.107"
}, {
"log_time": "May 11 04:46:01",
"username": "system",
"source_ipv4": "52.174.51.107"
}, {
"log_time": "May 11 04:46:17",
"username": "tsbot",
"source_ipv4": "52.174.51.107"
}, {
"log_time": "May 11 04:46:33",
"username": "server",
"source_ipv4": "52.174.51.107"
}, {
"log_time": "May 11 04:47:06",
"username": "guest",
"source_ipv4": "52.174.51.107"
}, {
"log_time": "May 11 04:47:22",
"username": "minecraft",
"source_ipv4": "52.174.51.107"
}, {
"log_time": "May 11 04:47:39",
"username": "dev",
"source_ipv4": "52.174.51.107"
}, {
"log_time": "May 11 04:47:55",
"username": "test",
"source_ipv4": "52.174.51.107"
}, {
"log_time": "May 11 04:48:12",
"username": "www",
"source_ipv4": "52.174.51.107"
}, {
"log_time": "May 11 04:48:46",
"username": "linux",
"source_ipv4": "52.174.51.107"
}, {
"log_time": "May 11 04:49:03",
"username": "nagios",
"source_ipv4": "52.174.51.107"
}
],
"last_successful_logins": [],
"last_failed_logins": [{
"log_time": "May 11 11:11:11",
"username": "pi",
"source_ipv4": "96.81.74.60"
}, {
"log_time": "May 11 11:11:54",
"username": "bananapi",
"source_ipv4": "96.81.74.60"
}, {
"log_time": "May 11 11:12:39",
"username": "vyos",
"source_ipv4": "96.81.74.60"
}, {
"log_time": "May 11 11:13:23",
"username": "xbian",
"source_ipv4": "96.81.74.60"
}, {
"log_time": "May 11 11:14:07",
"username": "vyatta",
"source_ipv4": "96.81.74.60"
}, {
"log_time": "May 11 16:37:28",
"username": "admin",
"source_ipv4": "37.114.183.184"
}, {
"log_time": "May 11 17:01:04",
"username": "support",
"source_ipv4": "89.197.161.164"
}, {
"log_time": "May 11 20:38:00",
"username": "pi",
"source_ipv4": "119.194.14.3"
}, {
"log_time": "May 11 20:38:00",
"username": "pi",
"source_ipv4": "119.194.14.3"
}, {
"log_time": "May 12 06:14:37",
"username": "admin",
"source_ipv4": "82.209.236.212"
}
],
"users_with_password": ["myuser1", "myuser2", "myuser10"]
},
"vip": {
"status": "$disabled"
},
"vulnerabilities": {
"status": "$disabled"
},
"clamav": {
"status": "$disabled"
},
"files_info": {
"status": "$disabled"
}
},
"asset": {
"instance": {
"AmiLaunchIndex": 0,
"ImageId": "ami-03803ff616e2d4872",
"InstanceId": "i-094db43f8f1ad67bc",
"InstanceType": "t2.nano",
"KeyName": "testing-master-key",
"LaunchTime": "2019-05-06 13:29:27+00:00",
"Monitoring": {
"State": "disabled"
},
"Placement": {
"AvailabilityZone": "us-east-1d",
"GroupName": "",
"Tenancy": "default"
},
"PrivateDnsName": "ip-172-31-18-161.ec2.internal",
"PrivateIpAddress": "172.31.18.161",
"ProductCodes": [],
"PublicDnsName": "ec2-52-91-99-231.compute-1.amazonaws.com",
"PublicIpAddress": "52.91.99.231",
"State": {
"Code": 16,
"Name": "running"
},
"StateTransitionReason": "",
"SubnetId": "subnet-298db763",
"VpcId": "vpc-7c47c106",
"Architecture": "x86_64",
"BlockDeviceMappings": [{
"DeviceName": "/dev/sda1",
"Ebs": {
"AttachTime": "2019-05-06 13:29:28+00:00",
"DeleteOnTermination": true,
"Status": "attached",
"VolumeId": "vol-0578befce5167531f"
}
}
],
"ClientToken": "",
"EbsOptimized": false,
"EnaSupport": true,
"Hypervisor": "xen",
"NetworkInterfaces": [{
"Association": {
"IpOwnerId": "amazon",
"PublicDnsName": "ec2-52-91-99-231.compute-1.amazonaws.com",
"PublicIp": "52.91.99.231"
},
"Attachment": {
"AttachTime": "2019-05-06 13:29:27+00:00",
"AttachmentId": "eni-attach-061d00b5ba9c32462",
"DeleteOnTermination": true,
"DeviceIndex": 0,
"Status": "attached"
},
"Description": "",
"Groups": [{
"GroupName": "ssh-sg-new",
"GroupId": "sg-0d42f9a8e8dc09b73"
}
],
"Ipv6Addresses": [],
"MacAddress": "0a:73:4e:7c:73:fc",
"NetworkInterfaceId": "eni-014c00856452e56fe",
"OwnerId": "862394810259",
"PrivateDnsName": "ip-172-31-18-161.ec2.internal",
"PrivateIpAddress": "172.31.18.161",
"PrivateIpAddresses": [{
"Association": {
"IpOwnerId": "amazon",
"PublicDnsName": "ec2-52-91-99-231.compute-1.amazonaws.com",
"PublicIp": "52.91.99.231"
},
"Primary": true,
"PrivateDnsName": "ip-172-31-18-161.ec2.internal",
"PrivateIpAddress": "172.31.18.161"
}
],
"SourceDestCheck": false,
"Status": "in-use",
"SubnetId": "subnet-298db763",
"VpcId": "vpc-7c47c106"
}
],
"RootDeviceName": "/dev/sda1",
"RootDeviceType": "ebs",
"SecurityGroups": [{
"GroupName": "ssh-sg-new",
"GroupId": "sg-0d42f9a8e8dc09b73"
}
],
"SourceDestCheck": false,
"Tags": [{
"Key": "Name",
"Value": "scan_me_too"
}, {
"Key": "system_test",
"Value": "recent_files"
}, {
"Key": "system_test2",
"Value": "passwords"
}
],
"VirtualizationType": "hvm",
"CpuOptions": {
"CoreCount": 1,
"ThreadsPerCore": 1
},
"CapacityReservationSpecification": {
"CapacityReservationPreference": "open"
},
"HibernationOptions": {
"Configured": false
}
},
"volumes": [{
"Attachments": [{
"AttachTime": "2019-05-06 13:29:28+00:00",
"Device": "/dev/sda1",
"InstanceId": "i-094db43f8f1ad67bc",
"State": "attached",
"VolumeId": "vol-0578befce5167531f",
"DeleteOnTermination": true
}
],
"AvailabilityZone": "us-east-1d",
"CreateTime": "2019-05-06 13:29:28.446000+00:00",
"Encrypted": false,
"Size": 8,
"SnapshotId": "snap-00a79a6d42de763df",
"State": "in-use",
"VolumeId": "vol-0578befce5167531f",
"Iops": 100,
"VolumeType": "gp2"
}
],
"snapshots": [],
"security_groups": [{
"Description": "Managed by Terraform",
"GroupName": "ssh-sg-new",
"IpPermissions": [{
"FromPort": 22,
"IpProtocol": "tcp",
"IpRanges": [{
"CidrIp": "0.0.0.0/0"
}
],
"Ipv6Ranges": [],
"PrefixListIds": [],
"ToPort": 22,
"UserIdGroupPairs": []
}
],
"OwnerId": "862394810259",
"GroupId": "sg-0d42f9a8e8dc09b73",
"IpPermissionsEgress": [{
"IpProtocol": "-1",
"IpRanges": [{
"CidrIp": "0.0.0.0/0"
}
],
"Ipv6Ranges": [],
"PrefixListIds": [],
"UserIdGroupPairs": []
}
],
"VpcId": "vpc-7c47c106"
}
],
"interfaces": [{
"id": "eni-014c00856452e56fe",
"vpc": "vpc-7c47c106",
"private_ips": ["172.31.18.161"],
"public_ips": ["52.91.99.231"]
}
],
"public_ports": {
"ports": ["22"],
"security_groups": ["sg-0d42f9a8e8dc09b73"]
},
"cpu": {},
"memory": {},
"disks": [{
"size": "7.7G",
"used": "1.8G",
"available": "5.9G",
"percent": "24%"
}
],
"os": {
"distro": "ubuntu",
"release": "16.04",
"kernel": null
},
"container_info": {
"container_id": "main"
}
}
}
}
}
]
}