Spark and JSON


JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. We can define complex nested structures using JSON file format. Its a semi structured data format and in this blog post, we will look at the JSON support in Apache Spark.

Pooja

JSON to DataFrame

Spark DataFrame is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. Its very easy to read a JSON file and construct Spark dataframes.
spark.read.json("example.json")

JSON file above should have one json object per line. If the json object span multiple lines, we can use the below:
spark.read.json(path="example.json", multiLine=True)

We can also convert json string into Spark DataFrame. We can load JSON lines or an RDD of Strings storing JSON objects (one object per record) and returns the result as a DataFrame.

import json
from pyspark.sql.functions import col,array_contains, explode
person = '''{
"firstName": "John",
"lastName": "Smith",
"isAlive": "True",
"age": 27,
"address": {
"streetAddress": "21 2nd Street",
"city": "New York",
"state": "NY",
"postalCode": "10021-3100"
},
"phoneNumbers": [
{
"type": "home",
"number": "212 555-1234"
},
{
"type": "office",
"number": "646 555-4567"
},
{
"type": "mobile",
"number": "123 456-7890"
}
],
"children": [],
"spouse": null
}'''


rdd = sc.parallelize([person])
df = spark.read.json(rdd)
df.show()

DataFrame to JSON

We can get back the JSON objects using toJSON() method on dataframe.

df.toJSON().collect()
Output:
['{"address":{"city":"New York","postalCode":"10021-3100","state":"NY","streetAddress":"21 2nd Street"},"age":27,"children":[],"firstName":"John","isAlive":"True","lastName":"Smith","phoneNumbers":[{"number":"212 555-1234","type":"home"},{"number":"646 555-4567","type":"office"},{"number":"123 456-7890","type":"mobile"}]}']

That was quite simple. 🙂 If you look closely through the JSON output, you will notice it is missing the spouse column. So, if you have all the values in the column null, the column is not included in the JSON output.
We can get around this problem by doing something like below:

df.fillna('null').toJSON().collect()


Output:
['{"address":{"city":"New York","postalCode":"10021-3100","state":"NY","streetAddress":"21 2nd Street"},"age":27,"children":[],"firstName":"John","isAlive":"True","lastName":"Smith","phoneNumbers":[{"number":"212 555-1234","type":"home"},{"number":"646 555-4567","type":"office"},{"number":"123 456-7890","type":"mobile"}],"spouse":"null"}']

We can also choose save the DataFrame to JSON file

df.write.format('json').save(os.path.join(tempfile.mkdtemp(), 'person_data'))

DataFrame JSON Column Transformation

We can get JSON representation of the any StructField using jsonValue()

from pyspark.sql.types import StructField, StringType, ArrayType, StructType
field = StructField("addresses", ArrayType(StructType([StructField('House', StringType(), True), StructField('State', StringType(), True)])), True)
field.jsonValue()

A DataFrame column having JSON string values can be converted into nested structure using from_json()

data = [(1, '''{"addresses":[{"House":78,"State":"Belfast"}]}''')]
schema = StructType([StructField("addresses", ArrayType(StructType([StructField('House', LongType(), True), StructField('State', StringType(), True)])), True)])
df = spark.createDataFrame(data, ("key", "value"))
df.select(from_json(df.value, schema).alias("json")).collect()


Output:
[Row(json=Row(addresses=[Row(House=78, State='Belfast')]))]

We can do the reverse using to_json() and convert nested structure into JSON string.

data = [Row(json=Row(addresses=[Row(House=78, State='Belfast')]))]
df = spark.createDataFrame(data)
df.select(to_json(df.json).alias('json_string')).show(truncate=False)

With existing tools, users often engineer complex pipelines to read and write JSON data sets within analytical systems. Spark, the unified analytical engine, facilitates reading and analysing JSON data and above are some of the Spark library support to deal with the JSON data.

2 Comments

  1. First, thanks to summarize all the cases. I have a question about from_json(), if the json string does not have fixed schema. Is there a way in spark python to convert json string value to a struct?
    Aaron

    1. Pyspark can infer schema from json string. So, if you want to get the struct schema of a json string, you can get parse that the json string to dataframe as shown in the post. Then get the schema using below :
      dataframe.schema
      I hope this answers your question.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.