Skip to content

Spark Json Data

Lecture 11: Working with JSON Data in Spark🔗

image

Two types of JSON notation:

  • Line Delimited JSON image

  • Multi Line JSON image

[
{
  "name": "Manish",
  "age": 20,
  "salary": 20000
},
{
  "name": "Nikita",
  "age": 25,
  "salary": 21000
},
{
  "name": "Pritam",
  "age": 16,
  "salary": 22000
},
{
  "name": "Prantosh",
  "age": 35,
  "salary": 25000
},
{
  "name": "Vikash",
  "age": 67,
  "salary": 40000
}
]

Line Delimited JSON is more efficient in terms of performance because the compiler knows that each line has one JSON record whereas in multiline json the compiler needs to keept track of where the record ends and the next one starts.

Different number of keys in each line🔗

image

Here what happens is that the line with the extra key has the value while for the rest its null image

Multiline Incorrect JSON🔗

We dont pass a list here rather its just dictionaries

{
  "name": "Manish",
  "age": 20,
  "salary": 20000
},
{
  "name": "Nikita",
  "age": 25,
  "salary": 21000
},
{
  "name": "Pritam",
  "age": 16,
  "salary": 22000
},
{
  "name": "Prantosh",
  "age": 35,
  "salary": 25000
},
{
  "name": "Vikash",
  "age": 67,
  "salary": 40000
}
When we process the json it just reads the first dictionary as a record and the rest is not processed.

image

Corrupted Records🔗

We dont need to define _corrupted_record in the schema, it will add the column on its ownn

image