Spark Json Data

Lecture 11: Working with JSON Data in Spark🔗

Two types of JSON notation:

Line Delimited JSON
Multi Line JSON

[
{
  "name": "Manish",
  "age": 20,
  "salary": 20000
},
{
  "name": "Nikita",
  "age": 25,
  "salary": 21000
},
{
  "name": "Pritam",
  "age": 16,
  "salary": 22000
},
{
  "name": "Prantosh",
  "age": 35,
  "salary": 25000
},
{
  "name": "Vikash",
  "age": 67,
  "salary": 40000
}
]

Line Delimited JSON is more efficient in terms of performance because the compiler knows that each line has one JSON record whereas in multiline json the compiler needs to keept track of where the record ends and the next one starts.

Different number of keys in each line🔗

Here what happens is that the line with the extra key has the value while for the rest its null

Multiline Incorrect JSON🔗

We dont pass a list here rather its just dictionaries

{
  "name": "Manish",
  "age": 20,
  "salary": 20000
},
{
  "name": "Nikita",
  "age": 25,
  "salary": 21000
},
{
  "name": "Pritam",
  "age": 16,
  "salary": 22000
},
{
  "name": "Prantosh",
  "age": 35,
  "salary": 25000
},
{
  "name": "Vikash",
  "age": 67,
  "salary": 40000
}

When we process the json it just reads the first dictionary as a record and the rest is not processed.

Corrupted Records🔗

We dont need to define _corrupted_record in the schema, it will add the column on its ownn