Spark | JSON Schema inference - apache-spark

I have some json data and they have a variable number of columns with different datatypes. Spark json schema inferencing has configuration options mentioned in the docs.
As far as I understand, spark runs a spark job to infer the schema. This spark jobs will be run in several tasks and each task will use sampleRatio to take a sample of data to infer the schema.
Is it reasonable to use samplingRatio configuration less than 1.0 when there are variable number of columns?
Are there any other configurations/tricks that would be applicable for my use case?
Appreciate any help to resolve this issue.

Related

Generate Spark schema code/persist and reuse schema

I am implementing some Spark Structured Streaming transformations from a Parquet data source. In order to read the data into a streaming DataFrame, one has to specify the schema (it cannot be automatically inferred). The schema is really complex and manually writing the schema code will be a very complex task.
Can you suggest a walkaround? Currently I am creating a batch DataFrame beforehand (using the same data source), Spark infers the schema and then I save the schema to a Scala object and use it as an input for the Structured Streaming reader.
I don't think it is a reliable or a well performing solution. Please suggest how to generate the schema code automatically or somehow persist the schema in a file and reuse it.
From the docs:
By default, Structured Streaming from file based sources requires you
to specify the schema, rather than rely on Spark to infer it
automatically. This restriction ensures a consistent schema will be
used for the streaming query, even in the case of failures. For ad-hoc
use cases, you can reenable schema inference by setting
spark.sql.streaming.schemaInference to true.
You could also open a shell, read one of the parquet files with automatic schema inference enabled, and save the schema to JSON for later reuse. You only have to do this one time, so it might be faster / more efficient than doing the similar-sounding workaround you're using now.

How to chain multiple jobs in Apache Spark

I would like to know is there a way to chain the jobs in Spark, so the output RDD (or other format) of first job is passed as input to the second job ?
Is there any API for it from Apache Spark ? Is this even idiomatic approach ?
From what I found is that there is a way to spin up another process through the yarn client for example Spark - Call Spark jar from java with arguments, but this assumes that you save it to some intermediate storage between jobs.
Also there are runJob and submitJob on SparkContext, but are they good fit for it ?
Use the same RDD definition to define the input/output of your jobs.
You should then be able to chain them.
The other option is to use DataFrames instead of RDD and figure out the schema at run-time.

Spark - High task deserialization time

I am running a job using Spark Sql with some complex queries (group by 7 fields, partition by 5 fields and rank etc. ). When I am running the job on a large dataset (1TB in parquet), task deserialization time is very high for one of the stage. But the logs just says that its reading data from parquet files (from S3). Can anyone please help me understand why this is happening. I can tell that jar size is not the issue since I don't see this in other stages.
If I have to use Kyro serialization, how will I use it with Dataset ? (I am not using any custom objects)

Kafka spark streaming dynamic schema

I'm strangling Kafka spark streaming with dynamic schema.
I"m consuming from Kafka (KafkaUtils.createDirectStream) each message /JSON field can be nested, each field can appear in some messages and sometimes not.
The only thing I found is to do:
Spark 2.0 implicit encoder, deal with missing column when type is Option[Seq[String]] (scala)
case class MyTyp(column1: Option[Any], column2: Option[Any]....)
This will cover,im not sure, fields that may appear, and nested Fileds.
Any approval/other Ideas/general help will be appreciated ...
After long integration and trails, two ways to solve non schema Kafka consuming: 1) Throw "editing/validation" each message with "lambda" function .not my favorite. 2) Spark: on each micro batch obtain flatten schema and intersect needed columns. use spark SQL to query the frame for needed data. That worked for me.

Apache Spark: Python function serialized automatically

I was going through the Apache spark documentation. Spark docs for python says the following:
...We can pass Python functions to Spark, which are automatically
serialized along with any variables that they reference...
I don't fully understand what it means. Does it have to do something the the RDD type?
What does it mean in the context of spark?
The serialization is necessary when using PySpark because the function you define locally needs to be executed remotely on each of the worker nodes. This concept isn't really related to the RDD type.

Resources