why spark read csv generate three jobs - apache-spark

I tried simple example on spark 2.1cloudra2:
val flightData2015 = spark
.read
.option("inferSchema", "true")
.option("header", "true")
.csv("/2015-summary.csv")
but when I check spark shell UI,I found it generate three jobs:
I think every action should related to a job,am I right? I do some experiment found out every option can generate a job. Does option act like action? please help understand this situation.

#yuxh,its because of the defaultMinPartitions which have been set to 3.It reflects Parallelism when a spark job is executed.You can change it in yarn-site.xml globally or dynamically specific to a job by issuing sqlContext.setConf("spark.sql.shuffle.partitions", "your valueā€¯)

Related

Starting a Spark Session without executors

I have a use-case where I need to use some Spark's API without actually performing any data processing. For example: I want to read the schema of some Hive table with spark.table(table_name).schema.
I want the process to be fast and lightweight. Specifically, I want to avoid the relatively long wait time to get the resources when starting. Is there a way to get a limited Spark Session with just the driver JVM and no executors at all?
The best I managed is this, but I wanted to see if I can make it even lighter:
spark = (
SparkSession
.builder
.enableHiveSupport()
.master("local[1]")
.config("spark.executor.instances", "1")
.config("spark.executor.cores", "1")
.config("spark.executor.memory", "450m")
.config("spark.executor.memoryOverhead", "0")
.config("spark.shuffle.service.enabled", "false")
.config("spark.dynamicAllocation.enabled", "false")
.config("spark.ui.enabled", "false")
)
Just to clear up your line of thought:
In local mode, the Driver and Executors are created in a single JVM. But there are no real Executors; there are just N cores for the Spark App to use.
So you are good with local[1], but you need not state this executore-params.

Is spark.read.load() an action or transformation? It takes time with this statement alone

I tried just loading the data using the below code and it looks like, without any other action on this, it is taking a lot of time. Bigger the file size is, more the time it takes.
print("STARTED")
biglog_df = spark.read.format("csv") \
.option("header",True) \
.option("inferSchema",True) \
.option("path","bigLog.txt") \
.load()
print("DONE STARTING")
It took around 20 Secs to print "DONE STARTING" when file is of 4GB, while it took more than a minute to get to "DONE STARTING" when file size is 25GB. Does this mean that Spark is trying to load the data? So, is load an action?
The load operation is not lazy evaluated if you set the inferSchema option to True. In this case, spark will launch a job to scan the file and infer the type of columns.
You can avoid this behavior by informing the schema while reading the file.
You can observe this behavior with this test:
Open a new interactive session in pyspark;
Open Spark UI > Pyspark Session > Jobs
And Run:
df = (
spark.read.format("csv")
.option("header", True)
.option("inferSchema", True)
.option("path", "s3a://first-street-climate-risk-statistics-for-noncommercial-use/01_DATA/Climate_Risk_Statistics/v1.3/Zip_level_risk_FEMA_FSF_v1.3.csv")
.load()
)
You will notice that jobs will be launched to scan (part of) the file to infer the schema.
If you load the file informing the schema:
import json
from pyspark.sql.types import StructType
json_schema = '{"fields":[{"metadata":{},"name":"zipcode","nullable":true,"type":"integer"},{"metadata":{},"name":"count_property","nullable":true,"type":"integer"},{"metadata":{},"name":"count_fema_sfha","nullable":true,"type":"integer"},{"metadata":{},"name":"pct_fema_sfha","nullable":true,"type":"double"},{"metadata":{},"name":"count_fs_risk_2020_5","nullable":true,"type":"integer"},{"metadata":{},"name":"pct_fs_risk_2020_5","nullable":true,"type":"double"},{"metadata":{},"name":"count_fs_risk_2050_5","nullable":true,"type":"integer"},{"metadata":{},"name":"pct_fs_risk_2050_5","nullable":true,"type":"double"},{"metadata":{},"name":"count_fs_risk_2020_100","nullable":true,"type":"integer"},{"metadata":{},"name":"pct_fs_risk_2020_100","nullable":true,"type":"double"},{"metadata":{},"name":"count_fs_risk_2050_100","nullable":true,"type":"integer"},{"metadata":{},"name":"pct_fs_risk_2050_100","nullable":true,"type":"double"},{"metadata":{},"name":"count_fs_risk_2020_500","nullable":true,"type":"integer"},{"metadata":{},"name":"pct_fs_risk_2020_500","nullable":true,"type":"double"},{"metadata":{},"name":"count_fs_risk_2050_500","nullable":true,"type":"integer"},{"metadata":{},"name":"pct_fs_risk_2050_500","nullable":true,"type":"double"},{"metadata":{},"name":"count_fs_fema_difference_2020","nullable":true,"type":"integer"},{"metadata":{},"name":"pct_fs_fema_difference_2020","nullable":true,"type":"double"},{"metadata":{},"name":"avg_risk_score_all","nullable":true,"type":"double"},{"metadata":{},"name":"avg_risk_score_2_10","nullable":true,"type":"double"},{"metadata":{},"name":"avg_risk_fsf_2020_100","nullable":true,"type":"double"},{"metadata":{},"name":"avg_risk_fsf_2020_500","nullable":true,"type":"double"},{"metadata":{},"name":"avg_risk_score_sfha","nullable":true,"type":"double"},{"metadata":{},"name":"avg_risk_score_no_sfha","nullable":true,"type":"double"},{"metadata":{},"name":"count_floodfactor1","nullable":true,"type":"integer"},{"metadata":{},"name":"count_floodfactor2","nullable":true,"type":"integer"},{"metadata":{},"name":"count_floodfactor3","nullable":true,"type":"integer"},{"metadata":{},"name":"count_floodfactor4","nullable":true,"type":"integer"},{"metadata":{},"name":"count_floodfactor5","nullable":true,"type":"integer"},{"metadata":{},"name":"count_floodfactor6","nullable":true,"type":"integer"},{"metadata":{},"name":"count_floodfactor7","nullable":true,"type":"integer"},{"metadata":{},"name":"count_floodfactor8","nullable":true,"type":"integer"},{"metadata":{},"name":"count_floodfactor9","nullable":true,"type":"integer"},{"metadata":{},"name":"count_floodfactor10","nullable":true,"type":"integer"}],"type":"struct"}'
schema = StructType.fromJson(json.loads(json_schema))
df = (
spark.read.format("csv")
.schema(schema)
.option("header", True)
.option("path", "s3a://first-street-climate-risk-statistics-for-noncommercial-use/01_DATA/Climate_Risk_Statistics/v1.3/Zip_level_risk_FEMA_FSF_v1.3.csv")
.load()
)
Spark will launch no jobs as the schema details will already be available in the catalog.
As already explained by #rodrigo,
the csv option inferSchema imply a pass over the whole csv file to infer the schema.
You can change the behavior providing the schema by yourself (if you want to create it by hand, maybe with a case class if you are on scala) or by using the samplingRatio option that indicate how much of your file you want to scan, in order to have faster operations while setting up your dataframe.
All the interesting behavior are explained in the documentation, that you can find here:
Dataframe reader documentation with options for csv file reading
biglog_df =
spark.read.format("csv")
.option("header",True)
.option("inferSchema",True)
.option("samplingRatio", 0.01)
.option("path","bigLog.txt").load()
BRegards

Why does Spark Structured Streaming not allow changing the number of input sources?

I would like to build a Spark streaming pipeline that reads from multiple Kafka Topics (that vary in number over time). I intended on stopping the the streaming job, adding/removing the new topics, and starting the job again whenever I required an update to the topics in the streaming job using one of the two options outlined in the Spark Structured Streaming + Kafka Integration Guide:
# Subscribe to multiple topics
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
.option("subscribe", "topic1,topic2") \
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
# Subscribe to a pattern
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
.option("subscribePattern", "topic.*") \
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
Upon further investigation, I noticed the following point in the Spark Structured Streaming Programming Guide and am trying to understand why changing the number of input sources is "not allowed":
Changes in the number or type (i.e. different source) of input sources: This is not allowed.
Definition of "Not Allowed" (also from Spark Structured Streaming Programming Guide):
The term not allowed means you should not do the specified change as the restarted query is likely to fail with unpredictable errors. sdf represents a streaming DataFrame/Dataset generated with sparkSession.readStream.
My understanding is that Spark Structured Streaming implements its own checkpointing mechanism:
In case of a failure or intentional shutdown, you can recover the previous progress and state of a previous query, and continue where it left off. This is done using checkpointing and write-ahead logs. You can configure a query with a checkpoint location, and the query will save all the progress information (i.e. range of offsets processed in each trigger) and the running aggregates (e.g. word counts in the quick example) to the checkpoint location. This checkpoint location has to be a path in an HDFS compatible file system, and can be set as an option in the DataStreamWriter when starting a query.
Can someone please explain why changing the number of sources is "not allowed"? I assume that would be one of the benefits of the checkpointing mechanism.
Steps to add new input source in existing running model streaming job
Stop the current running Streaming in which model is running.
hdfs dfs -get output/checkpoints/<model_name>offsets <local_directory>/offsets
There will be 3 files(since last 3 offsets are saved by spark) in the directory. sample format below for single file
v1
{ "batchWatermarkMs":0,"batchTimestampMs":1578463128395,"conf":{"spark.sql.streaming.stateStore.providerClass":"org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider","spark.sql.streaming.flatMapGroupsWithState.stateFormatVersion":"2","spark.sql.streaming.multipleWatermarkPolicy":"min","spark.sql.streaming.aggregation.stateFormatVersion":"2","spark.sql.shuffle.partitions":"200"}}
{ "logOffset":0}
{ "logOffset":0}
each {"logOffset":batchId} represents single input source.
To add new input source add "-" at the end of each file in the directory.
Sample updated file
v1
{"batchWatermarkMs":0,"batchTimestampMs":1578463128395,"conf":{"spark.sql.streaming.stateStore.providerClass":"org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider","spark.sql.streaming.flatMapGroupsWithState.stateFormatVersion":"2","spark.sql.streaming.multipleWatermarkPolicy":"min","spark.sql.streaming.aggregation.stateFormatVersion":"2","spark.sql.shuffle.partitions":"200"}}
{"logOffset":0}
{"logOffset":0}
If you want to add more than 1 input source then add "-" equal to number of new input source.
hdfs dfs -put -f <local_directory>/offsets output/checkpoints/<model_name>offsets
The best way to do what you want it's running your readStreams in multiple thread.
I'm doing this, reading 40 tables at same time. For do this I follow this article:
https://cm.engineering/multiple-spark-streaming-jobs-in-a-single-emr-cluster-ca86c28d1411.
I will do a quick brief about what I do after read and mount my code structure with main function, executor, and a trait with my spark session who will be shared with all jobs .
1.Two Lists of the topics that I want to read.
So, in Scala I create two lists. The first list is the topics that always I want to read and the second list it's a Dynamic list where when I stop my job I can add some new topics.
Pattern Matching to run the jobs.
I have two job different jobs, one that I run to the tables that always I run and Dynamic jobs that I run to specifc topics,in other words, If I want to add a new topic and create a new job to him, I add this job in pattern matching. In the bellow code, I want to run specfic job to the Cars and Ship tables and all another tables that I put in the specifc list will run the same replication table job
var tables = specifcTables ++ dynamicTables
tables.map(table => {
table._1 match {
case "CARS" => new CarsJob
case "SHIPS" => new ShipsReplicationJob
case _ => new ReplicationJob
After this I pass this pattern matching to a createjobs function that will instantiate each of these jobs and I pass this function to a startFutureTask function who will put each of these jobs in different threads
startFutureTasks(createJobs(tables))
I hope I've helped. Thanks !

Dynamic resource allocation for spark applications not working

I am new to Spark and trying to figure out how dynamic resource allocation works. I have spark structured streaming application which is trying to read million records at a time from Kafka and process them. My application always starts with 3 executors and never increase the number of executors.
It takes 5-10 minutes to finish the processing. I thought it will increase the number of executors(up to 10) and try to finish the processing sooner, which is not happening.What am I missing here? How is this supposed to work?
I have set below properties in Ambari for Spark
spark.dynamicAllocation.enabled = true
spark.dynamicAllocation.initialExecutors = 3
spark.dynamicAllocation.maxExecutors = 10
spark.dynamicAllocation.minExecutors = 3
spark.shuffle.service.enabled = true
Below is how my submit command looks like
/usr/hdp/3.0.1.0-187/spark2/bin/spark-submit --class com.sb.spark.sparkTest.sparkTest --master yarn --deploy-mode cluster --queue default sparkTest-assembly-0.1.jar
Spark code
//read stream
val dsrReadStream = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", brokers) //kafka bokers
.option("startingOffsets", startingOffsets) // start point to read
.option("maxOffsetsPerTrigger", maxoffsetpertrigger) // no. of records per batch
.option("failOnDataLoss", "true")
/****
Logic to validate format of loglines. Writing invalid log lines to kafka and store valid log lines in 'dsresult'
****/
//write stream
val dswWriteStream =dsresult.writeStream
.outputMode(outputMode) // file write mode, default append
.format(writeformat) // file format ,default orc
.option("path",outPath) //hdfs file write path
.option("checkpointLocation", checkpointdir) location
.option("maxRecordsPerFile", 999999999)
.trigger(Trigger.ProcessingTime(triggerTimeInMins))
Just to Clarify further,
spark.streaming.dynamicAllocation.enabled=true
worked only for Dstreams API. See Jira
Also, if you set
spark.dynamicAllocation.enabled=true
and run a structured streaming job, the batch dynamic allocation algorithm kicks in, which may not be very optimal. See Jira
Dynamic Resource Allocation does not work with Spark Streaming
Refer this link

How to validate orc vectorization is working within spark application?

I have enabled below listed configurations within my spark streaming
application but I unable to infer the performance benefit after setting these parameters ,
If any one of you know any means to validate whether vectorization is working as expeced/enabled correctly !
Note: I am using Spark 2.3 and converted all the data within my application
in native orc format 1.4 version.
sparkSqlCtx.setConf("spark.sql.orc.filterPushdown", "true")
sparkSqlCtx.setConf("spark.sql.orc.enabled", "true")
sparkSqlCtx.setConf("spark.sql.hive.convertMetastoreOrc", "true")
sparkSqlCtx.setConf("spark.sql.orc.char.enabled", "true")
sparkSqlCtx.setConf("spark.sql.orc.impl","native")
sparkSqlCtx.setConf("spark.sql.orc.enableVectorizedReader","true")
You need to set as below
spark.sql("set spark.sql.orc.impl=native")
You can confirm with
spark.sql("set spark.sql.orc.impl").show

Resources