I have a use-case where I need to use some Spark's API without actually performing any data processing. For example: I want to read the schema of some Hive table with spark.table(table_name).schema.
I want the process to be fast and lightweight. Specifically, I want to avoid the relatively long wait time to get the resources when starting. Is there a way to get a limited Spark Session with just the driver JVM and no executors at all?
The best I managed is this, but I wanted to see if I can make it even lighter:
spark = (
SparkSession
.builder
.enableHiveSupport()
.master("local[1]")
.config("spark.executor.instances", "1")
.config("spark.executor.cores", "1")
.config("spark.executor.memory", "450m")
.config("spark.executor.memoryOverhead", "0")
.config("spark.shuffle.service.enabled", "false")
.config("spark.dynamicAllocation.enabled", "false")
.config("spark.ui.enabled", "false")
)
Just to clear up your line of thought:
In local mode, the Driver and Executors are created in a single JVM. But there are no real Executors; there are just N cores for the Spark App to use.
So you are good with local[1], but you need not state this executore-params.
Related
I have Spark set up in standalone mode on a single node with 2 cores and 16GB of RAM to make some rough POCs.
I want to load data from a SQL source using val df = spark.read.format('jdbc')...option('numPartitions',n).load(). When I tried to measure the time taken to read a table for different numPartitions values by calling a df.rdd.count, I saw the the time was the same regardless of the value I gave. I also noticed one the context web UI that the number of Active executors was 1, even though I gave SPARK_WORKER_INSTANCES=2 and SPARK_WORKER_CORES=1in my spark_env.sh file.
I have 2 questions:
Do the numPartitions actually created depend on the number of executors?
How do I start spark-shell with multiple executors in my current setup?
Thanks!
Number of partitions doesn't depend on your number of executors - althaugh there is best practice (partitions per cores), but it doesn't determined by the executors instances.
In case of reading from JDBC, to make it parallelize reading you need a partition column, e.g:
spark.read("jdbc")
.option("url", url)
.option("dbtable", "table")
.option("user", user)
.option("password", password)
.option("numPartitions", numPartitions)
.option("partitionColumn", "<partition_column>")
.option("lowerBound", 1)
.option("upperBound", 10000)
.load()
That will parallel the queries from the databases to 10,000/numPartitions results of each query.
About your second question, you can find all over spark configuration over here: https://spark.apache.org/docs/latest/configuration.html , (spark2-shell --num-executors, or the configuration --conf spark.executor.instances).
Specifing the number of the executors meaning dynamic allocation will be off so be aware of that.
I am new to Spark and trying to figure out how dynamic resource allocation works. I have spark structured streaming application which is trying to read million records at a time from Kafka and process them. My application always starts with 3 executors and never increase the number of executors.
It takes 5-10 minutes to finish the processing. I thought it will increase the number of executors(up to 10) and try to finish the processing sooner, which is not happening.What am I missing here? How is this supposed to work?
I have set below properties in Ambari for Spark
spark.dynamicAllocation.enabled = true
spark.dynamicAllocation.initialExecutors = 3
spark.dynamicAllocation.maxExecutors = 10
spark.dynamicAllocation.minExecutors = 3
spark.shuffle.service.enabled = true
Below is how my submit command looks like
/usr/hdp/3.0.1.0-187/spark2/bin/spark-submit --class com.sb.spark.sparkTest.sparkTest --master yarn --deploy-mode cluster --queue default sparkTest-assembly-0.1.jar
Spark code
//read stream
val dsrReadStream = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", brokers) //kafka bokers
.option("startingOffsets", startingOffsets) // start point to read
.option("maxOffsetsPerTrigger", maxoffsetpertrigger) // no. of records per batch
.option("failOnDataLoss", "true")
/****
Logic to validate format of loglines. Writing invalid log lines to kafka and store valid log lines in 'dsresult'
****/
//write stream
val dswWriteStream =dsresult.writeStream
.outputMode(outputMode) // file write mode, default append
.format(writeformat) // file format ,default orc
.option("path",outPath) //hdfs file write path
.option("checkpointLocation", checkpointdir) location
.option("maxRecordsPerFile", 999999999)
.trigger(Trigger.ProcessingTime(triggerTimeInMins))
Just to Clarify further,
spark.streaming.dynamicAllocation.enabled=true
worked only for Dstreams API. See Jira
Also, if you set
spark.dynamicAllocation.enabled=true
and run a structured streaming job, the batch dynamic allocation algorithm kicks in, which may not be very optimal. See Jira
Dynamic Resource Allocation does not work with Spark Streaming
Refer this link
I tried simple example on spark 2.1cloudra2:
val flightData2015 = spark
.read
.option("inferSchema", "true")
.option("header", "true")
.csv("/2015-summary.csv")
but when I check spark shell UI,I found it generate three jobs:
I think every action should related to a job,am I right? I do some experiment found out every option can generate a job. Does option act like action? please help understand this situation.
#yuxh,its because of the defaultMinPartitions which have been set to 3.It reflects Parallelism when a spark job is executed.You can change it in yarn-site.xml globally or dynamically specific to a job by issuing sqlContext.setConf("spark.sql.shuffle.partitions", "your valueā€¯)
I need to write my final dataframe to hdfs and oracle database.
currently once saving to hdfs done, it start writing to rdbms. is there any way to use java threads to save same dataframe to hdfs as well as rdbms parallel.
finalDF.write().option("numPartitions", "10").jdbc(url, exatable, jdbcProp);
finalDF.write().mode("OverWrite").insertInto(hiveDBWithTable);
Thanks.
Cache finalDF before writing to hdfs and rdbms. Then make sure that enough executors are available for writing simultaneously. If number of partitions in finalDF are p and cores per executors are c, then you need minimum ceilof(p/c)+ceilof(10/c) executors.
df.show and df.write are Actions. Actions occur sequentially in Spark. So, answer is No, not possible standardly unless threads used.
We can use below code to append dataframe values to table
DF.write
.mode("append")
.format("jdbc")
.option("driver", driverProp)
.option("url", urlDbRawdata)
.option("dbtable", TABLE_NAME)
.option("user", userName)
.option("password", password)
.option("numPartitions", maxNumberDBPartitions)
.option("batchsize",batchSize)
.save()
I've got an application that orchestrates batch job executions and I want to create a SparkSession per job execution - especially in order to get a clean separation of registered temp views, functions etc.
So, this would lead to thousands of SparkSessions per day, that will only live for the duration of a job (from a few minutes up to a several hours). Is there any argument to not do this ?
I am aware of the fact, that there is only one SparkContext per JVM. I also know that a SparkContext performs some JVM global caching, but what exactly does this mean for this scenario ? What is e.g. cached in a SparkContext and what would happen if there are many spark jobs executed using those sessions ?
This shows how multiple sessions can be build with different configures
Use
spark1.clearActiveSession();
spark1.clearDefaultSession();
To clear the sessions.
SparkSession spark1 = SparkSession.builder()
.master("local[*]")
.appName("app1")
.getOrCreate();
Dataset<Row> df = spark1.read().format("csv").load("data/file1.csv");
df.show();
spark1.clearActiveSession();
spark1.clearDefaultSession();
SparkSession spark2 = SparkSession.builder()
.master("local[*]")
.appName("app2")
.getOrCreate();
Dataset<Row> df2 = spark1.read().format("csv").load("data/file2.csv");
df2.show();
For your questions.
Spark context save the rdds in memory for quicker processing.
If there is lot of data . The save tables or rdds are moved to the hdd .
A session can access the tables if it saved as a view at any point.
It is better to do multiple spark-submits for your jobs with unique id instead of having different configs.