Spark load from Elasticsearch: number of executor and partitions - apache-spark

I'm trying to load data from an Elasticsearch index into a dataframe in Spark. My machine has 12 CPU's and 1 core. I'm using PySpark on a Jupyter Notebook with the following Spark config:
pathElkJar = currentUserFolder+"/elasticsearch-hadoop-"+connectorVersion+"/dist/elasticsearch- spark-20_2.11-"+connectorVersion+".jar"
spark = SparkSession.builder \
.appName("elastic") \
.config("spark.jars",pathElkJar) \
.enableHiveSupport() \
.getOrCreate()
Now whether I do:
df = es_reader.load()
or:
df = es_reader.load(numPartitions=12)
I get the same output from the following prints:
print('Master: {}'.format(spark.sparkContext.master))
print('Number of partitions: {}'.format(df.rdd.getNumPartitions()))
print('Number of executors:{}'.format(spark.sparkContext._conf.get('spark.executor.instances')))
print('Partitioner: {}'.format(df.rdd.partitioner))
print('Partitions structure: {}'.format(df.rdd.glom().collect()))
Master: local[*]
Number of partitions: 1
Number of executors: None
Partitioner: None
I was expecting 12 partitions, which I can only obtain by doing a repartition() on the dataframe. Furthermore I thought that the number of executors by default equals the number of CPU's. But even by doing the following:
spark.conf.set("spark.executor.instances", "12")
I can't manually set the number of executors. It is true I have 1 core for each of the 12 CPU's, but how should I go about it?
I modified the configuration file after creating the Spark session (without restarting this obviously leads to no changes), by specifying the number of executor as follows:
spark = SparkSession.builder \
.appName("elastic") \
.config("spark.jars",pathElkJar) \
.config("spark.executor.instances", "12") \
.enableHiveSupport() \
.getOrCreate()
I now correctly get 12 executors. Still I don't understand why it doesn't do it automatically and still the number of partitions when loading the dataframe is 1. I would expect it to be 12 as the number of executors, am I right?

The problem regarding the executors and partitioning arised from the fact that i was using spark in local mode which allows for one executor maximum. Using Yarn or other resource managers such as mesos solved the problem

Related

PySpark + Dataproc - Can't get more than X executors and X GB/Ram per executors

I use a Dataproc cluster to lemmatize strings using Spark NLP.
My cluster has 5 nodes + 1 master, each of the worker nodes has 16CPU + 64GB RAM.
Doing some maths, my ideal Spark config is:
spark.executor.instances = 14
spark.executor.cores = 5
spark.executor.memory = 19G
With that conf, I maximize the usage of the machines and have enough room for ApplicationManager and Off-Heap memory.
However when creating the SparkSession with
spark = SparkSession \
.builder \
.appName('perf-test-extract-skills') \
.config("spark.default.parallelism", "140") \
.config("spark.driver.maxResultSize", "19G") \
.config("spark.executor.memoryOverhead", "1361m") \
.config("spark.driver.memoryOverhead", "1361m") \
.config("spark.sql.adaptive.enabled", "true") \
.config("spark.dynamicAllocation.enabled", "false") \
.config("spark.executor.instances", "14") \
.config("spark.executor.cores", "5") \
.config("spark.executor.memory", "19G") \
.config("spark.kryoserializer.buffer.max", "2000M") \
.config('spark.jars.packages', 'com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.26.0,com.johnsnowlabs.nlp:spark-nlp_2.12:4.1.0') \
.getOrCreate()
I can only get 10 workers with 10Gib RAM on each as shown on the screenshot below:
I tried editing the parameter yarn.nodemanager.resource.memory-mb to 64000 to let yarn manage nodes with up to 64GB RAM but it's still the same, I can't go beyond the 10 workers and 10GB RAM.
Also, when I check the values in the "environment" tab, everything looks ok and the values are set according to my SparkSession config, meaning that the master did a request but it cannot be fullfiled ?
Is there something I forgot or are my maths wrong ?
EDIT: I managed to increase the number of executors with the new SparkSession I shared above. I can now get 14 executors but each executor is still using 10GB Ram when it should use 19.
Here is one of my executors, is it using 19GB of RAM ? I don't really understand the meaning of the different "memory" columns.

Dataproc Didn't Process Big Data in Parallel Using pyspark

I launched a DataProc cluster in GCP, with one master node and 3 work nodes. Every node has 8 vCPU and 30G memory.
I developed a pyspark code, which read one csv file from GCS. The csv file is about 30G in size.
df_raw = (
spark
.read
.schema(schema)
.option('header', 'true')
.option('quote', '"')
.option('multiline', 'true')
.csv(infile)
)
df_raw = df_raw.repartition(20, "Product")
print(df_raw.rdd.getNumPartitions())
Here is how I launched the pyspark into dataproc:
gcloud dataproc jobs submit pyspark gs://<my-gcs-bucket>/<my-program>.py \
--cluster=${CLUSTER} \
--region=${REGION} \
I got the partition number of only 1.
I attached the nodes usage image here for your reference.
Seems it used only one vCore from one worker node.
How to make this in parallel with multiple partitions and using all nodes and more vCores?
Tried repartition to 20, but it still only used one vCore from one work node, as below:
Pyspark default partition is 200. So I was surprised to see dataproc didn't use all available resources for this kind of task.
This isn't a dataproc issue, but a pure Spark/pyspark one.
In order to parallelize your data it needs to split into multiple partitions - a number larger than the number of executors (total worker cores) you have. (E.g. ~ *2, ~ *3, ...)
There are various ways to do this e.g.:
Split data into files or folders and parallelize the list of files/folders and work on each one (or use a database that already does this and keeps this partitioning in Spark read).
Repartition your data after you get a Spark DF e.g. read the number of executors and multiply them by N and repartition to this many partitions. When you do this, you must chose columns which divide your data well i.e. into many parts, not into a few parts only e.g. by day, by a customer ID, not by a status ID.
df = df.repartition(num_partitions, 'partition_by_col1', 'partition_by_col2')
The code runs on the master node and the parallel stages are distributed amongst the worker nodes, e.g.
df = (
df.withColumn(...).select(...)...
.write(...)
)
Since Spark functions are lazy, they only run when you reach a step like write or collect which causes the DF to be evaluated.
You might want to try to increase the number of executors by passing Spark configuration via --properties of Dataproc command line. So something like
gcloud dataproc jobs submit pyspark gs://<my-gcs-bucket>/<my-program>.py \
--cluster=${CLUSTER} \
--region=${REGION} \
--properties=spark.executor.instances=5

Pyspark GroupBy and count too slow

I am running pyspark on dataproc cluster with 4 nodes, each node having 2 cores and 8 GB RAM.
I have a dataframe with a column containing list of words. I exploded this column and counted the number of occurences using-
df.groupBy("exploded_col").count()
Before exploding, there were ~78 mn rows.
But, running the above code takes too long (more than 4 hours). Why is spark taking unusually long time? I'm still new to spark, so I'm not fully aware of appropriate settings to deal with huge data.
I have the following settings for sparkContext
enter code here
SparkSession.builder \
.appName("Spark NLP Licensed") \
.master("yarn") \
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.11:2.5.1")
spark.conf.set("spark.sql.shuffle.partitions",20)
spark.conf.set("spark.num.executors",100)
spark.conf.set("spark.executor.cores",1)
spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")
I even set "spark.sql.shuffle.partitions" to 2001, but that didn't work either.
Please help.
The main reason for the poor performance is that groupBy usually cause a data shuffle between the executors. You can use the built in spark function countDistinct in this manner:
from spark.sql.functions import countDistinct
df.agg(countDistinct("exploded_col"))

Pyspark crashing on Dataproc cluster for small dataset

I am running a jupyter notebook created on a gcp dataproc cluster consisting of 3 worker nodes and 1 master node of type "N1-standard2" (2 cores, 7.5GB RAM), for my data science project. The dataset consists of ~0.4 mn rows. I have called a groupBy function with the groupBy column consisting of only 10 unique values, so that the output dataframe should consist of just 10 rows!
It's susprising that it crashes everytime I call grouped_df.show() or grouped_df.toPandas(), where grouped_df is obtained after calling groupBy() and sum() function.
This should be a cakewalk for spark which was originally built for processing large datasets. I am attaching the spark config that I am using which I have defined in a function.
builder = SparkSession.builder \
.appName("Spark NLP Licensed") \
.master("local[*]") \
.config("spark.driver.memory", "40G") \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
.config("spark.kryoserializer.buffer.max", "2000M") \
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.11:2.5.1") \
.config("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem") \
.config("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
return builder.getOrCreate()
`
This is the error I am getting. Please help.
Setting master's URL in setMaster() helped. Now I can load data as large as 20GB and perform groupBy() operations as well on the cluster.
Thanks #mazaneicha.

Spark off heap memory leak on Yarn with Kafka direct stream

I am running spark streaming 1.4.0 on Yarn (Apache distribution 2.6.0) with java 1.8.0_45 and also Kafka direct stream. I am also using spark with scala 2.11 support.
The issue I am seeing is that both driver and executor containers are gradually increasing the physical memory usage till a point where yarn container kill it. I have configured upto 192M Heap and 384 off heap space in my driver but it eventually runs out of it
The Heap memory appears to be fine with regular GC cycles. There is no OutOffMemory encountered ever in any such runs
Infact I am not generating any traffic on the kafka queues still this happens. Here is the code I am using
object SimpleSparkStreaming extends App {
val conf = new SparkConf()
val ssc = new StreamingContext(conf,Seconds(conf.getLong("spark.batch.window.size",1L)));
ssc.checkpoint("checkpoint")
val topics = Set(conf.get("spark.kafka.topic.name"));
val kafkaParams = Map[String, String]("metadata.broker.list" -> conf.get("spark.kafka.broker.list"))
val kafkaStream = KafkaUtils.createDirectStream[String,String,StringDecoder,StringDecoder](ssc, kafkaParams, topics)
kafkaStream.foreachRDD(rdd => {
rdd.foreach(x => {
println(x._2)
})
})
kafkaStream.print()
ssc.start()
ssc.awaitTermination()
}
I am running this on CentOS 7. The command used for spark submit is following
./bin/spark-submit --class com.rasa.cloud.prototype.spark.SimpleSparkStreaming \
--conf spark.yarn.executor.memoryOverhead=256 \
--conf spark.yarn.driver.memoryOverhead=384 \
--conf spark.kafka.topic.name=test \
--conf spark.kafka.broker.list=172.31.45.218:9092 \
--conf spark.batch.window.size=1 \
--conf spark.app.name="Simple Spark Kafka application" \
--master yarn-cluster \
--num-executors 1 \
--driver-memory 192m \
--executor-memory 128m \
--executor-cores 1 \
/home/centos/spark-poc/target/lib/spark-streaming-prototype-0.0.1-SNAPSHOT.jar
Any help is greatly appreciated
Regards,
Apoorva
Try increasing executor cores. In your example the only core is dedicated for consuming the streaming data, leaving no cores to process in the incoming data.
It could be a memory leak... Have you try with conf.set("spark.executor.extraJavaOptions","-XX:+UseG1GC") ?
This is not a Kafka answer this will be isolated to Spark and how its cataloguing system is poor when it comes to consistent persistence and large operations. If you are consistently writing to a perisitence layer (i.e. in a loop re-persisting a DF after a large operation then running again) or running a large query (i.e. inputDF.distinct.count); the Spark job will begin placing some data into memory and not efficiently removing the objects that are stale.
This means overtime an object that was able to quickly run once, will steadily slow down until no memory remains available. For everyone at home spin up a AWS EMR with a large DataFrame loaded int the environment run the below query:
var iterator = 1
val endState = 15
var currentCount = 0
while (iterator <= endState) {
currentCount = inputDF.distinct.count
print("The number of unique records are : " + currentCount)
iterator = iterator + 1
}
While the job is running watch the Spark UIs memory management, if the DF is sufficiently large enough for the session, you will start to notice a drop in run-time with each subsequent run, mainly blocks are becoming stale but Spark is unable to identify when to clean those blocks.
The best way I have found a solution to this problem was by writing my DF locally, clearing the persisitence layer and loading the data back in. It is a "sledge-hammer" approach to the problem, but for my business case it was the easily solution to implement that caused a 90% increase in run-time for our large tables (taking 540 minutes to around 40 with less memory).
The code I currently use is:
val interimDF = inputDF.action
val tempDF = interimDF.write.format(...).option("...","...").save("...")
spark.catalog.clearCache
val interimDF = spark.read..format(...).option("...","...").save("...").persist
interimDF.count
Here are a derivative if you dont unpersist DFs in child sub-processes:
val interimDF = inputDF.action
val tempDF = interimDF.write.format(...).option("...","...").save("...")
for ((k,v) <- sc.getPersistentRDDs) {
v.unpersist()
}
val interimDF = spark.read..format(...).option("...","...").save("...").persist
interimDF.count

Resources