Pyspark GroupBy and count too slow - apache-spark

I am running pyspark on dataproc cluster with 4 nodes, each node having 2 cores and 8 GB RAM.
I have a dataframe with a column containing list of words. I exploded this column and counted the number of occurences using-
df.groupBy("exploded_col").count()
Before exploding, there were ~78 mn rows.
But, running the above code takes too long (more than 4 hours). Why is spark taking unusually long time? I'm still new to spark, so I'm not fully aware of appropriate settings to deal with huge data.
I have the following settings for sparkContext
enter code here
SparkSession.builder \
.appName("Spark NLP Licensed") \
.master("yarn") \
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.11:2.5.1")
spark.conf.set("spark.sql.shuffle.partitions",20)
spark.conf.set("spark.num.executors",100)
spark.conf.set("spark.executor.cores",1)
spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")
I even set "spark.sql.shuffle.partitions" to 2001, but that didn't work either.
Please help.

The main reason for the poor performance is that groupBy usually cause a data shuffle between the executors. You can use the built in spark function countDistinct in this manner:
from spark.sql.functions import countDistinct
df.agg(countDistinct("exploded_col"))

Related

Why my spark dataframe has only 1 partition?

from pyspark.sql import SparkSession
spark= SparkSession.builder.master("local[4]").getOrCreate()
df = spark.read.csv("annual-enterprise-survey-2021-financial-year-provisional-size-bands-csv.csv")
df.createOrReplaceTempView("table")
sqldf = spark.sql('SELECT _c5 FROM table WHERE _c5 > "1000"')
print(sqldf.count())
print(df.rdd.getNumPartitions())
print(sqldf.rdd.getNumPartitions())
I am trying to see the effect of parallelism in spark. How can I decide how many partitions will I have when I am running actions on my dataframe? In the below code, my output for number of partitions is 1s. In UI it shows 1 task for the count job. Shouldnt spark create 4 tasks(number of cores on my local machine) and then do the count operation faster?
Partitions and workers are not mapped one to one although they can be.
local[4] defines the number of workers. To specify number of partitions for a dataframe, one can use repartition or coallece function.
For example you can write
sqldf = sqldf.repartition(4)

Dataproc Didn't Process Big Data in Parallel Using pyspark

I launched a DataProc cluster in GCP, with one master node and 3 work nodes. Every node has 8 vCPU and 30G memory.
I developed a pyspark code, which read one csv file from GCS. The csv file is about 30G in size.
df_raw = (
spark
.read
.schema(schema)
.option('header', 'true')
.option('quote', '"')
.option('multiline', 'true')
.csv(infile)
)
df_raw = df_raw.repartition(20, "Product")
print(df_raw.rdd.getNumPartitions())
Here is how I launched the pyspark into dataproc:
gcloud dataproc jobs submit pyspark gs://<my-gcs-bucket>/<my-program>.py \
--cluster=${CLUSTER} \
--region=${REGION} \
I got the partition number of only 1.
I attached the nodes usage image here for your reference.
Seems it used only one vCore from one worker node.
How to make this in parallel with multiple partitions and using all nodes and more vCores?
Tried repartition to 20, but it still only used one vCore from one work node, as below:
Pyspark default partition is 200. So I was surprised to see dataproc didn't use all available resources for this kind of task.
This isn't a dataproc issue, but a pure Spark/pyspark one.
In order to parallelize your data it needs to split into multiple partitions - a number larger than the number of executors (total worker cores) you have. (E.g. ~ *2, ~ *3, ...)
There are various ways to do this e.g.:
Split data into files or folders and parallelize the list of files/folders and work on each one (or use a database that already does this and keeps this partitioning in Spark read).
Repartition your data after you get a Spark DF e.g. read the number of executors and multiply them by N and repartition to this many partitions. When you do this, you must chose columns which divide your data well i.e. into many parts, not into a few parts only e.g. by day, by a customer ID, not by a status ID.
df = df.repartition(num_partitions, 'partition_by_col1', 'partition_by_col2')
The code runs on the master node and the parallel stages are distributed amongst the worker nodes, e.g.
df = (
df.withColumn(...).select(...)...
.write(...)
)
Since Spark functions are lazy, they only run when you reach a step like write or collect which causes the DF to be evaluated.
You might want to try to increase the number of executors by passing Spark configuration via --properties of Dataproc command line. So something like
gcloud dataproc jobs submit pyspark gs://<my-gcs-bucket>/<my-program>.py \
--cluster=${CLUSTER} \
--region=${REGION} \
--properties=spark.executor.instances=5

Spark load from Elasticsearch: number of executor and partitions

I'm trying to load data from an Elasticsearch index into a dataframe in Spark. My machine has 12 CPU's and 1 core. I'm using PySpark on a Jupyter Notebook with the following Spark config:
pathElkJar = currentUserFolder+"/elasticsearch-hadoop-"+connectorVersion+"/dist/elasticsearch- spark-20_2.11-"+connectorVersion+".jar"
spark = SparkSession.builder \
.appName("elastic") \
.config("spark.jars",pathElkJar) \
.enableHiveSupport() \
.getOrCreate()
Now whether I do:
df = es_reader.load()
or:
df = es_reader.load(numPartitions=12)
I get the same output from the following prints:
print('Master: {}'.format(spark.sparkContext.master))
print('Number of partitions: {}'.format(df.rdd.getNumPartitions()))
print('Number of executors:{}'.format(spark.sparkContext._conf.get('spark.executor.instances')))
print('Partitioner: {}'.format(df.rdd.partitioner))
print('Partitions structure: {}'.format(df.rdd.glom().collect()))
Master: local[*]
Number of partitions: 1
Number of executors: None
Partitioner: None
I was expecting 12 partitions, which I can only obtain by doing a repartition() on the dataframe. Furthermore I thought that the number of executors by default equals the number of CPU's. But even by doing the following:
spark.conf.set("spark.executor.instances", "12")
I can't manually set the number of executors. It is true I have 1 core for each of the 12 CPU's, but how should I go about it?
I modified the configuration file after creating the Spark session (without restarting this obviously leads to no changes), by specifying the number of executor as follows:
spark = SparkSession.builder \
.appName("elastic") \
.config("spark.jars",pathElkJar) \
.config("spark.executor.instances", "12") \
.enableHiveSupport() \
.getOrCreate()
I now correctly get 12 executors. Still I don't understand why it doesn't do it automatically and still the number of partitions when loading the dataframe is 1. I would expect it to be 12 as the number of executors, am I right?
The problem regarding the executors and partitioning arised from the fact that i was using spark in local mode which allows for one executor maximum. Using Yarn or other resource managers such as mesos solved the problem

Pyspark crashing on Dataproc cluster for small dataset

I am running a jupyter notebook created on a gcp dataproc cluster consisting of 3 worker nodes and 1 master node of type "N1-standard2" (2 cores, 7.5GB RAM), for my data science project. The dataset consists of ~0.4 mn rows. I have called a groupBy function with the groupBy column consisting of only 10 unique values, so that the output dataframe should consist of just 10 rows!
It's susprising that it crashes everytime I call grouped_df.show() or grouped_df.toPandas(), where grouped_df is obtained after calling groupBy() and sum() function.
This should be a cakewalk for spark which was originally built for processing large datasets. I am attaching the spark config that I am using which I have defined in a function.
builder = SparkSession.builder \
.appName("Spark NLP Licensed") \
.master("local[*]") \
.config("spark.driver.memory", "40G") \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
.config("spark.kryoserializer.buffer.max", "2000M") \
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.11:2.5.1") \
.config("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem") \
.config("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
return builder.getOrCreate()
`
This is the error I am getting. Please help.
Setting master's URL in setMaster() helped. Now I can load data as large as 20GB and perform groupBy() operations as well on the cluster.
Thanks #mazaneicha.

Spark Cluster configuration

I'm using a spark cluster with of two nodes each having two executors(each using 2 cores and 6GB memory).
Is this a good cluster configuration for a faster execution of my spark jobs?
I am kind of new to spark and I am running a job on 80 million rows of data which includes shuffling heavy tasks like aggregate(count) and join operations(self join on a dataframe).
Bottlenecks:
Showing Insufficient resources for my executors while reading the data.
On a smaller dataset, it's taking a lot of time.
What should be my approach and how can I do away with my bottlenecks?
Any suggestion would be highly appreciable.
query= "(Select x,y,z from table) as df"
jdbcDF = spark.read.format("jdbc").option("url", mysqlUrl) \
.option("dbtable", query) \
.option("user", mysqldetails[2]) \
.option("password", mysqldetails[3]) \
.option("numPartitions", "1000")\
.load()
This gives me a dataframe which on jdbcDF.rdd.getNumPartitions() gives me value of 1. Am I missing something here?. I think I am not parallelizing my dataset.
There are different ways to improve the performance of your application. PFB some of the points which may help.
Try to reduce the number of records and columns for processing. As you have mentioned you are new to spark and you might not need all 80 million rows, so you can filter the rows to whatever you require. Also, select the columns which is required but not all.
If you are using some data frequently then try considering caching the data, so that for the next operation it will be read from the memory.
If you are joining two DataFrames and if one of them is small enough to fit in memory then you can consider broadcast join.
Increasing the resources might not improve the performance of your application in all cases, but looking at your configuration of the cluster, it should help. It might be good idea to throw some more resources and check the performance.
You can also try using Spark UI to monitor your application and see if there are few task which are taking long time than others. Then probably you need to deal with skewness of your data.
You can try considering to Partition your data based on the columns which you are using in your filter criteria.

Resources