Not able to set number of shuffle partition in pyspark - apache-spark

I know that by default, the number of partition for tasks is set to 200 in spark. I can't seem to change this. I'm running jupyter with spark 1.6.
I'm loading a fairly small table with about 37K rows from hive using the following in my notebook
from pyspark.sql.functions import *
sqlContext.sql("set spark.sql.shuffle.partitions=10")
test= sqlContext.table('some_table')
print test.rdd.getNumPartitions()
print test.count()
The output confirms 200 tasks. From the activity log, it's spinning up 200 tasks, which is an overkill. it seems like line number 2 above is ignored. So, I tried the following:
test = sqlContext.table('gfcctdmn_work.icgdeskrev_emma_cusip_activity_bw').repartition(5)
and create a new cell:
print test.rdd.getNumPartitions()
print test.count()
The output shows 5 partitions, but the log shows 200 tasks being spun up for the count, and then repartition to 5 took place after. However, if I convert it first to RDD, and back to DataFrame as follow:
test = sqlContext.table('gfcctdmn_work.icgdeskrev_emma_cusip_activity_bw').repartition(5).rdd
and create a new cell:
print test.getNumPartitions()
print test.toDF().count()
The very first time I ran the new cell, it's still running with 200 tasks. However, the second time I ran the new cell, it ran with 5 tasks.
How can I make the code run with 5 tasks the very first time it's running?
Would you mind explaining why this behaves this way(specifying number of partition, but it's still running under default settings)? Is it because the defauly Hive table was created using 200 partitions?

At the beginning of your notebook, do something like this:
from pyspark.conf import SparkConf
sc.stop()
conf = SparkConf().setAppName("test")
conf.set("spark.default.parallelism", 10)
sc = SparkContext(conf=conf)
When the notebook starts you have already a SparkContext created for you, but still you can change configuration and recreate it.
As for spark.default.parallelism, I understand it is what you need, take a look here:
Default number of partitions in RDDs returned by transformations like
join, reduceByKey, and parallelize when not set by user.

Related

Why does Spark crossJoin take so long for a tiny dataframe?

I'm trying to do the following crossJoin on two dataframes with 5 rows each, but Spark spawns 40000 tasks on my machine and it took 30 seconds to achieve the task. Any idea why that is happening?
df = spark.createDataFrame([['1','1'],['2','2'],['3','3'],['4','4'],['5','5']]).toDF('a','b')
df = df.repartition(1)
df.select('a').distinct().crossJoin(df.select('b').distinct()).count()
You call a .distinct before join, it requires a shuffle, so it repartitions data based on spark.sql.shuffle.partitions property value. Thus, df.select('a').distinct() and df.select('b').distinct() result in new DataFrames each with 200 partitions, 200 x 200 = 40000
Two things - it looks like you cannot directly control the number of partitions a DF is created with, so we can first create a RDD instead (where you can specify the number of partitions) and convert it to DF. Also you can set the shuffle partitions to '1' as well. These both ensure you will have just 1 partition during the whole execution and should speed things up.
Just note that this shouldn't be an issue at all for larger datasets, for which Spark is designed (it would be faster to achieve the same result on a dataset of this size not using spark at all). So in the general case you won't really need to do stuff like this, but tune the number of partitions to your resources/data.
spark.conf.set("spark.default.parallelism", "1")
spark.conf.set("spark.sql.shuffle.partitions", "1")
df = sc.parallelize([['1','1'],['2','2'],['3','3'],['4','4'],['5','5']], 1).toDF(['a','b'])
df.select('a').distinct().crossJoin(df.select('b').distinct()).count()
spark.conf.set sets the configuration for a single execution only, if you want more permanent changes do them in the actual spark conf file

Unable to change number of partitions in Pyspark with Spark 3.0.1

I'm using Pyspark on Spark 3.0.1 on Windows 10 locally for testing and developing, and regardless of what I try the number of processes spawned is always 200 which is way too many for my small test cases.
I'm creating my Spark-SQL context like this:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("pyspark_test").master("local")\
.config('spark.shuffle.partitions', '16')\
.config('spark.adaptive.enabled', 'True')\
.config("spark.adaptive.coalescePartitions.enabled", "True").getOrCreate()
Doing print(spark.sparkContext._conf.getAll()) later shows that the parameters have been correctly set (host censored by me):
[('spark.master', 'local'),
('spark.driver.host', '**************'),
('spark.app.name', 'pyspark_test'),
('spark.adaptive.enabled', 'True'),
('spark.rdd.compress', 'True'),
('spark.adaptive.coalescePartitions.enabled', 'True'),
('spark.driver.port', '58352'),
('spark.serializer.objectStreamReset', '100'),
('spark.submit.pyFiles', ''),
('spark.shuffle.partitions', '16'),
('spark.executor.id', 'driver'),
('spark.submit.deployMode', 'client'),
('spark.app.id', 'local-1602571079244')]
I'm executing the task using spark-submit in the console, so each SparkSession should be created new with the given config.
My code contains a groupBy, an inner join, and a write.csv at the end. The csv output is the main issue here.
When I do a coalesce(1) before writing csv it takes 3 minutes to collect 200 pieces of data into one, the output csv has 338KB. In the Stages Overview I can see that it only runs 2 tasks in parallel while going through the 200 pieces. Without that it just writes 200 separate csv files with 2KB each which also takes around 3 minutes.
My input data is two csv files with the sizes 3.8MB and 826KB.
I tried this with and without enabling adaptive optimization, but it feels like my settings are being ignored anyway.
I am aware of this related question but that was three and a half years ago on V1.6.
Also I did experiment with first creating a SparkContext, setting and getting a conf, stopping the SparkContext and using the conf for my SparkSession, but that didn't help either.
So my simple question is: Why is my setting of spark.shuffle.partitions being ignored and how do I fix this?
I do feel a bit stupid now.
I need to set spark.sql.shuffle.partitions and not spark.shuffle.partitions.
I was expecting Spark to throw an error on getting a setting that doesn't exist and when that didn't happen I thought it was okay.

DropDuplicates in PySpark gives stackoverflowerror

I have a PySpark program which reads a json files of size around 350-400 MB and created a dataframe out of it.
In my next step, I create a Spark SQL query using createOrReplaceTempView and select few columns as required
Once this is done, I filter my dataframe with some conditions. It was working fine until this point of time.
Now, I needed to remove some duplicate values using a column. So, I introduced,
dropDuplicates in next step and it suddenly started giving me StackoverflowError
Below is the sample code:-
def create_some_df(initial_df):
initial_df.createOrReplaceTempView('data')
original_df = spark.sql('select c1,c2,c3,c4 from data')
## Filter out some events
original_df = original_df.filter(filter1condition)
original_df = original_df.filter(filter2condition)
original_df = original_df.dropDuplicates(['c1'])
return original_df
It worked fine until I added dropDuplicates method.
I am using 3 node AWS EMR cluster c5.2xlarge
I am running PySpark using spark-submit command in YARN client mode
What I have tried
I tried adding persist and cache before calling filter, but it didn't help
EDIT - Some more details
I realise that the error appears when I invoke my write function after multiple transformation i.e first action.
If I have dropDuplicates in my transformation before I write, it fails with error.
If I do not have dropDuplicates in my transformation, write works fine.

Writing a dataframe to disk taking an unrealistically long time in Pyspark (Spark 2.1.1)

I'm running Pyspark on a single server with multiple CPUs. All other operations (reading, joining, filtering, custom UDFs) are executed quickly except for writing to disk. The dataframe I'm trying to save is of size around ~400 gb with 200 partitions.
sc.getConf().getAll()
The driver memory is 16g, and working directory has enough space (> 10TB)
I'm trying to save using the following command:
df.repartition(1).write.csv("out.csv")
Wondering if anyone has run into the same issue. Also will changing any of the config parameters before pyspark is invoked help solve the issue?
Edits (a few clarifications):
When I mean other operations were executed quickly, there was always an action after transformation, in my case they were row counts. So all the operations were executed super fast. Still haven't gotten around why writing takes such a ridiculous amount of time.
One of my colleagues brought up the fact that the disks in our server might have a limit on concurrent writing which might be slowing things down, still investigating on this. Interested in knowing if others are seeing slow write times on a Spark cluster too. I have confirmation from one user regarding this on AWS cluster.
All other operations (reading, joining, filtering, custom UDFs)
There are because there are transformations - they don't do anything until data has to be saved.
The dataframe I'm trying to save is of size around ~400 gb
(...)
I'm trying to save using the following command:
df.repartition(1).write.csv("out.csv")
That just cannot work well. Even ignoring part where you use a single machine, saving 400GB with a single thread (!) is just hopeless. Even if it succeeds, it is not better than using plain bash script.
Skipping over Spark - sequential writes for 400GB will take a substantial amount of time, even on average size disk. And given multiple shuffle (join, repartition) data will be written to disk multiple times.
After a lot of trial and error, I realized that the issue was due to the method I used to read the file from disk. I was using the in-built read.csv function, and when I switched over to the read function in databricks-csv package the problem went away. I'm now able to write files to disk at a reasonable time. It's really strange, maybe it's a bug in 2.1.1 or databricks csv package is really optimized.
1.read.csv method
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("model") \
.config("spark.worker.dir", "xxxx") \
.getOrCreate()
df = spark.read.load("file.csv", format="csv", header = True)
write.csv("file_after_processing.csv")
2.Using the databricks-csv package
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('file.csv')
train.write.format('com.databricks.spark.csv').save('file_after_processing.csv')

Executing local (driver) code iteratively on Spark DataFrame items

I am using Spark, Dataframes and Python.
Let say I have a quite huge dataframe, with every Row containing some JPG images as binary data. I want to build some kind of browser to display every image sequentially.
I have a view function that take a single row as input and does something like this:
def view(row):
windows = popup_window_that_display_image(row.image)
waitKey()
destroy_window(window)
The following code works fine with spark-submit option --master local[*]:
df = load_and_compute_dataframe(context, some_arguments)
df.foreach(view)
Obviously, the view function cannot run on remote Spark executors. So the above code fails in yarn-client mode.
I can use the following code to work in yarn-client mode:
df = load_and_compute_dataframe(context, some_arguments)
data = df.limit(10).collect();
for x in data:
view(w)
The drawback is that I can only collect a few items. Data is too huge to get more than 10 or 100 items at once.
So my questions are:
Is there a mean to have some DF/RDD operation executes locally on the driver, instead of the executors ?
Is there something that allows me to collect 10 items from a DF, starting from the 11th ? Should I try to add an "ID" column to my DF and iterate over it (ugly) ?
Any other way to achieve this result ?
Thanks for help !

Resources