spark repartition / executor inconsistencies commandline vs jupyter - apache-spark

I wasn't really sure what to title this question -- happy for a suggested better summary
I'm beating my head trying to figure out why a dead simple spark job works fine from Jupyter, but from the command line is left with insufficient executors to progress.
What I'm trying to do: I have a large amount of data (<1TB) from which I need to extract a small amount of data (~1GB) and save as parquet.
Problem I have: when my dead-simple code is run from the command line, I only get as many executors as I have final partitions, which is ideally one given it is small. The same exact code works just fine in Jupyter, same cluster, where it tasks out >10k tasks across my entire cluster. The commandline version never progresses. Since it doesn't produce any logs beyond reporting lack of progress, i'm not sure where more to dig.
I have tried both python3 mycode.py and spark-submit mycode.py with lots of variations to no avail. My cluster has dynamicAllocation configured.
import findspark
findspark.init('/usr/lib/spark/')
from pyspark.sql import SparkSession
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
data = spark.read.parquet(<datapath>).select(<fields>)
subset = [<list of items>]
spark.sparkContext.broadcast(subset)
data.filter(field.isin.(subset)).coalesce(1).write.parquet("output")
** edit: original version mistakenly had repartition(1) instead of coalesce.
In this case, run from the command line, my process will get one executor.
In my logs, the only real hint I get is
WARN TaskSetManager: Stage 1 contains a task of very large size (330 KB). The maximum recommended task size is 100 KB.
which makes sense given the lack of resources being allocated.
I have tried to manually force the number of executors using spark-submit runtime settings. In that case, it will start with my initial settings and then immediately start bringing them down until there is only one and nothing progresses.
Any ideas? thanks.

I ended up phoning a friend on this one...
the code that was running fine in JupyterHub, but not via the commandline was essentially a:
read parquet,
filter on some small field,
coalesce(1)
write parquet
I had assumed that coalesce(1) and repartition(1) should have the same results -- even though coalesce(N) and repartition(N) do not -- given that they all go to one partition.
According to my friend, Spark sometimes optimizes coalesce(1) to a single task, which was the behavior I saw. By changing it to repartition(1), everything works fine.
I still have no idea why it works fine in JupyterHub --- having done >20 experiments -- and never on the commandline -- also >20 experiements.
But, if you want to take your data lake to a data puddle this way, use repartition(1) or repartition(n), where n is small, instead of coalesce.

Related

Spark generates different execution plans for reading Orc table

We have a PySpark script that basically joins 3 large tables and aggregates the results. When running the script from a Jupyter notebook, everything is fine and fast - it completes without errors in a reasonable amount of time.
Then we try to launch the script as part of our pipeline (runs in Airflow, but I don't think it matters), and it fails most of the times. We have a lot of lost executors, retries, and usually it fails at the end, although couple of times it did succeed.
I've noticed in the Spark UI, that the amount of data that the 'pipeline' script reads in one of the stages is larger than the 'notebook' one does (something like 700GB instead of 500GB), and writes more shuffle data. That surprised me, as the scripts are absolutely the same (loaded from Git), and they are using the same tables and the same dates. I've checked the execution plans, and found only 1 difference - for 1 of the tables, the 'notebook' script generates 'Scan hive' action, whereas the 'pipeline' generates 'FileScan orc'. Other 2 tables are accessed in the same way by both environments - one of them with 'Scan hive' and another one with 'FileScan orc'. That also puzzles me a bit, but as long as the 'notebook' version runs fine, I don't mind.
The 'pipeline' environment is more powerful than the 'notebook' one, it has more executors with more memory. They run in the same Kubernetes cluster. The Spark version is 3.0.1 for both of them. All the tables are read using spark.read(table_name). The tables are in Orc format and are stored in HDFS on-prem.
I'm not sure if this is the source of the problem, but apparently there are some differences in the configuration between those environments, so I would like to know why is the operation different. I would be grateful to have any clue on how Spark decides which method to use.
We tried increasing executor memory settings, increasing the number executors, played with the number of partitions, setting spark.sql.hive.convertMetastoreOrc and so on - nothing helps, it still fails most of the times.

Writing a dataframe to disk taking an unrealistically long time in Pyspark (Spark 2.1.1)

I'm running Pyspark on a single server with multiple CPUs. All other operations (reading, joining, filtering, custom UDFs) are executed quickly except for writing to disk. The dataframe I'm trying to save is of size around ~400 gb with 200 partitions.
sc.getConf().getAll()
The driver memory is 16g, and working directory has enough space (> 10TB)
I'm trying to save using the following command:
df.repartition(1).write.csv("out.csv")
Wondering if anyone has run into the same issue. Also will changing any of the config parameters before pyspark is invoked help solve the issue?
Edits (a few clarifications):
When I mean other operations were executed quickly, there was always an action after transformation, in my case they were row counts. So all the operations were executed super fast. Still haven't gotten around why writing takes such a ridiculous amount of time.
One of my colleagues brought up the fact that the disks in our server might have a limit on concurrent writing which might be slowing things down, still investigating on this. Interested in knowing if others are seeing slow write times on a Spark cluster too. I have confirmation from one user regarding this on AWS cluster.
All other operations (reading, joining, filtering, custom UDFs)
There are because there are transformations - they don't do anything until data has to be saved.
The dataframe I'm trying to save is of size around ~400 gb
(...)
I'm trying to save using the following command:
df.repartition(1).write.csv("out.csv")
That just cannot work well. Even ignoring part where you use a single machine, saving 400GB with a single thread (!) is just hopeless. Even if it succeeds, it is not better than using plain bash script.
Skipping over Spark - sequential writes for 400GB will take a substantial amount of time, even on average size disk. And given multiple shuffle (join, repartition) data will be written to disk multiple times.
After a lot of trial and error, I realized that the issue was due to the method I used to read the file from disk. I was using the in-built read.csv function, and when I switched over to the read function in databricks-csv package the problem went away. I'm now able to write files to disk at a reasonable time. It's really strange, maybe it's a bug in 2.1.1 or databricks csv package is really optimized.
1.read.csv method
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("model") \
.config("spark.worker.dir", "xxxx") \
.getOrCreate()
df = spark.read.load("file.csv", format="csv", header = True)
write.csv("file_after_processing.csv")
2.Using the databricks-csv package
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('file.csv')
train.write.format('com.databricks.spark.csv').save('file_after_processing.csv')

Spark write to CSV fails even after 8 hours

I have a dataframe with roughly 200-600 gb of data I am reading, manipulating, and then writing to csv using the spark shell (scala) on an elastic map reduce cluster.Spark write to CSV fails even after 8 hours
here's how I'm writing to csv:
result.persist.coalesce(20000).write.option("delimiter",",").csv("s3://bucket-name/results")
The result variable is created through a mix of columns from some other dataframes:
var result=sources.join(destinations, Seq("source_d","destination_d")).select("source_i","destination_i")
Now, I am able to read the csv data it is based on in roughly 22 minutes. In this same program, I'm also able to write another (smaller) dataframe to csv in 8 minutes. However, for this result dataframe it takes 8+ hours and still fails ... saying one of the connections was closed.
I'm also running this job on 13 x c4.8xlarge instances on ec2, with 36 cores each and 60 gb of ram, so I thought I'd have the capacity to write to csv, especially after 8 hours.
Many stages required retries or had failed tasks and I can't figure out what I'm doing wrong or why it's taking so long. I can see from the Spark UI that it never even got to the write CSV stage and was busy with persist stages, but without the persist function it was still failing after 8 hours. Any ideas? Help is greatly appreciated!
Update:
I've ran the following command to repartition the result variable into 66K partitions:
val r2 = result.repartition(66000) #confirmed with numpartitions
r2.write.option("delimiter",",").csv("s3://s3-bucket/results")
However, even after several hours, the jobs are still failing. What am I doing wrong still?
note, I'm running spark shell via spark-shell yarn --driver-memory 50G
Update 2:
I've tried running the write with a persist first:
r2.persist(StorageLevel.MEMORY_AND_DISK)
But I had many stages fail, returning a, Job aborted due to stage failure: ShuffleMapStage 10 (persist at <console>:36) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 3' or saying Connection from ip-172-31-48-180.ec2.internal/172.31.48.180:7337 closed
Executors page
Spark web UI page for a node returning a shuffle error
Spark web UI page for a node returning an ec2 connection closed error
Overall Job Summary page
I can see from the Spark UI that it never even got to the write CSV
stage and was busy with persist stages, but without the persist
function it was still failing after 8 hours. Any ideas?
It is FetchFailedException i.e Failed to fetch a shuffle block
Since you are able to deal with small files, only huge data its failed...
I strongly feel that not enough partitions.
Fist thing is verify/Print source.rdd.getNumPartitions(). and destinations.rdd.getNumPartitions(). and result.rdd.getNumPartitions().
You need to repartition after the data is loaded in order to partition the data (via shuffle) to other nodes in the cluster. This will give you the parallelism that you need for faster processing with out fail
Further more, to verify the other configurations applied...
print all the config like this, adjust them to correct values as per demand.
sc.getConf.getAll
Also have a look at
SPARK-5928
Spark-TaskRunner-FetchFailedException Possible reasons : OOM or Container memory limits
repartition both source and destination before joining, with number of partitions such that each partition would be 10MB - 128MB(try to tune), there is no need to make it 20000(imho too many).
then join by those two columns and then write, without repartitioning(ie. output partitions should be same as reparitioning before join)
if you still have trouble, try to make same thing after converting to both dataframes to rdd(there are some differences between apis, and especially regarding repartitions, key-value rdds etc)

Spark on localhost

For testing purposes, while I donĀ“t have production cluster, I am using spark locally:
print('Setting SparkContext...')
sconf = SparkConf()
sconf.setAppName('myLocalApp')
sconf.setMaster('local[*]')
sc = SparkContext(conf=sconf)
print('Setting SparkContext...OK!')
Also, I am using a very very small dataset, consisting of only 20 rows in a postgresql database ( ~2kb)
Also(!), my code is quite simple as well, only grouping 20 rows by a key and applying a trivial map operation
params = [object1, object2]
rdd = df.rdd.keyBy(lambda x: (x.a, x.b, x.c)) \
.groupByKey() \
.mapValues(lambda value: self.__data_interpolation(value, params))
def __data_interpolation(self, data, params):
# TODO: only for testing
return data
What bothers me is that the whole execution takes about 5 minutes!!
Inspecting the Spark UI, I see that most of the time was spent in Stage 6: byKey method. (Stage 7, collect() method was also slow...)
Some info:
These numbers make no sense to me... Why do I need 22 tasks, executing for 54 sec, to process less than 1 kb of data
Can it be a network issue, trying to figure out the ip address of localhost?
I don't know... Any clues?
It appears the main reason for the slower performance in your code snippet is due to the use of groupByKey(). The issue with groupByKey is that it ends up shuffling all of the key-value pairs resulting in a lot of data unnecessarily being transferred. A good reference to explain this issue is Avoid GroupByKey.
To work around this issue, you can:
Try using reduceByKey which should be faster (more info is also included in the above Avoid GroupByKey link).
Use DataFrames (instead of RDDs) as DFs include performance optimizations (and the DF GroupBy statement is faster than the RDD version). As well, as you're using Python, you can avoid the Python-to-JVM issues with PySpark RDDs. More information on this can be seen in PySpark Internals
By the way, reviewing the Spark UI diagram above, the #22 refers to the task # within the DAG (not the number of tasks executed).
HTH!
I suppose the "postgresql" is the key to solve that puzzle.
keyBy is probably the first operation that really uses the data so it's execution time is bigger as it needs to get the data from external database. You can verify it by adding at the beginning:
df.cache()
df.count() # to fill the cache
df.rdd.keyBy....
If I am right, you need to optimize the database. It may be:
Network issue (slow network to DB server)
Complicated (and slow) SQL on this database (try it using postgre shell)
Some authorization difficulties on DB server
Problem with JDBC driver you use
From what I have seen happening in my system while running spark:
When we run a spark job it internally creates map and reduce tasks and runs them. In your case, to run the data you have, it created 22 such tasks. I bigger the size of data the number may be big.
Hope this helps.

Spark job out of RAM (java.lang.OutOfMemoryError), even though there's plenty. xmx too low?

I'm getting java.lang.OutOfMemoryError with my Spark job, even though only 20% of the total memory is in use.
I've tried several configurations:
1x n1-highmem-16 + 2x n1-highmem-8
3x n1-highmem-8
My dataset consist of 1.8M records, read from a local json file on the master node. The entire dataset in json format is 7GB. The job I'm trying to execute involves a simple computation followed by a reduceByKey. Nothing extraordinary. The job runs fine on my single home computer with only 32GB ram (xmx28g), although it requires some caching to disk.
The job is submitted through spark-submit, locally on the server (SSH).
Stack trace and Spark config can be viewed here: https://pastee.org/sgda
The code
val rdd = sc.parallelize(Json.load()) // load everything
.map(fooTransform) // apply some trivial transformation
.flatMap(_.bar.toSeq) // flatten results
.map(c => (c, 1)) // count
.reduceByKey(_ + _)
.sortBy(_._2)
log.v(rdd.collect.map(toString).mkString("\n"))
The root of the problem is that you should try to offload more I/O to the distributed tasks instead of shipping it back and forth between the driver program and the worker tasks. While it may not be obvious at times which calls are driver-local and which ones describe a distributed action, rules of thumb include avoiding parallelize and collect unless you absolutely need all of the data in one place. The amounts of data you can Json.load() and the parallelize will max out at whatever largest machine type is possible, whereas using calls like sc.textFile theoretically scale to hundreds of TBs or even PBs without problem.
The short-term fix in your case would be to try passing spark-submit --conf spark.driver.memory=40g ... or something in that range. Dataproc defaults allocate less than a quarter of the machine to driver memory because commonly the cluster must support running multiple concurrent jobs, and also needs to leave enough memory on the master node for the HDFS namenode and the YARN resource manager.
Longer term you might want to experiment with how you can load the JSON data as an RDD directly, instead of loading it in a single driver and using parallelize to distribute it, since this way you can dramatically speed up the input reading time by having tasks load the data in parallel (and also getting rid of the warning Stage 0 contains a task of very large size which is likely related to the shipping of large data from your driver to worker tasks).
Similarly, instead of collect and then finishing things up on the driver program, you can do things like sc.saveAsTextFile to save in a distributed manner, without ever bottlenecking through a single place.
Reading the input as sc.textFile would assume line-separated JSON, and you can parse inside some map task, or you can try using sqlContext.read.json. For debugging purposes, it's often enough instead of using collect() to just call take(10) to take a peek at some records without shipping all of it to the driver.

Resources