spark wholeTextFiles fails for large data - apache-spark

I use pyspark version 1.5.0 with Cloudera 5.5.0. All scripts are running fine except when I use sc.wholeTextFiles. Using this command gives an error:
Kryo Serialization failed: Buffer overflow. Available:0, required: 23205706. To avoid this, increase spark.kryoserializer.buffer.max
However, I don't find the property spark.kryoserializer.buffer.max in the spark web UI; it is not present under the Environment tab in Spark web UI. The only "kryo" in this page is the value org.apache.spark.selializer.KryoSerializer of the name spark.serializer.
Why can't I see this property? And how to fix the problem?
EDIT
Turns out that the Kryo error was caused by a printing to the shell. Without printing, the error is actually java.io.IOExceptionL Filesystem closed!
The script now works correctly for a small portion of the data, but running it on all of the data (about 500GB, 10,000 files) returns this error.
I tried to pass in the propery --conf "spak.yarn.executor.memoryOverhead=2000", and it seems that it allows a slightly larger part of the data to be read, but it still ultimately fails on the full data. It takes 10-15 minutes of running before the error appears.
The RDD is big, but the error is produced even when only doing .count() on it.

You should pass such property when submitting a job. This is why it's not in Cloudera UI.
http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_running_spark_apps.html
In your case: --conf "spark.kryoserializer.buffer.max = 64M" (for example)
Also, I'm not sure but it might happen that if you increase Kryo buffer you might want to increase akka frame size.

Related

PySpark job only partly running

I have a PySpark script that I am running locally with spark-submit in Docker. In my script I have a call toPandas() on a PySpark DataFrame and afterwards I have various manipulations of the DataFrame, finishing in a call to to_csv() to write results to a local CSV file.
When I run this script, the code after the call to toPandas() does not appear to run. I have log statements before this method call and afterwards, however only the log entries before the call show up on the spark-submit console output. I have thought that maybe this is due to the rest of the code being run in a separate executor process by Spark, so the logs don't show on the console. If this is true, how can I see my application logs for the executor? I have enabled the event log with spark.eventLog.enabled=true but this seems to show only internal events, not my actual application log statements.
Even if the assumption about executor logs above is true or false, I don't see the CSV file written to the path that I expect (/tmp). Further, the history server says No completed applications found! when I start it, configuring it to read the event log (spark.history.fs.logDirectory for the history server, spark.eventLog.dir for Spark). It does show an incomplete application, the only complete job listed there is for my toPandas() call.
How am I supposed to figure out what's happening? Spark shows no errors at any point, and I can't seem to view my own application logs.
When you use the toPandas() to convert your spark dataframe to your pandas dataframe, it's actually a heavy action because it will pull all the records to the driver.
Remember that Spark is a distributed computing engine and it's doing the parallel computing. Therefore your data will be distributed to different node and it's completely different to pandas dataframe, since pandas works on single machine but spark work in cluster. You can check this post: why does python dataFrames' are localted only in the same machine?
Back to your post, actually it covers 2 questions:
Why there is no logs after toPandas(): As mentioned above, Spark is a distributed computing engine. The event log will only save the job details which appear in Spark computation DAG. Other non Spark log will not be saved in spark log, if you really those log, you need to use external library like logging to collect the logs in driver.
Why there is no CSV saved in /tmp dir: As you mentioned that when you check the event log, there is a an incomplete application but not a failed application, I believe you dataframe is so huge that you collection has not finished and your transformation in pandas dataframe has even not yet started. You can try to collect few record, let's say df.limit(20).toPandas() to see if it works or not. If it works, that means your dataframe that converts to pandas is so large and it takes time. If it's not work, maybe you can share more about the error traceback.

PySpark: Job aborts due to stage failure, but resetting max size isn't recognized

I'm attempting to display a dataframe in PySpark after reading the files in using a function/subroutine. Reading the files in works greatly, but it's the display that's not working. Actually, due to lazy evaluation, this may not be true.
I get this error
SparkException: Job aborted due to stage failure: Total size of serialized results of 29381 tasks (4.0 GB) is bigger than spark.driver.maxResultSize (4.0 GB)
so I do what was suggested https://forums.databricks.com/questions/66/how-do-i-work-around-this-error-when-using-rddcoll.html
sqlContext.setConf("spark.driver.maxResultSize", "8g")
sqlContext.getConf("spark.driver.maxResultSize")
however, the bizarre part is, this gives the same error back when I re-run the display(df) command.
It's like Spark is just ignoring my commands.
I've tried increasing the number of workers and making both the worker type and driver type larger, but neither of these fixed anything.
How can I get this to work? or is this a bug in Databricks/Spark?
It all depends on your code and partitioning of the code with respect to the cluster size. Increasing spark.driver.maxResultSize is the first option to solve the problem and eventually look for a permanent solution to modify the code or design. Please do avoid collecting more data to driver node.
OR
You need to change this parameter in the cluster configuration. Go into the cluster settings, under Advanced select spark and paste spark.driver.maxResultSize 0 (for unlimited) or whatever the value suits you. Using 0 is not recommended. You should optimize the job by re partitioning.
For more details, refer "Spark Configurations - Application Properties".
Hope this helps. Do let us know if you any further queries.

The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached

I try to analyze a dataset of 500Mb in Databricks. These data are stored in Excel file. The first thing that I did was to install Spark Excel package com.crealytics.spark.excel from Maven (last version - 0.11.1).
These are the parameters of the cluster:
Then I executed the following code in Scala notebook:
val df_spc = spark.read
.format("com.crealytics.spark.excel")
.option("useHeader", "true")
.load("dbfs:/FileStore/tables/test.xlsx")
But I got error about the Java heap size and then I get another error "java.io.IOException: GC overhead limit exceeded". Then I executed this code again and got another error after 5 minutes running:
The spark driver has stopped unexpectedly and is restarting. Your
notebook will be automatically reattached.
I do not understand why it happens. In fact the data set is quite small for the distributed computing and the cluster size should be ok to process these data. What should I check to solve it?
I also got stuck in same situation where i am unable to process my 35000 record xlsx file.
Below solutions I tried to work around:
With the free azure subscription and 14 day pay as you go mode, you can process xlsx with less number of records.In my case with trial version, I have to change it to 25 records.
Also downgrade the worker type to Standard_F4S 8GB Memory 4core, 0.5DBU, 1 worker configuration.
Added below options:
sqlContext.read.format("com.crealytics.spark.excel").
option("location","filename here...").option("useHeader","true").option("treatEmptyValueAsNulls","true").option("maxRowsInMemory",20).option("inferSchema","true").load("filename here...")
I had this same issue. We reached out to DataBricks, who provided us this answer
"In the past we were able to address this issue by simply restarting a cluster that has been up for a long period of time.
This issue occurs due the fact that JVMs reuse the memory locations too many times and start misbehaving."

Memory difference between pyspark and spark?

I have been trying to get a PySpark job to work which creates a RDD with a bunch of binary files, and then I use a flatMap operation to process the binary data into a bunch of rows. This has lead to a bunch of out of memory errors, and after playing around with memory settings for a while I have decided to get the simplest thing possible working, which is just counting the number of files in the RDD.
This also fails with OOM error. So I opened up both the spark-shell and PySpark and ran the commands in the REPL/shell with default settings, the only additional parameter was --master yarn. The spark-shellversion works, while the PySpark version shows the same OOM error.
Is there that much overhead to running PySpark? Or is this a problem with binaryFiles being new? I am using Spark version 2.2.0.2.6.4.0-91.
The difference:
Scala will load records as PortableDataStream - this means process is lazy, and unless you call toArray on the values, won't load data at all.
Python will call Java backend, but load the data as byte array. This part will be eager-ish, therefore might fail on both sides.
Additionally PySpark will use at least twice as much memory - for Java and Python copy.
Finally binaryFiles (same as wholeTextFiles) are very inefficient and don't perform well, if individual input files are large. In case like this it is better to implement format specific Hadoop input format.
Since you are reading multiple binary files with binaryFiles() and starting Spark 2.1, the minPartitions argument of binaryFiles() is ignored
1.try to repartition the input files based on the following:
enter code hererdd = sc.binaryFiles(Path to the binary file , minPartitions = ).repartition()
2.You may try reducing the partition size to 64 MB or less depending on your size of the data using below config's
spark.files.maxPartitionBytes, default 128 MB
spark.files.openCostInBytes, default 4 MB
spark.default.parallelism

spark repartition / executor inconsistencies commandline vs jupyter

I wasn't really sure what to title this question -- happy for a suggested better summary
I'm beating my head trying to figure out why a dead simple spark job works fine from Jupyter, but from the command line is left with insufficient executors to progress.
What I'm trying to do: I have a large amount of data (<1TB) from which I need to extract a small amount of data (~1GB) and save as parquet.
Problem I have: when my dead-simple code is run from the command line, I only get as many executors as I have final partitions, which is ideally one given it is small. The same exact code works just fine in Jupyter, same cluster, where it tasks out >10k tasks across my entire cluster. The commandline version never progresses. Since it doesn't produce any logs beyond reporting lack of progress, i'm not sure where more to dig.
I have tried both python3 mycode.py and spark-submit mycode.py with lots of variations to no avail. My cluster has dynamicAllocation configured.
import findspark
findspark.init('/usr/lib/spark/')
from pyspark.sql import SparkSession
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
data = spark.read.parquet(<datapath>).select(<fields>)
subset = [<list of items>]
spark.sparkContext.broadcast(subset)
data.filter(field.isin.(subset)).coalesce(1).write.parquet("output")
** edit: original version mistakenly had repartition(1) instead of coalesce.
In this case, run from the command line, my process will get one executor.
In my logs, the only real hint I get is
WARN TaskSetManager: Stage 1 contains a task of very large size (330 KB). The maximum recommended task size is 100 KB.
which makes sense given the lack of resources being allocated.
I have tried to manually force the number of executors using spark-submit runtime settings. In that case, it will start with my initial settings and then immediately start bringing them down until there is only one and nothing progresses.
Any ideas? thanks.
I ended up phoning a friend on this one...
the code that was running fine in JupyterHub, but not via the commandline was essentially a:
read parquet,
filter on some small field,
coalesce(1)
write parquet
I had assumed that coalesce(1) and repartition(1) should have the same results -- even though coalesce(N) and repartition(N) do not -- given that they all go to one partition.
According to my friend, Spark sometimes optimizes coalesce(1) to a single task, which was the behavior I saw. By changing it to repartition(1), everything works fine.
I still have no idea why it works fine in JupyterHub --- having done >20 experiments -- and never on the commandline -- also >20 experiements.
But, if you want to take your data lake to a data puddle this way, use repartition(1) or repartition(n), where n is small, instead of coalesce.

Resources