Spark partitionBy on write.save brings all data to driver? - apache-spark

So basically I have a python spark job that reads some simple json files, and then tries to write them as orc files partitioned by one field. The partition is not very balanced, as some keys are really big, and other really small.
I had memory issues when doing something like this:
events.write.mode('append').partitionBy("type").save("s3n://mybucket/tofolder"), format="orc")
Adding memory to the executors didn't seem to have any effect, but I solved it increasing the driver memory. Does this mean that all the data is being send to the driver for it to write? Can't each executor write its own partition? Im using Spark 2.0.1

Even if you partition dataset and then write it on storage there is no possibility that records are sent to the driver. You should look at logs of memory issues (if they occur on driver on or executors) to figure out exact reason of failing.
Probably your driver has too low memory to handle this write because of previous computations. Try decreasing spark.ui.retainedJobs and spark.ui.retainedStages to save memory on old jobs and stages metadata. If this won't help, connect to driver with jvisualvm to find job/stage than consumes large heap fragments and try to optimize.

Related

How to tell if spark session will be able to hold data size in dataframe?

Intend to read data from an Oracle DB with pyspark (running in local mode) and store locally as parquet. Is there a way to tell whether a spark session dataframe will be able to hold the amount of data from the query (which will be the whole table, ie. select * from mytable)? Are there common solutions for if the data would not be able to fit in a dataframe?
* Saw a similar question here, but was a little confused by the discussion in the comments
As you are running on local, So I assume it is not on a cluster. You can not say exactly how much memory would require? However, you can go close to it. You check your respective table size that how much disk space it's using. Suppose you mytable has occupied 1GB of Hard disk then spark would be required RAM more than that, because Spark's engine required some memory for its own processing. Try to have 2GB extra, for safer side than actual table size.
To check you table size in Oracle, You can use below query:
select segment_name,segment_type,bytes/1024/1024 MB
from dba_segments
where segment_type='TABLE' and segment_name='<yourtablename>';
It will give you a result in MB.
To configure JVM related parameter in Apache-Spark you can check this.
It doesn't matter how big the table is if you are running spark in a distributed manner. You would need to worry about the memory if:-
You are reading the data in the driver and then doing a broadcast.
Caching the dataframe for some computation.
Usually for your spark application a DAG gets generated and if you are using JDBC source then the workers will read the data directly and use the shuffle space and off-heap to disk for memory intensive computation.

setting tuning parameters of a spark job

I'm relatively new to spark and I have a few questions related to the tuning optimizations with respect to the spark submit command.
I have followed : How to tune spark executor number, cores and executor memory?
and I understand how to utilise maximum resources out of my spark cluster.
However, I was recently asked how to define the number of cores, memory and cores when I have a relatively smaller operation to do as if I give maximum resources, it is going to be underutilised .
For instance,
if I have to just do a merge job (read files from hdfs and write one single huge file back to hdfs using coalesce) for about 60-70 GB (assume each file is of 128 mb in size which is the block size of HDFS) of data(in avro format without compression), what would be the ideal memory, no of executor and cores required for this?
Assume I have the configurations of my nodes same as the one mentioned in the link above.
I can't understand the concept of how much memory will be used up by the entire job provided there are no joins, aggregations etc.
The amount of memory you will need depends on what you run before the write operation. If all you're doing is reading data combining it and writing it out, then you will need very little memory per cpu because the dataset is never fully materialized before writing it out. If you're doing joins/group-by/other aggregate operations all of those will require much ore memory. The exception to this rule is that spark isn't really tuned for large files and generally is much more performant when dealing with sets of reasonably sized files. Ultimately the best way to get your answers is to run your job with the default parameters and see what blows up.

Does Apache Spark cache RDD in node-level or cluster-level?

I know that Apache Spark persist method saves RDDs in memory and that if there is not enough memory space, it stores the remaining partitions of the RDD in the filesystem (disk). What I can't seem to understand is the following:
Imagine we have a cluster and we want to persist an RDD. Suppose node A does not have a lot of memory space and that node B does. Let's suppose now that after running the persist command, node A runs out of memory. The question now is:
Does Apache Spark search for more memory space in node B and try to store everything in memory?
Or given that there is not enough space in node A, Spark stores the remaining partitions of the RDD in the disk of node A even if there some memory space available in node B?
Thanks for your answers.
Normally Spark doesn't search for the free space. Data is cached locally on the executor responsible for a particular partition.
The only exception is the case when you use replicated persistence mode - in that case additional copy will be place on another node.
The closest thing I could find is this To cache or not to cache. I had plenty of situations when data was mildly skewed and was getting memory related exceptions/failures when trying to cache/persist into RAM, one way around it was to use StorageLevels like MEMORY_AND_DISK, but obviously it was taking longer to cache and than read those partitions.
Also in Spark UI you can find the information about executors and how much of their memory is used for caching, you can experiment and monitor how it behaves.

OOM loading data from parquet

we have an Apache Spark 1.4.0 cluster and we would like to load data from a set of 350 parquet file from HDFS. Currently, when we try to run our program, we get a "OutOfMemory Error" driver side.
Profiling both an Executor and the driver we noticed that during the operation the executor memory remains constant when the driver memory constantly increase.
For each parquet file we load data as follows:
sqlContext.read().format(PARQUET_OUT_TYPE).load("path").toJavaRDD(mappingFunction)
and, after that, we join the RDDs by "union" and then we coalesce them
partitions.reduce((r1,r2) -> r1.union(r2).coalesce(PARTITION_COUNT))
What looks really strange to me is that the executor memory remain constant during the loading phase (when I expect to see it increase 'cause of the data read by the node) and the driver memory constantly increase (when I expect to see it remain constant because that should not be loaded in the driver memory).
Is there something wrong with the way we load data? Could you please explain me how to read data from parquet in parallel?
Thanks
the OOM was caused by the parquet metadata not by the data.
Thanks

spark streaming failed batches

I see some failed batches in my spark streaming application because of memory related issues like
Could not compute split, block input-0-1464774108087 not found
, and I was wondering if there is a way to re process those batches on the side without messing with the current running application, just in general , does not have to be the same exact exception.
Thanks in advance
Pradeep
This may happen in cases where your data ingestion rate into spark is higher than memory allocated or can be kept. You can try changing StorageLevel to MEMORY_AND_DISK_SER so that when it is low on memory Spark can spill data to disk. This will prevent your error.
Also, I don't think this error means that any data was lost while processing, but that input block which was added by your block manager just timed out before processing started.
Check similar question on Spark User list.
Edit:
Data is not lost, it was just not present where the task was expecting it to be. As per Spark docs:
You can mark an RDD to be persisted using the persist() or cache()
methods on it. The first time it is computed in an action, it will be
kept in memory on the nodes. Spark’s cache is fault-tolerant – if any
partition of an RDD is lost, it will automatically be recomputed using
the transformations that originally created it.

Resources