pyspark Collect causing memory to shoot up 80GB - apache-spark

I have a Spark job that reads a CSV file and does a bunch of joins and renaming columns.
The file size is in MB
x = info_collect.collect()

x size in python is around 100MB
however I get a memory crash, checking Gangla the memory goes up 80GB.
I have no idea why collection 100MB can cause memory to spike like that.
Could someone please advice?

Related

PySpark OOM for multiple data files

I want to process several idependent csv files of similar sizes (100 MB) in parallel with PySpark.
I'm running PySpark on a single machine:
spark.driver.memory 20g
spark.executor.memory 2g
local[1]
File content:
type (has the same value within each csv), timestamp, price
First I tested it on one csv (note I used 35 different window functions):
logData = spark.read.csv("TypeA.csv", header=False,schema=schema)
// Compute moving avg. I used 35 different moving averages.
w = (Window.partitionBy("type").orderBy(f.col("timestamp").cast("long")).rangeBetween(-24*7*3600 * i, 0))
logData = logData.withColumn("moving_avg", f.avg("price").over(w))
// Some other simple operations... No Agg, no sort
logData.write.parquet("res.pr")
This works great. However, i had two issues with scaling this job:
I tried to increase number of window functions to 50 the job OOMs. Not sure why PySpark doesn't spill to disk in this case, since window functions are independent of each other
I tried to run the job for 2 CSV files, it also OOMs. It is also not clear why it is not spilled to disk, since the window functions are basically partitioned by CSV files, so they are independent.
The question is why PySpark doesn't spill to disk in these two cases to prevent OOM, or how can I hint the Spark to do it?
If your machine cannot run all of these you can do that in sequence and write the data of each bulk of files before loading the next bulk.
I'm not sure if this is what you mean but you can try hint spark to write some of the data to your disk instead of keep it on RAM with:
df.persist(StorageLevel.MEMORY_AND_DISK)
Update if it helps
In theory, you could process all these 600 files in one single machine. Spark should spill to disk when meemory is not enough. But there're some points to consider:
As the logic involves window agg, which results in heavy shuffle operation. You need to check whether OOM happened on map or reduce phase. Map phase process each partition of file, then write shuffle output into some file. Then reduce phase need to fetch all these shuffle output from all map tasks. It's obvious that in your case you can't hold all map tasks running.
So it's highly likely that OOM happened on map phase. If this is the case, it means the memory per core can't process one signle partition of file. Please be aware that spark will do rough estimation of memory usage, then do spill if it thinks it should be. As the estatimation is not accurate, so it's still possible OOM. You can tune partition size by below configs:
spark.sql.files.maxPartitionBytes (default 128MB)
Usaually, 128M input needs 2GB heap with total 4G executor memory as
executor JVM heap execution memory (0.5 of total executor memory) =
(total executor memory - executor.memoryOverhead (default 0.1)) * spark.memory.storageFraction (0.6)
You can post all your configs in Spark UI for further investigation.

How it works when the action result size is bigger than the machines memory?

Storage and memory size of a machine on which pyspark collect()(action) is 1gb. But my resultant file size is 4gb(which is stored in 4 partitions of size 1gb each). Now how is my 4gb result is going to return output?
Your job will probably crush on OOO error.
You can either write the result to HDFS and read it from there instead of collect ( collect is bad practice)
or you can give more memory to your driver machine ( the Driver will store the collect data)

How does Apache Spark process data that does not fit into the memory?

i have return a spark program to find the count of records from the 2GB memory file with storage memory of 1GB and it ran successfully.
But My question here is as 2GB file cannot fit into 1GB memory, but still how spark process the file and return the count.
Just because you have 2Gb file in disk, does not mean that it will take same or less or more memory in RAM. The other point is that how your file is stored in disk (row format or columnar format). Assume it is stored in ORC format, then it will already have a precomputed detail about tables.
I will suggest you check your spark executor and task detail about memory details to understand how many stages/executors/task is used to complete the DAG.

very small Batch processing with spark

We are working on a project, where we need to process some dataset which is very small, in fact, less than 100 rows in csv format. There are around 20-30 such jobs that process these kinds of datasets. But the load can grow in future, and it can reach into big data category. Is it fine to start with spark for these extra-small load, so that system remains scalable tomorrow? Or should we write a normal program for now in java/c# that runs on schedule? And in future if load of some of these tasks becomes really high, switch to spark?
Thanks in advance.
Absolutely fine,One thing to remember before running Job is to check memory and allocating memory based on size of data.
Say you have 10 cores , 50GB ram and initially you have csv files of 3kb or 1MB in size.Giving 50Gb ram and 10cores for 1Mb Files is a false approach,
Before you tigger the Job you should be carefull in allocating memory and number of executors.
For above csv files of 3Mb data you can give 2-cores at maximum and 5Gb of RAM to get job done.With the increase of size in data you can increase of usage of cores and memory.
Before you open sparkshell(Here I am using Pyspark and yarn as resource manager).This Can be done by example:
pyspark --master yarn --num-executors-memory <512M ,2G>
Thank you.

Spark partitionBy on write.save brings all data to driver?

So basically I have a python spark job that reads some simple json files, and then tries to write them as orc files partitioned by one field. The partition is not very balanced, as some keys are really big, and other really small.
I had memory issues when doing something like this:
events.write.mode('append').partitionBy("type").save("s3n://mybucket/tofolder"), format="orc")
Adding memory to the executors didn't seem to have any effect, but I solved it increasing the driver memory. Does this mean that all the data is being send to the driver for it to write? Can't each executor write its own partition? Im using Spark 2.0.1
Even if you partition dataset and then write it on storage there is no possibility that records are sent to the driver. You should look at logs of memory issues (if they occur on driver on or executors) to figure out exact reason of failing.
Probably your driver has too low memory to handle this write because of previous computations. Try decreasing spark.ui.retainedJobs and spark.ui.retainedStages to save memory on old jobs and stages metadata. If this won't help, connect to driver with jvisualvm to find job/stage than consumes large heap fragments and try to optimize.

Resources