Does saveAsTextFile function in Spark transfer data to the driver?

Does saveAsTextFile function in Spark transfer data to the driver? - apache-spark

Firstly, I join two dataframes, the first DF is filtered from second DF and is about 8MB (260 000 records), second DF is from file that is cca 2GB (37 000 000 records). Then I call
joinedDF.javaRDD().saveAsTextFile("hdfs://xxx:9000/users/root/result");
and I tried also
joinedDF.write().mode(SaveMode.Overwrite).json("hdfs://xxx:9000/users/root/result");
I am bit confused since I get an exception
ERROR TaskSetManager: Total size of serialized results of 54 tasks
(1034.6 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
As I know, saveAsTextFile should outputs directly from workers. So why I get exception related to the driver?
I know about the option to increase spark.driver.maxResultSize and I set it to unlimited, but it does not help since my driver has in total just 4,8GB memory.
EDIT:
DataFrame df1 = table.as("A");
DataFrame df2 = table.withColumnRenamed("id", "key").filter("value = 'foo'");
joinedDF = df1.join(df2.as("B"), col("A.id").
startsWith(col("B.key")),
"right_outer");
I tried broadcast variable too, change is in df2
DataFrame df2 = sc.broadcast(table.withColumnRenamed("id", "key").filter("value = 'foo'")).getValue();

Found the answer in the related post https://stackoverflow.com/a/29602918/5957143
To Summarize #kuujo's answer :
saveAsTextFile does not send the data back to the driver. Rather, it
sends the result of the save back to the driver once it's complete.
That is, saveAsTextFile is distributed. The only case where it's not
distributed is if you only have a single partition or you've
coallesced your RDD back to a single partition before calling
saveAsTextFile.

Related

When does Spark send data to different executors after I created a DataFrame with RDD?

I'm trying to construct a DataFrame from a list of data, then write it as parquet files:
dataframe = None
while True:
data_list = get_data_list() # this function would return a list of data, about 1 million rows
rdd = sparkContext.parallelize(data_list, 20)
if dataframe:
dataframe.union(sparkSession.createDataFrame(data=rdd))
else:
dataframe = sparkSession.createDataFrame(data=rdd)
if some_judgement:
break
dataframe.write.parquet('...')
But I found the driver would fail with java.lang.OutOfMemoryError: Java heap space after a few cycles. If I increase the driver-memory or decrease the number of cycles in the loop, this exception stops occurring. So I guess even if I created an RDD, the data is still stored in the driver. So when will the data be sent to executors ? I want to decrease the memory usage of the driver.

Can you check the logs and see where the exception is happening (Either at driver or executor)? If its happening at driver -> can you increase driver memory to 8 or 10 GB and see if it gets succeeded ?
Also I would suggest to set some higher values for memoryOverHead params.
spark.driver.memoryOverhead
spark.executor.memoryOverhead

Spark Broadcast results in increased size of dataframe

I have a dataframe of 1 integer column made of 1B rows. So ideally, the size of the dataframe should be 1B * 4 bytes ~= 4GB. This is proven to be correct when I cache the dataframe and check the size. The size is around 4GB.
Now, if I try to broadcast the same dataframe to join with another dataframe, I get an error: Caused by: org.apache.spark.SparkException: Cannot broadcast the table that is larger than 8GB: 14 GB
Why does the size of a broadcasted dataframe increase? I have seen this in other cases as well where a 300MB dataframe shows up as 3GB broadcasted dataframe in Spark UI SQL tab.
Any reasoning or help is appreciated.

The size increases in memory, if dataframe was broadcasted across your cluster. How much it will increase depends on how many workers you have, because Spark needs to copy your dataframe on every worker to deal with your next operations.
Do not broadcast big dataframes, only small ones, to use in join operations.
As per link:
Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.

More so an error according to this post. See https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-37321

Why Spark applies a broadcast join for a file with size larger than the autoBroadcastJoinThreshold?

I am using spark 3.1.1 and joining two Dataframes of file size 8.6Gb and 25.2Mb respectively and not applying any filter. Spark is automatically using BroadcastHashJoin for this, although spark.sql.autoBroadcastJoinThreshold defaults to 10Mb.
How is 25.2Mb converted to 8.1Mb without applying any filter to be eligible for broadcast?
val df1 = spark.read
.option("header",true)
.csv("s3a://data/staging/received/data/spark/3/KernelVersionOutputFiles.csv")
.withColumn("Pid",substring(rand(),3,4).cast("bigint"))
val df2 = spark.read
.option("header",true)
.csv("s3a://data/staging/received/data/spark/3/ForumTopics.csv")
.withColumn("Cid",substring(rand(),3,4).cast("bigint"))
val df3 = df2.coalesce(1)
val joinDf = df1.join(df3, df1("Pid") === df3("Cid"))
val cnt = joinDf.count()
DAG looks like this:

Spark applies a broadcast join, because the data of 25MB in the csv ("size of files read") will be lower than 10MB when serialized by Spark ("data size").
The amount shown with "size of files read" is pretty accurate because Spark is able to compute the statistics directly on the files of data. However, the "data size" shown in the DAG suffers from the inaccuracy of the SizeEstimator.
There it says:
"Estimate the number of bytes that the given object takes up on the JVM heap. The estimate includes space taken up by objects referenced by the given object, their references, and so on and so forth.
This is useful for determining the amount of heap space a broadcast variable will occupy on each executor or the amount of space each object will take when caching objects in deserialized form. This is not the same as the serialized size of the object, which will typically be much smaller."
If you want to get the actual size of you 25MB csv file, you could cache it and check the "Storage" tab in the WebUI.
In my test case, although I also left the configuration autoBroadcastJoinThreshold by default to 10MB, Spark applied a broadcast join. The estimation was 66MB for a json file of size 14MB. When I cached it, it showed a size of 3.5MB which is obviously lower than the threshold of 10MB.
The following picture shows my test case (similar to yours):
The following screenshot shows the actual size of the data, which is only 3.5MB:
Another reference is given from Microsoft about this here.

Repartition by dates for high concurrency and big output files

I'm running Spark job on AWS Glue. The job transforms the data and saves output to parquet files, partitioned by date (year, month, day directories). Job must be able to handle terabytes of input data and uses hundreds of executors, each with 5.5 GB memory limit.
Input covers over 2 years of data. The output parquet files for each date should be as big as possible, optionally split into 500 MB chunks. Creating multiple small files for each day is not wanted.
Few tested approaches:
repartitioning by the same columns as in write results in Out Of Memory errors on executors:
df = df.repartition(*output_partitions)
(df
.write
.partitionBy(output_partitions)
.parquet(output_path))
repartitioning with an additional column with random value results in having multiple small output files written (corresponding to spark.sql.shuffle.partitions value):
df = df.repartition(*output_partitions, "random")
(df
.write
.partitionBy(output_partitions)
.parquet(output_path))
setting the number of partitions in repartition function, for example to 10, gives 10 quite big output files, but I'm afraid it will cause Out Of Memory errors when actual data (TBs in size) will be loaded:
df = df.repartition(10, *output_partitions, "random")
(df
.write
.partitionBy(output_partitions)
.parquet(output_path))
(df in code snippets is a regular Spark Data Frame)
I know I can limit the output file size with maxRecordsPerFile write option. But this limits the output created from a single memory partition, so in the first place, I would need to have partitions created by date.
So the question is how to repartition data in memory to:
split it over multiple executors to prevent Out Of Memory errors,
save output for each day to limited number of big parquet files,
write output files in parallel (using as much executors as possible)?
I've read those sources but did not find a solution:
https://mungingdata.com/apache-spark/partitionby/
https://stackoverflow.com/a/42780452
https://stackoverflow.com/a/50812609

Spark seems to think that a particular broadcast variable is large in size

I am trying to do a broadcast join on two tables. The size of the smaller table will vary based upon the parameters but the size of the larger table is close to 2TB.
What I have noticed is that if I don't set the spark.sql.autoBroadcastJoinThreshold to 10G some of these operations do a SortMergeJoin instead of a broadcast join. But the size of the smaller table shouldn't be this big at all. I wrote the smaller table to a s3 folder and it took only 12.6 MB of space.
I did some operations on the smaller table so the shuffle size appears on the Spark History Server and the size in memory seemed to be 150 MB, nowhere near 10G. Also, if I force a broadcast join on the smaller table it takes a long time to broadcast, leading me to think that the table might not just be 150 MB in size.
What would be a good way to figure out the actual size that Spark is seeing and deciding whether it crosses the value defined by spark.sql.autoBroadcastJoinThreshold?

look at the SQL tab in the spark UI. there you will see the DAG of each job + the statistics that spark collects.
for each dataframe, it will contain the size as spark sees it.
BTW, you don't have set spark.sql.autoBroadcastJoinThreshold to a high number to force spark using the broadcast join.
you can simply wrap the small df with org.apache.spark.sql.functions.broadcast(df) and it will force broadcast only on that specific join

As mentioned in this question: DataFrame join optimization - Broadcast Hash Join
import org.apache.spark.sql.functions.broadcast
val employeesDF = employeesRDD.toDF
val departmentsDF = departmentsRDD.toDF
// materializing the department data
val tmpDepartments = broadcast(departmentsDF.as("departments"))
import context.implicits._
employeesDF.join(broadcast(tmpDepartments),
$"depId" === $"id", // join by employees.depID == departments.id
"inner").show()

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Does saveAsTextFile function in Spark transfer data to the driver? - apache-spark

Related

When does Spark send data to different executors after I created a DataFrame with RDD?

Spark Broadcast results in increased size of dataframe

Why Spark applies a broadcast join for a file with size larger than the autoBroadcastJoinThreshold?

Repartition by dates for high concurrency and big output files

Spark seems to think that a particular broadcast variable is large in size

Categories

Resources