Spark SQL createDataFrame() raising OutOfMemory exception - apache-spark

Does it create the whole dataFrame in Memory?
How do I create a large dataFrame (> 1 million Rows) and persist it for later queries?

To persist it for later queries:
val sc: SparkContext = ...
val hc = new HiveContext( sc )
val df: DataFrame = myCreateDataFrameCode().
coalesce( 8 ).persist( StorageLevel.MEMORY_ONLY_SER )
df.show()
This will coalesce the DataFrame to 8 partitions before persisting it with serialization. Not sure I can say what number of partitions is best, perhaps even "1". Check StorageLevel docs for other persistence options, such as MEMORY_AND_DISK_SER, which will persist to both memory and disk.
In answer to the first question, yes I think Spark will need to create the whole DataFrame in memory before persisting it. If you're getting 'OutOfMemory', that's probably the key roadblock. You don't say how you're creating it. Perhaps there's some workaround, like creating and persisting it in smaller pieces, persisting to memory_and_disk with serialization, and then combining the pieces.

Related

How to avoid re-evaluation of each transformation on pyspark data frame again and again

I have a spark data frame. I'm doing multiple transformations on the data frame. My code looks like this:
df = df.withColumn ........
df2 = df.filter......
df = df.join(df1 ...
df = df.join(df2 ...
Now I have around 30 + transformations like this. Also I'm aware of persisting of a data frame. So if I have some transformations like this:
df1 = df.filter.....some condition
df2 = df.filter.... some condtion
df3 = df.filter... some other conditon
I'm persisting the data frame "df" in the above case.
Now the problem is spark is taking too long to run (8 + mts) or sometimes it fails with Java heap space issue.
But after some 10+ transformations if I save to a table (persistent hive table) and read from table in the next line, it takes around 3 + mts to complete. Its not working even if I save it to a intermediate in memory table.
Cluster size is not the issue either.
# some transformations
df.write.mode("overwrite").saveAsTable("test")
df = spark.sql("select * from test")
# some transormations ---------> 3 mts
# some transformations
df.createOrReplaceTempView("test")
df.count() #action statement for view to be created
df = spark.sql("select * from test")
# some more transformations --------> 8 mts.
I looked at spark sql plan(still do not completely understand it). It looks like spark is re evaluating same dataframe again and again.
What I'm i doing wrong? I don have to write it to intermediate table.
Edit: I'm working on azure databricks 5.3 (includes Apache Spark 2.4.0, Scala 2.11)
Edit2: The issue is rdd long lineage. It looks like my spark application is getting slower and slower if the rdd lineage is increasing.
You should use caching.
Try using
df.cache
df.count
Using count to force caching all the information.
Also I recommend you take a look at this and this

How to auto calculate numRepartition while using spark dataframe write

When I tried to write dataframe to Hive Parquet Partitioned Table
df.write.partitionBy("key").mode("append").format("hive").saveAsTable("db.table")
It will create a lots of blocks in HDFS, each of the block only have small size of data.
I understand how it goes as each spark sub-task will create a block, then write data to it.
I also understand, num of blocks will increase the Hadoop performance, but it will also decrease the performance after reaching a threshold.
If i want to auto set numPartition, does anyone have a good idea?
numPartition = ??? // auto calc basing on df size or something
df.repartition("numPartition").write
.partitionBy("key")
.format("hive")
.saveAsTable("db.table")
First of all, why do you want to have an extra repartition step when you are already using partitionBy(key)- your data would be partitioned based on the key.
Generally, you could re-partition by a column value, that's a common scenario, helps in operations like reduceByKey, filtering based on column value etc. For example,
val birthYears = List(
(2000, "name1"),
(2000, "name2"),
(2001, "name3"),
(2000, "name4"),
(2001, "name5")
)
val df = birthYears.toDF("year", "name")
df.repartition($"year")
By Default spark will create 200 Partitions for shuffle operations. so, 200 files/blocks (if the file size is less) will be written to HDFS.
Configure the number of partitions to be created after shuffle based on your data in Spark using below configuration:
spark.conf.set("spark.sql.shuffle.partitions", <Number of paritions>)
ex: spark.conf.set("spark.sql.shuffle.partitions", "5"), so Spark will create 5 partitions and 5 files will be written to HDFS.

How to reliably write and restore partitioned data

I am looking for a way to write and restore partitioned dataset. For the purpose of this question I can accept both partitioned RDD:
val partitioner: org.apache.spark.Partitioner = ???
rdd.partitionBy(partitioner)
and Dataset[Row] / Dataframe:
df.repartition($"someColumn")
The goal is to avoid shuffle when data is restored. For example:
spark.range(n).withColumn("foo", lit(1))
.repartition(m, $"id")
.write
.partitionBy("id")
.parquet(path)
shouldn't require shuffle for:
spark.read.parquet(path).repartition(m, $"id")
I thought about writing partitioned Dataset to Parquet but I believe that Spark doesn't use this information.
I can work only with disk storage not a database or data grid.
It might be achieved by bucketBy in dataframe/dataset api probably, but there is a catch - directly saving to parquet won't work, only saveAsTable works.
Dataset<Row> parquet =...;
parquet.write()
.bucketBy(1000, "col1", "col2")
.partitionBy("col3")
.saveAsTable("tableName");
sparkSession.read().table("tableName");
Another apporach for spark core is to use custom RDD, e.g see https://github.com/apache/spark/pull/4449 - i.e. after reading hdfs rdd you kind of setup partitioner back, but it a bit hacky and not supported natively(so it need to be adjusted for every spark version)

Why mapPartitionsWithIndex cause a shuffle in Spark?

I'm new in Spark. I'm checking the shuffling issues in a test application and I don't know why in my program the mapPartitionsWithIndex method cause a shuffle! As you can see in picture my initial RDD has two 16MB partition and Shuffle write about 49.8 MB.
I know that the map or mapPartition or mapPartitionsWithIndex are not shuffling transformation like groupByKey but I see that they also cause shuffle in Spark. Why?
I think you are performing some join/group operation after mapPartitionsWithIndex and that is causing shuffle.
you can establish it by modifying your code.
current code
val rdd = inputRDD1.mapPartitionsWithIndex(....)
val outRDD = rdd.join(inputRDD2)
Modified code
val rdd = inputRDD1.mapPartitionsWithIndex(....)
println(rdd.count)

How to duplicate RDD into multiple RDDs?

Is it possible to duplicate a RDD into two or several RDDs ?
I want to use the cassandra-spark driver and save a RDD into a Cassandra table, and, in addition, keep going with more calculations (and eventually save the result to Cassandra as well).
RDDs are immutable and transformations on RDDs create new RDDs. Therefore, it's not necessary to create copies of an RDD to apply different operations.
You could save the base RDD to secondary storage and further apply operations to it.
This is perfectly OK:
val rdd = ???
val base = rdd.byKey(...)
base.saveToCassandra(ks,table)
val processed = byKey.map(...).reduceByKey(...)
processed.saveToCassandra(ks,processedTable)
val analyzed = base.map(...).join(suspectsRDD).reduceByKey(...)
analyzed.saveAsTextFile("./path/to/save")

Resources