I have an application which submits multiple spark applications in parallel in client mode. What are the guidelines I need to follow to write the spark application such that driver memory is not overflown.
The operations I am doing in the spark are as follows:
val df1 : read data from file to dataframe
val df2 : do sum(col4) sum(col5) on df1
val df3 : sort df2 on col2
val df4 : df3.limit(threshould)
val result : fill in Null spaces with literals
save(write) result into file
I am creating multiple data frames here. Are all those data frames brought back to driver program? Will doing multiple operations in one step would make it client(driver) memory efficient?
I read https://spark.apache.org/docs/latest/programming-guide.html but did not get answer to my question.
Related
from pyspark.sql import SparkSession
spark= SparkSession.builder.master("local[4]").getOrCreate()
df = spark.read.csv("annual-enterprise-survey-2021-financial-year-provisional-size-bands-csv.csv")
df.createOrReplaceTempView("table")
sqldf = spark.sql('SELECT _c5 FROM table WHERE _c5 > "1000"')
print(sqldf.count())
print(df.rdd.getNumPartitions())
print(sqldf.rdd.getNumPartitions())
I am trying to see the effect of parallelism in spark. How can I decide how many partitions will I have when I am running actions on my dataframe? In the below code, my output for number of partitions is 1s. In UI it shows 1 task for the count job. Shouldnt spark create 4 tasks(number of cores on my local machine) and then do the count operation faster?
Partitions and workers are not mapped one to one although they can be.
local[4] defines the number of workers. To specify number of partitions for a dataframe, one can use repartition or coallece function.
For example you can write
sqldf = sqldf.repartition(4)
I have a spark data frame. I'm doing multiple transformations on the data frame. My code looks like this:
df = df.withColumn ........
df2 = df.filter......
df = df.join(df1 ...
df = df.join(df2 ...
Now I have around 30 + transformations like this. Also I'm aware of persisting of a data frame. So if I have some transformations like this:
df1 = df.filter.....some condition
df2 = df.filter.... some condtion
df3 = df.filter... some other conditon
I'm persisting the data frame "df" in the above case.
Now the problem is spark is taking too long to run (8 + mts) or sometimes it fails with Java heap space issue.
But after some 10+ transformations if I save to a table (persistent hive table) and read from table in the next line, it takes around 3 + mts to complete. Its not working even if I save it to a intermediate in memory table.
Cluster size is not the issue either.
# some transformations
df.write.mode("overwrite").saveAsTable("test")
df = spark.sql("select * from test")
# some transormations ---------> 3 mts
# some transformations
df.createOrReplaceTempView("test")
df.count() #action statement for view to be created
df = spark.sql("select * from test")
# some more transformations --------> 8 mts.
I looked at spark sql plan(still do not completely understand it). It looks like spark is re evaluating same dataframe again and again.
What I'm i doing wrong? I don have to write it to intermediate table.
Edit: I'm working on azure databricks 5.3 (includes Apache Spark 2.4.0, Scala 2.11)
Edit2: The issue is rdd long lineage. It looks like my spark application is getting slower and slower if the rdd lineage is increasing.
You should use caching.
Try using
df.cache
df.count
Using count to force caching all the information.
Also I recommend you take a look at this and this
I want to know if Spark knows the partitioning key of the parquet file and uses this information to avoid shuffles.
Context:
Running Spark 2.0.1 running local SparkSession. I have a csv dataset that I am saving as parquet file on my disk like so:
val df0 = spark
.read
.format("csv")
.option("header", true)
.option("delimiter", ";")
.option("inferSchema", false)
.load("SomeFile.csv"))
val df = df0.repartition(partitionExprs = col("numerocarte"), numPartitions = 42)
df.write
.mode(SaveMode.Overwrite)
.format("parquet")
.option("inferSchema", false)
.save("SomeFile.parquet")
I am creating 42 partitions by column numerocarte. This should group multiple numerocarte to same partition. I don't want to do partitionBy("numerocarte") at the write time because I don't want one partition per card. It would be millions of them.
After that in another script I read this SomeFile.parquet parquet file and do some operations on it. In particular I am running a window function on it where the partitioning is done on the same column that the parquet file was repartitioned by.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val df2 = spark.read
.format("parquet")
.option("header", true)
.option("inferSchema", false)
.load("SomeFile.parquet")
val w = Window.partitionBy(col("numerocarte"))
.orderBy(col("SomeColumn"))
df2.withColumn("NewColumnName",
sum(col("dollars").over(w))
After read I can see that the repartition worked as expected and DataFrame df2 has 42 partitions and in each of them are different cards.
Questions:
Does Spark know that the dataframe df2 is partitioned by column numerocarte?
If it knows, then there will be no shuffle in the window function. True?
If it does not know, It will do a shuffle in the window function. True?
If it does not know, how do I tell Spark the data is already partitioned by the right column?
How can I check a partitioning key of DataFrame? Is there a command for this? I know how to check number of partitions but how to see partitioning key?
When I print number of partitions in a file after each step, I have 42 partitions after read and 200 partitions after withColumn which suggests that Spark repartitioned my DataFrame.
If I have two different tables repartitioned with the same column, would the join use that information?
Does Spark know that the dataframe df2 is partitioned by column numerocarte?
It does not.
If it does not know, how do I tell Spark the data is already partitioned by the right column?
You don't. Just because you save data which has been shuffled, it does not mean, that it will be loaded with the same splits.
How can I check a partitioning key of DataFrame?
There is no partitioning key once you loaded data, but you can check queryExecution for Partitioner.
In practice:
If you want to support efficient pushdowns on the key, use partitionBy method of DataFrameWriter.
If you want a limited support for join optimizations use bucketBy with metastore and persistent tables.
See How to define partitioning of DataFrame? for detailed examples.
I am answering my own question for future reference what worked.
Following suggestion of #user8371915, bucketBy works!
I am saving my DataFrame df:
df.write
.bucketBy(250, "userid")
.saveAsTable("myNewTable")
Then when I need to load this table:
val df2 = spark.sql("SELECT * FROM myNewTable")
val w = Window.partitionBy("userid")
val df3 = df2.withColumn("newColumnName", sum(col("someColumn")).over(w)
df3.explain
I confirm that when I do window functions on df2 partitioned by userid there is no shuffle! Thanks #user8371915!
Some things I learned while investigating it
myNewTable looks like a normal parquet file but it is not. You could read it normally with spark.read.format("parquet").load("path/to/myNewTable") but the DataFrame created this way will not keep the original partitioning! You must use spark.sql select to get correctly partitioned DataFrame.
You can look inside the table with spark.sql("describe formatted myNewTable").collect.foreach(println). This will tell you what columns were used for bucketing and how many buckets there are.
Window functions and joins that take advantage of partitioning often require also sort. You can sort data in your buckets at the write time using .sortBy() and the sort will be also preserved in the hive table. df.write.bucketBy(250, "userid").sortBy("somColumnName").saveAsTable("myNewTable")
When working in local mode the table myNewTable is saved to a spark-warehouse folder in my local Scala SBT project. When saving in cluster mode with mesos via spark-submit, it is saved to hive warehouse. For me it was located in /user/hive/warehouse.
When doing spark-submit you need to add to your SparkSession two options: .config("hive.metastore.uris", "thrift://addres-to-your-master:9083") and .enableHiveSupport(). Otherwise the hive tables you created will not be visible.
If you want to save your table to specific database, do spark.sql("USE your database") before bucketing.
Update 05-02-2018
I encountered some problems with spark bucketing and creation of Hive tables. Please refer to question, replies and comments in Why is Spark saveAsTable with bucketBy creating thousands of files?
Does it create the whole dataFrame in Memory?
How do I create a large dataFrame (> 1 million Rows) and persist it for later queries?
To persist it for later queries:
val sc: SparkContext = ...
val hc = new HiveContext( sc )
val df: DataFrame = myCreateDataFrameCode().
coalesce( 8 ).persist( StorageLevel.MEMORY_ONLY_SER )
df.show()
This will coalesce the DataFrame to 8 partitions before persisting it with serialization. Not sure I can say what number of partitions is best, perhaps even "1". Check StorageLevel docs for other persistence options, such as MEMORY_AND_DISK_SER, which will persist to both memory and disk.
In answer to the first question, yes I think Spark will need to create the whole DataFrame in memory before persisting it. If you're getting 'OutOfMemory', that's probably the key roadblock. You don't say how you're creating it. Perhaps there's some workaround, like creating and persisting it in smaller pieces, persisting to memory_and_disk with serialization, and then combining the pieces.
I am creating a demo that uses spark SQL (data frames) and spark streaming. I am no spark expert my any means so I need some help!
I load about ~1million objects from a DB to spark Dataframe and I do SQL queries to match some fields with that and the live data from spark streaming.
For example,
SELECT *
FROM Person
WHERE Person.name='stream.name' AND Person.age='stream.age' AND ... etc
stream.xxx is a java string which I extract from spark streaming RDD into a string.
Now, the problem is that with a dataframe of 1 million rows and several columns, the SQL query above can take some time to execute even if the DF is persisted in memory. I had an idea where I would break up the Person table into zip code regions (each dataframe contains Person from 1 region) and process each spark stream RDD on each dataframe. This would reduce query times and makes things faster.
I am not sure how I would do the partition though. Heres some sample code.
// Setup Spark Stream with receiver
JavaReceiverInputDStream<String> transaction = jssc.receiverStream(new TransactionStreamReceiver(StorageLevel.MEMORY_AND_DISK()));
DataFrame person1 = //load logic ommitted
DataFrame person2 = //load logic ommitted
DataFrame person3 = //load logic ommitted
// Break up Person table for faster processing
transaction.foreachRDD(new TransactionProcessingFunction(person1, sqlContext, window,1));
transaction.foreachRDD(new TransactionProcessingFunction(person2, sqlContext, window,2));
transaction.foreachRDD(new TransactionProcessingFunction(person3, sqlContext, window,2));
I assumed that each worker node will process a foreachRDD method, but this is not the case. Is there any way I can assign each worker to run each foreachRDD in parallel?
EDIT: The TransactionProcessingFunction class is essentially just a forloop that loops through the stream data and does the query above and show some results.
I will try to get the necromancer silver medal directing you to this page to see how to load a SQL table inside the stream. Once you create a table object, you could repartition() the rdd or simply let the Spark Environmnet do its job.