Spark: Apply multiple transformations without recalculating or caching - apache-spark

Is it possible to take the output of a transformation (RDD/Dataframe) and feed it to two independent transformations without recalculating the first transformation and without caching the whole dataset?
Long version
Consider the case.
I have a very large dataset that doesn't fit in memory. Now I do some transformations on it which prepare the data to be worked on efficiently (grouping, filtering, sorting....):
DATASET --(TF1: transformation with group by, etc)--> DF1
DF1 --(TF2: more_transformations_some_columns)--> output
DF1 --(TF3: more_transformations_other_columns)--> output2
I was wondering if there is any way (or planned in dev) to tell Spark that, after TF1, it must reuse the same results (at partition level, without caching everything!) to serve both TF2 and TF3.
This can be conceptually imagined as a cache() at each partition, with automatic unpersist() when the partition was consumed by the further transformations.
I searched for a long time but couldn't find any way of doing it.
My attempt:
DF1 = spark.read()... .groupBy().agg()...
DF2 = DF1.select("col1").cache() # col1 fits in mem
DF3 = DF1.select("col1", transformation(other_cols)).write()... # Force evaluation of col1
Unfortunately, DF3 cannot guess it could to the caching of col1. So apparently it isn't possible to ask spark to only cache a few columns. That would already alleviate the problem.
Any ideas?

I don't think it is possible to cache just some of the columns,
but will this solve your problem?
DF1 = spark.read()... .groupBy().agg()...
DF3 = DF1.select("col1", transformation(other_cols)).cache()
DF3.write()
DF2 = DF3.select("col1")

Related

Spark caching - when to cache after foreachbatch (Spark Streaming)

I'm currently reading from a Kafka topic using spark streaming. Then, ForEachBatch (df), I do some transformations. I first filter the df batch by an id (df_filtered - i can do this filter n amount of times), then create a dataframe based on that filtered df (new_df_filtered - because the the data comes as a json message and I want to convert it to a normal column structure, providing it the schema), and finally writing in to 2 sinks.
Here's a sample of the code:
def sink_process(self, df: DataFrame, current_ids: list):
df.repartition(int(os.environ.get("SPARK_REPARTITION_NUMBER")))
df.cache()
for id in current_ids:
df_filtered = self.df_filter_by_id(df, id) #this returns the new dataframe with the schema. Uses a .where and then a .createDataFrame
first_row = df_filtered.take(1) #making sure that this filter action returns any data
if first_row:
df_filtered.cache()
self.sink_process(df_filtered, id)
df_filtered.unpersist()
df.unpersist()
My question is where should I cache this data for optimal performance. Right now I cached the batch before applying any transformations, which I have come to realise that at that point is not really doing anything, as it's only cached when the first action occurs. So following this logic, i'm only really caching this df when i'm reaching that .take, right? But at this point, i'm also caching that filtered df. The idea behind caching the batch data before the filter was that if had a log different ids, i wasn't fetching the data every time I was doing the filter, but I might have gotten this all wrong.
Can anyone please help clarify what would be the best approach? Maybe only caching the df_filtered one which is going to be used for the different sinks?
Thanks

Need to release the memory used by unused spark dataframes

I am not caching or persisting the spark dataframe. If I have to do many additional things in the same session by aggregating and modifying content of the dataframe as part of the process then when and how would the initial dataframe be released from memory?
Example:
I load a dataframe DF1 with 10 million records. Then I do some transformation on the dataframe which creates a new dataframe DF2. Then there are a series of 10 steps I do on DF2. All through this, I do not need DF1 anymore. How can I be sure that DF1 no longer exists in memory and hampering performance? Is there any approach using which I can directly remove DF1 from memory? Or does DF1 get automatically removed based on Least Recently Used (LRU) approach?
That's not how spark work. Dataframes are lazy ... the only things stored in memories are the structures and the list of tranformation you have done on your dataframes. The data are not stored in memory (unless you cache them and apply an action).
Therefore, I do not see any problem in your question.
Prompted by a question from A Pantola, in comments I'm returning here to post a better answer to this question. Note there are MANY possible correct answers how to optimize RAM usage which will depend on the work being done!
First, write the dataframe to DBFS, something like this:
spark.createDataFrame(data=[('A',0)],schema=['LETTERS','NUMBERS'])\
.repartition("LETTERS")\
.write.partitionBy("LETTERS")\
.parquet(f"/{tmpdir}",mode="overwrite")
Now,
df = spark.read.parquet(f"/{tmpdir}")
Assuming you don't set up any caching on the above df, then each time spark finds a reference to df it will parallel read the dataframe and compute whatever is specified.
Note the above solution will minimize RAM usage, but may require more CPU on every read. Also, the above solution will have a cost of writing to parquet.

Spark non-deterministic results after repartition

Is there some way to get deterministic results from dataframe repartition without sorting? In the below code, I get different results while doing the same operation.
from pyspark.sql.functions import rand, randn
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.range(0, 100000)
# repartition dataframe to 5 partitions
df2 = df.repartition(5).persist()
df2.head(5)
Out[1]: [Row(id=5324), Row(id=5389), Row(id=6209), Row(id=7640), Row(id=8090)]
df2.unpersist()
df3 = df.repartition(5).persist()
df3.head(5)
Out[2]: [Row(id=1019), Row(id=652), Row(id=2287), Row(id=470), Row(id=1348)]
Spark Version - 2.4.5
This non deterministic behaviour is expected. Here's how...
.repartition(num) does a round-robin repartitioning when no columns are passed inside the function. This does not guarantee that a specific row will always be in a particular partition.
.head(n) returns first n rows of first partition of dataframe.
If you want an order, you need to use orderBy !
According to this JIRA, repartitioning (by default) involves a local sort, and is fully deterministic. From the PR notes:
In this PR, we propose ... performing a local sort before
partitioning, after we make the input row ordering deterministic, the
function from rows to partitions is fully deterministic too.
The downside of the approach is that with extra local sort inserted,
the performance of repartition() will go down, so we add a new config
named spark.sql.execution.sortBeforeRepartition to control whether
this patch is applied. The patch is default enabled to be
safe-by-default, but user may choose to manually turn it off to avoid
performance regression.
head(n) on the other hand is not (unless you apply orderBy which again repartitions dataset to one partition), but that isn't your concern right?

Select specific columns in a PySpark dataframe to improve performance

Working with Spark dataframes imported from Hive, sometimes I end up with several columns that I don't need. Supposing that I don't want to filter them with
df = SqlContext.sql('select cols from mytable')
and I'm importing the entire table with
df = SqlContext.table(mytable)
does a select and subsequent cache improves performance/decrease memory usage, like
df = df.select('col_1', 'col_2', 'col_3')
df.cache()
df.count()
or is just waste of time? I will do lots of operations and data manipulations on df, like avg, withColumn, etc.
IMO it makes sense to filter them beforehand:
df = SqlContext.sql('select col_1, col_2, col_3 from mytable')
so you won't waste resources...
If you can't do it this way, then you can do it as you did it...
It is certainly a good practice but it is rather unlikely to result in a performance boost unless you try to pass data through Python RDD or do something similar. If certain columns are not required to compute the output optimizer should automatically infer projections and push these as early as possible in the execution plan.
Also it is worth noting that using df.count() after df.cache() will be useless most of the time (if not always). In general count is rewritten by the optimizer as
SELECT SUM(1) FROM table
so what is typically requested from the source is:
SELECT 1 FROM table
Long story short there is nothing useful to cache here.

Inconsistent Persistence of DataFrames in Spark 1.5

We recently switched to Spark 1.5.0 from 1.4.1 and have noticed some inconsistent behavior in persisting DataFrames.
df1 = sqlContext.read.parquet("df1.parquet")
df1.count()
161,100,982
df2 = sqlContext.read.parquet("df2.parquet")
df2.count()
67,498,706
join_df = df1.join(df2, "id")
join_df.count()
160,608,147
join_df.write.parquet("join.parquet")
join_parquet = sqlContext.read.parquet("join.parquet")
join_parquet.count()
67,698,892
join_df.write.json("join.json")
join_json = sqlContext.read.parquet("join.json")
join_json.count()
67,695,663
The first major issue is that there is an order of magnitude difference between the count of the join DataFrame and the persisted join DataFrame. Secondly, persisting the same DataFrame into 2 different formats yields different results.
Does anyone have any idea on what could be going on here?
I had similar problem in Spark 1.6.1:
When I created 2 different DataFrames from single RDD and persisted to 2 Parquets, I found out, that Rows in parquets are not the same.
Later I found out that somewhere among definition of pipeline of operations for this RDD, I had rdd.reduceByKey() operation, that returned undeterministic result for multiple calls.
Apparently that for each DataFrame one different reduceByKey() was called and resulted in small differences in some Rows.
Might be something similar in your case.

Resources