Spark caching - when to cache after foreachbatch (Spark Streaming) - apache-spark

I'm currently reading from a Kafka topic using spark streaming. Then, ForEachBatch (df), I do some transformations. I first filter the df batch by an id (df_filtered - i can do this filter n amount of times), then create a dataframe based on that filtered df (new_df_filtered - because the the data comes as a json message and I want to convert it to a normal column structure, providing it the schema), and finally writing in to 2 sinks.
Here's a sample of the code:
def sink_process(self, df: DataFrame, current_ids: list):
df.repartition(int(os.environ.get("SPARK_REPARTITION_NUMBER")))
df.cache()
for id in current_ids:
df_filtered = self.df_filter_by_id(df, id) #this returns the new dataframe with the schema. Uses a .where and then a .createDataFrame
first_row = df_filtered.take(1) #making sure that this filter action returns any data
if first_row:
df_filtered.cache()
self.sink_process(df_filtered, id)
df_filtered.unpersist()
df.unpersist()
My question is where should I cache this data for optimal performance. Right now I cached the batch before applying any transformations, which I have come to realise that at that point is not really doing anything, as it's only cached when the first action occurs. So following this logic, i'm only really caching this df when i'm reaching that .take, right? But at this point, i'm also caching that filtered df. The idea behind caching the batch data before the filter was that if had a log different ids, i wasn't fetching the data every time I was doing the filter, but I might have gotten this all wrong.
Can anyone please help clarify what would be the best approach? Maybe only caching the df_filtered one which is going to be used for the different sinks?
Thanks

Related

Cache() in Pyspark Dataframe

I have a dataframe and I need to include several transformations on it. I thought of performing all the actions in the same dataframe. So if I need to use cache Should I cache the dataframe after every action performed in it ?
df=df.selectExpr("*","explode(area)").select("*","col.*").drop(*['col','area'])
df.cache()
df=df.withColumn('full_name',f.concat(f.col('first_name'),f.lit(' '),f.col('last_name'))).drop('first_name','last_name')
df.cache()
df=df.withColumn("cleaned_map", regexp_replace("date", "[^0-9T]", "")).withColumn("date_type", to_date("cleaned_map", "ddMMyyyy")).drop('date','cleaned_map')
df.cache()
df=df.filter(df.date_type.isNotNull())
df.show()
Should I add like this or caching once is enough ?
Also I want to know if I use multiple dataframes instead of one for the above code should I include cache at every transformation. Thanks a lot !
The answer is simple, when you do df = df.cache() or df.cache() both are locates to an RDD in the granular level. Now , once you are performing any operation the it will create a new RDD, so this is pretty evident that will not be cached, so having said that it's up to you which DF/RDD you want to cache().Also, try avoiding try unnecessary caching as the data will be persisted in memory.
Below is the source code for cache() from spark documentation
def cache(self):
"""
Persist this RDD with the default storage level (C{MEMORY_ONLY_SER}).
"""
self.is_cached = True
self.persist(StorageLevel.MEMORY_ONLY_SER)
return self

What is the fastest way to get a large number of time ranges using Apache Spark?

I have about 100 GB of time series data in Hadoop. I'd like to use Spark to grab all data from 1000 different time ranges.
I have tried this using Apache Hive by creating an extremely long SQL statement that has about 1000 'OR BETWEEN X AND Y OR BETWEEN Q AND R' statements.
I have also tried using Spark. In this technique I've created a dataframe that has the time ranges in question and loaded that into spark with:
spark_session.CreateDataFrame()
and
df.registerTempTable()
With this, I'm doing a join with the newly created timestamp dataframe and the larger set of timestamped data.
This query is taking an extremely long time and I'm wondering if there's a more efficient way to do this.
Especially if the data is not partitioned or ordered in any special way, you or Spark need to scan it all no matter what.
I would define a predicate given the set of time ranges:
import scala.collection.immutable.Range
val ranges: List[Range] = ??? // load your ranges here
def matches(timestamp: Int): Boolean = {
// This is not efficient, a better data structure than a List
// should be used, but this is just an example
ranges.contains(_.contains(timestamp))
}
val data: RDD[(Int, T)] = ??? // load the data in an RDD
val filtered = data.filter(x => matches(x.first))
You can do the same with DataFrame/DataSet and UDFs.
This works well if the set of ranges is provided in the driver. If instead it comes from a table, like the 100G data, first collect it back in the driver, if not too big.
Your Spark job goes through 100GB dataset to select relevant data.
I don’t think there is big difference between using SQL or data frame api, as under the hood the full scan happening anyway.
I would consider re-structuring your data, so it is optimised for specific queries.
In your cases partitioning by time can give quite significant improvement (for ex. HIVE table with partitioning).
If you perform search using the same field, that has been used for partitioning - Spark job will only look into relevant partitions.

Spark: Apply multiple transformations without recalculating or caching

Is it possible to take the output of a transformation (RDD/Dataframe) and feed it to two independent transformations without recalculating the first transformation and without caching the whole dataset?
Long version
Consider the case.
I have a very large dataset that doesn't fit in memory. Now I do some transformations on it which prepare the data to be worked on efficiently (grouping, filtering, sorting....):
DATASET --(TF1: transformation with group by, etc)--> DF1
DF1 --(TF2: more_transformations_some_columns)--> output
DF1 --(TF3: more_transformations_other_columns)--> output2
I was wondering if there is any way (or planned in dev) to tell Spark that, after TF1, it must reuse the same results (at partition level, without caching everything!) to serve both TF2 and TF3.
This can be conceptually imagined as a cache() at each partition, with automatic unpersist() when the partition was consumed by the further transformations.
I searched for a long time but couldn't find any way of doing it.
My attempt:
DF1 = spark.read()... .groupBy().agg()...
DF2 = DF1.select("col1").cache() # col1 fits in mem
DF3 = DF1.select("col1", transformation(other_cols)).write()... # Force evaluation of col1
Unfortunately, DF3 cannot guess it could to the caching of col1. So apparently it isn't possible to ask spark to only cache a few columns. That would already alleviate the problem.
Any ideas?
I don't think it is possible to cache just some of the columns,
but will this solve your problem?
DF1 = spark.read()... .groupBy().agg()...
DF3 = DF1.select("col1", transformation(other_cols)).cache()
DF3.write()
DF2 = DF3.select("col1")

reducebykey and aggregatebykey in spark Dataframe

I am using spark 2.0 to read the data from parquet file .
val Df = sqlContext.read.parquet("c:/data/parquet1")
val dfSelect= Df.
select(
"id",
"Currency",
"balance"
)
val dfSumForeachId=dfSelect.groupBy("id").sum("balance")
val total=dfSumForeachId.agg(sum("sum(balance)")).first().getDouble(0)
In order to get a total balance value is this the best way of getting it using an action first() on a dataframe ?
In spark 2.0 is it fine to use groupby key ,does it have the same performance issue like groupbykey on rdd like does it need to shuffle the whole data over the network and then perform aggregation or the aggregation is performed locally like reducebykey in earlier version of the spark
Thanks
Getting the data by using first is a perfectly valid way of getting the data. That said, doing:
val total = dfSelect.agg(sum("balance")).first().getDouble(0)
would probably give you better performance for getting the total.
group by key and reduce by key work exactly the same as previous versions for the same reasons. group by key makes no assumption on the action you want to do and therefore cannot know how to do partial aggregations as reduce by key does.
When you do dataframe groupby and sum you are actually doing reduce by key with the + option and the second aggregation you did is a reduce with the +. That said dataframe does it more efficiently because, knowing exactly what is done it can perform many optimizations such as whole stage code generation.

spark streaming with aggregation

I am trying to understand spark streaming in terms of aggregation principles.
Spark DF are based on the mini batches and computations are done on the mini batch that came within a specific time window.
Lets say we have data coming in as -
Window_period_1[Data1, Data2, Data3]
Window_period_2[Data4, Data5, Data6]
..
then first computation will be done for Window_period_1 and then for Window_period_2. If I need to use the new incoming data along with historic data lets say kind of groupby function between Window_period_new and data from Window_period_1 and Window_period_2, how would I do that?
Another way of seeing the same thing would be lets say if I have a requirement where a few data frames are already created -
df1, df2, df3 and I need to run an aggregation which will involve data from
df1, df2, df3 and Window_period_1, Window_period_2, and all new incoming streaming data
how would I do that?
Spark allows you to store state in rdd (with checkpoints). So, even after restart, job will restore it state from checkpoint and continie streaming.
However, we faced with performance problems with checkpoint (specially, after restoring state), so it is worth to implement storint state using some external source (like hbase)

Resources