Spark Broadcasting Alternatives - apache-spark

Our application uses a long-running spark context(just like spark RPEL) to enable users perform tasks online. We use spark broadcasts heavily to process dimensional data. As in common practice, we broadcast the dimension tables and use dataframe APIs to join the fact table with the other dimension tables. One of the dimension tables is quite big and has about 100k records and 15MB of size in-memory(kyro serialized is just few MBs lesser).
We see that every spark JOB on the de-normalized dataframe is causing all the dimensions to be broadcasted over and over again. The bigger table takes ~7 secs every time it is broadcasted. We are trying to find a way to have the dimension tables broadcasted only once per context life span. We tried both sqlcontext and sparkcontext broadcasting.
Are there any other alternatives to spark broadcasting? Or is there a way to reduce the memory footprint of the dataframe(compression/serialization etc. - post-kyro is still 15MB :( ) ?

Possible Alternative
We use Iginite spark integration to load large amount of data at start of job and keep on mutating as needed.
In embedded mode you can start ignite at boot of Spark context and kill in the end.
You can read more about it here.
https://ignite.apache.org/features/igniterdd.html

Finally we were able to find a stopgap solution until spark support pinning of RDDs or preferably RDDs in a later version. This is apparently not addressed even in v2.1.0.
The solution relies on RDD mapPartitions, below is a brief summary of the approach
Collect the dimension table records as map of key-value pairs and broadcast using spark context. You can possibly use RDD.keyBy
Map fact rows using RDD mapPartitions method.
For each fact row mapParitions
collect the dimension ids in the fact row and lookup the dimension records
yields a new fact row by denormalizing the dimension ids in the fact
table

Related

Suggestion for multiple joins in spark

Recently I got a requirement to perform combination joins.
I have to perform around 30 to 36 joins in Spark.
It was consuming more time to build the execution plan. So I cached the execution plan in intermediate stages using df.localCheckpoint().
Is this a good way to do? Any thoughts, please share.
Yes, it is fine.
This is mostly discussed for iterative ML algorithms, but can be equally applied for a Spark App with many steps - e.g. joins.
Quoting from https://medium.com/#adrianchang/apache-spark-checkpointing-ebd2ec065371:
Spark programs take a huge performance hit when fault tolerance occurs
as the entire set of transformations to a DataFrame or RDD have to be
recomputed when fault tolerance occurs or for each additional
transformation that is applied on top of an RDD or DataFrame.
localCheckpoint() is not "reliable".
Caching is definitely a strategy to optimize your performance. In general, given that your data size and resource of your spark application remains unchanged, there are three points that need to be considered when you want to optimize your joining operation:
Data skewness: In most of the time, when I'm trying to find out the reason why the joining takes a lot of time, data skewness is always be one of the reasons. In fact, not only the joining operation, any transformation need a even data distribution so that you won't have a skewed partition that have lots of data and wait the single task in single partition. Make sure your data are well distributed.
Data broadcasting: When we do the joining operation, data shuffling is inevitable. In some case, we use a relatively small dataframe as a reference to filter the data in a very big dataframe. In this case, it's a very expensive operation to shuffle the dataframe. Instead, we can use the dataframe broadcasting to broadcast your small dataframe to every single node and prevent the costly shuffling.
Keep your joining data as lean as possible: like what I mentioned in point 2, data shuffling is inevitable when you do the joining operation. Therefore, please keep your dataframe as lean as possible, which means remove the rows / columns if it's unnecessary to reduce the size of data that need to be moved across the network during the data shuffling.

Spark 2.4.6 + JDBC Reader: When predicate pushdown set to false, is data read in parallel by spark from the engine?

I am trying to extract data from a big table in SAP HANA, which is around 1.5tb in size, and the best way is to run in parallel across nodes and threads. Spark JDBC is the perfect candidate for the task, but in order to actually extract in parallel it requires partition column, lower/upper bound and number of partitions option to be set. To make the operation of the extraction easier, I considered adding an added partition column which would be the row_number() function and use MIN(), MAX() as lower/upper bounds respectively. And then the operations team just would be required to provide the number of partitions to have.
The problem is that HANA runs out of memory and it is very likely that row_number() is too costly on the engine. I can only imagine that over 100 threads run the same query during every fetch to apply the where filters and retrieve the corresponding chunk.
So my question is, if I disable the predicate pushdown option, how does spark behave? is it only read by one executor and then the filters are applied on spark side? Or does it do some magic to split the fetching part from the DB?
What could you suggest for extracting such a big table using the available JDBC reader?
Thanks in advance.
Before executing your primary query from Spark, run pre-ingestion query to fetch the size of the Dataset being loaded, i.e. as you have mentioned Min(), Max() etc.
Expecting that the data is uniformly distributed between Min and Max keys, you can partition across executors in Spark by providing Min/Max/Number of Executors.
You don't need(want) to change your primary datasource by adding additional columns to support data ingestion in this case.

Spark dataframe creation through already distributed in-memory data sets

I am new to the Spark community. Please ignore if this question doesn't make sense.
My PySpark Dataframe is just taking a fraction of time (in ms) in 'Sorting', but moving data is much expensive (> 14 sec).
Explanation:
I have a huge Arrow RecordBatches collection which is equally distributed on all of my worker node's memories (in plasma_store). Currently, I am collecting all those RecordBatches back in my master node, merging them, and converting them to a single Spark Dataframe. Then I apply sorting function on that dataframe.
Spark dataframe is a cluster distributed data collection.
So my question is:
Is it possible to create a Spark dataframe from all that already distributed Arrow RecordBatches data collections in the worker's nodes memories? So that the data should remain in the respective worker's nodes memories (instead of bringing it to master node, merging, and then creating distributed dataframe).
Thanks!
Yes you can store the data in a spark cache, whenever you try to get the data, it would get you from cache rather than the source.
Please utilize below kinks to understand more on cache,
https://sparkbyexamples.com/spark/spark-dataframe-cache-and-persist-explained/
where does df.cache() is stored
https://unraveldata.com/to-cache-or-not-to-cache/

Batch processing job (Spark) with lookup table that's too big to fit into memory

I'm trying to write a batch job to process a couple of hundreds of terabytes that currently sit in an HBase database (in an EMR cluster in AWS), all in a single large table. For every row I'm processing, I need to get additional data from a lookup table (a simple integer to string mapping) that is in a second HBase table. We'd be doing 5-10 lookups per row.
My current implementation uses a Spark job that's distributing partitions of the input table to its workers, in the following shape:
Configuration hBaseConfig = newHBaseConfig();
hBaseConfig.set(TableInputFormat.SCAN, convertScanToString(scan));
hBaseConfig.set(TableInputFormat.INPUT_TABLE, tableName);
JavaPairRDD<ImmutableBytesWritable, Result> table = sparkContext.newAPIHadoopRDD(hBaseConfig, TableInputFormat.class, ImmutableBytesWritable.class, Result.class);
table.map(val -> {
// some preprocessing
}).foreachPartition(p -> {
p.forEachRemaining(row -> {
// code that does the lookup
});
});
The problem is that the lookup table is too big to fit in the workers' memory. They all need access to all parts of the lookup table, but their access pattern would significantly benefit from a cache.
Am I right in thinking that I cannot use a simple map as a broadcast variable because it'd need to fit into memory?
Spark uses a shared nothing architecture, so I imagine there won't be an easy way to share a cache across all workers, but can we build a simple LRU cache for every individual worker?
How would I implement such a local worker cache that gets the data from the lookup table in HBase on a cache miss? Can I somehow distribute a reference to the second table to all workers?
I'm not set on my choice of technology, apart from HBase as the data source. Is there a framework other than Spark which could be a better fit for my use case?
You have a few of options for dealing with this requirement:
1- Use RDD or Dataset joins
You can load both of your HBase tables as Spark RDD or Datasets and then do a join on your lookup key.
Spark will split both RDD into partitions and shuffle content around so that rows with the same keys end up on the same executors.
By managing the number of number of partitions within spark you should be able to join 2 tables on any arbitrary sizes.
2- Broadcast a resolver instance
Instead of broadcasting a map, you can broadcast a resolver instance that does a HBase lookup and temporary LRU cache. Each executor will get a copy of this instance and can manage its own cache and you can invoke them within for foreachPartition() code.
Beware, the resolver instance needs to implement Serializable so you will have to declare the cache, HBase connections and HBase Configuration properties as transient to be initialized on each executor.
I run such a setup in Scala on one of the projects I maintain: it works and can be more efficient than the straight Spark join if you know your access patterns and manage you cache efficiently
3- Use the HBase Spark connector to implement your lookup logic
Apache HBase has recently incorporated improved HBase Spark connectors
The documentation is pretty sparse right now, you need to look at the JIRA tickets and the documentation of the previous incarnation of these tools
Cloudera's SparkOnHBase but the last unit test in the test suite looks pretty much like what you want
I have no experience with this API though.

Spark containers killed by YARN during group by

I have a data set extracted from Hbase, which is a long form of wide table, i.e has rowKey, columnQualifier and value columns. To get a form of pivot, I need to group by rowKey, which is a string UUID, into a collection and make an object out of the collection. The problem is that only group-by I manage to perform is count the number of elements in groups; other group-bys fail due to container being kill due to memory overflow beyond YARN container limits. I did experiment a lot with the memory sizes, including overhead, partitioning with and without sorting etc. I went even into a high number of partitions i.e. about 10 000 but the job dies the same. I tried both DataFrame groupBy and collect_list, as well as Dataset grouByKey and mapGroups.
The code works on a small data set but not on the larger one. The data set is about 500 GB in Parquet files. The data is not skewed as the largest group in group by have only 50 elements. Thus, by all known to me means the partitions should easily fit in memory as the aggregated data per one rowKey is not really large. The data keys and values are mostly strings and there are not long.
I am using Spark 2.0.2; the above computations were all done is Scala.
You're probably running into the dreaded groupByKey shuffle. Please read this Databricks article on avoiding groupByKey, which details the underlying differences between the two functions.
If you don't want the read the article, the short story is this: Though groupByKey and reduceByKey produce the same results, groupByKey instantiates a shuffle of ALL data, while reduceByKey tries to minimize data shuffle by reducing first. A bit like MapReduce Combiners, if you're familiar with that concept.

Resources