Distribute by vs Cluster by in spark SQL - apache-spark

I have recently started working on spark, We always use cluster by to optimize the tables before joining but I wanted to know is there any scenario where we prefer distribute by over cluster by clause.

Only difference between cluster by and distribute by is
Distribute by only repartitions the data based on the expression while cluster by first repartitions that data and then sorts the data based on key in each partition.
Equivalent representations of cluster by and distribute by in dataframe api is as follows:
distribute by
df.repartition($"key", 2)
cluster by
df.repartition($"key", 2).sortWithinPartitions()
Both involves shuffle operation except cluster by has extra sorting operation.

Related

PySpark Parallelism

I am new to spark and am trying to implement reading data from a parquet file and then after some transformation returning it to web ui as a paginated way. Everything works no issue there.
So now I want to improve the performance of my application, after some google and stack search I found out about pyspark parallelism.
What I know is that :
pyspark parallelism works by default and It creates a parallel process based on the number of cores the system has.
Also for this to work data should be partitioned.
Please correct me if my understanding is not right.
Questions/doubt:
I am reading data from one parquet file, so my data is not partitioned and if I use the .repartition() method on my dataframe that is expensive. so how should I use PySpark Parallelism here ?
Also I could not find any simple implementation of pyspark parallelism, which could explain how to use it.
In spark cluster 1 core reads one partition so if you are on multinode spark cluster
then you need to leave some meory for existing system manager like Yarn etc.
https://spoddutur.github.io/spark-notes/distribution_of_executors_cores_and_memory_for_spark_application.html
you can use reparation and specify number of partitions
df.repartition(n)
where n is the number of partition. Repartition is for parlelleism, it will be ess expensive then process your single file without any partition.

Batch processing job (Spark) with lookup table that's too big to fit into memory

I'm trying to write a batch job to process a couple of hundreds of terabytes that currently sit in an HBase database (in an EMR cluster in AWS), all in a single large table. For every row I'm processing, I need to get additional data from a lookup table (a simple integer to string mapping) that is in a second HBase table. We'd be doing 5-10 lookups per row.
My current implementation uses a Spark job that's distributing partitions of the input table to its workers, in the following shape:
Configuration hBaseConfig = newHBaseConfig();
hBaseConfig.set(TableInputFormat.SCAN, convertScanToString(scan));
hBaseConfig.set(TableInputFormat.INPUT_TABLE, tableName);
JavaPairRDD<ImmutableBytesWritable, Result> table = sparkContext.newAPIHadoopRDD(hBaseConfig, TableInputFormat.class, ImmutableBytesWritable.class, Result.class);
table.map(val -> {
// some preprocessing
}).foreachPartition(p -> {
p.forEachRemaining(row -> {
// code that does the lookup
});
});
The problem is that the lookup table is too big to fit in the workers' memory. They all need access to all parts of the lookup table, but their access pattern would significantly benefit from a cache.
Am I right in thinking that I cannot use a simple map as a broadcast variable because it'd need to fit into memory?
Spark uses a shared nothing architecture, so I imagine there won't be an easy way to share a cache across all workers, but can we build a simple LRU cache for every individual worker?
How would I implement such a local worker cache that gets the data from the lookup table in HBase on a cache miss? Can I somehow distribute a reference to the second table to all workers?
I'm not set on my choice of technology, apart from HBase as the data source. Is there a framework other than Spark which could be a better fit for my use case?
You have a few of options for dealing with this requirement:
1- Use RDD or Dataset joins
You can load both of your HBase tables as Spark RDD or Datasets and then do a join on your lookup key.
Spark will split both RDD into partitions and shuffle content around so that rows with the same keys end up on the same executors.
By managing the number of number of partitions within spark you should be able to join 2 tables on any arbitrary sizes.
2- Broadcast a resolver instance
Instead of broadcasting a map, you can broadcast a resolver instance that does a HBase lookup and temporary LRU cache. Each executor will get a copy of this instance and can manage its own cache and you can invoke them within for foreachPartition() code.
Beware, the resolver instance needs to implement Serializable so you will have to declare the cache, HBase connections and HBase Configuration properties as transient to be initialized on each executor.
I run such a setup in Scala on one of the projects I maintain: it works and can be more efficient than the straight Spark join if you know your access patterns and manage you cache efficiently
3- Use the HBase Spark connector to implement your lookup logic
Apache HBase has recently incorporated improved HBase Spark connectors
The documentation is pretty sparse right now, you need to look at the JIRA tickets and the documentation of the previous incarnation of these tools
Cloudera's SparkOnHBase but the last unit test in the test suite looks pretty much like what you want
I have no experience with this API though.

How to force Spark Dataframe to be split across all the worker nodes?

I want to create a small dataframe with just 10 rows. And I want to force this dataframe to be distributed to two worker nodes. My cluster has only two worker nodes. How do I do that?
Currently, whenever I create such a small dataframe, it gets persisted in only one worker node.
I know, Spark is build for Big Data and this question does not make much sense. However, conceptually, I just wanted to know if at all it is feasible or possible to enforce the Spark dataframe to be split across all the worker nodes (given a very small dataframe with 10-50 rows only).
Or, it is completely impossible and we have to rely upon the Spark master for this dataframe distribution?

Is there a way to control the distribution of spark partitions across nodes in a cluster?

I have an 8 node cluster and I load two dataframes from a jdbc source like this:
positionsDf = spark.read.jdbc(
url=connStr,
table=positionsSQL,
column="PositionDate",
lowerBound=41275,
upperBound=42736,
numPartitions=128*3,
properties=props
)
positionsDF.cache()
varDatesDf = spark.read.jdbc(
url=connStr,
table=datesSQL,
column="PositionDate",
lowerBound=41275,
upperBound=42736,
numPartitions=128 * 3,
properties=props
)
varDatesDF.cache()
res = varDatesDf.join(positionsDf, on='PositionDate').count()
I can some from the storage tab of the application UI that the partitions are evenly distributed across the cluster nodes. However, what I can't tell is how they are distributed across the nodes. Ideally, both dataframes would be distributed in such a way that the joins are always local to the node, or even better local to the executors.
In other words, will the positionsDF dataframe partition that contains records with PositionDate="01 Jan 2016", be located in the same executor memory space as the varDatesDf dataframe partition that contains records with PositionDate="01 Jan 2016"? Will they be on the same node? Or is it just random?
Is there any way to see what partitions are on which node?
Does spark distribute the partitions created using a column key like this in a deterministic way across nodes? Will they always be node/executor local?
will the positionsDF dataframe partition that contains records with PositionDate="01 Jan 2016", be located in the same executor memory space as the varDatesDf dataframe partition that contains records with PositionDate="01 Jan 2016"
It won't be in general. Even if data is co-partitioned (it is not here) it doesn't imply co-location.
Is there any way to see what partitions are on which node?
This relation doesn't have to be fixed over time. Task can be for example rescheduled. You can use different RDD tricks (TaskContext) or database log but it is not reliable.
would be distributed in such a way that the joins are always local to the node, or even better local to the executors.
Scheduler has its internal optimizations and low level APIs allow you to set node preferences but this type of things are not controllable in Spark SQL.

Spark streaming RDD partitions

In Spark streaming, is it possible to assign specific RDD partitions to specific nodes in the cluster (for data locality?)
For example, I get a stream of events [a,a,a,b,b,b] and have a 2 node Spark cluster.
I want all a's to always go to Node 1 and all b's to always go to Node 2.
Thanks!
This is possible by specifying a custom partitioner for your RDD. The RangeBasedPartitioner will partition your RDD based on range, but you can implement any partitioning logic you with a custom partitioner. Its generally useful/important for partitions to be relatively balanced, and depending on your input data doing something like this could cause problems (e.g. stragglers etc.) so be careful.

Resources