Stack - Broadcast a csv? - apache-spark

Assume I'm creating a spark dataset from a shared store of data as follows:
Dataset<Row> item = spark.read().option("delimiter", "|").option("header","true").csv(fName).cache();
Is there a way to tell Spark to broadcast item to all nodes, such that no shuffle is needed to use it? I have a bunch of little lookup tables and I'd like to see if broadcasting them helps avoid shuffles.

You can use two approaches:
collect() given Dataset and broadcast it manually. You said that those files are small, so it's possible. But, it will work with UDFs / strong typed operators like map, not with standard function.
Example:
val items = item.as[MyCaseClass].collect()
val itemsBcV = sparkContext.broadcast(items)
// later, UDF
val funnyUDF = udf ((x : String) => {
val valueFromBroadcast = itemsBcV.value;
// processing
});
Preferred: Don't broadcast manually, just in processing add broadcast hint.
First, import org.apache.spark.sql.functions._
For example:
someBigTable.join(broadcast(item), "id")
in pure SQL syntax it is:
item.createOrReplaceTempView("item")
select /*+ BROADCAST(item) */ * from bigTable join item
Spark will manage broadcasting this variable and use quicker Broadcast Hash Join instead of Hash Join or Sort Merge Join

Related

How to join efficiently 2 Spark dataframes partitioned by some column, when that column is one of multiple join keys?

I am currently facing some issues in Spark 3.0.2 to efficiently join 2 Spark dataframes when
The 2 Spark DataFrames are partitioned by some key id;
id is part of the join key, but it is not the only one.
My intuition is telling me that the query optimizer is, in this case, not choosing the optimal path. I will illustrate my issue through a minimal example (note that this particular example does not really require a join, it's just for illustrative purposes).
Let's start from the simple case: the 2 dataframes are partitioned by id, and we join by id only:
from pyspark.sql import SparkSession, Row, Window
import pyspark.sql.functions as F
spark = SparkSession.builder.getOrCreate()
# Make up some test dataframe
df = spark.createDataFrame([Row(id=i // 10, order=i % 10, value=i) for i in range(10000)])
# Create the left side of the join (repartitioned by id)
df2 = df.repartition(50, 'id')
# Create the right side of the join (also repartitioned by id)
df3 = df2.select('id', F.col('order').alias('order_alias'), F.lit(0).alias('dummy'))
# Perform the join
joined_df = df2.join(df3, on='id')
joined_df.foreach(lambda x: None)
This results in the following efficient plan:
This plan is efficient: it recognizes that the 2 dataframes are already partitioned by the join key and avoids to re-shuffle them. The 2 dataframes are not only repartitioned, but also colocated.
What happens if there is an additional join key? It results in an inefficient plan:
joined_df = df2.join(df3, on=[df2.id==df3.id, df2.order==df3.order_alias])
joined_df.foreach(lambda x: None)
The plan is inefficient since it is repartitioning the 2 dataframes to do the join. This does not make sense to me. Intuitively, we could use the existing partitions: all keys to be joined will be found in the same partition as before, there is just one additional condition to apply! So I thought: perhaps we could phrase the 2nd condition as a filter?
joined_df.foreach(lambda x: None)
joined_df = df2.join(df3, on='id')
joined_df_filtered = joined_df.filter(df2.order==df3.order_alias)
This however results in the same inefficient plan, since Spark query optimizer will just merge the 2nd filter with the join.
So, I finally thought that maybe I could force Spark to process the join as I want by adding a dummy cache step, by trying the following:
from pyspark import StorageLevel
joined_df = df2.join(df3, on='id')
# Note that this storage level will not cache anything, it's just to suggest to Spark that I need this intermediate result
joined_df.persist(StorageLevel(False, False, False, False))
# Do the filtering after "persisting" the join
joined_df_filtered = joined_df.filter(df2.order==df3.order_alias)
joined_df_filtered.foreach(lambda x: None)
This results in an efficient plan! It is in fact much faster than the previous ones.
The workaround of "persisting" the first join to force Spark to use a more efficient processing plan is "good enough" for my use case, but I still have a few questions:
Am I missing something in my intuition that Spark should actually be reusing partitions when the partition key is part of the join key, instead of re-shuffling?
Is this expected behavior of the query optimizer? Should a ticket be filed for it?
Is there a better way to force the desired processing plan than adding the "persist" step? It seems more like an indirect workaround than a direct solution.

Apache spark custom log unfiltered data (LazyLogging)

I'm filtering a column to comply with some validations and I can filter using Spark built-in functions,
but I need to log the invalid data with a proper message (I am using LazyLogging), is there any way I can do it without using a custom UDF, so I can keep Spark optimization?
for example filtering names that are shorter then 20 characters:
df.filter(length($"name") <= lit(20))
in this scenario how can I log the names that are more than 20 characters without custom UDF?
In case the result of the filter operation is not too large that it does not fit into your driver, you can collect the result and print it out to your default Logger.
val logCollection = df.filter(length($"name") > lit(20)).collectAsList
logCollection.foreach(logger.info(_))
As an alternative you can create a separate stream by applying another writeStream format to write the names into a database, console etc. Just keep in mind that when you do this, you will actually create multiple streaming queries within your SparkSession which are consuming the data independently:
val originalDf = df.[...]
val logDf = df.filter(length($"name") > lit(20))
val originalQuery = originalDf.writeStream.[...].start() // keep logic as is
val logQuery = logDf.writeStream.format("console").[...].start()
spark.streams.awaitAnyTermination()

Is it necessary to broadcast an object member in Spark?

Say I have an object and I need to make some operations towards the member of this object: arr.
object A {
val arr = (0 to 1000000).toList
def main(args: Array[String]): Unit = {
//...init spark context
val rdd: RDD[Int] = ...
rdd.map(arr.contains(_)).saveAsTextFile...
}
}
What is the difference between broadcasted arr and not broadcasted?
i.e.
val arrBr = sc.broadcast(arr)
rdd.map(arrBr.value.contains(_))
and
rdd.map(arr.contains(_))
In my opinion, the object A is a singleton object, so it will be transferred through the nodes in Spark.
Is it necessary to use broadcast in this scenario?
In the case
rdd.map(arr.contains(_))
arr is serialized shipped for each task
while in
val arrBr = sc.broadcast(arr)
rdd.map(arrBr.value.contains(_))
this is only done once per executor.
Therefore you should use broadcast when dealing with large datastructures.
Just two additional things to mention beside Raphael's answer which is correct. You must always consider the size of the variable that you broadcast this shouldn't be too large otherwise Spark will face difficulties to distribute it efficiently along the cluster. In your case is:
4B x 1000000 = 4000000B ~ 4GB
which exceeds already the default value 4MB and can be controlled by modifying the value of spark.broadcast.blockSize.
Another factor to decide whether to use or not broadcast is when you have joins and want to avoid shuffling. By broadcasting a dataframe the keys will be available immediately in the node and hence avoid retrieving data from different nodes(shuffling).

Deleting specific column in Cassandra from Spark

I was able to delete specific column with the RDD API with -
sc.cassandraTable("books_ks", "books")
.deleteFromCassandra("books_ks", "books",SomeColumns("book_price"))
I am struggling to do this with the Dataframe API.
Can someone please share an example?
You cannot delete via the DF API and it's unnatural via the RDD api. RDDs and DFs are immutable, meaning no modification. You can filter them to cut them down but this generates a new RDD / DF.
Having said that what you can do is filter out the rows that you wish to delete and then just build a C* client to carry out that deletion:
// imports for Spark and C* connection
import org.apache.spark.sql.cassandra._
import com.datastax.spark.connector.cql.CassandraConnectorConf
spark.setCassandraConf("Test Cluster", CassandraConnectorConf.ConnectionHostParam.option("localhost"))
val df = spark.read.format("org.apache.spark.sql.cassandra").options(Map("keyspace" -> "books_ks", "table" -> "books")).load()
val dfToDelete = df.filter($"price" < 3).select($"price");
dfToDelete.show();
// import for C* client
import com.datastax.driver.core._
// build a C* client (part of the dependency of the scala driver)
val clusterBuilder = Cluster.builder().addContactPoints("127.0.0.1");
val cluster = clusterBuilder.build();
val session = cluster.connect();
// loop over everything that you filtered in the DF and delete specified row.
for(price <- dfToDelete.collect())
session.execute("DELETE FROM books_ks.books WHERE price=" + price.get(0).toString);
Few Warnings This wont work well if you're trying to delete a large portion of rows. Using collect here means that this work will be done in Spark's driver program, aka SPOF & bottle-neck.
Better way to do this would be to go a) define a DF UDF to carry out the delete, benefit would be you get parallelization. Option b) to the RDD level and just the delete as you've shown above.
Moral of the story, just because it can be done, doesn't mean it should be done.

How to get an Iterator of Rows using Dataframe in SparkSQL

I have an application in SparkSQL which returns large number of rows that are very difficult to fit in memory so I will not be able to use collect function on DataFrame, is there a way using which I can get all this rows as an Iterable instaed of the entire rows as list.
I am executing this SparkSQL application using yarn-client.
Generally speaking transferring all the data to the driver looks a pretty bad idea and most of the time there is a better solution out there but if you really want to go with this you can use toLocalIterator method on a RDD:
val df: org.apache.spark.sql.DataFrame = ???
df.cache // Optional, to avoid repeated computation, see docs for details
val iter: Iterator[org.apache.spark.sql.Row] = df.rdd.toLocalIterator
Actually you can just use: df.toLocalIterator, here is the reference in Spark source code:
/**
* Return an iterator that contains all of [[Row]]s in this Dataset.
*
* The iterator will consume as much memory as the largest partition in this Dataset.
*
* Note: this results in multiple Spark jobs, and if the input Dataset is the result
* of a wide transformation (e.g. join with different partitioners), to avoid
* recomputing the input Dataset should be cached first.
*
* #group action
* #since 2.0.0
*/
def toLocalIterator(): java.util.Iterator[T] = withCallback("toLocalIterator", toDF()) { _ =>
withNewExecutionId {
queryExecution.executedPlan.executeToIterator().map(boundEnc.fromRow).asJava
}
}

Resources