How to use a broadcast collection in a udf? - apache-spark

How to use a broadcast collection in Spark SQL 1.6.1 udf. Udf should be called from Main SQL as shown below
sqlContext.sql("""Select col1,col2,udf_1(key) as value_from_udf FROM table_a""")
udf_1() should be looking through a broadcast small collection to return value to main sql.

Here's a minimal reproducible example in pySpark, illustrating the use of broadcast variables to perform lookups, employing a lambda function as an UDF inside a SQL statement.
# Create dummy data and register as table
df = sc.parallelize([
(1,"a"),
(2,"b"),
(3,"c")]).toDF(["num","let"])
df.registerTempTable('table')
# Create broadcast variable from local dictionary
myDict = {1: "y", 2: "x", 3: "z"}
broadcastVar = sc.broadcast(myDict)
# Alternatively, if your dict is a key-value rdd,
# you can do sc.broadcast(rddDict.collectAsMap())
# Create lookup function and apply it
sqlContext.registerFunction("lookup", lambda x: broadcastVar.value.get(x))
sqlContext.sql('select num, let, lookup(num) as test from table').show()
+---+---+----+
|num|let|test|
+---+---+----+
| 1| a| y|
| 2| b| x|
| 3| c| z|
+---+---+----+

Related

Is there a way to use a map/dict in Pyspark to avoid CASE WHEN condition equals pairs?

I have a problem in Pyspark creating a column based on values in another column for a new dataframe.
It's boring and seems to me not a good practice to use a lot of
CASE
WHEN column_a = 'value_1' THEN 'value_x'
WHEN column_a = 'value_2' THEN 'value_y'
...
WHEN column_a = 'value_289' THEN 'value_xwerwz'
END
In cases like this, in python, I get used to using a dict or, even better, a configparser file and avoid the if else condition. I just pass the key and python returns the desired value. Also, we have a 'fallback' option for ELSE clause.
The problem seems to me that we are not treating a single row but all of them in one command, so using dict/map/configparser is an unavailable option. I thought about using a loop with dict, but it seems too slow and a waste of computation as we repeat all the conditions.
I'm still looking for this practice, if I find it, I'll post it here. But, you know, probably a lot of people already use it and I don't know yet. But if there is no other way, ok. Use many WHEN THEN conditions won't be a choice.
Thank you
I tried to use a dict and searched for solutions like this
You could create a function which converts a dict into a Spark F.when, e.g.:
import pyspark.sql.functions as F
def create_spark_when(column, conditions, default):
when = None
for key, value in conditions.items():
current_when = F.when(F.col(column) == key, value)
if when is None:
when = current_when.otherwise(default)
else:
when = current_when.otherwise(when)
return when
df = spark.createDataFrame([(0,), (1,), (2,)])
df.show()
my_conditions = {1: "a", 2: "b"}
my_default = "c"
df.withColumn(
"my_column",
create_spark_when("_1", my_conditions, my_default),
).show()
Output:
+---+
| _1|
+---+
| 0|
| 1|
| 2|
+---+
+---+---------+
| _1|my_column|
+---+---------+
| 0| c|
| 1| a|
| 2| b|
+---+---------+
One choice is to use create a dataframe out of dictionary and perform join
This would work:
Creating a Dataframe:
dict={"value_1": "value_x", "value_2": "value_y"}
dict_df=spark.createDataFrame([(k,v) for k,v in dict.items()], ["key","value"])
Performing the join:
df.alias("df1")\
.join(F.broadcast(dict_df.alias("df2")), F.col("column_a")==F.col("key"))\
.selectExpr("df1.*","df2.value as newColumn")\
.show()
We can broadcast the dict_df as it is small.
Input:
Dict_df:
Output:
Alternatively, you can use a UDF - but that is not recommended.

Calculate new column in spark Dataframe, crossing a tokens list column in df1 with a text column in df2 with pyspark

I am using spark 2.4.5 and I need to calculate the sentiment score from a token list column (MeaningfulWords column) of df1, according to the words in df2 (spanish sentiment dictionary). In df1 I must create a new column with the scores list of tokens and another column with the mean of scores (sum of scores / count words) of each record. If any token in the list (df1) is not in the dictionary (df2), zero is scored.
The Dataframes looks like this:
df1.select("ID","MeaningfulWords").show(truncate=True, n=5)
+------------------+------------------------------+
| ID| MeaningfulWords|
+------------------+------------------------------+
|abcde00000qMQ00001|[casa, alejado, buen, gusto...|
|abcde00000qMq00002|[clientes, contentos, servi...|
|abcde00000qMQ00003| [resto, bien]|
|abcde00000qMQ00004|[mal, servicio, no, antiend...|
|abcde00000qMq00005|[gestion, adecuada, proble ...|
+------------------+------------------------------+
df2.show(5)
+-----+----------+
|score| word|
+-----+----------+
| 1.68|abandonado|
| 3.18| abejas|
| 2.8| aborto|
| 2.46| abrasador|
| 8.13| abrazo|
+-----+----------+
The new columns to add in df1, should look like this:
+------------------+---------------------+
| MeanScore| ScoreList|
+------------------+---------------------+
| 2.95|[3.10, 2.50, 1.28,...|
| 2.15|[1.15, 3.50, 2.75,...|
| 2.75|[4.20, 1.00, 1.75,...|
| 3.25|[3.25, 2.50, 3.20,...|
| 3.15|[2.20, 3.10, 1.28,...|
+------------------+---------------------+
I have reviewed some options using .join, but using columns with different data types gives error.
I have also tried converting the Dataframes to RDD and calling a function:
def map_words_to_values(review_words, dict_df):
return [dict_df[word] for word in review_words if word in dict_df]
RDD1=swRemoved.rdd.map(list)
RDD2=Dict_df.rdd.map(list)
reviewsRDD_dict_values = RDD1.map(lambda tuple: (tuple[0], map_words_to_values(tuple[1], RDD2)))
reviewsRDD_dict_values.take(3)
But with this option I get the error:
PicklingError: Could not serialize object: Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
I have found some examples to score text using afinn library. But it doesn't works with spanish text.
I wanna try to utilize native functions of pyspark instead of using udfs to avoid affect the performance, if it's possible. But I'm a begginer in spark and I would like to find the spark way to do that.
You could do this by first joining using array_contains word, then groupBy with aggregations of first, collect_list, and mean.(spark2.4+)
welcome to SO
df1.show()
#+------------------+----------------------------+
#|ID |MeaningfulWords |
#+------------------+----------------------------+
#|abcde00000qMQ00001|[casa, alejado, buen, gusto]|
#|abcde00000qMq00002|[clientes, contentos, servi]|
#|abcde00000qMQ00003|[resto, bien] |
#+------------------+----------------------------+
df2.show()
#+-----+---------+
#|score| word|
#+-----+---------+
#| 1.68| casa|
#| 2.8| alejado|
#| 1.03| buen|
#| 3.68| gusto|
#| 0.68| clientes|
#| 2.1|contentos|
#| 2.68| servi|
#| 1.18| resto|
#| 1.98| bien|
#+-----+---------+
from pyspark.sql import functions as F
df1.join(df2, F.expr("""array_contains(MeaningfulWords,word)"""),'left')\
.groupBy("ID").agg(F.first("MeaningfulWords").alias("MeaningfullWords")\
,F.collect_list("score").alias("ScoreList")\
,F.mean("score").alias("MeanScore"))\
.show(truncate=False)
#+------------------+----------------------------+-----------------------+------------------+
#|ID |MeaningfullWords |ScoreList |MeanScore |
#+------------------+----------------------------+-----------------------+------------------+
#|abcde00000qMQ00003|[resto, bien] |[1.18, 1.98] |1.58 |
#|abcde00000qMq00002|[clientes, contentos, servi]|[0.68, 2.1, 2.68] |1.8200000000000003|
#|abcde00000qMQ00001|[casa, alejado, buen, gusto]|[1.68, 2.8, 1.03, 3.68]|2.2975 |
#+------------------+----------------------------+-----------------------+------------------+

spark aggregate set of events per key including their change timestamps

For a dataframe of:
+----+--------+-------------------+----+
|user| dt| time_value|item|
+----+--------+-------------------+----+
| id1|20200101|2020-01-01 00:00:00| A|
| id1|20200101|2020-01-01 10:00:00| B|
| id1|20200101|2020-01-01 09:00:00| A|
| id1|20200101|2020-01-01 11:00:00| B|
+----+--------+-------------------+----+
I want to capture all the unique items i.e. collect_set, but retain its own time_value
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.functions.unix_timestamp
import org.apache.spark.sql.functions.collect_set
import org.apache.spark.sql.types.TimestampType
val timeFormat = "yyyy-MM-dd HH:mm"
val dx = Seq(("id1", "20200101", "2020-01-01 00:00", "A"), ("id1", "20200101","2020-01-01 10:00", "B"), ("id1", "20200101","2020-01-01 9:00", "A"), ("id1", "20200101","2020-01-01 11:00", "B")).toDF("user", "dt","time_value", "item").withColumn("time_value", unix_timestamp(col("time_value"), timeFormat).cast(TimestampType))
dx.show
A
dx.groupBy("user", "dt").agg(collect_set("item")).show
+----+--------+-----------------+
|user| dt|collect_set(item)|
+----+--------+-----------------+
| id1|20200101| [B, A]|
+----+--------+-----------------+
does not retain the time_value information when the signal switched from A to B. How can I keep the time value information for each set in the item?
Would it be possible to have the collect_set within a window function to achieve the desired result? Currently, I can only think of:
use a window function to determine pairs of events
filter to change events
aggregate
which needs to shuffle multiple times. Alternatively, a UDF would be possible (collect_list(sort_array(struct(time_value, item)))) but that also seems rather clumsy.
Is there a better way?
I would indeed use window-functions to isolate the change-points, I think there are no alternatives:
val win = Window.partitionBy($"user",$"dt").orderBy($"time_value")
dx
.orderBy($"time_value")
.withColumn("item_change_post",coalesce((lag($"item",1).over(win)=!=$"item"),lit(false)))
.withColumn("item_change_pre",lead($"item_change_post",1).over(win))
.where($"item_change_pre" or $"item_change_post")
.show()
+----+--------+-------------------+----+----------------+---------------+
|user| dt| time_value|item|item_change_post|item_change_pre|
+----+--------+-------------------+----+----------------+---------------+
| id1|20200101|2020-01-01 09:00:00| A| false| true|
| id1|20200101|2020-01-01 10:00:00| B| true| false|
+----+--------+-------------------+----+----------------+---------------+
then use something like groupBy($"user",$"dt").agg(collect_list(struct($"time_value",$"item")))
I don't think that multiple shuffles occur, because you always partition/group by the same keys.
You can try to make it more efficient by aggregating your initial dataframe to the min/max time_value for each item, then do the same as above.

create column with a running total in a Spark Dataset

Suppose we have a Spark Dataset with two columns, say Index and Value, sorted by the first column (Index).
((1, 100), (2, 110), (3, 90), ...)
We'd like to have a Dataset with a third column with a running total of the values in the second column (Value).
((1, 100, 100), (2, 110, 210), (3, 90, 300), ...)
Any suggestions how to do this efficiently, with one pass through the data? Or are there any canned CDF type functions out there that could be utilized for this?
If need be, the Dataset can be converted to a Dataframe or an RDD to accomplish the task, but it will have to remain a distributed data structure. That is, it cannot be simply collected and turned to an array or sequence, and no mutable variables are to be used (val only, no var).
but it will have to remain a distributed data structure.
Unfortunately what you've said you seek to do isn't possible in Spark. If you are willing to repartition the data set to a single partition (in effect consolidating it on a single host) you could easily write a function to do what you wish, keeping the incremented value as a field.
Since Spark functions don't share state across the network when they execute, there's no way to create the shared state you would need to keep the data set completely distributed.
If you're willing to relax your requirement and allow the data to be consolidated and read through in a single pass on one host then you may do what you wish by repartitioning to a single partition and applying a function. This does not pull the data onto the driver (keeping it in HDFS/the cluster) but does still compute the output serially, on a single executor. For example:
package com.github.nevernaptitsa
import java.io.Serializable
import java.util
import org.apache.spark.sql.{Encoders, SparkSession}
object SparkTest {
class RunningSum extends Function[Int, Tuple2[Int, Int]] with Serializable {
private var runningSum = 0
override def apply(v1: Int): Tuple2[Int, Int] = {
runningSum+=v1
return (v1, runningSum)
}
}
def main(args: Array[String]): Unit ={
val session = SparkSession.builder()
.appName("runningSumTest")
.master("local[*]")
.getOrCreate()
import session.implicits._
session.createDataset(Seq(1,2,3,4,5))
.repartition(1)
.map(new RunningSum)
.show(5)
session.createDataset(Seq(1,2,3,4,5))
.map(new RunningSum)
.show(5)
}
}
The two statements here show different output, the first providing the correct output (serial, because repartition(1) is called), and the second providing incorrect output because the result is computed in parallel.
Results from first statement:
+---+---+
| _1| _2|
+---+---+
| 1| 1|
| 2| 3|
| 3| 6|
| 4| 10|
| 5| 15|
+---+---+
Results from second statement:
+---+---+
| _1| _2|
+---+---+
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
| 5| 9|
+---+---+
A colleague suggested the following which relies on the RDD.mapPartitionsWithIndex() method.
(To my knowledge, the other data structure do not provide this kind of reference to their partitions' indices.)
val data = sc.parallelize((1 to 5)) // sc is the SparkContext
val partialSums = data.mapPartitionsWithIndex{ (i, values) =>
Iterator((i, values.sum))
}.collect().toMap // will in general have size other than data.count
val cumSums = data.mapPartitionsWithIndex{ (i, values) =>
val prevSums = (0 until i).map(partialSums).sum
values.scanLeft(prevSums)(_+_).drop(1)
}

Does GraphFrames api support creation of Bipartite graphs?

Does GraphFrames api support creation of Bipartite graphs in the current version?
Current version: 0.1.0
Spark version : 1.6.1
As pointed out in the comments to this question, neither GraphFrames nor GraphX have built-in support for bipartite graphs. However, they both have more than enough flexibility to let you create bipartite graphs. For a GraphX solution, see this previous answer. That solution uses a shared trait between the different vertex / object type. And while that works with RDDs that's not going to work for DataFrames. A row in a DataFrame has a fixed schema -- it can't sometimes contain a price column and sometimes not. It can have a price column that's sometimes null, but the column has to exist in every row.
Instead, the solution for GraphFrames seems to be that you need to define a DataFrame that's essentially a linear sub-type of both types of objects in your bipartite graph -- it has to contain all of the fields of both types of objects. This is actually pretty easy -- a join with full_outer is going to give you that. Something like this:
val players = Seq(
(1,"dave", 34),
(2,"griffin", 44)
).toDF("id", "name", "age")
val teams = Seq(
(101,"lions","7-1"),
(102,"tigers","5-3"),
(103,"bears","0-9")
).toDF("id","team","record")
You could then create a super-set DataFrame like this:
val teamPlayer = players.withColumnRenamed("id", "l_id").join(
teams.withColumnRenamed("id", "r_id"),
$"r_id" === $"l_id", "full_outer"
).withColumn("l_id", coalesce($"l_id", $"r_id"))
.drop($"r_id")
.withColumnRenamed("l_id", "id")
teamPlayer.show
+---+-------+----+------+------+
| id| name| age| team|record|
+---+-------+----+------+------+
|101| null|null| lions| 7-1|
|102| null|null|tigers| 5-3|
|103| null|null| bears| 0-9|
| 1| dave| 34| null| null|
| 2|griffin| 44| null| null|
+---+-------+----+------+------+
You could possibly do it a little cleaner with structs:
val tpStructs = players.select($"id" as "l_id", struct($"name", $"age") as "player").join(
teams.select($"id" as "r_id", struct($"team",$"record") as "team"),
$"l_id" === $"r_id",
"full_outer"
).withColumn("l_id", coalesce($"l_id", $"r_id"))
.drop($"r_id")
.withColumnRenamed("l_id", "id")
tpStructs.show
+---+------------+------------+
| id| player| team|
+---+------------+------------+
|101| null| [lions,7-1]|
|102| null|[tigers,5-3]|
|103| null| [bears,0-9]|
| 1| [dave,34]| null|
| 2|[griffin,44]| null|
+---+------------+------------+
I'll also point out that more or less the same solution would work in GraphX with RDDs. You could always create a vertex via joining two case classes that don't share any traits:
case class Player(name: String, age: Int)
val playerRdd = sc.parallelize(Seq(
(1L, Player("date", 34)),
(2L, Player("griffin", 44))
))
case class Team(team: String, record: String)
val teamRdd = sc.parallelize(Seq(
(101L, Team("lions", "7-1")),
(102L, Team("tigers", "5-3")),
(103L, Team("bears", "0-9"))
))
playerRdd.fullOuterJoin(teamRdd).collect foreach println
(101,(None,Some(Team(lions,7-1))))
(1,(Some(Player(date,34)),None))
(102,(None,Some(Team(tigers,5-3))))
(2,(Some(Player(griffin,44)),None))
(103,(None,Some(Team(bears,0-9))))
With all respect to the previous answer, this seems like a more flexible way to handle it -- without having to share a trait between the combined objects.

Resources