is there have different between distinct() and reduceByKey() in spark - apache-spark

i have a RDD type like this: RDD[((String), SomeDTO)]
this RDD is come from an union method, and I can be sure that the element value of the same key must be the same, so if i want distinct all element of the rdd, what is the difference between the two methods I use
\\first
context.union(Array(rdd1, rdd2)).distinct()
\\second
context.union(Array(rdd1, rdd2)).reduceByKey((_, curr) => curr)
i'm beginner of spark, the only different i know is that distinct() running slowly

Referring the source code https://github.com/apache/spark/blob/5d45a415f3a29898d92380380cfd82bfc7f579ea/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L449 , distinct also follows the reduceByKey approach so you should be alright. distinct would not be slower than reduceByKey.

Related

Choose between map+inner loop and flatMapValues+reduceByKey

I have a data like following in pairRDD, and I would like to collect a map with username as key, and sum of each list as value. The number of users is very large say 100m+, and lists are <1k in size. There are 2 choices I can think of - mapToPair and sum list with a simple for loop inside mapToPair, or flatMapValues the list to create <user, value> pairs then reduceBykey. Which is way is better?
Seq(
("user1",List(8,2,....)),
("user2",List(1,12,.....)),
...
("userN",List(99,5,...))
)
I would guess rdd.mapValues(_.sum) would be faster because you iterate over the elements once instead of twice (once to flatten, once to reduce).
But the best answer would be to just test it an see.
Best tip I can think of though, is try to work with DataFrames or Datasets (Spark SQL) to begin with. If you end up with a flattened DataFrame you can call df.groupBy($"user").agg(F.sum($"value")) or if you have a Dataframe like the RDD you described you can just use the aggregate SQL function

Spark Dataframe map function

val df1 = Seq(("Brian", 29, "0-A-1234")).toDF("name", "age", "client-ID")
val df2 = Seq(("1234", 555-5555, "1234 anystreet")).toDF("office-ID", "BusinessNumber", "Address")
I'm trying to run a function on each row of a dataframe (in streaming). This function will contain a combination of scala code, and Spark dataframe api code. for example, I want to take the 3 features from df, and use them to filter a second dataframe called df2. My understanding is that a UDF can't accomplish this. Now I have all the filtering code working just fine, without the ability to apply it to each row of df.
My goal is to be able to do something like
df.select("ID","preferences").map(row => ( //filter df2 using row(0), row(1) and row(3) ))
The dataframes can't be joined, there is not a joinable relationship between them.
Although I'm using Scala, an answer in Java or Python would probably be fine.
I'm also fine with alternative ways of accomplishing this. If I could extract the data from the rows into separate variables (keep in mind this is streaming), that's also fine.
My understanding is that a UDF can't accomplish this.
It is correct, but neither can map (local Datasets seem to be an exception Why does this Spark code make NullPointerException?). A nested logic like this one can be expressed only using joins:
If both Datasets are streaming it has to be equijoin. It means that even though:
The dataframes can't be joined, there is not a joinable relationship between them.
You have to derive one in some way which approximates well filter condition.
If one Dataset is not streaming, you can brute force things with crossJoin followed by filter, but it is of course hardly recommended.

Difference between reduce and reduceByKey in Apache Spark

What is the difference between reduce and reduceByKey in Apache Spark in terms of their functionalities?
Why reduceByKey is a transformation and reduce is an action?
This is close to a duplicate of my answer explaining reduceByKey, but I will elaborate to the specific part that makes the two different. However refer to my answer for a bit more specifics on the internals of reduceByKey.
Basically, reduce must pull the entire dataset down into a single location because it is reducing to one final value. reduceByKey on the other hand is one value for each key. And since this action can be run on each machine locally first then it can remain an RDD and have further transformations done on its dataset.
Note, however that there is a reduceByKeyLocally you can use to automatically pull down the Map to a single location also.
Please go through this official documentation link .
reduce is an action which Aggregate the elements of the dataset using a function func (which takes two arguments and returns one),also we can use reduce for single RDDs (for more info Please click HERE).
reduceByKey When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. (for more info Please click HERE)
this is the qt assistant :
reduce(f): Reduces the elements of this RDD using the specified
commutative and associative binary operator. Currently reduces
partitions locally.
reduceByKey(func, numPartitions=None, partitionFunc=) :
Merge the values for each key using an associative and commutative reduce
function.

What's the difference between explode function and operator?

What's the difference between explode function and explode operator?
spark.sql.functions.explode
explode function creates a new row for each element in the given array or map column (in a DataFrame).
val signals: DataFrame = spark.read.json(signalsJson)
signals.withColumn("element", explode($"data.datapayload"))
explode creates a Column.
See functions object and the example in How to unwind array in DataFrame (from JSON)?
Dataset<Row> explode / flatMap operator (method)
explode operator is almost the explode function.
From the scaladoc:
explode returns a new Dataset where a single column has been expanded to zero or more rows by the provided function. This is similar to a LATERAL VIEW in HiveQL. All columns of the input row are implicitly joined with each value that is output by the function.
ds.flatMap(_.words.split(" "))
Please note that (again quoting the scaladoc):
Deprecated (Since version 2.0.0) use flatMap() or select() with functions.explode() instead
See Dataset API and the example in How to split multi-value column into separate rows using typed Dataset?
Despite explode being deprecated (that we could then translate the main question to the difference between explode function and flatMap operator), the difference is that the former is a function while the latter is an operator. They have different signatures, but can give the same results. That often leads to discussions what's better and usually boils down to personal preference or coding style.
One could also say that flatMap (i.e. explode operator) is more Scala-ish given how ubiquitous flatMap is in Scala programming (mainly hidden behind for-comprehension).
flatMap is much better in performance in comparison to explode as flatMap require much lesser data shuffle.
If you are processing big data (>5 GB) the performance difference could be seen evidently.

Apache Spark Transformations: groupByKey vs reduceByKey vs aggregateByKey

These three Apache Spark Transformations are little confusing. Is there any way I can determine when to use which one and when to avoid one?
I think official guide explains it well enough.
I will highlight differences (you have RDD of type (K, V)):
if you need to keep the values, then use groupByKey
if you no need to keep the values, but you need to get some aggregated info about each group (items of the original RDD, which have the same K), you have two choices: reduceByKey or aggregateByKey (reduceByKey is kind of particular aggregateByKey)
2.1 if you can provide an operation which take as an input (V, V) and returns V, so that all the values of the group can be reduced to the one single value of the same type, then use reduceByKey. As a result you will have RDD of the same (K, V) type.
2.2 if you can not provide this aggregation operation, then use aggregateByKey. It happens when you reduce values to another type. So you will have (K, V2) as a result.
In addition to #Hlib answer, I would like to add few more points.
groupByKey() is just to group your dataset based on a key.
reduceByKey() is something like grouping + aggregation. We can say reduceBykey() equvelent to dataset.group(...).reduce(...).
aggregateByKey() is logically same as reduceByKey() but it lets you return result in different type. In another words, it lets you have a input as type x and aggregate result as type y. For example (1,2),(1,4) as input and (1,"six") as output.

Resources