Spark, What does groupby return? - apache-spark

what's the return value of groupBy and agg in spark?
(this was one of the confusing part of pandas, and I never got it, and I guess it is similar here with spark)
df.groupBy("col1").agg(max("col2").alias("col2_max"))
it seems even if it looks like regular dataframe when you do .show() on it, I believe it is not a dataframe. (because if you do another .agg after the initial .agg things get weird
So what does groupBy return and agg return?

According to the Spark documentation, the Dataframe.groupBy method returns a GroupData object which basically has aggregated methods like agg, count, sum, avg, etc. The agg method (and the other ones) return a DataFrame
For further details, review the following documentation links: http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.groupBy and http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.GroupedData
Hope this helps

Related

is there have different between distinct() and reduceByKey() in spark

i have a RDD type like this: RDD[((String), SomeDTO)]
this RDD is come from an union method, and I can be sure that the element value of the same key must be the same, so if i want distinct all element of the rdd, what is the difference between the two methods I use
\\first
context.union(Array(rdd1, rdd2)).distinct()
\\second
context.union(Array(rdd1, rdd2)).reduceByKey((_, curr) => curr)
i'm beginner of spark, the only different i know is that distinct() running slowly
Referring the source code https://github.com/apache/spark/blob/5d45a415f3a29898d92380380cfd82bfc7f579ea/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L449 , distinct also follows the reduceByKey approach so you should be alright. distinct would not be slower than reduceByKey.

PySpark: combine aggregate and window functions

I am working with a legacy Spark SQL code like this:
SELECT
column1,
max(column2),
first_value(column3),
last_value(column4)
FROM
tableA
GROUP BY
column1
ORDER BY
columnN
I am rewriting it in PySpark as below
df.groupBy(column1).agg(max(column2), first(column3), last(column4)).orderBy(columnN)
When I'm comparing the two outcomes I can see differences in the fields generated by the first_value/first and last_value/last functions.
Are they behaving in a non-deterministic way when used outside of Window functions?
Can groupBy aggregates be combined with Window functions?
This behaviour is possible when you have a wide table and you don't specify ordering for the remaining columns. What happens under the hood is that spark takes first() or last() row, whichever is available to it as the first condition-matching row on the heap. Spark SQL and pyspark might access different elements because the ordering is not specified for the remaining columns.
In terms of Window function, you can use a partitionBy(f.col('column_name')) in your Window, which kind of works like a groupBy - it groups the data according to a partitioning column. However, without specifying the ordering for all columns, you might arrive at the same problem of non-determinicity. Hope this helps!
For completeness sake, I recommend you have a look at the pyspark doc for the first() and last() functions here: https://spark.apache.org/docs/2.4.3/api/python/pyspark.sql.html#pyspark.sql.functions.first
In particular, the following note brings light to why you behaviour was non-deterministic:
Note The function is non-deterministic because its results depends on order of rows which may be non-deterministic after a shuffle.
Definitely !
import pyspark.sql.functions as F
partition = Window.partitionBy("column1").orderBy("columnN")
data = data.withColumn("max_col2", F.max(F.col("column2")).over(partition))\
.withColumn("first_col3", F.first(F.col("column3")).over(partition))\
.withColumn("last_col4", F.last(F.col("column4")).over(partition))
data.show(10, False)

Spark Dataframe map function

val df1 = Seq(("Brian", 29, "0-A-1234")).toDF("name", "age", "client-ID")
val df2 = Seq(("1234", 555-5555, "1234 anystreet")).toDF("office-ID", "BusinessNumber", "Address")
I'm trying to run a function on each row of a dataframe (in streaming). This function will contain a combination of scala code, and Spark dataframe api code. for example, I want to take the 3 features from df, and use them to filter a second dataframe called df2. My understanding is that a UDF can't accomplish this. Now I have all the filtering code working just fine, without the ability to apply it to each row of df.
My goal is to be able to do something like
df.select("ID","preferences").map(row => ( //filter df2 using row(0), row(1) and row(3) ))
The dataframes can't be joined, there is not a joinable relationship between them.
Although I'm using Scala, an answer in Java or Python would probably be fine.
I'm also fine with alternative ways of accomplishing this. If I could extract the data from the rows into separate variables (keep in mind this is streaming), that's also fine.
My understanding is that a UDF can't accomplish this.
It is correct, but neither can map (local Datasets seem to be an exception Why does this Spark code make NullPointerException?). A nested logic like this one can be expressed only using joins:
If both Datasets are streaming it has to be equijoin. It means that even though:
The dataframes can't be joined, there is not a joinable relationship between them.
You have to derive one in some way which approximates well filter condition.
If one Dataset is not streaming, you can brute force things with crossJoin followed by filter, but it is of course hardly recommended.

What's the difference between explode function and operator?

What's the difference between explode function and explode operator?
spark.sql.functions.explode
explode function creates a new row for each element in the given array or map column (in a DataFrame).
val signals: DataFrame = spark.read.json(signalsJson)
signals.withColumn("element", explode($"data.datapayload"))
explode creates a Column.
See functions object and the example in How to unwind array in DataFrame (from JSON)?
Dataset<Row> explode / flatMap operator (method)
explode operator is almost the explode function.
From the scaladoc:
explode returns a new Dataset where a single column has been expanded to zero or more rows by the provided function. This is similar to a LATERAL VIEW in HiveQL. All columns of the input row are implicitly joined with each value that is output by the function.
ds.flatMap(_.words.split(" "))
Please note that (again quoting the scaladoc):
Deprecated (Since version 2.0.0) use flatMap() or select() with functions.explode() instead
See Dataset API and the example in How to split multi-value column into separate rows using typed Dataset?
Despite explode being deprecated (that we could then translate the main question to the difference between explode function and flatMap operator), the difference is that the former is a function while the latter is an operator. They have different signatures, but can give the same results. That often leads to discussions what's better and usually boils down to personal preference or coding style.
One could also say that flatMap (i.e. explode operator) is more Scala-ish given how ubiquitous flatMap is in Scala programming (mainly hidden behind for-comprehension).
flatMap is much better in performance in comparison to explode as flatMap require much lesser data shuffle.
If you are processing big data (>5 GB) the performance difference could be seen evidently.

PySpark SQL: consolidating .withColumn calls

I have an RDD that I've converted into a Spark SQL DataFrame. I want to do a number of transformations of columns with UDFs, which ends up looking something like this:
df = df.withColumn("col1", udf1(df.col1))\
.withColumn("col2", udf2(df.col2))\
...
...
.withColumn("newcol", udf(df.oldcol1, df.oldcol2))\
.drop(df.oldcol1).drop(df.oldcol2)\
...
etc.
Is there is a more concise way to express this (both the repeated withColumn and drop calls)?
You can pass several operations in one expression.
exprs = [udf1(col("col1")).alias("col1"),
udf2(col("col2")).alias("col2"),
...
udfn(col("coln")).alias("coln")]
And then unpack them inside a select:
df = df.select(*exprs)
So, taking this approach you will execute such udfs over your df and you will rename the resulting columns. Note that my answer is almost exactly like this, however the question was totally different from mine, so this is why I decided to answer it and not flag it as duplicate.

Resources