Issue with Spark Window over with Group By - apache-spark

I want to populate agg over window which with a different grain than select group by.
Using Scala sql.
Select c1,c2,c3,max(c4),max(c5),
Max(c4) over (partition by c1,c2,c3),
Avg(c5) over (partition by c1,c2,c3)
From temp_view
Group by c1,c2,c3
Getting Error saying :
c4 and c5 not being part of Group by or use first().

As I said in a comment, GroupBy and PartitionBy share the same purpose in a few aspects. If you use GroupBy then all aggregation work over these GroupBy columns only. The same thing occurs when you use partition by. The only major difference between both is groupBy Reduces the no. of records and In select we need to use only columns which are used in group by But in ParitionBy Number of records will not be reduced. Instead of that it will add one extra aggregated column and In select we can use N no. of columns.
For your issue, you are using columns c1,c2,c3 in Group By and using Max(c4), AVG(c5) with partition by so it is giving you error.
For you use case, you can use either of below queries:
Select c1,c2,c3,max(c4),max(c5)
From temp_view
Group by c1,c2,c3
OR
Select c1,c2,c3,
Max(c4) over (partition by c1,c2,c3),
Avg(c5) over (partition by c1,c2,c3)
From temp_view
Below is the example which will give you a clear picture,
scala> spark.sql("""SELECT * from table""").show()
+---+----------------+-------+------+
| ID| NAME|COMPANY|SALARY|
+---+----------------+-------+------+
| 1| Gannon Chang| ABC|440993|
| 2| Hashim Morris| XYZ| 49140|
| 3| Samson Le| ABC|413890|
| 4| Brandon Doyle| XYZ|384118|
| 5| Jacob Coffey| BCD|504819|
| 6| Dillon Holder| ABC|734086|
| 7|Salvador Vazquez| NGO|895082|
| 8| Paki Simpson| BCD|305046|
| 9| Laith Stewart| ABC|943750|
| 10| Simon Whitaker| NGO|561896|
| 11| Denton Torres| BCD| 10442|
| 12|Garrison Sellers| ABC| 53024|
| 13| Theodore Bolton| TTT|881521|
| 14| Kamal Roberts| TTT|817422|
+---+----------------+-------+------+
//You can only use column to select that is in group by
scala> spark.sql("""SELECT COMPANY, max(SALARY) from table group by COMPANY""").show()
+-------+-----------+
|COMPANY|max(SALARY)|
+-------+-----------+
| NGO| 895082|
| BCD| 504819|
| XYZ| 384118|
| TTT| 881521|
| ABC| 943750|
+-------+-----------+
//It will give error if you select all column or column other than Group By
scala> spark.sql("""SELECT *, max(SALARY) from table group by COMPANY""").show()
org.apache.spark.sql.AnalysisException: expression 'table.`ID`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;
Aggregate [COMPANY#94], [ID#92, NAME#93, COMPANY#94, SALARY#95L, max(SALARY#95L) AS max(SALARY)#213L]
+- SubqueryAlias table
+- Relation[ID#92,NAME#93,COMPANY#94,SALARY#95L] parquet
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)
at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:92)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:187)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$9.apply(CheckAnalysis.scala:220)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$9.apply(CheckAnalysis.scala:220)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:220)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80)
at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:92)
at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:105)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:641)
... 49 elided
//But you can select all columns with partition by
scala> spark.sql("""SELECT *, Max(SALARY) over (PARTITION BY COMPANY) as Max_Salary from table""").show()
+---+----------------+-------+------+----------+
| ID| NAME|COMPANY|SALARY|Max_Salary|
+---+----------------+-------+------+----------+
| 7|Salvador Vazquez| NGO|895082| 895082|
| 10| Simon Whitaker| NGO|561896| 895082|
| 5| Jacob Coffey| BCD|504819| 504819|
| 8| Paki Simpson| BCD|305046| 504819|
| 11| Denton Torres| BCD| 10442| 504819|
| 2| Hashim Morris| XYZ| 49140| 384118|
| 4| Brandon Doyle| XYZ|384118| 384118|
| 13| Theodore Bolton| TTT|881521| 881521|
| 14| Kamal Roberts| TTT|817422| 881521|
| 1| Gannon Chang| ABC|440993| 943750|
| 3| Samson Le| ABC|413890| 943750|
| 6| Dillon Holder| ABC|734086| 943750|
| 9| Laith Stewart| ABC|943750| 943750|
| 12|Garrison Sellers| ABC| 53024| 943750|
+---+----------------+-------+------+----------+

Related

Calculate Spark column value depending on another row value on the same column

I'm working on Apache spark 2.3.0 cloudera4 and I have an issue processing a Dataframe.
I've got this input dataframe:
+---+---+----+
| id| d1| d2 |
+---+---+----+
| 1| | 2.0|
| 2| |-4.0|
| 3| | 6.0|
| 4|3.0| |
+---+---+----+
And I need this output:
+---+---+----+----+
| id| d1| d2 | r |
+---+---+----+----+
| 1| | 2.0| 7.0|
| 2| |-4.0| 5.0|
| 3| | 6.0| 9.0|
| 4|3.0| | 3.0|
+---+---+.---+----+
Which is, from an iterating perspective, get the biggest id row (4) and put the d1 value on the r column, then take the next row (3) and put r[4] + d2[3] on r column, and so on.
Is it posible to do something like that on Spark? because I will need a computed value from a row to calculate the value for another row.
How about this? The important bit is sum($"r1").over(Window.orderBy($"id".desc) which calculates a cumulative sum of a column. Other than that, I'm creating a couple of helper columns to get the max id and get the ordering right.
val result = df
.withColumn("max_id", max($"id").over(Window.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)))
.withColumn("r1", when($"id" === $"max_id", $"d1").otherwise($"d2"))
.withColumn("r", sum($"r1").over(Window.orderBy($"id".desc)))
.drop($"max_id").drop($"r1")
.orderBy($"id")
result.show
+---+----+----+---+
| id| d1| d2| r|
+---+----+----+---+
| 1|null| 2.0|7.0|
| 2|null|-4.0|5.0|
| 3|null| 6.0|9.0|
| 4| 3.0|null|3.0|
+---+----+----+---+

how to combine rows in a data frame by id

I have a data frame:
+---------+---------------------+
| id| Name|
+---------+---------------------+
| 1| 'Gary'|
| 1| 'Danny'|
| 2| 'Christopher'|
| 2| 'Kevin'|
+---------+---------------------+
I need to combine all the Name values in the id column. Please tell me how to get from it:
+---------+------------------------+
| id| Name|
+---------+------------------------+
| 1| ['Gary', 'Danny']|
| 2| ['Kevin','Christopher']|
+---------+------------------------+
You can use groupBy and collect functions. Based on your need you can use list or set etc.
df.groupBy(col("id")).agg(collect_list(col("Name"))
in case you want duplicate values
df.groupBy(col("id")).agg(collect_set(col("Name"))
if you want unique values
Use groupBy and collect_list functions for this case.
from pyspark.sql.functions import *
df.groupBy(col("id")).agg(collect_list(col("Name")).alias("Name")).show(10,False)
#+---+------------------------+
#|id |Name |
#+---+------------------------+
#|1 |['Gary', 'Danny'] |
#|2 |['Kevin', 'Christopher']|
#+---+------------------------+
df.groupby('id')['Name'].apply(list)

Crossjoin between two dataframes that is dependent on a common column

A crossJoin can be done as follows:
df1 = pd.DataFrame({'subgroup':['A','B','C','D']})
df2 = pd.DataFrame({'dates':pd.date_range(date_today, date_today + timedelta(3), freq='D')})
sdf1 = spark.createDataFrame(df1)
sdf2 = spark.createDataFrame(df2)
sdf1.crossJoin(sdf2).toPandas()
In this example there are two dataframes each containing 4 rows, in the end, I get 16 rows.
However, for my problem, I would like to do a cross join per user, and the user is another column in the two dataframes, e.g.:
df1 = pd.DataFrame({'user':[1,1,1,1,2,2,2,2],'subgroup':['A','B','C','D','A','B','D','E']})
df2 = pd.DataFrame({'user':[1,1,1,1,2,2,2,2],'dates':np.hstack([np.array(pd.date_range(date_today, date_today + timedelta(3), freq='D')),np.array(pd.date_range(date_today+timedelta(1), date_today + timedelta(4), freq='D'))])})
The result of applying the per-user crossJoin should be a dataframe with 32 rows. Is this possible in pyspark and how can this be done?
A cross join is a join that generates a multiplication of lines because the joining key does not identify rows uniquely (in our case the joining key is trivial or there is no joining key at all)
Let's start with sample data frames:
import pyspark.sql.functions as psf
import pyspark.sql.types as pst
df1 = spark.createDataFrame(
[[user, value] for user, value in zip(5 * list(range(2)), np.random.randint(0, 100, 10).tolist())],
schema=pst.StructType([pst.StructField(c, pst.IntegerType()) for c in ['user', 'value1']]))
df2 = spark.createDataFrame(
[[user, value] for user, value in zip(5 * list(range(2)), np.random.randint(0, 100, 10).tolist())],
schema=pst.StructType([pst.StructField(c, pst.IntegerType()) for c in ['user', 'value2']]))
+----+------+
|user|value1|
+----+------+
| 0| 76|
| 1| 59|
| 0| 14|
| 1| 71|
| 0| 66|
| 1| 61|
| 0| 2|
| 1| 22|
| 0| 16|
| 1| 83|
+----+------+
+----+------+
|user|value2|
+----+------+
| 0| 65|
| 1| 81|
| 0| 60|
| 1| 69|
| 0| 21|
| 1| 61|
| 0| 98|
| 1| 76|
| 0| 40|
| 1| 21|
+----+------+
Let's try joining the data frames on a constant column to see the equivalence between a cross join and regular join on a constant (trivial) column:
df = df1.withColumn('key', psf.lit(1)) \
.join(df2.withColumn('key', psf.lit(1)), on=['key'])
We get an error from spark > 2, because it realises we're trying to do a cross join (cartesian product)
Py4JJavaError: An error occurred while calling o1865.showString.
: org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans
LogicalRDD [user#1538, value1#1539], false
and
LogicalRDD [user#1542, value2#1543], false
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;
If your joining key (user here) is not a column that uniquely identifies rows, you'll get a multiplication of lines as well but within each user group:
df = df1.join(df2, on='user')
print("Number of rows : \tdf1: {} \tdf2: {} \tdf: {}".format(df1.count(), df2.count(), df.count()))
Number of rows : df1: 10 df2: 10 df: 50
+----+------+------+
|user|value1|value2|
+----+------+------+
| 1| 59| 81|
| 1| 59| 69|
| 1| 59| 61|
| 1| 59| 76|
| 1| 59| 21|
| 1| 71| 81|
| 1| 71| 69|
| 1| 71| 61|
| 1| 71| 76|
| 1| 71| 21|
| 1| 61| 81|
| 1| 61| 69|
| 1| 61| 61|
| 1| 61| 76|
| 1| 61| 21|
| 1| 22| 81|
| 1| 22| 69|
| 1| 22| 61|
| 1| 22| 76|
| 1| 22| 21|
+----+------+------+
5 * 5 rows for user 0 + 5 * 5 rows for user 1, hence 50
Note: Using a self join followed by a filter usually means you should be using window functions instead.

inner join not working in DataFrame using Spark 2.1

My Data Set :-
emp dataframe looks like this :-
emp.show()
+---+-----+------+----------+-------------+
| ID| NAME|salary|department| date|
+---+-----+------+----------+-------------+
| 1| sban| 100.0| IT| 2018-01-10|
| 2| abc| 200.0| HR| 2018-01-05|
| 3| Jack| 100.0| SALE| 2018-01-05|
| 4| Ram| 100.0| IT|2018-01-01-06|
| 5|Robin| 200.0| IT| 2018-01-07|
| 6| John| 200.0| SALE| 2018-01-08|
| 7| sban| 300.0| Director| 2018-01-01|
+---+-----+------+----------+-------------+
2- Then I group by using name and took its max salary , say dataframe is grpEmpByName :-
val grpByName = emp.select(col("name")).groupBy(col("name")).agg(max(col("salary")).alias("max_salary"))
grpByName.select("*").show()
+-----+----------+
| name|max_salary|
+-----+----------+
| Jack| 100.0|
|Robin| 200.0|
| Ram| 100.0|
| John| 200.0|
| abc| 200.0|
| sban| 300.0|
+-----+----------+
3- Then trying to join :-
val joinedBySalarywithMaxSal = emp.join(grpEmpByName, col("emp.salary") === col("grpEmpByName.max_salary") , "inner")
Its throwing
18/02/08 21:29:26 INFO CodeGenerator: Code generated in 13.667672 ms
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`grpByName.max_salary`' given input columns: [NAME, department, date, ID, salary, max_salary, NAME];;
'Join Inner, (salary#2 = 'grpByName.max_salary)
:- Project [ID#0, NAME#1, salary#2, department#3, date#4]
: +- MetastoreRelation default, emp
+- Aggregate [NAME#44], [NAME#44, max(salary#45) AS max_salary#25]
+- Project [salary#45, NAME#44]
+- Project [ID#43, NAME#44, salary#45, department#46, date#47]
+- MetastoreRelation default, emp
I am not getting why its not working as when I check
grpByName.select(col("max_salary")).show()
+----------+
|max_salary|
+----------+
| 100.0|
| 200.0|
| 100.0|
| 200.0|
| 200.0|
| 300.0|
+----------+
Thanks in advance .
The dot notation is used to refer to nested structures inside a table, not to refer to the table itself.
Call the col method define on the DataFrame instead, like this:
emp.join(grpEmpByName, emp.col("salary") === grpEmpByName.col("max_salary"), "inner")
You can see an example here.
Furthermore, note that joins are inner by default, so you should just be able to write the following:
emp.join(grpEmpByName, emp.col("salary") === grpEmpByName.col("max_salary"))
i am not sure, hope can help:
val joinedBySalarywithMaxSal = emp.join(grpEmpByName, emp.col("emp") === grpEmpByName.col("max_salary") , "inner")

Combine multiple datasets to single dataset without using unionAll function in Apache Spark sql

I am having my datasets as follows
Dataset 1:
+----------+--------------------+---------+---+
| Time| address| Date|value|sample
+----------+--------------------+---------+---+------+
|8:00:00 AM| AAbbbbbbbbbbbbbbbb|12/9/2014| 1 |0 |
|8:31:27 AM| AAbbbbbbbbbbbbbbbb|12/9/2014| 1 |0 |
+----------+--------------------+---------+---+------+
Dataset 2:
| Time| Location| Date|sample|value
+-----------+--------------------+---------+------+------+
| 8:45:00 AM| AAbbbbbbbbbbbbbbbb|12/9/2016| 5 | 0 |
| 9:15:00 AM| AAbbbbbbbbbbbbbbbb|12/9/2016| 5 | 0 |
+-----------+--------------------+---------+------+------+
I am using the following unionAll() function to combine both ds1 and ds2,
Dataset<Row> joined = dataset1.unionAll(dataset2).distinct();
Is there any better way to combine this ds1 and ds2, Since unionAll() function is deprecated in Spark 2.x.?
You can use union() to combine the two dataframes/datasets
df1.union(df2)
Output:
+----------+------------------+---------+-----+------+
| Time| address| Date|value|sample|
+----------+------------------+---------+-----+------+
|8:00:00 AM|AAbbbbbbbbbbbbbbbb|12/9/2014| 1| 0|
|8:31:27 AM|AAbbbbbbbbbbbbbbbb|12/9/2014| 1| 0|
|8:45:00 AM|AAbbbbbbbbbbbbbbbb|12/9/2016| 5| 0|
|9:15:00 AM|AAbbbbbbbbbbbbbbbb|12/9/2016| 5| 0|
+----------+------------------+---------+-----+------+
It also removes the duplicates rows
Hope this helps!

Resources