inner join not working in DataFrame using Spark 2.1 - apache-spark

My Data Set :-
emp dataframe looks like this :-
emp.show()
+---+-----+------+----------+-------------+
| ID| NAME|salary|department| date|
+---+-----+------+----------+-------------+
| 1| sban| 100.0| IT| 2018-01-10|
| 2| abc| 200.0| HR| 2018-01-05|
| 3| Jack| 100.0| SALE| 2018-01-05|
| 4| Ram| 100.0| IT|2018-01-01-06|
| 5|Robin| 200.0| IT| 2018-01-07|
| 6| John| 200.0| SALE| 2018-01-08|
| 7| sban| 300.0| Director| 2018-01-01|
+---+-----+------+----------+-------------+
2- Then I group by using name and took its max salary , say dataframe is grpEmpByName :-
val grpByName = emp.select(col("name")).groupBy(col("name")).agg(max(col("salary")).alias("max_salary"))
grpByName.select("*").show()
+-----+----------+
| name|max_salary|
+-----+----------+
| Jack| 100.0|
|Robin| 200.0|
| Ram| 100.0|
| John| 200.0|
| abc| 200.0|
| sban| 300.0|
+-----+----------+
3- Then trying to join :-
val joinedBySalarywithMaxSal = emp.join(grpEmpByName, col("emp.salary") === col("grpEmpByName.max_salary") , "inner")
Its throwing
18/02/08 21:29:26 INFO CodeGenerator: Code generated in 13.667672 ms
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`grpByName.max_salary`' given input columns: [NAME, department, date, ID, salary, max_salary, NAME];;
'Join Inner, (salary#2 = 'grpByName.max_salary)
:- Project [ID#0, NAME#1, salary#2, department#3, date#4]
: +- MetastoreRelation default, emp
+- Aggregate [NAME#44], [NAME#44, max(salary#45) AS max_salary#25]
+- Project [salary#45, NAME#44]
+- Project [ID#43, NAME#44, salary#45, department#46, date#47]
+- MetastoreRelation default, emp
I am not getting why its not working as when I check
grpByName.select(col("max_salary")).show()
+----------+
|max_salary|
+----------+
| 100.0|
| 200.0|
| 100.0|
| 200.0|
| 200.0|
| 300.0|
+----------+
Thanks in advance .

The dot notation is used to refer to nested structures inside a table, not to refer to the table itself.
Call the col method define on the DataFrame instead, like this:
emp.join(grpEmpByName, emp.col("salary") === grpEmpByName.col("max_salary"), "inner")
You can see an example here.
Furthermore, note that joins are inner by default, so you should just be able to write the following:
emp.join(grpEmpByName, emp.col("salary") === grpEmpByName.col("max_salary"))

i am not sure, hope can help:
val joinedBySalarywithMaxSal = emp.join(grpEmpByName, emp.col("emp") === grpEmpByName.col("max_salary") , "inner")

Related

Equality Filter in Spark Structured Streaming [duplicate]

I want to select a column that equals to a certain value. I am doing this in scala and having a little trouble.
Heres my code
df.select(df("state")==="TX").show()
this returns the state column with boolean values instead of just TX
Ive also tried
df.select(df("state")=="TX").show()
but this doesn't work either.
I had the same issue, and the following syntax worked for me:
df.filter(df("state")==="TX").show()
I'm using Spark 1.6.
There is another simple sql like option. With Spark 1.6 below also should work.
df.filter("state = 'TX'")
This is a new way of specifying sql like filters. For a full list of supported operators, check out this class.
You should be using where, select is a projection that returns the output of the statement, thus why you get boolean values. where is a filter that keeps the structure of the dataframe, but only keeps data where the filter works.
Along the same line though, per the documentation, you can write this in 3 different ways
// The following are equivalent:
peopleDf.filter($"age" > 15)
peopleDf.where($"age" > 15)
peopleDf($"age" > 15)
To get the negation, do this ...
df.filter(not( ..expression.. ))
eg
df.filter(not($"state" === "TX"))
df.filter($"state" like "T%%") for pattern matching
df.filter($"state" === "TX") or df.filter("state = 'TX'") for equality
Worked on Spark V2.*
import sqlContext.implicits._
df.filter($"state" === "TX")
if needs to be compared against a variable (e.g., var):
import sqlContext.implicits._
df.filter($"state" === var)
Note : import sqlContext.implicits._
We can write multiple Filter/where conditions in Dataframe.
For example:
table1_df
.filter($"Col_1_name" === "buddy") // check for equal to string
.filter($"Col_2_name" === "A")
.filter(not($"Col_2_name".contains(" .sql"))) // filter a string which is not relevent
.filter("Col_2_name is not null") // no null filter
.take(5).foreach(println)
Here is the complete example using spark2.2+ taking data in json...
val myjson = "[{\"name\":\"Alabama\",\"abbreviation\":\"AL\"},{\"name\":\"Alaska\",\"abbreviation\":\"AK\"},{\"name\":\"American Samoa\",\"abbreviation\":\"AS\"},{\"name\":\"Arizona\",\"abbreviation\":\"AZ\"},{\"name\":\"Arkansas\",\"abbreviation\":\"AR\"},{\"name\":\"California\",\"abbreviation\":\"CA\"},{\"name\":\"Colorado\",\"abbreviation\":\"CO\"},{\"name\":\"Connecticut\",\"abbreviation\":\"CT\"},{\"name\":\"Delaware\",\"abbreviation\":\"DE\"},{\"name\":\"District Of Columbia\",\"abbreviation\":\"DC\"},{\"name\":\"Federated States Of Micronesia\",\"abbreviation\":\"FM\"},{\"name\":\"Florida\",\"abbreviation\":\"FL\"},{\"name\":\"Georgia\",\"abbreviation\":\"GA\"},{\"name\":\"Guam\",\"abbreviation\":\"GU\"},{\"name\":\"Hawaii\",\"abbreviation\":\"HI\"},{\"name\":\"Idaho\",\"abbreviation\":\"ID\"},{\"name\":\"Illinois\",\"abbreviation\":\"IL\"},{\"name\":\"Indiana\",\"abbreviation\":\"IN\"},{\"name\":\"Iowa\",\"abbreviation\":\"IA\"},{\"name\":\"Kansas\",\"abbreviation\":\"KS\"},{\"name\":\"Kentucky\",\"abbreviation\":\"KY\"},{\"name\":\"Louisiana\",\"abbreviation\":\"LA\"},{\"name\":\"Maine\",\"abbreviation\":\"ME\"},{\"name\":\"Marshall Islands\",\"abbreviation\":\"MH\"},{\"name\":\"Maryland\",\"abbreviation\":\"MD\"},{\"name\":\"Massachusetts\",\"abbreviation\":\"MA\"},{\"name\":\"Michigan\",\"abbreviation\":\"MI\"},{\"name\":\"Minnesota\",\"abbreviation\":\"MN\"},{\"name\":\"Mississippi\",\"abbreviation\":\"MS\"},{\"name\":\"Missouri\",\"abbreviation\":\"MO\"},{\"name\":\"Montana\",\"abbreviation\":\"MT\"},{\"name\":\"Nebraska\",\"abbreviation\":\"NE\"},{\"name\":\"Nevada\",\"abbreviation\":\"NV\"},{\"name\":\"New Hampshire\",\"abbreviation\":\"NH\"},{\"name\":\"New Jersey\",\"abbreviation\":\"NJ\"},{\"name\":\"New Mexico\",\"abbreviation\":\"NM\"},{\"name\":\"New York\",\"abbreviation\":\"NY\"},{\"name\":\"North Carolina\",\"abbreviation\":\"NC\"},{\"name\":\"North Dakota\",\"abbreviation\":\"ND\"},{\"name\":\"Northern Mariana Islands\",\"abbreviation\":\"MP\"},{\"name\":\"Ohio\",\"abbreviation\":\"OH\"},{\"name\":\"Oklahoma\",\"abbreviation\":\"OK\"},{\"name\":\"Oregon\",\"abbreviation\":\"OR\"},{\"name\":\"Palau\",\"abbreviation\":\"PW\"},{\"name\":\"Pennsylvania\",\"abbreviation\":\"PA\"},{\"name\":\"Puerto Rico\",\"abbreviation\":\"PR\"},{\"name\":\"Rhode Island\",\"abbreviation\":\"RI\"},{\"name\":\"South Carolina\",\"abbreviation\":\"SC\"},{\"name\":\"South Dakota\",\"abbreviation\":\"SD\"},{\"name\":\"Tennessee\",\"abbreviation\":\"TN\"},{\"name\":\"Texas\",\"abbreviation\":\"TX\"},{\"name\":\"Utah\",\"abbreviation\":\"UT\"},{\"name\":\"Vermont\",\"abbreviation\":\"VT\"},{\"name\":\"Virgin Islands\",\"abbreviation\":\"VI\"},{\"name\":\"Virginia\",\"abbreviation\":\"VA\"},{\"name\":\"Washington\",\"abbreviation\":\"WA\"},{\"name\":\"West Virginia\",\"abbreviation\":\"WV\"},{\"name\":\"Wisconsin\",\"abbreviation\":\"WI\"},{\"name\":\"Wyoming\",\"abbreviation\":\"WY\"}]"
import spark.implicits._
val df = spark.read.json(Seq(myjson).toDS)
df.show
import spark.implicits._
val df = spark.read.json(Seq(myjson).toDS)
df.show
scala> df.show
+------------+--------------------+
|abbreviation| name|
+------------+--------------------+
| AL| Alabama|
| AK| Alaska|
| AS| American Samoa|
| AZ| Arizona|
| AR| Arkansas|
| CA| California|
| CO| Colorado|
| CT| Connecticut|
| DE| Delaware|
| DC|District Of Columbia|
| FM|Federated States ...|
| FL| Florida|
| GA| Georgia|
| GU| Guam|
| HI| Hawaii|
| ID| Idaho|
| IL| Illinois|
| IN| Indiana|
| IA| Iowa|
| KS| Kansas|
+------------+--------------------+
// equals matching
scala> df.filter(df("abbreviation") === "TX").show
+------------+-----+
|abbreviation| name|
+------------+-----+
| TX|Texas|
+------------+-----+
// or using lit
scala> df.filter(df("abbreviation") === lit("TX")).show
+------------+-----+
|abbreviation| name|
+------------+-----+
| TX|Texas|
+------------+-----+
//not expression
scala> df.filter(not(df("abbreviation") === "TX")).show
+------------+--------------------+
|abbreviation| name|
+------------+--------------------+
| AL| Alabama|
| AK| Alaska|
| AS| American Samoa|
| AZ| Arizona|
| AR| Arkansas|
| CA| California|
| CO| Colorado|
| CT| Connecticut|
| DE| Delaware|
| DC|District Of Columbia|
| FM|Federated States ...|
| FL| Florida|
| GA| Georgia|
| GU| Guam|
| HI| Hawaii|
| ID| Idaho|
| IL| Illinois|
| IN| Indiana|
| IA| Iowa|
| KS| Kansas|
+------------+--------------------+
only showing top 20 rows
Let's create a sample dataset and do a deep dive into exactly why OP's code didn't work.
Here's our sample data:
val df = Seq(
("Rockets", 2, "TX"),
("Warriors", 6, "CA"),
("Spurs", 5, "TX"),
("Knicks", 2, "NY")
).toDF("team_name", "num_championships", "state")
We can pretty print our dataset with the show() method:
+---------+-----------------+-----+
|team_name|num_championships|state|
+---------+-----------------+-----+
| Rockets| 2| TX|
| Warriors| 6| CA|
| Spurs| 5| TX|
| Knicks| 2| NY|
+---------+-----------------+-----+
Let's examine the results of df.select(df("state")==="TX").show():
+------------+
|(state = TX)|
+------------+
| true|
| false|
| true|
| false|
+------------+
It's easier to understand this result by simply appending a column - df.withColumn("is_state_tx", df("state")==="TX").show():
+---------+-----------------+-----+-----------+
|team_name|num_championships|state|is_state_tx|
+---------+-----------------+-----+-----------+
| Rockets| 2| TX| true|
| Warriors| 6| CA| false|
| Spurs| 5| TX| true|
| Knicks| 2| NY| false|
+---------+-----------------+-----+-----------+
The other code OP tried (df.select(df("state")=="TX").show()) returns this error:
<console>:27: error: overloaded method value select with alternatives:
[U1](c1: org.apache.spark.sql.TypedColumn[org.apache.spark.sql.Row,U1])org.apache.spark.sql.Dataset[U1] <and>
(col: String,cols: String*)org.apache.spark.sql.DataFrame <and>
(cols: org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame
cannot be applied to (Boolean)
df.select(df("state")=="TX").show()
^
The === operator is defined in the Column class. The Column class doesn't define a == operator and that's why this code is erroring out.
Here's the accepted answer that works:
df.filter(df("state")==="TX").show()
+---------+-----------------+-----+
|team_name|num_championships|state|
+---------+-----------------+-----+
| Rockets| 2| TX|
| Spurs| 5| TX|
+---------+-----------------+-----+
As other posters have mentioned, the === method takes an argument with an Any type, so this isn't the only solution that works. This works too for example:
df.filter(df("state") === lit("TX")).show
+---------+-----------------+-----+
|team_name|num_championships|state|
+---------+-----------------+-----+
| Rockets| 2| TX|
| Spurs| 5| TX|
+---------+-----------------+-----+
The Column equalTo method can also be used:
df.filter(df("state").equalTo("TX")).show()
+---------+-----------------+-----+
|team_name|num_championships|state|
+---------+-----------------+-----+
| Rockets| 2| TX|
| Spurs| 5| TX|
+---------+-----------------+-----+
It worthwhile studying this example in detail. Scala's syntax seems magical at times, especially when method are invoked without dot notation. It's hard for the untrained eye to see that === is a method defined in the Column class!
In Spark 2.4
To compare with one value:
df.filter(lower(trim($"col_name")) === "<value>").show()
To compare with collection of value:
df.filter($"col_name".isInCollection(new HashSet<>(Arrays.asList("value1", "value2")))).show()

Spark SQL : Why am I seeing 3 jobs instead of one single job in the Spark UI?

As per my understanding, there will be one job for each action in Spark.
But often I see there are more than one jobs triggered for a single action.
I was trying to test this by doing a simple aggregation on a dataset to get the maximum from each category ( here the "subject" field)
While examining the Spark UI, I can see there are 3 "jobs" executed for the groupBy operation, while I was expecting just one.
Can anyone help me to understand why there is 3 instead of just 1?
students.show(5)
+----------+--------------+----------+----+-------+-----+-----+
|student_id|exam_center_id| subject|year|quarter|score|grade|
+----------+--------------+----------+----+-------+-----+-----+
| 1| 1| Math|2005| 1| 41| D|
| 1| 1| Spanish|2005| 1| 51| C|
| 1| 1| German|2005| 1| 39| D|
| 1| 1| Physics|2005| 1| 35| D|
| 1| 1| Biology|2005| 1| 53| C|
| 1| 1|Philosophy|2005| 1| 73| B|
// Task : Find Highest Score in each subject
val highestScores = students.groupBy("subject").max("score")
highestScores.show(10)
+----------+----------+
| subject|max(score)|
+----------+----------+
| Spanish| 98|
|Modern Art| 98|
| French| 98|
| Physics| 98|
| Geography| 98|
| History| 98|
| English| 98|
| Classics| 98|
| Math| 98|
|Philosophy| 98|
+----------+----------+
only showing top 10 rows
While examining the Spark UI, I can see there are 3 "jobs" executed for the groupBy operation, while I was expecting just one.
Can anyone help me to understand why there is 3 instead of just 1?
== Physical Plan ==
*(2) HashAggregate(keys=[subject#12], functions=[max(score#15)])
+- Exchange hashpartitioning(subject#12, 1)
+- *(1) HashAggregate(keys=[subject#12], functions=[partial_max(score#15)])
+- *(1) FileScan csv [subject#12,score#15] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/C:/lab/SparkLab/files/exams/students.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<subject:string,score:int>
I think only #3 does the actual "job" (executes a plan which you'll see if you open Details for the query on SQL tab). The other two are preparatory steps --
#1 is querying NameNode to build InMemoryFileIndex to read your csv, and
#2 is sampling the dataset to execute .groupBy("subject").max("score") which internally requires a sortByKey (here are more details on that).
I would suggest to check the physical plan-
highestScores.explain()
You might see something like-
*(2) HashAggregate(keys=[subject#9], functions=[max(score#12)], output=[subject#9, max(score)#51])
+- Exchange hashpartitioning(subject#9, 2)
+- *(1) HashAggregate(keys=[subject#9], functions=[partial_max(score#12)], output=[subject#9, max#61])
[Map stage] stage#1 is to achieve the local aggregation (partial aggregation) and then the shuffling happened using hashpartitioning(subject). Note the hashpartitioner uses group by column
[Reduce stage] stage#2 is to merge the output of stage#1 to get final max(score)
this is actually used to print the top 10 records show(10)

Issue with Spark Window over with Group By

I want to populate agg over window which with a different grain than select group by.
Using Scala sql.
Select c1,c2,c3,max(c4),max(c5),
Max(c4) over (partition by c1,c2,c3),
Avg(c5) over (partition by c1,c2,c3)
From temp_view
Group by c1,c2,c3
Getting Error saying :
c4 and c5 not being part of Group by or use first().
As I said in a comment, GroupBy and PartitionBy share the same purpose in a few aspects. If you use GroupBy then all aggregation work over these GroupBy columns only. The same thing occurs when you use partition by. The only major difference between both is groupBy Reduces the no. of records and In select we need to use only columns which are used in group by But in ParitionBy Number of records will not be reduced. Instead of that it will add one extra aggregated column and In select we can use N no. of columns.
For your issue, you are using columns c1,c2,c3 in Group By and using Max(c4), AVG(c5) with partition by so it is giving you error.
For you use case, you can use either of below queries:
Select c1,c2,c3,max(c4),max(c5)
From temp_view
Group by c1,c2,c3
OR
Select c1,c2,c3,
Max(c4) over (partition by c1,c2,c3),
Avg(c5) over (partition by c1,c2,c3)
From temp_view
Below is the example which will give you a clear picture,
scala> spark.sql("""SELECT * from table""").show()
+---+----------------+-------+------+
| ID| NAME|COMPANY|SALARY|
+---+----------------+-------+------+
| 1| Gannon Chang| ABC|440993|
| 2| Hashim Morris| XYZ| 49140|
| 3| Samson Le| ABC|413890|
| 4| Brandon Doyle| XYZ|384118|
| 5| Jacob Coffey| BCD|504819|
| 6| Dillon Holder| ABC|734086|
| 7|Salvador Vazquez| NGO|895082|
| 8| Paki Simpson| BCD|305046|
| 9| Laith Stewart| ABC|943750|
| 10| Simon Whitaker| NGO|561896|
| 11| Denton Torres| BCD| 10442|
| 12|Garrison Sellers| ABC| 53024|
| 13| Theodore Bolton| TTT|881521|
| 14| Kamal Roberts| TTT|817422|
+---+----------------+-------+------+
//You can only use column to select that is in group by
scala> spark.sql("""SELECT COMPANY, max(SALARY) from table group by COMPANY""").show()
+-------+-----------+
|COMPANY|max(SALARY)|
+-------+-----------+
| NGO| 895082|
| BCD| 504819|
| XYZ| 384118|
| TTT| 881521|
| ABC| 943750|
+-------+-----------+
//It will give error if you select all column or column other than Group By
scala> spark.sql("""SELECT *, max(SALARY) from table group by COMPANY""").show()
org.apache.spark.sql.AnalysisException: expression 'table.`ID`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;
Aggregate [COMPANY#94], [ID#92, NAME#93, COMPANY#94, SALARY#95L, max(SALARY#95L) AS max(SALARY)#213L]
+- SubqueryAlias table
+- Relation[ID#92,NAME#93,COMPANY#94,SALARY#95L] parquet
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)
at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:92)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:187)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$9.apply(CheckAnalysis.scala:220)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$9.apply(CheckAnalysis.scala:220)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:220)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80)
at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:92)
at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:105)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:641)
... 49 elided
//But you can select all columns with partition by
scala> spark.sql("""SELECT *, Max(SALARY) over (PARTITION BY COMPANY) as Max_Salary from table""").show()
+---+----------------+-------+------+----------+
| ID| NAME|COMPANY|SALARY|Max_Salary|
+---+----------------+-------+------+----------+
| 7|Salvador Vazquez| NGO|895082| 895082|
| 10| Simon Whitaker| NGO|561896| 895082|
| 5| Jacob Coffey| BCD|504819| 504819|
| 8| Paki Simpson| BCD|305046| 504819|
| 11| Denton Torres| BCD| 10442| 504819|
| 2| Hashim Morris| XYZ| 49140| 384118|
| 4| Brandon Doyle| XYZ|384118| 384118|
| 13| Theodore Bolton| TTT|881521| 881521|
| 14| Kamal Roberts| TTT|817422| 881521|
| 1| Gannon Chang| ABC|440993| 943750|
| 3| Samson Le| ABC|413890| 943750|
| 6| Dillon Holder| ABC|734086| 943750|
| 9| Laith Stewart| ABC|943750| 943750|
| 12|Garrison Sellers| ABC| 53024| 943750|
+---+----------------+-------+------+----------+

Compare two dataset and get what fields are changed

I am working on a spark using Java, where I will download data from api and compare with mongodb data, while the downloaded json have 15-20 fields but database have 300 fields.
Now my task is to compare the downloaded jsons to mongodb data, and get whatever fields changed with past data.
Sample data set
Downloaded data from API
StudentId,Name,Phone,Email
1,tony,123,a#g.com
2,stark,456,b#g.com
3,spidy,789,c#g.com
Mongodb data
StudentId,Name,Phone,Email,State,City
1,tony,1234,a#g.com,NY,Nowhere
2,stark,456,bg#g.com,NY,Nowhere
3,spidy,789,c#g.com,OH,Nowhere
I can't use the except, because of column length.
Expected output
StudentId,Name,Phone,Email,Past_Phone,Past_Email
1,tony,1234,a#g.com,1234, //phone number only changed
2,stark,456,b#g.com,,bg#g.com //Email only changed
3,spidy,789,c#g.com,,
Consider your data is in 2 dataframes. We can create temporary views for them, as shown below,
api_df.createOrReplaceTempView("api_data")
mongo_df.createOrReplaceTempView("mongo_data")
Next we can use Spark SQL. Here, we join both these views using the StudentId column and then use a case statement on top of them to compute the past phone number and email.
spark.sql("""
select a.*
, case when a.Phone = b.Phone then '' else b.Phone end as Past_phone
, case when a.Email = b.Email then '' else b.Email end as Past_Email
from api_data a
join mongo_data b
on a.StudentId = b.StudentId
order by a.StudentId""").show()
Output:
+---------+-----+-----+-------+----------+----------+
|StudentId| Name|Phone| Email|Past_phone|Past_Email|
+---------+-----+-----+-------+----------+----------+
| 1| tony| 123|a#g.com| 1234| |
| 2|stark| 456|b#g.com| | bg#g.com|
| 3|spidy| 789|c#g.com| | |
+---------+-----+-----+-------+----------+----------+
Please find the below same source code. Here I am taking the only phone number condition as an example.
val list = List((1,"tony",123,"a#g.com"), (2,"stark",456,"b#g.com")
(3,"spidy",789,"c#g.com"))
val df1 = list.toDF("StudentId","Name","Phone","Email")
.select('StudentId as "StudentId_1", 'Name as "Name_1",'Phone as "Phone_1",
'Email as "Email_1")
df1.show()
val list1 = List((1,"tony",1234,"a#g.com","NY","Nowhere"),
(2,"stark",456,"bg#g.com", "NY", "Nowhere"),
(3,"spidy",789,"c#g.com","OH","Nowhere"))
val df2 = list1.toDF("StudentId","Name","Phone","Email","State","City")
.select('StudentId as "StudentId_2", 'Name as "Name_2", 'Phone as "Phone_2",
'Email as "Email_2", 'State as "State_2", 'City as "City_2")
df2.show()
val df3 = df1.join(df2, df1("StudentId_1") ===
df2("StudentId_2")).where(df1("Phone_1") =!= df2("Phone_2"))
df3.withColumnRenamed("Phone_1", "Past_Phone").show()
+-----------+------+-------+-------+
|StudentId_1|Name_1|Phone_1|Email_1|
+-----------+------+-------+-------+
| 1| tony| 123|a#g.com|
| 2| stark| 456|b#g.com|
| 3| spidy| 789|c#g.com|
+-----------+------+-------+-------+
+-----------+------+-------+--------+-------+-------+
|StudentId_2|Name_2|Phone_2| Email_2|State_2| City_2|
+-----------+------+-------+--------+-------+-------+
| 1| tony| 1234| a#g.com| NY|Nowhere|
| 2| stark| 456|bg#g.com| NY|Nowhere|
| 3| spidy| 789| c#g.com| OH|Nowhere|
+-----------+------+-------+--------+-------+-------+
+-----------+------+----------+-------+-----------+------+-------+-------+-------+-------+
|StudentId_1|Name_1|Past_Phone|Email_1|StudentId_2|Name_2|Phone_2|Email_2|State_2| City_2|
+-----------+------+----------+-------+-----------+------+-------+-------+-------+-------+
| 1| tony| 123|a#g.com| 1| tony| 1234|a#g.com| NY|Nowhere|
+-----------+------+----------+-------+-----------+------+-------+-------+-------+-------+
We have :
df1.show
+-----------+------+-------+-------+
|StudentId_1|Name_1|Phone_1|Email_1|
+-----------+------+-------+-------+
| 1| tony| 123|a#g.com|
| 2| stark| 456|b#g.com|
| 3| spidy| 789|c#g.com|
+-----------+------+-------+-------+
df2.show
+-----------+------+-------+--------+-------+-------+
|StudentId_2|Name_2|Phone_2| Email_2|State_2| City_2|
+-----------+------+-------+--------+-------+-------+
| 1| tony| 1234| a#g.com| NY|Nowhere|
| 2| stark| 456|bg#g.com| NY|Nowhere|
| 3| spidy| 789| c#g.com| OH|Nowhere|
+-----------+------+-------+--------+-------+-------+
After Join :
var jn = df2.join(df1,df1("StudentId_1")===df2("StudentId_2"))
Then
var ans = jn.withColumn("Past_Phone", when(jn("Phone_2").notEqual(jn("Phone_1")),jn("Phone_1")).otherwise("")).withColumn("Past_Email", when(jn("Email_2").notEqual(jn("Email_1")),jn("Email_1")).otherwise(""))
Reference : Spark: Add column to dataframe conditionally
Next :
ans.select(ans("StudentId_2") as "StudentId",ans("Name_2") as "Name",ans("Phone_2") as "Phone",ans("Email_2") as "Email",ans("Past_Email"),ans("Past_Phone")).show
+---------+-----+-----+--------+----------+----------+
|StudentId| Name|Phone| Email|Past_Email|Past_Phone|
+---------+-----+-----+--------+----------+----------+
| 1| tony| 1234| a#g.com| | 123|
| 2|stark| 456|bg#g.com| b#g.com| |
| 3|spidy| 789| c#g.com| | |
+---------+-----+-----+--------+----------+----------+

Combine multiple datasets to single dataset without using unionAll function in Apache Spark sql

I am having my datasets as follows
Dataset 1:
+----------+--------------------+---------+---+
| Time| address| Date|value|sample
+----------+--------------------+---------+---+------+
|8:00:00 AM| AAbbbbbbbbbbbbbbbb|12/9/2014| 1 |0 |
|8:31:27 AM| AAbbbbbbbbbbbbbbbb|12/9/2014| 1 |0 |
+----------+--------------------+---------+---+------+
Dataset 2:
| Time| Location| Date|sample|value
+-----------+--------------------+---------+------+------+
| 8:45:00 AM| AAbbbbbbbbbbbbbbbb|12/9/2016| 5 | 0 |
| 9:15:00 AM| AAbbbbbbbbbbbbbbbb|12/9/2016| 5 | 0 |
+-----------+--------------------+---------+------+------+
I am using the following unionAll() function to combine both ds1 and ds2,
Dataset<Row> joined = dataset1.unionAll(dataset2).distinct();
Is there any better way to combine this ds1 and ds2, Since unionAll() function is deprecated in Spark 2.x.?
You can use union() to combine the two dataframes/datasets
df1.union(df2)
Output:
+----------+------------------+---------+-----+------+
| Time| address| Date|value|sample|
+----------+------------------+---------+-----+------+
|8:00:00 AM|AAbbbbbbbbbbbbbbbb|12/9/2014| 1| 0|
|8:31:27 AM|AAbbbbbbbbbbbbbbbb|12/9/2014| 1| 0|
|8:45:00 AM|AAbbbbbbbbbbbbbbbb|12/9/2016| 5| 0|
|9:15:00 AM|AAbbbbbbbbbbbbbbbb|12/9/2016| 5| 0|
+----------+------------------+---------+-----+------+
It also removes the duplicates rows
Hope this helps!

Resources