How to pass dataframe in ISIN operator in spark dataframe - apache-spark

I want to pass dataframe which has set of values to new query but it fails.
1) Here I am selecting particular column so that I can pass under ISIN in next query
scala> val managerIdDf=finalEmployeesDf.filter($"manager_id"!==0).select($"manager_id").distinct
managerIdDf: org.apache.spark.sql.DataFrame = [manager_id: bigint]
2) My sample data:
scala> managerIdDf.show
+----------+
|manager_id|
+----------+
| 67832|
| 65646|
| 5646|
| 67858|
| 69062|
| 68319|
| 66928|
+----------+
3) When I execute final query it fails:
scala> finalEmployeesDf.filter($"emp_id".isin(managerIdDf)).select("*").show
java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.DataFrame [manager_id: bigint]
I also tried converting to List and Seq but it generates an error only. Like below when I try to convert to Seq and re run the query it throws an error:
scala> val seqDf=managerIdDf.collect.toSeq
seqDf: Seq[org.apache.spark.sql.Row] = WrappedArray([67832], [65646], [5646], [67858], [69062], [68319], [66928])
scala> finalEmployeesDf.filter($"emp_id".isin(seqDf)).select("*").show
java.lang.RuntimeException: Unsupported literal type class scala.collection.mutable.WrappedArray$ofRef WrappedArray([67832], [65646], [5646], [67858], [69062], [68319], [66928])
I also referred this post but in vain. This type of query I am trying it for solving subqueries in spark dataframe. Anyone here pls ?

An alternative approach using the dataframes and tempviews and free format SQL of SPARK SQL - don't worry about the logic, it's just convention and an alternative to your initial approach - that should equally suffice:
val df2 = Seq(
("Peter", "Doe", Seq(("New York", "A000000"), ("Warsaw", null))),
("Bob", "Smith", Seq(("Berlin", null))),
("John", "Jones", Seq(("Paris", null)))
).toDF("firstname", "lastname", "cities")
df2.createOrReplaceTempView("persons")
val res = spark.sql("""select *
from persons
where firstname
not in (select firstname
from persons
where lastname <> 'Doe')""")
res.show
or
val list = List("Bob", "Daisy", "Peter")
val res2 = spark.sql("select firstname, lastname from persons")
.filter($"firstname".isin(list:_*))
res2.show
or
val query = s"select * from persons where firstname in (${list.map ( x => "'" + x + "'").mkString(",") })"
val res3 = spark.sql(query)
res3.show
or
df2.filter($"firstname".isin(list: _*)).show
or
val list2 = df2.select($"firstname").rdd.map(r => r(0).asInstanceOf[String]).collect.toList
df2.filter($"firstname".isin(list2: _*)).show
In your case specifically:
val seqDf=managerIdDf.rdd.map(r => r(0).asInstanceOf[Long]).collect.toList 2)
finalEmployeesDf.filter($"emp_id".isin(seqDf: _)).select("").show

Yes, you cannot pass a DataFrame in isin. isin requires some values that it will filter against.
If you want an example, you can check my answer here
As per question update, you can make the following change,
.isin(seqDf)
to
.isin(seqDf: _*)

Related

How to JOIN 3 RDD's using Spark Scala

I want to join 3 tables using spark rdd. I achieved my objective using spark sql but when I tried to join it using Rdd I am not getting the desired results. Below is my query using spark SQL and the output:
scala> actorDF.as("df1").join(movieCastDF.as("df2"),$"df1.act_id"===$"df2.act_id").join(movieDF.as("df3"),$"df2.mov_id"===$"df3.mov_id").
filter(col("df3.mov_title")==="Annie Hall").select($"df1.act_fname",$"df1.act_lname",$"df2.role").show(false)
+---------+---------+-----------+
|act_fname|act_lname|role |
+---------+---------+-----------+
|Woody |Allen |Alvy Singer|
+---------+---------+-----------+
Now I created the pairedRDDs for three datasets and it is as below :
scala> val actPairedRdd=actRdd.map(_.split("\t",-1)).map(p=>(p(0),(p(1),p(2),p(3))))
scala> actPairedRdd.take(5).foreach(println)
(101,(James,Stewart,M))
(102,(Deborah,Kerr,F))
(103,(Peter,OToole,M))
(104,(Robert,De Niro,M))
(105,(F. Murray,Abraham,M))
scala> val movieCastPairedRdd=movieCastRdd.map(_.split("\t",-1)).map(p=>(p(0),(p(1),p(2))))
movieCastPairedRdd: org.apache.spark.rdd.RDD[(String, (String, String))] = MapPartitionsRDD[318] at map at <console>:29
scala> movieCastPairedRdd.foreach(println)
(101,(901,John Scottie Ferguson))
(102,(902,Miss Giddens))
(103,(903,T.E. Lawrence))
(104,(904,Michael))
(105,(905,Antonio Salieri))
(106,(906,Rick Deckard))
scala> val moviePairedRdd=movieRdd.map(_.split("\t",-1)).map(p=>(p(0),(p(1),p(2),p(3),p(4),p(5),p(6))))
moviePairedRdd: org.apache.spark.rdd.RDD[(String, (String, String, String, String, String, String))] = MapPartitionsRDD[322] at map at <console>:29
scala> moviePairedRdd.take(2).foreach(println)
(901,(Vertigo,1958,128,English,1958-08-24,UK))
(902,(The Innocents,1961,100,English,1962-02-19,SW))
Here actPairedRdd and movieCastPairedRdd is linked with each other and movieCastPairedRdd and moviePairedRdd is linked since they have common column.
Now when I join all the three datasets I am not getting any data
scala> actPairedRdd.join(movieCastPairedRdd).join(moviePairedRdd).take(2).foreach(println)
I am getting blank records. So where am I going wrong ?? Thanks in advance
JOINs like this with RDDs are painful, that's another reason why DFs are nicer.
You get no data as the pair RDD = K, V has no common data for the K part of the last RDD. The K's with 101, 102 will join, but there is no commonality with the 901, 902. You need to shift things around, like this, my more limited example:
val rdd1 = sc.parallelize(Seq(
(101,("James","Stewart","M")),
(102,("Deborah","Kerr","F")),
(103,("Peter","OToole","M")),
(104,("Robert","De Niro","M"))
))
val rdd2 = sc.parallelize(Seq(
(101,(901,"John Scottie Ferguson")),
(102,(902,"Miss Giddens")),
(103,(903,"T.E. Lawrence")),
(104,(904,"Michael"))
))
val rdd3 = sc.parallelize(Seq(
(901,("Vertigo",1958 )),
(902,("The Innocents",1961))
))
val rdd4 = rdd1.join(rdd2)
val new_rdd4 = rdd4.keyBy(x => x._2._2._1) // Redefine Key for join with rdd3
val rdd5 = rdd3.join(new_rdd4)
rdd5.collect
returns:
res14: Array[(Int, ((String, Int), (Int, ((String, String, String), (Int, String)))))] = Array((901,((Vertigo,1958),(101,((James,Stewart,M),(901,John Scottie Ferguson))))), (902,((The Innocents,1961),(102,((Deborah,Kerr,F),(902,Miss Giddens))))))
You will need to strip out the data via a map, I leave that to you. INNER join per default.

Spark-Scala Try Select Statement

I'm trying to incorporate a Try().getOrElse() statement in my select statement for a Spark DataFrame. The project I'm working on is going to be applied to multiple environments. However, each environment is a little different in terms of the naming of the raw data for ONLY one field. I do not want to write several different functions to handle each different field. Is there a elegant way to handle exceptions, like this below, in a DataFrame select statement?
val dfFilter = dfRaw
.select(
Try($"some.field.nameOption1).getOrElse($"some.field.nameOption2"),
$"some.field.abc",
$"some.field.def"
)
dfFilter.show(33, false)
However, I keep getting the following error, which makes sense because it does not exist in this environments raw data, but I'd expect the getOrElse statement to catch that exception.
org.apache.spark.sql.AnalysisException: No such struct field nameOption1 in...
Is there a good way to handle exceptions in Scala Spark for select statements? Or will I need to code up different functions for each case?
val selectedColumns = if (dfRaw.columns.contains("some.field.nameOption1")) $"some.field.nameOption2" else $"some.field.nameOption2"
val dfFilter = dfRaw
.select(selectedColumns, ...)
So I'm revisiting this question after a year. I believe this solution to be much more elegant to implement. Please let me know anyone else's thoughts:
// Generate a fake DataFrame
val df = Seq(
("1234", "A", "AAA"),
("1134", "B", "BBB"),
("2353", "C", "CCC")
).toDF("id", "name", "nameAlt")
// Extract the column names
val columns = df.columns
// Add a "new" column name that is NOT present in the above DataFrame
val columnsAdd = columns ++ Array("someNewColumn")
// Let's then "try" to select all of the columns
df.select(columnsAdd.flatMap(c => Try(df(c)).toOption): _*).show(false)
// Let's reduce the DF again...should yield the same results
val dfNew = df.select("id", "name")
dfNew.select(columnsAdd.flatMap(c => Try(dfNew(c)).toOption): _*).show(false)
// Results
columns: Array[String] = Array(id, name, nameAlt)
columnsAdd: Array[String] = Array(id, name, nameAlt, someNewColumn)
+----+----+-------+
|id |name|nameAlt|
+----+----+-------+
|1234|A |AAA |
|1134|B |BBB |
|2353|C |CCC |
+----+----+-------+
dfNew: org.apache.spark.sql.DataFrame = [id: string, name: string]
+----+----+
|id |name|
+----+----+
|1234|A |
|1134|B |
|2353|C |
+----+----+

Spark aggregate rows with custom function

To make it simple, let's assume we have a dataframe containing the following data:
+----------+---------+----------+----------+
|firstName |lastName |Phone |Address |
+----------+---------+----------+----------+
|firstName1|lastName1|info1 |info2 |
|firstName1|lastName1|myInfo1 |dummyInfo2|
|firstName1|lastName1|dummyInfo1|myInfo2 |
+----------+---------+----------+----------+
How can I merge all rows grouping by (firstName,lastName) and keep in the columns Phone and Address only data starting by "my" to get the following :
+----------+---------+----------+----------+
|firstName |lastName |Phone |Address |
+----------+---------+----------+----------+
|firstName1|lastName1|myInfo1 |myInfo2 |
+----------+---------+----------+----------+
Maybe should I use agg function with a custom UDAF? But how can I implement it?
Note: I'm using Spark 2.2 along with Scala 2.11.
You can use groupBy and collect_set aggregation function and use a udf function to filter in the first string that starts with "my"
import org.apache.spark.sql.functions._
def myudf = udf((array: Seq[String]) => array.filter(_.startsWith("my")).head)
df.groupBy("firstName ", "lastName")
.agg(myudf(collect_set("Phone")).as("Phone"), myudf(collect_set("Address")).as("Address"))
.show(false)
which should give you
+----------+---------+-------+-------+
|firstName |lastName |Phone |Address|
+----------+---------+-------+-------+
|firstName1|lastName1|myInfo1|myInfo2|
+----------+---------+-------+-------+
I hope the answer is helpful
If only two columns involved, filtering and join can be used instead of UDF:
val df = List(
("firstName1", "lastName1", "info1", "info2"),
("firstName1", "lastName1", "myInfo1", "dummyInfo2"),
("firstName1", "lastName1", "dummyInfo1", "myInfo2")
).toDF("firstName", "lastName", "Phone", "Address")
val myPhonesDF = df.filter($"Phone".startsWith("my"))
val myAddressDF = df.filter($"Address".startsWith("my"))
val result = myPhonesDF.alias("Phones").join(myAddressDF.alias("Addresses"), Seq("firstName", "lastName"))
.select("firstName", "lastName", "Phones.Phone", "Addresses.Address")
result.show(false)
Output:
+----------+---------+-------+-------+
|firstName |lastName |Phone |Address|
+----------+---------+-------+-------+
|firstName1|lastName1|myInfo1|myInfo2|
+----------+---------+-------+-------+
For many columns, when only one row expected, such construction can be used:
val columnsForSearch = List("Phone", "Address")
val minExpressions = columnsForSearch.map(c => min(when(col(c).startsWith("my"), col(c)).otherwise(null)).alias(c))
df.groupBy("firstName", "lastName").agg(minExpressions.head, minExpressions.tail: _*)
Output is the same.
UDF with two parameters example:
val twoParamFunc = (firstName: String, Phone: String) => firstName + ": " + Phone
val twoParamUDF = udf(twoParamFunc)
df.select(twoParamUDF($"firstName", $"Phone")).show(false)

Filling null values with the mean of the column in HiveQL and Spark

I am using HiveQL in spark and woul like to fill null values by the mean of the column in spark.
Using below codes:
StringBuilder query = new StringBuilder("select `ts0` as ts ");
String[] cols = dataFrame.columns();
for (String col : cols) {
query.append(",`" + col + "` as " + trimmedCol);
}
}
I think I should use "case" command when there is a null value. Can anyone guide me how to do above?
You could to try this following
scala> val df = sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("inferSchema","true").load("na_test.csv")
scala> df.show()
scala> df.na.fill(10.0,Seq("age"))
scala> df.na.fill(10.0,Seq("age")).show
scala> df.na.replace("age", Map(35 -> 61,24 -> 12))).show()

Issue with DataFrame na() fill methods and ambiguous reference

I'm using Spark 1.3.1 where joining two dataframes repeats the column(s) being
joined. I'm left outer joining two data frames and want to send the
resulting dataframe to the na().fill() method to convert nulls to known
values based on the data type of the column. I've built a map of
"table.column" -> "value" and pass that to the fill method. But I get
exception instead of success :(. What are my options? I see that there is a dataFrame.withColumnRenamed method but I can only rename one column. I have joins that involve more than one column. Do I just have to ensure that there is a unique set of column names, regardless of table aliases in the dataFrame where I apply the na().fill() method?
Given:
scala> val df1 = sqlContext.jsonFile("people.json").as("df1")
df1: org.apache.spark.sql.DataFrame = [first: string, last: string]
scala> val df2 = sqlContext.jsonFile("people.json").as("df2")
df2: org.apache.spark.sql.DataFrame = [first: string, last: string]
I can join them together with
val df3 = df1.join(df2, df1("first") === df2("first"), "left_outer")
And I have a map that converts data type to value.
scala> val map = Map("df1.first"->"unknown", "df1.last" -> "unknown",
"df2.first" -> "unknown", "df2.last" -> "unknown")
But executing fill(map) results in exception.
scala> df3.na.fill(map)
org.apache.spark.sql.AnalysisException: Reference 'first' is ambiguous,
could be: first#6, first#8.;
Here is what I came up with. In my original example, there is nothing interesting left in df2 after the join, so I changed this to be classical department / employee example.
department.json
{"department": 2, "name":"accounting"}
{"department": 1, "name":"engineering"}
person.json
{"department": 1, "first":"Bruce", "last": "szalwinski"}
And now I can join the dataframes, build the map, and replace nulls with unknowns.
scala> val df1 = sqlContext.jsonFile("department.json").as("df1")
df1: org.apache.spark.sql.DataFrame = [department: bigint, name: string]
scala> val df2 = sqlContext.jsonFile("people.json").as("df2")
df2: org.apache.spark.sql.DataFrame = [department: bigint, first: string, last: string]
scala> val df3 = df1.join(df2, df1("department") === df2("department"), "left_outer")
df3: org.apache.spark.sql.DataFrame = [department: bigint, name: string, department: bigint, first: string, last: string]
scala> val map = Map("first" -> "unknown", "last" -> "unknown")
map: scala.collection.immutable.Map[String,String] = Map(first -> unknown, last -> unknown)
scala> val df4 = df3.select("df1.department", "df2.first", "df2.last").na.fill(map)
df4: org.apache.spark.sql.DataFrame = [department: bigint, first: string, last: string]
scala> df4.show()
+----------+-------+----------+
|department| first| last|
+----------+-------+----------+
| 2|unknown| unknown|
| 1| Bruce|szalwinski|
+----------+-------+----------+

Resources