Using Spark find completeness from multiple datasets - apache-spark

I have requirement to verify certain data present in multiple datasets(or csv) using Spark 2. This can is defined as matching of two more keys from all datasets and generate report of all matching and non matching keys from all datasets.
For e.g. There are four datasets and each dataset has two matching keys with other keys. It means all the datasets needs to be matched on two matching keys defined.Lets say userId,userName are two matching keys in below datasets:
Dataset A: userId,userName,age,contactNumber
Dataset B: orderId,orderDetails,userId,userName
Dataset C: departmentId,userId,userName,departmentName
Dataset D: userId,userName,address,pin
Dataset A:
userId,userName,age,contactNumber
1,James,29,1111
2,Ferry,32,2222
3,Lotus,21,3333
Dataset B:
orderId,orderDetails,userId,userName
DF23,Chocholate,1,James
DF45,Gifts,3,Lotus
Dataset C:
departmentId,userId,userName,departmentName
N99,1,James,DE
N100,2,Ferry,AI
Dataset D:
userId,userName,address,pin
1,James,minland street,cvk-dfg
Need to generate a report (or similar report) like
------------------------------------------
userId,userName,status
------------------------------------------
1,James,MATCH
2,Ferry,MISSING IN B, MISSING IN D
3,Lotus,MISSING IN B, MISSING IN C, MISSING IN D
I have tried joining of datasets as follwos
DatsetA-B:
userId,userName,age,contactNumber,orderId,orderDetails,userId,userName,status
1,James,29,1111,DF23,Chocholate,1,James,MATCH
2,Ferry,32,2222,,,,,Missing IN Left
3,Lotus,21,3333,DF45,Gifts,3,Lotus,MATCH
DatsetC-D:
departmentId,userId,userName,departmentName,userId,userName,address,pin,status
N99,1,James,DE,1,James,minland street,cvk-dfg,MATCH
N100,2,Ferry,AI,,,,,Missing IN Right
DatsetAB-CD:
Joining criteria: userId and userName of A with C, userId and userName of B with D
userId,userName,age,contactNumber,orderId,orderDetails,userId,userName,status,departmentId,userId,userName,departmentName,userId,userName,address,pin,status,status
1,James,29,1111,DF23,Chocholate,1,James,MATCH,N99,1,James,DE,1,James,minland street,cvk-dfg,MATCH,MATCH
2,Ferry,32,2222,,,,,Missing IN Left,N100,2,Ferry,AI,,,,,Missing IN Right,Missing IN Right
No row is coming for userId 3

If data is defined as:
val dfA = Seq((1, "James", 29, 1111), (2, "Ferry", 32, 2222),(3, "Lotus", 21, 3333)).toDF("userId,userName,age,contactNumber".split(","): _*)
val dfB = Seq(("DF23", "Chocholate", 1, "James"), ("DF45", "Gifts", 3, "Lotus")).toDF("orderId,orderDetails,userId,userName".split(","): _*)
val dfC = Seq(("N99", 1, "James", "DE"), ("N100", 2, "Ferry", "AI")).toDF("departmentId,userId,userName,departmentName".split(","): _*)
val dfD = Seq((1, "James", "minland street", "cvk-dfg")).toDF("userId,userName,address,pin".split(","): _*)
Define keys:
import org.apache.spark.sql.functions._
val keys = Seq("userId", "userName")
Combine Datasets:
val dfs = Map("A" -> dfA, "B" -> dfB, "C" -> dfC, "D" -> dfD)
val combined = dfs.map {
case (key, df) => df.withColumn("df", lit(key)).select("df", keys: _*)
}.reduce(_ unionByName _)
Pivot and convert result to booleans:
val pivoted = combined
.groupBy(keys.head, keys.tail: _*)
.pivot("df", dfs.keys.toSeq)
.count()
val result = dfs.keys.foldLeft(pivoted)(
(df, c) => df.withColumn(c, col(c).isNotNull.alias(c))
)
// +------+--------+----+-----+-----+-----+
// |userId|userName| A| B| C| D|
// +------+--------+----+-----+-----+-----+
// | 1| James|true| true| true| true|
// | 3| Lotus|true| true|false|false|
// | 2| Ferry|true|false| true|false|
// +------+--------+----+-----+-----+-----+
Use resulting boolean matrix to generate final report.
This can become pretty expensive when datasets become large. If you know that one dataset contains all possible keys, and you don't require exact results (some false negatives are acceptable), you can use Bloom filter.
Here we can use dfA as a reference:
val expectedNumItems = dfA.count
val fpp = 0.00001
val key = struct(keys map col: _*).cast("string").alias("key")
val filters = dfs.filterKeys(_ != "A").mapValues(df => {
val f = df.select(key).stat.bloomFilter("key", expectedNumItems, fpp);
udf((s: String) => f.mightContain(s))
})
filters.foldLeft(dfA.select(keys map col: _*)){
case (df, (c, f)) => df.withColumn(c, f(key))
}.show
// +------+--------+-----+-----+-----+
// |userId|userName| B| C| D|
// +------+--------+-----+-----+-----+
// | 1| James| true| true| true|
// | 2| Ferry|false| true|false|
// | 3| Lotus| true|false|false|
// +------+--------+-----+-----+-----+

Related

Spark DataFrame - how to convert single column into multiple row

Need to convert single row into multiple columns. Did below things.
val list = List("a", "b", "c", "d")
import spark.implicits._
val df = list.toDF("id")
df.show()
import spark.implicits._
val transpose = list.zipWithIndex.map {
case (_, index) => col("data").getItem(index).as(s"col_${index}")
}
df.select(collect_list($"id").as("data")).select(transpose: _*).show()
output:
+-----+-----+-----+-----+
|col_0|col_1|col_2|col_3|
+-----+-----+-----+-----+
| a| b| c| d|
+-----+-----+-----+-----+
Did something and convert it. But problem with transpose function, it is relaying original data (list). If we do any filter in df, it will always shows 4 column as original list have 4. How can i shortout this list.
Adding more info
df.filter($"id" =!="a" ).select(collect_list($"id").as("data")).select(transpose: _*).show()\
if apply filter condition and show command
+-----+-----+-----+-----+
|col_0|col_1|col_2|col_3|
+-----+-----+-----+-----+
| b| c| d| null|
+-----+-----+-----+-----+
which is wrong and should show 3 columns not 4 columns.
you could do it with pivot :
val df = List("a", "b", "c", "d").toDF("id")
val dfFiltered = df.filter($"id"=!="a")
dfFiltered
.groupBy().pivot($"id").agg(first($"id"))
.toDF((0 until dfFiltered.count().toInt).map(i => s"col_$i"):_*)
.show()
+-----+-----+-----+
|col_0|col_1|col_2|
+-----+-----+-----+
| b| c| d|
+-----+-----+-----
Did some trick with trimming the columns based on df row count.
Let me know if it helps
import org.apache.spark.sql.functions._
object TransposeV2 {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
val list = List("a", "b", "c", "d")
import spark.implicits._
val df = list.toDF("id")
df.show()
import spark.implicits._
val transpose = list.zipWithIndex.map {
case (_, index) => {
col("data").getItem(index).as(s"col_${index}")
}
}
df.select(collect_list($"id").as("data")).select(transpose: _*).show()
val dfInterim = df.filter($"id" =!="a" )
val finalElements : Int = dfInterim.count().toInt
dfInterim.select(collect_list($"id").as("data")).select(transpose.take(finalElements): _*).show()
}
}

Spark Aggregating multiple columns (possible to array) from join output

I've below datasets
Table1
Table2
Now I would like to get below dataset. I've tried with left outer join Table1.id == Table2.departmentid but, I am not getting the desired output.
Later, I need to use this table to get several counts and convert the data into an xml . I will be doing this convertion using map.
Any help would be appreciated.
Only joining is not enough to get the desired output. Probably You are missing something and last element of each nested array might be departmentid. Assuming the last element of nested array is departmentid, I've generated the output by the following way:
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.functions.collect_list
case class department(id: Integer, deptname: String)
case class employee(employeid:Integer, empname:String, departmentid:Integer)
val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
val department_df = Seq(department(1, "physics")
,department(2, "computer") ).toDF()
val emplyoee_df = Seq(employee(1, "A", 1)
,employee(2, "B", 1)
,employee(3, "C", 2)
,employee(4, "D", 2)).toDF()
val result = department_df.join(emplyoee_df, department_df("id") === emplyoee_df("departmentid"), "left").
selectExpr("id", "deptname", "employeid", "empname").
rdd.map {
case Row(id:Integer, deptname:String, employeid:Integer, empname:String) => (id, deptname, Array(employeid.toString, empname, id.toString))
}.toDF("id", "deptname", "arrayemp").
groupBy("id", "deptname").
agg(collect_list("arrayemp").as("emplist")).
orderBy("id", "deptname")
The output looks like this:
result.show(false)
+---+--------+----------------------+
|id |deptname|emplist |
+---+--------+----------------------+
|1 |physics |[[2, B, 1], [1, A, 1]]|
|2 |computer|[[4, D, 2], [3, C, 2]]|
+---+--------+----------------------+
Explanation: If i break down the last dataframe transformation into multiple steps, it'll probably make clear how the output is generated.
left outer join between department_df and employee_df
val df1 = department_df.join(emplyoee_df, department_df("id") === emplyoee_df("departmentid"), "left").
selectExpr("id", "deptname", "employeid", "empname")
df1.show()
+---+--------+---------+-------+
| id|deptname|employeid|empname|
+---+--------+---------+-------+
| 1| physics| 2| B|
| 1| physics| 1| A|
| 2|computer| 4| D|
| 2|computer| 3| C|
+---+--------+---------+-------+
creating array using some column's values from the df1 dataframe
val df2 = df1.rdd.map {
case Row(id:Integer, deptname:String, employeid:Integer, empname:String) => (id, deptname, Array(employeid.toString, empname, id.toString))
}.toDF("id", "deptname", "arrayemp")
df2.show()
+---+--------+---------+
| id|deptname| arrayemp|
+---+--------+---------+
| 1| physics|[2, B, 1]|
| 1| physics|[1, A, 1]|
| 2|computer|[4, D, 2]|
| 2|computer|[3, C, 2]|
+---+--------+---------+
create new list aggregating multiple arrays using df2 dataframe
val result = df2.groupBy("id", "deptname").
agg(collect_list("arrayemp").as("emplist")).
orderBy("id", "deptname")
result.show(false)
+---+--------+----------------------+
|id |deptname|emplist |
+---+--------+----------------------+
|1 |physics |[[2, B, 1], [1, A, 1]]|
|2 |computer|[[4, D, 2], [3, C, 2]]|
+---+--------+----------------------+
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val df = spark.sparkContext.parallelize(Seq(
(1,"Physics"),
(2,"Computer"),
(3,"Maths")
)).toDF("ID","Dept")
val schema = List(
StructField("EMPID", IntegerType, true),
StructField("EMPNAME", StringType, true),
StructField("DeptID", IntegerType, true)
)
val data = Seq(
Row(1,"A",1),
Row(2,"B",1),
Row(3,"C",2),
Row(4,"D",2) ,
Row(5,"E",null)
)
val df_emp = spark.createDataFrame(
spark.sparkContext.parallelize(data),
StructType(schema)
)
val newdf = df_emp.withColumn("CONC",array($"EMPID",$"EMPNAME",$"DeptID")).groupBy($"DeptID").agg(expr("collect_list(CONC) as emplist"))
df.join(newdf,df.col("ID") === df_emp.col("DeptID")).select($"ID",$"Dept",$"emplist").show()
---+--------+--------------------+
| ID| Dept| listcol|
+---+--------+--------------------+
| 1| Physics|[[1, A, 1], [2, B...|
| 2|Computer|[[3, C, 2], [4, D...|

spark dynamically create struct/json per group

I have a spark dataframe like
+-----+---+---+---+------+
|group| a| b| c|config|
+-----+---+---+---+------+
| a| 1| 2| 3| [a]|
| b| 2| 3| 4|[a, b]|
+-----+---+---+---+------+
val df = Seq(("a", 1, 2, 3, Seq("a")),("b", 2, 3,4, Seq("a", "b"))).toDF("group", "a", "b","c", "config")
How can I add an additional column i.e.
df.withColumn("select_by_config", <<>>).show
as a struct or JSON which combines a number of columns (specified by config) in something similar to a hive named struct / spark struct / json column? Note, this struct is specific per group and not constant for the whole dataframe; it is specified in config column.
I can imagine that a df.map could do the trick, but the serialization overhead does not seem to be efficient. How can this be achieved via SQL only expressions? Maybe as a Map-type column?
edit
a possible but really clumsy solution for 2.2 is:
val df = Seq((1,"a", 1, 2, 3, Seq("a")),(2, "b", 2, 3,4, Seq("a", "b"))).toDF("id", "group", "a", "b","c", "config")
df.show
import spark.implicits._
final case class Foo(id:Int, c1:Int, specific:Map[String, Int])
df.map(r => {
val config = r.getAs[Seq[String]]("config")
print(config)
val others = config.map(elem => (elem, r.getAs[Int](elem))).toMap
Foo(r.getAs[Int]("id"), r.getAs[Int]("c"), others)
}).show
are there any better ways to solve the problem for 2.2?
If you use a recent build (Spark 2.4.0 RC 1 or later) a combination of higher order functions should do the trick. Create a map of columns:
import org.apache.spark.sql.functions.{
array, col, expr, lit, map_from_arrays, map_from_entries
}
val cols = Seq("a", "b", "c")
val dfm = df.withColumn(
"cmap",
map_from_arrays(array(cols map lit: _*), array(cols map col: _*))
)
and transform the config:
dfm.withColumn(
"config_mapped",
map_from_entries(expr("transform(config, k -> struct(k, cmap[k]))"))
).show
// +-----+---+---+---+------+--------------------+----------------+
// |group| a| b| c|config| cmap| config_mapped|
// +-----+---+---+---+------+--------------------+----------------+
// | a| 1| 2| 3| [a]|[a -> 1, b -> 2, ...| [a -> 1]|
// | b| 2| 3| 4|[a, b]|[a -> 2, b -> 3, ...|[a -> 2, b -> 3]|
// +-----+---+---+---+------+--------------------+----------------+

How to change case of whole pyspark dataframe to lower or upper

I am trying to apply pyspark sql functions hash algorithm for every row in two dataframes to identify the differences. Hash algorithm is case sensitive .i.e. if column contains 'APPLE' and 'Apple' are considered as two different values, so I want to change the case for both dataframes to either upper or lower. I am able to achieve only for dataframe headers but not for dataframe values.Please help
#Code for Dataframe column headers
self.df_db1 =self.df_db1.toDF(*[c.lower() for c in self.df_db1.columns])
Assuming df is your dataframe, this should do the work:
from pyspark.sql import functions as F
for col in df.columns:
df = df.withColumn(col, F.lower(F.col(col)))
Both answers seems to be ok with one exception - if you have numeric column, it will be converted to string column. To avoid this, try:
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val fields = df.schema.fields
val stringFields = df.schema.fields.filter(f => f.dataType == StringType)
val nonStringFields = df.schema.fields.filter(f => f.dataType != StringType).map(f => f.name).map(f => col(f))
val stringFieldsTransformed = stringFields .map (f => f.name).map(f => upper(col(f)).as(f))
val df = sourceDF.select(stringFieldsTransformed ++ nonStringFields: _*)
Now types are correct also when you have non-string fields, i.e. numeric fields).
If you know that each column is of String type, use one of the other answers - they are correct in that cases :)
Python code in PySpark:
from pyspark.sql.functions import *
from pyspark.sql.types import *
sourceDF = spark.createDataFrame([(1, "a")], ['n', 'n1'])
fields = sourceDF.schema.fields
stringFields = filter(lambda f: isinstance(f.dataType, StringType), fields)
nonStringFields = map(lambda f: col(f.name), filter(lambda f: not isinstance(f.dataType, StringType), fields))
stringFieldsTransformed = map(lambda f: upper(col(f.name)), stringFields)
allFields = [*stringFieldsTransformed, *nonStringFields]
df = sourceDF.select(allFields)
You can generate an expression using list comprehension:
from pyspark.sql import functions as psf
expression = [ psf.lower(psf.col(x)).alias(x) for x in df.columns ]
And then just call it over your existing dataframe
>>> df.show()
+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
>>> df.select(*select_expression).show()
+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+

An error about Dataset.filter in Spark SQL

I want to filter the dataset only to contain the record which can be found in MySQL.
Here is the Dataset:
dataset.show()
+---+-----+
| id| name|
+---+-----+
| 1| a|
| 2| b|
| 3| c|
+---+-----+
And here is the table in MySQL:
+---+-----+
| id| name|
+---+-----+
| 1| a|
| 3| c|
| 4| d|
+---+-----+
This is my code (running in spark-shell):
import java.util.Properties
case class App(id: Int, name: String)
val data = sc.parallelize(Array((1, "a"), (2, "b"), (3, "c")))
val dataFrame = data.map { case (id, name) => App(id, name) }.toDF
val dataset = dataFrame.as[App]
val url = "jdbc:mysql://ip:port/tbl_name"
val table = "my_tbl_name"
val user = "my_user_name"
val password = "my_password"
val properties = new Properties()
properties.setProperty("user", user)
properties.setProperty("password", password)
dataset.filter((x: App) =>
0 != sqlContext.read.jdbc(url, table, Array("id = " + x.id.toString), properties).count).show()
But I get "java.lang.NullPointerException"
at org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:638)
at org.apache.spark.sql.SQLConf.defaultDataSourceName(SQLConf.scala:558)
at org.apache.spark.sql.DataFrameReader.<init>(DataFrameReader.scala:362)
at org.apache.spark.sql.SQLContext.read(SQLContext.scala:623)
I have tested
val x = App(1, "aa")
sqlContext.read.jdbc(url, table, Array("id = " + x.id.toString), properties).count
val y = App(5, "aa")
sqlContext.read.jdbc(url, table, Array("id = " + y.id.toString), properties).count
and I can get the right result 1 and 0.
What's the problem with filter?
What's the problem with filter?
You get an exception because you're trying to execute an action (count on a DataFrame) inside a transformation (filter). Neither nested actions nor transformations are supported in Spark.
Correct solution is as usual either join on compatible data structures, lookup using local data structure or query directly against external system (without using Spark data structures).

Resources