SPARK DataFrame: Remove MAX value in a group - apache-spark

My data is like:
id | val
----------------
a1 | 10
a1 | 20
a2 | 5
a2 | 7
a2 | 2
I am trying to delete row that has MAX(val) in the group if I group on "id".
Result should be like:
id | val
----------------
a1 | 10
a2 | 5
a2 | 2
I am using SPARK DataFrame and SQLContext. I need some way like:
DataFrame df = sqlContext.sql("SELECT * FROM jsontable WHERE (id, val) NOT IN (SELECT is,MAX(val) from jsontable GROUP BY id)");
How can I do that?

You can do that using dataframe operations and Window functions. Assuming you have your data in the dataframe df1:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val maxOnWindow = max(col("val")).over(Window.partitionBy(col("id")))
val df2 = df1
.withColumn("max", maxOnWindow)
.where(col("val") < col("max"))
.select("id", "val")
In Java, the equivalent would be something like:
import org.apache.spark.sql.functions.Window;
import static org.apache.spark.sql.functions.*;
Column maxOnWindow = max(col("val")).over(Window.partitionBy("id"));
DataFrame df2 = df1
.withColumn("max", maxOnWindow)
.where(col("val").lt(col("max")))
.select("id", "val");
Here's a nice article about window functions: https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

Below is the Java implementation of Mario's scala code:
DataFrame df = sqlContext.read().json(input);
DataFrame dfMaxRaw = df.groupBy("id").max("val");
DataFrame dfMax = dfMaxRaw.select(
dfMaxRaw.col("id").as("max_id"), dfMaxRaw.col("max(val)").as("max_val")
);
DataFrame combineMaxWithData = df.join(dfMax, df.col("id")
.equalTo(dfMax.col("max_id")));
DataFrame finalResult = combineMaxWithData.filter(
combineMaxWithData.col("id").equalTo(combineMaxWithData.col("max_id"))
.and(combineMaxWithData.col("val").notEqual(combineMaxWithData.col("max_val")))
);

Here is how to do this using RDD and a more Scala-flavored approach:
// Let's first get the data in key-value pair format
val data = sc.makeRDD( Seq( ("a",20), ("a", 1), ("a",8), ("b",3), ("b",10), ("b",9) ) )
// Next let's find the max value from each group
val maxGroups = data.reduceByKey( Math.max(_,_) )
// We join the max in the group with the original data
val combineMaxWithData = maxGroups.join(data)
// Finally we filter out the values that agree with the max
val finalResults = combineMaxWithData.filter{ case (gid, (max,curVal)) => max != curVal }.map{ case (gid, (max,curVal)) => (gid,curVal) }
println( finalResults.collect.toList )
>List((a,1), (a,8), (b,3), (b,9))

Related

Dataframe Comparision in Spark: Scala

I have two dataframes in spark/scala in which i have some common column like salary,bonus,increment etc.
i need to compare these two dataframes's columns and anything changes like in first dataframe salary is 3000 and in second dataframe salary is 5000 then i need to insert 5000-3000=2000 in new dataframe as salary, and if in first dataframe salary is 5000 and in second dataframe salary is 3000 then i need to insert 5000+3000=8000 in new dataframe as salary, and if salary is same in both the dataframe then need to insert from second dataframe.
val columns = df1.schema.fields.map(_.salary)
val salaryDifferences = columns.map(col => df1.select(col).except(df2.select(col)))
salaryDifferences.map(diff => {if(diff.count > 0) diff.show})
I tried above query but its giving column and value where any difference is there i need to also check if diff is negative or positive and based to that i need to perform logic.can anyone please give me a hint how can i implement this and insert record in 3rd dataframe,
Join the Dataframes and use nested when and otherwise clause.
Also find comments in the code
import org.apache.spark.sql.functions._
object SalaryDiff {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
import spark.implicits._
val df1 = List(("1", "5000"), ("2", "3000"), ("3", "5000")).toDF("id", "salary") // First dataframe
val df2 = List(("1", "3000"), ("2", "5000"), ("3", "5000")).toDF("id", "salary") // Second dataframe
val df3 = df1 // Is your 3rd tables
.join(
df2
, df1("id") === df2("id") // Join both dataframes on id column
).withColumn("finalSalary", when(df1("salary") < df2("salary"), df2("salary") - df1("salary")) // 5000-3000=2000 check
.otherwise(
when(df1("salary") > df2("salary"), df1("salary") + df2("salary")) // 5000+3000=8000 check
.otherwise(df2("salary")))) // insert from second dataframe
.drop(df1("salary"))
.drop(df2("salary"))
.withColumnRenamed("finalSalary","salary")
.show()
}
}

How to iterate through dataframe without converting to dataset in spark?

I have a dataframe through which I want to iterate, but I dont want to convert dataframe to dataset.
We have to convert spark scala code to pyspark and pyspark does not support dataset.
I have tried the following code with by converting to dataset
data in file:
abc,a
mno,b
pqr,a
xyz,b
val a = sc.textFile("<path>")
//creating dataframe with column AA,BB
val b = a.map(x => x.split(",")).map(x =>(x(0).toString,x(1).toString)).toDF("AA","BB")
b.registerTempTable("test")
case class T(AA:String, BB: String)
//creating dataset from dataframe
val d = b.as[T].collect
d.foreach{ x=>
var m = spark.sql(s"select * from test where BB = '${x.BB}'")
m.show()
}
Without converting to dataset it gives error i.e. with
val d = b.collect
d.foreach{ x=>
var m = spark.sql(s"select * from test where BB = '${x.BB}'")
m.show()
}
it gives error:
error: value BB is not member of org.apache.spark.sql.ROW
You cannot loop dataframe as you have given in the above code. Use dataframe's rdd.collect to loop dataframe.
import spark.implicits._
val df = Seq(("abc","a"), ("mno","b"), ("pqr","a"),("xyz","b")).toDF("AA", "BB")
df.registerTempTable("test")
df.rdd.collect.foreach(x => {
val BBvalue = x.mkString(",").split(",")(1)
var m = spark.sql(s"select * from test where BB = '$BBvalue'")
m.show()
})
Inside the loop I used mkString to convert an rdd row to string and then split the column values with comma and use the index of column for accessing the value. For example, in the above code I have used (1) which means, column BB column index is 2.
Please let me know if you have any questions.

Spark load files collection in batch and find the line from each file with additional info from file level

I have the files collection specified with comma separator, like:
hdfs://user/cloudera/date=2018-01-15,hdfs://user/cloudera/date=2018-01-16,hdfs://user/cloudera/date=2018-01-17,hdfs://user/cloudera/date=2018-01-18,hdfs://user/cloudera/date=2018-01-19,hdfs://user/cloudera/date=2018-01-20,hdfs://user/cloudera/date=2018-01-21,hdfs://user/cloudera/date=2018-01-22
and I'm loading the files with Apache Spark, all in once with:
val input = sc.textFile(files)
Also, I have additional information associated with each file - the unique ID, for example:
File ID
--------------------------------------------------
hdfs://user/cloudera/date=2018-01-15 | 12345
hdfs://user/cloudera/date=2018-01-16 | 09245
hdfs://user/cloudera/date=2018-01-17 | 345hqw4
and so on
As the output, I need to receive the DataFrame with the rows, where each row will contain the same ID, as the ID of the file from which this line was read.
Is it possible to pass this information in some way to Spark in order to be able to associate with the lines?
Core sql approach with UDF (the same thing you can achieve with join if you represent File -> ID mapping as Dataframe):
import org.apache.spark.sql.functions
val inputDf = sparkSession.read.text(".../src/test/resources/test")
.withColumn("fileName", functions.input_file_name())
def withId(mapping: Map[String, String]) = functions.udf(
(file: String) => mapping.get(file)
)
val mapping = Map(
"file:///.../src/test/resources/test/test1.txt" -> "id1",
"file:///.../src/test/resources/test/test2.txt" -> "id2"
)
val resutlDf = inputDf.withColumn("id", withId(mapping)(inputDf("fileName")))
resutlDf.show(false)
Result:
+-----+---------------------------------------------+---+
|value|fileName |id |
+-----+---------------------------------------------+---+
|row1 |file:///.../src/test/resources/test/test1.txt|id1|
|row11|file:///.../src/test/resources/test/test1.txt|id1|
|row2 |file:///.../src/test/resources/test/test2.txt|id2|
|row22|file:///.../src/test/resources/test/test2.txt|id2|
+-----+---------------------------------------------+---+
text1.txt:
row1
row11
text2.txt:
row2
row22
This could help (not tested)
// read single text file into DataFrame and add 'id' column
def readOneFile(filePath: String, fileId: String)(implicit spark: SparkSession): DataFrame = {
val dfOriginal: DataFrame = spark.read.text(filePath)
val dfWithIdColumn: DataFrame = dfOriginal.withColumn("id", lit(fileId))
dfWithIdColumn
}
// read all text files into DataFrame
def readAllFiles(filePathIdsSeq: Seq[(String, String)])(implicit spark: SparkSession): DataFrame = {
// create empty DataFrame with expected schema
val emptyDfSchema: StructType = StructType(List(
StructField("value", StringType, false),
StructField("id", StringType, false)
))
val emptyDf: DataFrame = spark.createDataFrame(
rowRDD = spark.sparkContext.emptyRDD[Row],
schema = emptyDfSchema
)
val unionDf: DataFrame = filePathIdsSeq.foldLeft(emptyDf) { (intermediateDf: DataFrame, filePathIdTuple: (String, String)) =>
intermediateDf.union(readOneFile(filePathIdTuple._1, filePathIdTuple._2))
}
unionDf
}
References
spark.read.text(..) method
Create empty DataFrame

Spark aggregate rows with custom function

To make it simple, let's assume we have a dataframe containing the following data:
+----------+---------+----------+----------+
|firstName |lastName |Phone |Address |
+----------+---------+----------+----------+
|firstName1|lastName1|info1 |info2 |
|firstName1|lastName1|myInfo1 |dummyInfo2|
|firstName1|lastName1|dummyInfo1|myInfo2 |
+----------+---------+----------+----------+
How can I merge all rows grouping by (firstName,lastName) and keep in the columns Phone and Address only data starting by "my" to get the following :
+----------+---------+----------+----------+
|firstName |lastName |Phone |Address |
+----------+---------+----------+----------+
|firstName1|lastName1|myInfo1 |myInfo2 |
+----------+---------+----------+----------+
Maybe should I use agg function with a custom UDAF? But how can I implement it?
Note: I'm using Spark 2.2 along with Scala 2.11.
You can use groupBy and collect_set aggregation function and use a udf function to filter in the first string that starts with "my"
import org.apache.spark.sql.functions._
def myudf = udf((array: Seq[String]) => array.filter(_.startsWith("my")).head)
df.groupBy("firstName ", "lastName")
.agg(myudf(collect_set("Phone")).as("Phone"), myudf(collect_set("Address")).as("Address"))
.show(false)
which should give you
+----------+---------+-------+-------+
|firstName |lastName |Phone |Address|
+----------+---------+-------+-------+
|firstName1|lastName1|myInfo1|myInfo2|
+----------+---------+-------+-------+
I hope the answer is helpful
If only two columns involved, filtering and join can be used instead of UDF:
val df = List(
("firstName1", "lastName1", "info1", "info2"),
("firstName1", "lastName1", "myInfo1", "dummyInfo2"),
("firstName1", "lastName1", "dummyInfo1", "myInfo2")
).toDF("firstName", "lastName", "Phone", "Address")
val myPhonesDF = df.filter($"Phone".startsWith("my"))
val myAddressDF = df.filter($"Address".startsWith("my"))
val result = myPhonesDF.alias("Phones").join(myAddressDF.alias("Addresses"), Seq("firstName", "lastName"))
.select("firstName", "lastName", "Phones.Phone", "Addresses.Address")
result.show(false)
Output:
+----------+---------+-------+-------+
|firstName |lastName |Phone |Address|
+----------+---------+-------+-------+
|firstName1|lastName1|myInfo1|myInfo2|
+----------+---------+-------+-------+
For many columns, when only one row expected, such construction can be used:
val columnsForSearch = List("Phone", "Address")
val minExpressions = columnsForSearch.map(c => min(when(col(c).startsWith("my"), col(c)).otherwise(null)).alias(c))
df.groupBy("firstName", "lastName").agg(minExpressions.head, minExpressions.tail: _*)
Output is the same.
UDF with two parameters example:
val twoParamFunc = (firstName: String, Phone: String) => firstName + ": " + Phone
val twoParamUDF = udf(twoParamFunc)
df.select(twoParamUDF($"firstName", $"Phone")).show(false)

Spark how to use a UDF with a Join

I'd like to use a specific UDF with using Spark
Here's the plan:
I have a table A(10 million rows) and a table B(15 millions rows)
I'd like to use an UDF comparing one element of the table A and one of the table B
Is it possible
Here's a a sample of my code. At some point i also need to say that my UDF compare must be greater than 0,9:
DataFrame dfr = df
.select("name", "firstname", "adress1", "city1","compare(adress1,adress2)")
.join(dfa,df.col("adress1").equalTo(dfa.col("adress2"))
.and((df.col("city1").equalTo(dfa.col("city2"))
...;
Is it possible ?
Yes, you can. However it will be slower than normal operators, as Spark will be not able to do predicate pushdown
Example:
val udf = udf((x : String, y : String) => { here compute similarity; });
val df3 = df1.join(df2, udf(df1.field1, df2.field1) > 0.9)
For example:
val df1 = Seq (1, 2, 3, 4).toDF("x")
val df2 = Seq(1, 3, 7, 11).toDF("q")
val udf = org.apache.spark.sql.functions.udf((x : Int, q : Int) => { Math.abs(x - q); });
val df3 = df1.join(df2, udf(df1("x"), df2("q")) > 1)
You can also directly return boolean from User Defined Function

Resources