Columns value comparison in Spark data frame - apache-spark

I have a data frame which contains huge number of records. In that DF a record can be repeated multiple times and every time when it got updated the last updated field will have the date on which the record modified.
We have a group of columns on which we want to compare the rows of similar ids. During this comparison we want to capture what are the fields/columns has changed from previous record to current record and capture that in a "updated_columns" column of the updated record. Compare this second record to third record and identify the updated columns and capture that in "updated_columns" field of third record, continue the same till the last record of that id and do the same thing for each and every id which has more than one entry.
Initially we grouped the columns and created a hash out of that group of columns and compare against hash values of next row,this way it is helping me to identify records which has updates, but want the columns which got updated.
Here I am sharing some data, which is expected outcome and that is how the final data should like look post adding updated columns (here I can say, use columns Col1, Col2, Col3, col4 and Col5 for comparison between two rows):
Want to do this in a efficient way. Any one tried some thing like this.
Looking for a help!
~Krish.

A window can be used.
The idea is to group the data by ID, sort it by LAST-UPDATED, copy the values of the previous row (if it exists) into the current row and then compare the copied data with the current values.
val data = ... //the dataframe has the columns ID,Col1,Col2,Col3,Col4,Col5,LAST_UPDATED,IS_DELETED
val fieldNames = data.schema.fieldNames.dropRight(1) //1
val columns = fieldNames.map(f => col(f))
val windowspec = Window.partitionBy("ID").orderBy("LAST_UPDATED") //2
def compareArrayUdf() = ... //3
val result = data
.withColumn("cur", array(columns: _*)) //4
.withColumn("prev", lag($"cur", 1).over(windowspec)) //5
.withColumn("updated_columns", compareArrayUdf()($"cur", $"prev")) //6
.drop("cur", "prev") //7
.orderBy("LAST_UPDATED")
Remarks:
create a list of all fields to compare. All fields but the last one (LAST-UPDATED) are used
create a window that is partitioned by ID and each partition is sorted by LAST-UPDATED
create a udf that compares two arrays and maps the discovered differences to the field names, code see below
create a new column that contains all values that should be compared
create a new column that contains all values of the previous row (by using the lag-function) that should be compared. The previous row is the row with the same ID and the biggest LAST-UPDATED that is smaller than the current one. This field can be null
compare the two new columns and put the result into updated-columns
drop the two intermediate columns created in step 3 and 4
The compareArraysUdf is
def compareArray(cur: mutable.WrappedArray[String], prev: mutable.WrappedArray[String]): String = {
if (prev == null || cur == null) return ""
val res = new StringBuilder
for (i <- cur.indices) {
if (!cur(i).contentEquals(prev(i))) {
if (res.nonEmpty) res.append(",")
res.append(fieldNames(i))
}
}
res.toString()
}
def compareArrayUdf() = udf[String, mutable.WrappedArray[String], mutable.WrappedArray[String]](compareArray)

You can join your DataFrame or DataSet to itself, joining the rows where the id is the same in both rows and where the version of the left row is i and the version of the right row is i+1. Here's an example
case class T(id: String, version: Int, data: String)
val data = Seq(T("1", 1, "d1-1"), T("1", 2, "d1-2"), T("2", 1, "d2-1"), T("2", 2, "d2-2"), T("2", 3, "d2-3"), T("3", 1, "d3-1"))
data: Seq[T] = List(T(1,1,d1-1), T(1,2,d1-2), T(2,1,d2-1), T(2,2,d2-2), T(2,3,d2-3), T(3,1,d3-1))
val ds = data.toDS
val joined = ds.as("ds1").join(ds.as("ds2"), $"ds1.id" === $"ds2.id" && (($"ds1.version"+1) === $"ds2.version"))
And then you can reference the columns in the new DataFrame/DataSet like $"ds1.data and $"ds2.data etc.
To find the rows where the data changed from one version to another, you can do
joined.filter($"ds1.data" !== $"ds2.data")

Related

every value is a Object... want to delete first column pandas dataframe

If I try to print any value of the data frame I'm getting an Object (id value). I just want the value in the data frame.
I've tried removing the first column (df[0]), but that's removes the date column...
If you want to display the values of a specific column of a DataFrame :
df['Start Time'].values.tolist()
To refresh the index, you can use :
df = df.reset_index(drop=True)
And to change the type of a specific column (to int/float/str ...) :
df['column_name'] = df['column_name'].astype(int)

Spark drop duplicates and select row with max value

I'm trying to drop duplicates based on column1 and select the row with max value in column2. The column2 has "year"(2019,2020 etc) as values and it is of type "String". The solution I have is, converting the column 2 into integer and selecting the max value.
Dataset<Row> ds ; //The dataset with column1,column2(year), column3 etc.
Dataset<Row> newDs = ds.withColumn("column2Int", col("column2").cast(DataTypes.IntegerType));
newDs = newDs.groupBy("column1").max("column2Int"); // drops all other columns
This approach drops all other columns in the original dataset 'ds' when I do a "group by". So I have to do a join between 'ds' and 'newDS' to get back all the original columns. Also casting the String column to Integer looks like an ineffective workaround.
Is it possible to drop the duplicates and get the row with bigger string value from the original dataset itself ?
This is a classic de-duplication problem and you'll need to use Window + Rank + filter combo for this.
I'm not very familiar with the Java syntax, but the sample code should look like something below,
import org.apache.spark.sql.expressions.Window;
import org.apache.spark.sql.expressions.WindowSpec;
import org.apache.spark.sql.functions;
import org.apache.spark.sql.types.DataTypes;
Dataset<Row> df = ???;
WindowSpec windowSpec = Window.partitionBy("column1").orderBy(functions.desc("column2Int"));
Dataset<Row> result =
df.withColumn("column2Int", functions.col("column2").cast(DataTypes.IntegerType))
.withColumn("rank", functions.rank().over(windowSpec))
.where("rank == 1")
.drop("rank");
result.show(false);
Overview of what happened,
Add the casted integer column to the df for future sorting.
Subsections/ windows were formed in your dataset (partitions) based on the value of column1
For each of these subsections/ windows/ partitions the rows were sorted on column casted to int. Desc order as you want max.
Ranks like row numbers are assigned to the rows in each partition/ window created.
Filtering is done for all row where rank is 1 (max value as the ordering was desc.)

PySpark: Update column values for a given number of rows of a DataFrame

I have a DataFrame with 10 rows and 2 columns: an ID column with random identifier values and a VAL column filled with None.
vals = [
Row(ID=1,VAL=None),
Row(ID=2,VAL=None),
Row(ID=3,VAL=None),
Row(ID=4,VAL=None),
Row(ID=5,VAL=None),
Row(ID=6,VAL=None),
Row(ID=7,VAL=None),
Row(ID=8,VAL=None),
Row(ID=9,VAL=None),
Row(ID=10,VAL=None)
]
df = spark.createDataFrame(vals)
Now lets say I want to update the VAL column for 3 Rows with value "lets", 3 Rows with value "bucket" and 4 Rows with value "this".
Is there a straightforward way of doing this in PySpark?
Note: ID values is not necessarily consecutive, bucket distribution is not necessarily even
I'll try to explain an idea with some pseudo-code and you'll map to your solution.
Using window function on one partition we can generate row_number() sequential number for each row in dataframe and store it let say in column row_num.
Next your "rules" can be represented as another little dataframe: [min_row_num, max_row_num, label].
All you need is to join those two datasets on row number, adding new column:
df1.join(df2,
on=col('df1.row_num').between(col('min_row_num'), col('max_row_num'))
)
.select('df1.*', 'df2.label')

Spark: filter out all rows based on key/value

I have an RDD, x, in which I have two fields: id, value. If a row has a particular value, I want to take the id and filter out all rows with that id.
For example if I have:
id1,value1
id1,value2
and I want to filter out all ids if any rows with that id have value value1, then I would expect all rows to be filtered out. But currently only the first row is filtered out because it has a value of value1.
I've tried something like
val filter = x.filter(row => (set contains row.value))
This filters out all rows with a particular value, but leaves the other rows with the same id still in the RDD.
You have to apply a filter function for each rdd row and the function after the => should include the row as Array whether or not it includes that token idx or whatever. You might have to adjust the number of the token , but it should look something like this ( whether you should use contains or not contains depends on whether you want to filter in or filter out:
val filteredRDD = rawRDD
.filter(rowItem => !(rowItem.map(_.toString).toSeq
.contains(rowItem.(0).toString)))
or even something like:
val filteredRDD = rdd1.rawRDD(rowItem => !(rowItem._2 contains rowItem._1))

efficiently get joined and not joined data of a dataframe against other dataframe

I have two dataframes lets say A and B. They have different schemas.
I want to get records from dataframe A which joins with B on a key and the records which didn't get joined, I want those as well.
Can this be done in a single query?
Since going over the same data twice will reduce the performance. The DataFrame A is much bigger in size than B.
Dataframe B's size will be around 50Gb-100gb.
Hence I can't broadcast B in that case.
I am okay with getting a single Dataframe C as a result, which can have a partition column "Joined" with values "Yes" or "No", signifying whether the data in A got joined or not with B.
What in case if A has duplicates? and I don't want them.
I was thinking that I'll do a recudeByKey later on the C dataframe. Any suggestions around that?
I am using hive tables to store the Data in ORC file format on HDFS.
Writing code in scala.
Yes, you just need to do a left-outer join:
import sqlContext.implicits._
val A = sc.parallelize(List(("id1", 1234),("id1", 1234),("id3", 5678))).toDF("id1", "number")
val B = sc.parallelize(List(("id1", "Hello"),("id2", "world"))).toDF("id2", "text")
val joined = udf((id: String) => id match {
case null => "No"
case _ => "Yes"
})
val C = A
.distinct
.join(B, 'id1 === 'id2, "left_outer")
.withColumn("joined",joined('id2))
.drop('id2)
.drop('text)
This will yield a dataframe C:[id1: string, number: int, joined: string] that looks like this:
[id1,1234,Yes]
[id3,5678,No]
Note that I have added a distinct to filter out duplicates in A and that the last column in C refers to wether or not is was joined.
EDIT: Following remark from OP, I have added the drop lines to remove the columns from B.

Resources