Intersect dataframes that have a List column

Intersect dataframes that have a List column - apache-spark

I have two dataframes that have List as a column. Both the dataframes are identical except for the fact that the order of the list is different in the dataframes.
eg. Schema: (id text, name List'<'text>)
df1: (5,WrappedArray(abc, pqr, xyz))
df2: (5,WrappedArray(abc, xyz, pqr))
When i use intersect i dont get this record in the results. How can i get the intersection of such records?

I think you are right that the easiest way would be to sort the list column.
val sortListFunc = udf((inputList: WrappedArray[String]) => {
inputList.sorted
})
val df1Sorted = df1
.withColumn("name_sorted",sortListFunc(col("name"))
.select($"id","name_sorted".as("name"))
val df2Sorted = df2
.withColumn("name_sorted",sortListFunc(col("name"))
.select($"id","name_sorted".as("name"))
Then you should be able to join or intersect.

Related

Get only rows of dataframe where a subset of columns exist in another dataframe

I want to get all rows of a dataframe (df2) where the city column value and postcode column value also exist in another dataframe (df1).
Important is here that I want the combination of both columns and not look at the column individually.
My approach was this:
#1. Get all combinations
df_combinations=np.array(df1.select("Ort","Postleitzahl").dropDuplicates().collect())
sc.broadcast(df_combinations)
#2.Define udf
def combination_in_vx(ort,plz):
for arr_el in dfSpark_combinations:
if str(arr_el[0]) == ort and int(arr_el[1]) == plz:
return True
return False
combination_in_vx = udf(combination_in_vx, BooleanType())
#3.
df_tmp=df_2.withColumn("Combination_Exists", combination_in_vx('city','postcode'))
df_result=df_tmp.filter(df_tmp.Combination_Exists)
Although this should theoretically work it takes forever!
Does anybody know about a better solution here? Thank you very much!

You can do a left semi join using the two columns. This will include the rows in df2 where the values in both of the two specified columns exist in df1:
import pyspark.sql.functions as F
df_result = df2.join(df1, ["Ort", "Postleitzahl"], 'left_semi')

How to compare multiple columns in two tables and find out the duplicates?

I have two dataframe
Dataframe 1
Dataframe 2
ID column is not unique in the two tables. I want to compare all the columns in both the tables except ID's and print the unique rows
Expected output
I tried 'isin' function, but not working. Each dataframe size is 150000 and I removed duplicates in both the tables. Please advise how to do that?

You can use df.append to combine the dataframe, then use df.duplicated which will flag the duplicates.
df3 = df1.append(df, ignore_index=True)
df4 = df3.duplicated(subset=['Team', 'name', 'Country', 'Token'], keep=False)

pyspark RDD - Left outer join on specific key

I have two table A and B with hundred of columns. I am trying to apply left outer join on two table but they both have different keys.
I created a new column with same key in B as A. Then was able to apply left outer join. However, how do I join both tables if I am unable to make the column names consistent?
This is what I have tried:
a = spark.table('a').rdd
a = spark.table('a')
b = b.withColumn("acct_id",col("id"))
b = b.rdd
a.leftOuterJoin(b).collect()

If you have dataframe then why you are creating rdd for that, is there any specific need?
Try below command on dataframes -
a.join(b, a.column_name==b.column_name, 'left').show()
Here are few commands you can use to investigate your dataframe
##Get column names of dataframe
a.columns
##Get column names with their datatype of dataframe
a.dtypes
##What is the type of object (eg. dataframe, rdd etc.)
type(a)

DataFrames are faster than rdd, and you already have dataframes, so I sugest:
a = spark.table('a')
b = spark.table('b').withColumn("acct_id",col("id"))
result = pd.merge(a, b, left_on='id', right_on='acct_id', how='left').rdd

Join two DataFrames where the join key is different and only select some columns

What I would like to do is:
Join two DataFrames A and B using their respective id columns a_id and b_id. I want to select all columns from A and two specific columns from B
I tried something like what I put below with different quotation marks but still not working. I feel in pyspark, there should have a simple way to do this.
A_B = A.join(B, A.id == B.id).select(A.*, B.b1, B.b2)
I know you could write
A_B = sqlContext.sql("SELECT A.*, B.b1, B.b2 FROM A JOIN B ON A.a_id = B.b_id")
to do this but I would like to do it more like the pseudo code above.

Your pseudocode is basically correct. This slightly modified version would work if the id column existed in both DataFrames:
A_B = A.join(B, on="id").select("A.*", "B.b1", "B.b2")
From the docs for pyspark.sql.DataFrame.join():
If on is a string or a list of strings indicating the name of the join
column(s), the column(s) must exist on both sides, and this performs
an equi-join.
Since the keys are different, you can just use withColumn() (or withColumnRenamed()) to created a column with the same name in both DataFrames:
A_B = A.withColumn("id", col("a_id")).join(B.withColumn("id", col("b_id")), on="id")\
.select("A.*", "B.b1", "B.b2")
If your DataFrames have long complicated names, you could also use alias() to make things easier:
A_B = long_data_frame_name1.alias("A").withColumn("id", col("a_id"))\
.join(long_data_frame_name2.alias("B").withColumn("id", col("b_id")), on="id")\
.select("A.*", "B.b1", "B.b2")

Try this solution:
A_B = A.join(B,col('B.id') == col('A.id')).select([col('A.'+xx) for xx in A.columns]
+ [col('B.other1'),col('B.other2')])
The below lines in SELECT played the trick of selecting all columns from A and 2 columns from Table B.
[col('a.'+xx) for xx in a.columns] : all columns in a
[col('b.other1'),col('b.other2')] : some columns of b

I think the easier solution is just to join table A to table B with selected columns you want. here is a sample code to do this:
joined_tables = table_A.join(table_B.select('col1', 'col2', 'col3'), ['id'])
the code above join all columns from table_A and columns "col1", "col2", "col3" from table_B.

How to merge edits from one dataframe into another dataframe in Spark?

I have a dataframe df1 with 150 columns and many rows. I also have a dataframe df2 with the same schema but very few rows containing edits that should be applied to df1 (there's a key column id to identify which row to update). df2 has only columns with updates populated. The other of the columns are null. What I want to do is to update the rows in df1 with correspoding rows from dataframe df2 in the following way:
if a column in df2 is null, it should not cause any changes in df1
if a column in df2 contains a tilde "~", it should result in nullifying that column in df1
otherwise the value in column in df1 should get replaced with the value from df2
How can I best do it? Can it be done in a generic way without listing all the columns but rather iterating over them? Can it be done using dataframe API or do I need to switch to RDDs?
(Of course by updating dataframe df1 I mean creating a new, updated dataframe.)
Example
Let's say the schema is: id:Int, name:String, age: Int.
df1 is:
1,"Greg",18
2,"Kate",25
3,"Chris",30
df2 is:
1,"Gregory",null
2,~,26
The updated dataframe should look like this:
1,"Gregory",18
2,null,26
3,"Chris",30

you can also use case or coalesce using full outer join to merge the two dataframes. see a link below for an explanation.
Spark incremental loading overwrite old record

I figured out how to do it with an intermediate conversion to RDD. First, create a map idsToEdits where keys are row ids and values are maps of column numbers to values (only the non-null ones).
val idsToEdits=df2.rdd.map{row=>
(row(0),
row.getValuesMap[AnyVal](row.schema.fieldNames.filterNot(colName=>row.isNullAt(row.fieldIndex(colName))))
.map{case (k,v)=> (row.fieldIndex(k),if(v=="~") null else v)} )
}.collectAsMap()
Broadast that map and define an editRow function updating a row.
val idsToEditsBr=sc.broadcast(idsToEdits)
import org.apache.spark.sql.Row
val editRow:Row=>Row={ row =>
idsToEditsBr
.value
.get(row(0))
.map{edits => Row.fromSeq(edits.foldLeft(row.toSeq){case (rowSeq,
(idx,newValue))=>rowSeq.updated(idx,newValue)})}
.getOrElse(row)
}
Finally, use that function on RDD derived from df1 and convert back to a dataframe.
val updatedDF=spark.createDataFrame(df1.rdd.map(editRow),df1.schema)

It sounds like your question is how to perform this without explcitly naming all the columns so I will assume you have some "doLogic" udf function or dataframe functions to perform your logic after joining.
import org.apache.spark.sql.types.StringType
val cols = df1.schema.filterNot(x => x.name == "id").map({ x =>
if (x.dataType == StringType) {
doLogicUdf(col(x), col(x + "2")))
} else {
when(col(x + "2").isNotNull, col(x + "2")).otherwise(col(x))
}
}) :+ col("id")
val df2 = df2.select(df2.columns.map( x=> col(x).alias(x+"2")) : _*))
df1.join(df2, col("id") ===col("id2") , "inner").select(cols : _*)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Intersect dataframes that have a List column - apache-spark

Related

Get only rows of dataframe where a subset of columns exist in another dataframe

How to compare multiple columns in two tables and find out the duplicates?

pyspark RDD - Left outer join on specific key

Join two DataFrames where the join key is different and only select some columns

How to merge edits from one dataframe into another dataframe in Spark?

Categories

Resources