making a merge in pandas ojn 2 dataframe on multiple key

making a merge in pandas ojn 2 dataframe on multiple key - python-3.x

I have df1 and df2 and i wanna make a left join using pandas,
i also tryed this:
data_2 = pd.merge(df1, df2, ['var1' , 'var2' , 'var3' ] )
but is not really what i want to do.
i write the following join in SQL just to show what i really wanna do (please notice that the two df has different column name):
create df3 as
select a.* , b.*
from df1 as a left join df2 as b
on a.id=b.id_var
and a.speciality=b.speciality
and upcase(a.global_name)= upcase(b.product_name)
how can i do it using pandas?

Equivalent:
(df1.assign(upcase=df1.global_name.str.upper())
.merge(df2.assign(upcase=df2.product_name.str.upper()),
left_on=['id', 'speciality', 'upcase'],
right_on=['id_var', 'speciality', 'upcase'],
how='left')
.drop('upcase', axis=1)
)

Related

How do I give col names for reduce way of merging data frames

I have two dfs:- df1 and df2.:-
dfs=[df1,df2]
df_final = reduce(lambda left,right: pd.merge(left,right,on='Serial_Nbr'), dfs)
I want to select only one column apart from the merge column Serial_Nbr in df1while doing the merge.
how do i do this..?

Filter column in df1:
dfs=[df1[['Serial_Nbr']],df2]
Or if only 2 DataFrames remove reduce:
df_final = pd.merge(df1[['Serial_Nbr']], df2, on='Serial_Nbr')

How to merge specific column from another dataframe in Python Pandas?

I have two dataframe df1 and df2, in df1 I have 'id', 'name', 'rol' and in df2 I have 'id', 'sal', 'add', 'deg'.
I have to merge only 'sal' and 'deg' column from df2 to df1.
I have successfully merged all columns from df2 to df1.
but now I just need to add two columns on the basis of common column "id"
I am using python 3.7 version.
df_right = pd.merge(df1,df2,how='right',on='id')
how can I merge only these two columns ('sal' and 'deg') from df2 on the basis of 'id'?

Just go slice before you merge like so.
pd.merge(left=df1, right=df2[['id', 'sal', 'deg']], how='right', on='id')

Difference Between two Data frames

Is thee any way yo subtract values of two existing dataframe with the common headers in java ?
For example
DF1
|H0|H1|H2|H3|
|00|01|02|03|
|04|05|06|07|
|08|09|10|11|
DF2
|H0|H1|H2|H3|H4|
|01|02|03|04|12|
|05|06|07|08|13|
|09|11|12|13|14|
Subtraction example:
DF2 - DF1
|H0|H1|H2|H3|H4|
|01|01|01|01|12|
|01|01|01|01|13|
|01|01|01|01|14|

Pandas data frame merge select columns

I want to get data from only df2 (all columns) by comparing 'no' filed in both df1 and df2.
My 3 line code is below, for this i'm getting all columns from df1 and df2 not able to trim fields from df1. How to achieve ?
I've 2 pandas dataframes like below :
df1:
no,name,salary
1,abc,100
2,def,105
3,abc,110
4,def,115
5,abc,120
df2:
no,name,salary,dept,addr
1,abc,100,IT1,ADDR1
2,abc,101,IT2,ADDR2
3,abc,102,IT3,ADDR3
4,abc,103,IT4,ADDR4
5,abc,104,IT5,ADDR5
6,abc,105,IT6,ADDR6
7,abc,106,IT7,ADDR7
8,abc,107,IT8,ADDR8
df1 = pd.read_csv("D:\\data\\data1.csv")
df2 = pd.read_csv("D:\\data\\data2.csv")
resDF = pd.merge(df1, df2, on='no' , how='inner')

I think you need filter only no column, then on and how parameters are not necessary:
resDF = pd.merge(df1[['no']], df2)
Or use boolean indexing with filtering by isin:
resDF = df2[df2['no'].isin(df1['no'])]

How to merge edits from one dataframe into another dataframe in Spark?

I have a dataframe df1 with 150 columns and many rows. I also have a dataframe df2 with the same schema but very few rows containing edits that should be applied to df1 (there's a key column id to identify which row to update). df2 has only columns with updates populated. The other of the columns are null. What I want to do is to update the rows in df1 with correspoding rows from dataframe df2 in the following way:
if a column in df2 is null, it should not cause any changes in df1
if a column in df2 contains a tilde "~", it should result in nullifying that column in df1
otherwise the value in column in df1 should get replaced with the value from df2
How can I best do it? Can it be done in a generic way without listing all the columns but rather iterating over them? Can it be done using dataframe API or do I need to switch to RDDs?
(Of course by updating dataframe df1 I mean creating a new, updated dataframe.)
Example
Let's say the schema is: id:Int, name:String, age: Int.
df1 is:
1,"Greg",18
2,"Kate",25
3,"Chris",30
df2 is:
1,"Gregory",null
2,~,26
The updated dataframe should look like this:
1,"Gregory",18
2,null,26
3,"Chris",30

you can also use case or coalesce using full outer join to merge the two dataframes. see a link below for an explanation.
Spark incremental loading overwrite old record

I figured out how to do it with an intermediate conversion to RDD. First, create a map idsToEdits where keys are row ids and values are maps of column numbers to values (only the non-null ones).
val idsToEdits=df2.rdd.map{row=>
(row(0),
row.getValuesMap[AnyVal](row.schema.fieldNames.filterNot(colName=>row.isNullAt(row.fieldIndex(colName))))
.map{case (k,v)=> (row.fieldIndex(k),if(v=="~") null else v)} )
}.collectAsMap()
Broadast that map and define an editRow function updating a row.
val idsToEditsBr=sc.broadcast(idsToEdits)
import org.apache.spark.sql.Row
val editRow:Row=>Row={ row =>
idsToEditsBr
.value
.get(row(0))
.map{edits => Row.fromSeq(edits.foldLeft(row.toSeq){case (rowSeq,
(idx,newValue))=>rowSeq.updated(idx,newValue)})}
.getOrElse(row)
}
Finally, use that function on RDD derived from df1 and convert back to a dataframe.
val updatedDF=spark.createDataFrame(df1.rdd.map(editRow),df1.schema)

It sounds like your question is how to perform this without explcitly naming all the columns so I will assume you have some "doLogic" udf function or dataframe functions to perform your logic after joining.
import org.apache.spark.sql.types.StringType
val cols = df1.schema.filterNot(x => x.name == "id").map({ x =>
if (x.dataType == StringType) {
doLogicUdf(col(x), col(x + "2")))
} else {
when(col(x + "2").isNotNull, col(x + "2")).otherwise(col(x))
}
}) :+ col("id")
val df2 = df2.select(df2.columns.map( x=> col(x).alias(x+"2")) : _*))
df1.join(df2, col("id") ===col("id2") , "inner").select(cols : _*)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

making a merge in pandas ojn 2 dataframe on multiple key - python-3.x

Equivalent: (df1.assign(upcase=df1.global_name.str.upper()) .merge(df2.assign(upcase=df2.product_name.str.upper()), left_on=['id', 'speciality', 'upcase'], right_on=['id_var', 'speciality', 'upcase'], how='left') .drop('upcase', axis=1) )

Related

How do I give col names for reduce way of merging data frames

How to merge specific column from another dataframe in Python Pandas?

Difference Between two Data frames

Pandas data frame merge select columns

How to merge edits from one dataframe into another dataframe in Spark?

Categories

Resources