Search (row values) data from another dataframe

Search (row values) data from another dataframe - python-3.x

I have two dataframes, df1 and df2 respectively.
In one dataframe I have a list of search values (Actually Col1)
Col1 Col2
A1 val1, val2
B2 val4, val1
C3 val2, val5
I have another dataframe where I have a list of items
value items
val1 apples, oranges
val2 honey, mustard
val3 banana, milk
val4 biscuit
val5 chocolate
I want to iterate though the first DF and try to use that val as key to search for items from the second DF
Expected output:
A1 apples, oranges, honey, mustard
B2 biscuit, appleas, oranges
C3 honey, mustard, chocolate
I am able to add the values into dataframe and iterate through 1st DF
for index, row in DF1:
#list to hold all the values
finalList = []
list = df1['col2'].split(',')
for i in list:
print(i)
I just need help to fetch values from the second dataframe.
Would appreciate any help. Thanks.

Idea is use lambda function with split and lookup by dictionary:
d = df2.set_index('value')['items'].to_dict()
df1['Col2'] = df1['Col2'].apply(lambda x: ', '.join(d[y] for y in x.split(', ') if y in d))
print (df1)
Col1 Col2
0 A1 apples, oranges, honey, mustard
1 B2 biscuit, apples, oranges
2 C3 honey, mustard, chocolate
If there are lists in items values solution is changed with flattening:
d = df2.set_index('value')['items'].to_dict()
f = lambda x: ', '.join(z for y in x.split(', ') if y in d for z in d[y])
df1['Col2'] = df1['Col2'].apply(f)
print (df1)
Col1 Col2
0 A1 apples, oranges
1 B2 biscuit, apples, oranges
2 C3 honey, mustard, chocolate

Related

Pandas Dataframe merge on two different key to get original data

Question title might be confusing but here is the example of what I intends to perform.
Below is one the main dataframe with request data
d = {'ID':['A1','A2','A3','A4'],'ID2': ['B1','B2','B3','B4'],'B':[-1,5,6,7000],'ExtD':['CA','CB','CC','CD']}
df = pd.DataFrame(data=d)
df
Now, Response might be based on ID or ID2 column and looks like this -
d = {'RetID':['A1','A2','B3','B4'],'C':[1.3,5.4,4.5,1.3]}
df2 = pd.DataFrame(data=d)
df2
where RetID could be ID or ID2 from the request along with additional data C. Once response is received I need to merge it back with original dataframe to get data ExtD.
the solution I have come up with is to do -
df2 = df2.merge(df[['ID','ExtD',]],'left',left_on=['RetID'],right_on=['ID'])
df2 = df2.merge(df[['ID2','ExtD']],'left',left_on=['RetID'],right_on=['ID2'],suffixes = ('_d1','_d2'))
df2.rename({'ExtD_d1':'ExtD'},axis=1,inplace=True)
df2.loc[df2['ExtD'].isnull(),'ExtD'] = df2['ExtD_d2']
df2.drop({'ID2','ExtD_d2'},axis=1,inplace=True)
so expected output is,
res = {'RetID':['A1','A2','B3','B4'],'C':[1.3,5.4,4.5,1.3],'ExtD':['CA','CB','CC','CD']}
df2= pd.DataFrame(data=res)
df2
EDIT2: updated requirement tweak.
res = {'RetID':['A1','A2','B1','B2'],'C':[1.3,5.4,4.5,1.3],'ExtD':['CA','CB','CC','CD'],'ID':['A1','A2','A3','A4'],'ID2': ['B1','B2','B3','B4']}
Is there an efficient way to do this ? There might be more than 2 IDs - ID, ID2, ID3 and more than one column to join from the reqest dataframe. TIA.
EDIT: Fixed the typo.

Use melt to transform your first dataframe then merge with the second:
tmp = df.melt('ExtD', value_vars=['ID', 'ID2'], value_name='RetID')
df2 = df2.merge(tmp[['ExtD', 'RetID']])
>>> df2
RetID C ExtD
0 A1 1.3 CA
1 A2 5.4 CB
2 B1 4.5 CA
3 B2 1.3 CB
>>> tmp
ExtD variable RetID
0 CA ID A1
1 CB ID A2
2 CC ID A3
3 CD ID A4
4 CA ID2 B1
5 CB ID2 B2
6 CC ID2 B3
7 CD ID2 B4
Update
What if I need to merge ID and ID2 columns as well?
df2 = df2.merge(df[['ID', 'ID2', 'ExtD']], on='ExtD')
>>> df2
RetID C ExtD ID ID2
0 A1 1.3 CA A1 B1
1 A2 5.4 CB A2 B2
2 B3 4.5 CC A3 B3
3 B4 1.3 CD A4 B4

get distinct columns dataframe

Hello how can i do to only the lines where val is different in the 2 dataframes.
Notice that i can have id1 or id2 or both as below.
d2 = {'id1': ['X22', 'X13',np.nan,'X02','X14'],'id2': ['Y1','Y2','Y3','Y4',np.nan],'VAL1':[1,0,2,3,0]}
F1 = pd.DataFrame(data=d2)
d2 = {'id1': ['X02', 'X13',np.nan,'X22','X14'],'id2': ['Y4','Y2','Y3','Y1','Y22'],'VAL2':[1,0,4,3,1]}
F2 = pd.DataFrame(data=d2)
Expected Output
d2 = {'id1': ['X02',np.nan,'X22','X14'],'id2': ['Y4','Y3','Y1',np.nan],'VAL1':[3,2,1,0],'VAL2':[1,4,3,1]}
F3 = pd.DataFrame(data=d2)

First merge by all columns with left_on and right_on parameters, then filter out both rows and remove missing values by reshape by stack with unstack:
df=pd.merge(F1, F2, left_on=['id1','id2','VAL2'],
right_on=['id1','id2','VAL1'], how="outer", indicator=True)
df=(df[df['_merge'] !='both']
.set_index(['id1','id2'])
.drop('_merge', 1)
.stack()
.unstack()
.reset_index())
print (df)
id1 id2 VAL2 VAL1
0 X02 Y4 3 1
1 X22 Y1 1 3

F1.merge(F2,how='left',left_on=['id1','id2'],right_on=['id1','id2'])\
.query("VAL1!=VAL2")

Print a result into a data frame with pandas

I have a dataframe such as:
col1 col2 col3 ID
A 23 AZ ER1 ID1
B 12 ZE EZ1 ID2
C 13 RE RE1 ID3
I parsed the ID col in order to get some informations, to be quick, for each ID I get some informations, here is a result of the code:
for i in dataframe['ID']:
name = function(i,ranks=True)
print(name)
{'species': 'rabbit', 'genus': 'unis', 'subfamily': 'logomorphidae', 'family': 'lego', 'no rank': 'info, nothing', 'superkingdom': 'eucoryote'}
{'species': 'dog', 'genus': 'Rana', 'subfamily': 'Alphair', 'family': 'doggidae', 'no rank': 'dsDNA , no stage', 'superkingdom': 'eucaryote'}
{'species': 'duck', 'subfamily': 'duckinae', 'family': 'duckidae'}
...
as you can se it is a dictionary return. As you can also see for the ID 1 and 2 I get 6 informations (species, genus, subfamily, family,no rank,superkingdom)
for the ID 3 I only get 3 informations
And the idea is instead of just print the dic contents to add it directly in the dataframe and get :
col1 col2 col3 ID species genus subfamily family no rank superkingdom
A 23 AZ ER1 ID1 rabbit unis logomorphidae lego info, nothing, eucaryote
B 12 ZE EZ1 ID2 dog Rana Alphair doggidae dsDNA , no stage eucaryote
C 13 RE RE1 ID3 duck None duckinae duckidae None None
Have you an idea to do it with pandas?
Thanks for your help.

Store your output in a dict of dicts, making it easy to create a DataFrame and join it back.
d = {}
for i in dataframe['ID']:
d[i] = taxid.lineage_name(i, ranks=True)
df.merge(pd.DataFrame.from_dict(d, orient='index'), left_on='ID', right_index=True)
Output:
col1 col2 col3 ID species genus subfamily family no rank superkingdom
A 23 AZ ER1 ID1 rabbit unis logomorphidae lego info, nothing eucoryote
B 12 ZE EZ1 ID2 dog Rana Alphair doggidae dsDNA , no stage eucaryote
C 13 RE RE1 ID3 duck NaN duckinae duckidae NaN NaN

Sorting data based on column entries

I have a text file containing two column lets say col1 and col2.
col1 Col2
A20 A19
A120 A117
A120 A118
A120 B19
A120 B20
.
.
.
B40 A205
and so on.
I want to sort the above columns such that it gives me only those entries which have A and B side by side like:
col1 col2
A120 B20
B40 A205
I've tried using pd.DataFrame.sort but it doesn't return the required output.
Any help will be highly appreciated.

Use indexing by str with boolean indexing for check if not equal first 2 characters:
df = df[df['col1'].str[0] != df['Col2'].str[0]]
print (df)
col1 Col2
3 A120 B19
4 A120 B20
5 B40 A205
If possible multiple starting letters and need test only A and B:
print (df)
col1 Col2
0 A20 C19 <-changed sample data
1 A120 A117
2 A120 A118
3 A120 B19
4 A120 B20
5 B40 A205
a = df['col1'].str[0]
b = df['Col2'].str[0]
m1 = a.isin(['A','B'])
m2 = b.isin(['A','B'])
m3 = a != b
df = df[m1 & m2 & m3]
print (df)
col1 Col2
3 A120 B19
4 A120 B20
5 B40 A205

What is the efficient way to sort Spark Dataset based on two column values?

I have a large dataset of three columns in the following format:
col1 col2 col3
------------------
a1 1 i1
a1 1 i2
a1 2 i3
a3 2 i4
a3 1 i5
a2 3 i6
a2 3 i7
a2 1 i8
I wrote the following:
val datase2 = dataset.groupBy("col1","col2").agg(collect_list("col3").as("col3"))
.sort("col1", "col2")
.groupBy("col1").agg(collect_list("col2"), collect_list("col3"))
.toDF("col1", "col2", "col3").as[(String, Array[String], Array[String])]
To get the distinct values of col2 from the resultant dataset I wrote the following:
dataset2.select("col3").distinct().show()
The above code works fine for small dataset but for large dataset I got the following type of result (just to illustrate the scenario of inconsistant resultant dataset):
col1 col2 col3
-----------------------------------
a1 [1, 2] [[i1, i2], [i3]]
a2 [3, 1] [[i6, i7], [i8]]
a3 [2, 1] [[i4], [i5]]
As I did sort("col1", "col2") the output should be
col1 col2 col3
-----------------------------------
a1 [1, 2] [[i1, i2], [i3]]
a2 [1, 3] [[i8], [i6, i7]]
a3 [1, 2] [[i5], [i4]]
col2 will be in sorted order and the values of col2 and col3 would be consistent based on their array index. For example, the last row of above dataset would be
col2 col3
-------------------------
[1, 2] [[i5], [i4]]
but not
col2 col3
-------------------------
[1, 2] [[i4], [i5]]
How can I achieve my goal?

Combine records using struct and use sort_array:
dataset
.groupBy($"col1")
.agg(sort_array(collect_list(struct($"col2", $"col3"))).alias("data"))
.select($"col1", $"data.col2", $"data.col3")
Credits go to user6910411) for this answer.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Search (row values) data from another dataframe - python-3.x

Related

Pandas Dataframe merge on two different key to get original data

get distinct columns dataframe

Print a result into a data frame with pandas

Sorting data based on column entries

What is the efficient way to sort Spark Dataset based on two column values?

Categories

Resources