Sorting data based on column entries

Sorting data based on column entries - python-3.x

I have a text file containing two column lets say col1 and col2.
col1 Col2
A20 A19
A120 A117
A120 A118
A120 B19
A120 B20
.
.
.
B40 A205
and so on.
I want to sort the above columns such that it gives me only those entries which have A and B side by side like:
col1 col2
A120 B20
B40 A205
I've tried using pd.DataFrame.sort but it doesn't return the required output.
Any help will be highly appreciated.

Use indexing by str with boolean indexing for check if not equal first 2 characters:
df = df[df['col1'].str[0] != df['Col2'].str[0]]
print (df)
col1 Col2
3 A120 B19
4 A120 B20
5 B40 A205
If possible multiple starting letters and need test only A and B:
print (df)
col1 Col2
0 A20 C19 <-changed sample data
1 A120 A117
2 A120 A118
3 A120 B19
4 A120 B20
5 B40 A205
a = df['col1'].str[0]
b = df['Col2'].str[0]
m1 = a.isin(['A','B'])
m2 = b.isin(['A','B'])
m3 = a != b
df = df[m1 & m2 & m3]
print (df)
col1 Col2
3 A120 B19
4 A120 B20
5 B40 A205

Related

Pandas Dataframe merge on two different key to get original data

Question title might be confusing but here is the example of what I intends to perform.
Below is one the main dataframe with request data
d = {'ID':['A1','A2','A3','A4'],'ID2': ['B1','B2','B3','B4'],'B':[-1,5,6,7000],'ExtD':['CA','CB','CC','CD']}
df = pd.DataFrame(data=d)
df
Now, Response might be based on ID or ID2 column and looks like this -
d = {'RetID':['A1','A2','B3','B4'],'C':[1.3,5.4,4.5,1.3]}
df2 = pd.DataFrame(data=d)
df2
where RetID could be ID or ID2 from the request along with additional data C. Once response is received I need to merge it back with original dataframe to get data ExtD.
the solution I have come up with is to do -
df2 = df2.merge(df[['ID','ExtD',]],'left',left_on=['RetID'],right_on=['ID'])
df2 = df2.merge(df[['ID2','ExtD']],'left',left_on=['RetID'],right_on=['ID2'],suffixes = ('_d1','_d2'))
df2.rename({'ExtD_d1':'ExtD'},axis=1,inplace=True)
df2.loc[df2['ExtD'].isnull(),'ExtD'] = df2['ExtD_d2']
df2.drop({'ID2','ExtD_d2'},axis=1,inplace=True)
so expected output is,
res = {'RetID':['A1','A2','B3','B4'],'C':[1.3,5.4,4.5,1.3],'ExtD':['CA','CB','CC','CD']}
df2= pd.DataFrame(data=res)
df2
EDIT2: updated requirement tweak.
res = {'RetID':['A1','A2','B1','B2'],'C':[1.3,5.4,4.5,1.3],'ExtD':['CA','CB','CC','CD'],'ID':['A1','A2','A3','A4'],'ID2': ['B1','B2','B3','B4']}
Is there an efficient way to do this ? There might be more than 2 IDs - ID, ID2, ID3 and more than one column to join from the reqest dataframe. TIA.
EDIT: Fixed the typo.

Use melt to transform your first dataframe then merge with the second:
tmp = df.melt('ExtD', value_vars=['ID', 'ID2'], value_name='RetID')
df2 = df2.merge(tmp[['ExtD', 'RetID']])
>>> df2
RetID C ExtD
0 A1 1.3 CA
1 A2 5.4 CB
2 B1 4.5 CA
3 B2 1.3 CB
>>> tmp
ExtD variable RetID
0 CA ID A1
1 CB ID A2
2 CC ID A3
3 CD ID A4
4 CA ID2 B1
5 CB ID2 B2
6 CC ID2 B3
7 CD ID2 B4
Update
What if I need to merge ID and ID2 columns as well?
df2 = df2.merge(df[['ID', 'ID2', 'ExtD']], on='ExtD')
>>> df2
RetID C ExtD ID ID2
0 A1 1.3 CA A1 B1
1 A2 5.4 CB A2 B2
2 B3 4.5 CC A3 B3
3 B4 1.3 CD A4 B4

Search (row values) data from another dataframe

I have two dataframes, df1 and df2 respectively.
In one dataframe I have a list of search values (Actually Col1)
Col1 Col2
A1 val1, val2
B2 val4, val1
C3 val2, val5
I have another dataframe where I have a list of items
value items
val1 apples, oranges
val2 honey, mustard
val3 banana, milk
val4 biscuit
val5 chocolate
I want to iterate though the first DF and try to use that val as key to search for items from the second DF
Expected output:
A1 apples, oranges, honey, mustard
B2 biscuit, appleas, oranges
C3 honey, mustard, chocolate
I am able to add the values into dataframe and iterate through 1st DF
for index, row in DF1:
#list to hold all the values
finalList = []
list = df1['col2'].split(',')
for i in list:
print(i)
I just need help to fetch values from the second dataframe.
Would appreciate any help. Thanks.

Idea is use lambda function with split and lookup by dictionary:
d = df2.set_index('value')['items'].to_dict()
df1['Col2'] = df1['Col2'].apply(lambda x: ', '.join(d[y] for y in x.split(', ') if y in d))
print (df1)
Col1 Col2
0 A1 apples, oranges, honey, mustard
1 B2 biscuit, apples, oranges
2 C3 honey, mustard, chocolate
If there are lists in items values solution is changed with flattening:
d = df2.set_index('value')['items'].to_dict()
f = lambda x: ', '.join(z for y in x.split(', ') if y in d for z in d[y])
df1['Col2'] = df1['Col2'].apply(f)
print (df1)
Col1 Col2
0 A1 apples, oranges
1 B2 biscuit, apples, oranges
2 C3 honey, mustard, chocolate

match multiple columns within the same row

Table 1. I have a table that looks like this:
X Y Z
1 a p
2 a p
6 b p
7 c p
9 c p
Table 2. I have a different table that looks like this:
Col1 Col2 Col3 Col4
Row1 p p p
Row2 a b c
Row3 1
Row4 2
Row5 3
Row6 4
Row7 5
Row8 6
Row9 7
Row10 8
Row11 9
I want to mark "TRUE" when rows of table 1 match with values of its column in Table 1. As a result for example:
Col1 Col2 Col3 Col4
Row1 p p p
Row2 a b c
Row3 1 TRUE
Row4 2 TRUE
Row5 3
Row6 4
Row7 5
Row8 6 TRUE
Row9 7 TRUE
Row10 8
Row11 9 TRUE
Here is what I have tried so far. This is the formula for Col2 Row3:
=IFERROR(IF(AND(AND(MATCH(Col1Row3,X:X,0), MATCH(Col2Row1,Z:Z,0)), MATCH(Col2Row2,Y:Y,0)), "TRUE", ""),"")
I think it's not working because I am not containing the matches within the same row. How can I achieve my result?
Also, I do not want to specify a specific row in the formula because I have thousands of rows in Table 1, and Table 2 has to select values among those thousands of rows.

Use COUNTIFS
=IF(COUNTIFS($F:$F,$A3,$G:$G,B$2,$H:$H,B$1),TRUE,"")

If a row in a column matches a row in another column, paste a value in another column

I've been very frustrated trying to figure this out. I have an excel file like this:
Col Col2 Col3 Col4 Col5
gene5 6 (empty column) gene1 this_is_gene1
gene1 4 (empty column) gene2 this_is_gene2
gene3 4 (empty column) gene3 this_is_gene3
gene2 3 (empty column) gene4 this_is_gene4
gene4 3 (empty column) gene5 this_is_gene5
gene5 3 (empty column) gene6 this_is_gene6
If any value in column 1 is present in column 4, I want it to then paste the information from column 5 into Column 3, like the following:
Col Col2 Col3 Col4 Col5
gene5 6 this_is_gene6 gene1 this_is_gene1
gene1 4 this_is_gene4 gene2 this_is_gene2
gene3 4 this_is_gene4 gene3 this_is_gene3
gene2 3 this_is_gene3 gene4 this_is_gene4
gene4 3 this_is_gene4 gene5 this_is_gene5
gene5 3 this_is_gene5 gene6 this_is_gene6
Any help? I've played around with =VLOOKUP, but it seems like that only works on a static value (instead of values within a whole column.)

VLOOKUP should work for you, just tested it: Insert into your empty column =VLOOKUP(A2; D2:E2; 2; FALSE) assuming your table starts in cell A1 and has Col, Col2 etc as headers.

The last two lines in your expected results do not match previous ones. Sometimes your lookup is col1 in col4 (return col5) and other times it is "gene"&col2 lookup in col4 (return col5).
'either,
=VLOOKUP("gene"&B2, D:E, 2, FALSE)
'or,
=VLOOKUP(A2, D:E, 2, FALSE)

Looking for the Max Sum, based on Criteria and Unique Values

Col1 Col2 Col3
a 3 x
b 2 x
c 2 x
a 1 x
b 3 x
c 1 y
a 2 y
b 1 y
c 3 y
Using the table above, can anyone give me a formula to find:
The max sum of Col2 when Col3=X per each unique value in Col1
(Answer should be 5, would be 4 based on Col3=Y)

Create a PivotTable with Col3 as FILTERS (select x), Col1 for ROWS and Sum of Col2 for VALUES. Uncheck Show grand totals for Columns and then for whichever column contains Sum of Col2 take the maximum, say:
=MAX(F:F)

Well it's not ideal but it works:
Column D put an array formula in for Max If:
in D2: =MAX(IF($C$2:$C$10=C2,SUM(IF($A$2:$A$10=A2,IF($C$2:$C$10=C2,$B$2:$B$10)))))
Change the ranges obviously.
Then in E2 put this: =MAX(IF($C$2:$C$10=C2,$D$2:$D$10))
These are both array formulas so after inputting them you must press CTRL-SHIFT-ENTER not just enter.
Then drag down.
There may be a way to combine these but my array formula knowledge is limited
Here are the results:
Col1 Col2 Col3 Sum of max per col 1 Max of col 4 per col 3
a 3 x 4 5
b 2 x 5 5
c 2 x 2 5
a 1 x 4 5
b 3 x 5 5
c 1 y 4 4
a 2 y 2 4
b 1 y 1 4
c 3 y 4 4
If you don't use CTRL-SHIFT-ENTER you will get 18 and 5 all the way down.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Sorting data based on column entries - python-3.x

Related

Pandas Dataframe merge on two different key to get original data

Search (row values) data from another dataframe

match multiple columns within the same row

If a row in a column matches a row in another column, paste a value in another column

Looking for the Max Sum, based on Criteria and Unique Values

Categories

Resources