how to map two dataframes on condition while having different rows - python-3.x

I have two dataframes that need to be mapped (or joined?) based on some condition. These are the dataframes:
df_1
img_names img_array
0 1_rel 253
1 1_rel_right 255
2 1_rel_top 250
3 4_rel 180
4 4_rel_right 182
5 4_rel_top 189
6 7_rel 217
7 7_rel_right 183
8 7_rel_top 196
df_2
List_No time
0 1 38
1 4 23
2 7 32
After mapping I would like to get the following dataframe:
df_3
img_names img_array List_No time
0 1_rel 253 1 38
1 1_rel_right 255 1 38
2 1_rel_top 250 1 38
3 4_rel 180 4 23
4 4_rel_right 182 4 23
5 4_rel_top 189 4 23
6 7_rel 217 7 32
7 7_rel_right 183 7 32
8 7_rel_top 196 7 32
Basically, df_2's each row is populated 3 times to match the number of rows in df_1 and the mapping (if we can say so) is done by the split string in each row of df_1's img_name column. The names of row elements in img_names may have different names, but each of them always starts with the some number (1,4,7 in this case) and an undescore, etc. So I need to split the correspongding number in each row and map it with the row elements of List_No.
I hope the example above is clear.
Thank you.

Looks like you can just extract the digit parts and merge:
df_1['List_No'] = df_1['img_names'].str.split('_').str[0].astype(int)
df_3 = df_1.merge(df_2, on='List_No')
Output:
img_names img_array List_No time
0 1_rel 253 1 38
1 1_rel_right 255 1 38
2 1_rel_top 250 1 38
3 4_rel 180 4 23
4 4_rel_right 182 4 23
5 4_rel_top 189 4 23
6 7_rel 217 7 32
7 7_rel_right 183 7 32
8 7_rel_top 196 7 32

An alternative to #QuangHoang's answer (which I believe you should pick, as it is more robust). This uses the map method, and assumes every value in df2's time is in df1:
df1.assign(
List_No=df1.img_names.str.extract(r"(\d)", expand=False).astype(int),
time=lambda x: x.List_No.map(df2["time"]),
)
img_names img_array List_No time
0 1_rel 253 1 38
1 1_rel_right 255 1 38
2 1_rel_top 250 1 38
3 4_rel 180 4 23
4 4_rel_right 182 4 23
5 4_rel_top 189 4 23
6 7_rel 217 7 32
7 7_rel_right 183 7 32
8 7_rel_top 196 7 32

Related

I would like to find consecutive numbers in column A and column B in python (pandas)

I would like to find consecutive numbers in column A and Column B in python, Column A should be ascending but Column B is descending. I am attaching an example file.
Input file
nucleotide
Pos_A
Pos_B
Connection_Pos20_Pos102
20
102
Connection_Pos19_Pos102
19
102
Connection_Pos20_Pos101
20
101
Connection_Pos18_Pos102
18
102
Connection_Pos19_Pos101
19
101
Connection_Pos20_Pos100
20
100
Connection_Pos17_Pos102
17
102
Connection_Pos18_Pos101
18
101
Connection_Pos19_Pos100
19
100
Connection_Pos20_Pos99
20
99
Connection_Pos16_Pos102
16
102
Connection_Pos17_Pos101
17
101
Connection_Pos18_Pos100
18
100
Connection_Pos19_Pos99
19
99
Connection_Pos20_Pos98
20
98
Connection_Pos15_Pos102
15
102
Connection_Pos16_Pos101
16
101
Connection_Pos17_Pos100
17
100
Connection_Pos18_Pos99
18
99
Connection_Pos19_Pos98
19
98
Connection_Pos20_Pos97
20
97
Connection_Pos14_Pos102
14
102
Connection_Pos15_Pos101
15
101
Connection_Pos16_Pos100
16
100
Output:
nucleotide
Pos_A
Pos_B
Consecutive ID
Consecutive Number (Size)
Connection_Pos20_Pos102
20
102
101
1
Connection_Pos19_Pos102
19
102
100
2
Connection_Pos20_Pos101
20
101
100
2
Connection_Pos18_Pos102
18
102
99
3
Connection_Pos19_Pos101
19
101
99
3
Connection_Pos20_Pos100
20
100
99
3
Connection_Pos17_Pos102
17
102
98
4
Connection_Pos18_Pos101
18
101
98
4
Connection_Pos19_Pos100
19
100
98
4
Connection_Pos20_Pos99
20
99
98
4
Connection_Pos16_Pos102
16
102
97
5
Connection_Pos17_Pos101
17
101
97
5
Connection_Pos18_Pos100
18
100
97
5
Connection_Pos19_Pos99
19
99
97
5
Connection_Pos20_Pos98
20
98
97
5
Connection_Pos15_Pos102
15
102
96
6
Connection_Pos16_Pos101
16
101
96
6
Connection_Pos17_Pos100
17
100
96
6
Connection_Pos18_Pos99
18
99
96
6
Connection_Pos19_Pos98
19
98
96
6
Connection_Pos20_Pos97
20
97
96
6
Connection_Pos14_Pos102
14
102
95
7
Connection_Pos15_Pos101
15
101
95
7
Connection_Pos16_Pos100
16
100
95
7
Connection_Pos17_Pos99
17
99
95
7
Connection_Pos18_Pos98
18
98
95
7
Connection_Pos19_Pos97
19
97
95
7
Connection_Pos20_Pos96
20
96
95
7
For Consecutive ID, if Pos_B's shifted difference != 1, then we want to subtract 1, so we mark those indexes as -1 with mul(-1) and cumsum them:
df['ID'] = df.Pos_B.shift().sub(df.Pos_B).ne(1).mul(-1).cumsum() + df.Pos_B[0]
For Consecutive Number, if Pos_A's shifted difference != -1, then we want to add 1, so we mark those indexes as 1 and cumsum again:
df['Number'] = df.Pos_A.shift().sub(df.Pos_A).ne(-1).mul(1).cumsum()
Result:
nucleotide Pos_A Pos_B ID Number
0 Connection_Pos20_Pos102 20 102 101 1
1 Connection_Pos19_Pos102 19 102 100 2
2 Connection_Pos20_Pos101 20 101 100 2
3 Connection_Pos18_Pos102 18 102 99 3
4 Connection_Pos19_Pos101 19 101 99 3
5 Connection_Pos20_Pos100 20 100 99 3
6 Connection_Pos17_Pos102 17 102 98 4
7 Connection_Pos18_Pos101 18 101 98 4
8 Connection_Pos19_Pos100 19 100 98 4
9 Connection_Pos20_Pos99 20 99 98 4
10 Connection_Pos16_Pos102 16 102 97 5
11 Connection_Pos17_Pos101 17 101 97 5
12 Connection_Pos18_Pos100 18 100 97 5
13 Connection_Pos19_Pos99 19 99 97 5
14 Connection_Pos20_Pos98 20 98 97 5
15 Connection_Pos15_Pos102 15 102 96 6
16 Connection_Pos16_Pos101 16 101 96 6
17 Connection_Pos17_Pos100 17 100 96 6
18 Connection_Pos18_Pos99 18 99 96 6
19 Connection_Pos19_Pos98 19 98 96 6
20 Connection_Pos20_Pos97 20 97 96 6
21 Connection_Pos14_Pos102 14 102 95 7
22 Connection_Pos15_Pos101 15 101 95 7
23 Connection_Pos16_Pos100 16 100 95 7
Do it one by one then groupby with ngroup
s1 = df.Pos_A.diff().le(0).cumsum()
s2 = df.Pos_B.diff().ge(0).cumsum()
df['out'] = df.groupby([s1,s2]).ngroup()+1
Out[452]:
0 1
1 2
2 2
3 3
4 3
5 3
6 4
7 4
8 4
9 4
10 5
11 5
12 5
13 5
14 5
15 6
16 6
17 6
18 6
19 6
20 6
21 7
22 7
23 7
24 7
25 7
26 7
27 7
dtype: int64

How to extract lines from a file when the second columns of a file matches the values in another file

I got two files.
file 1:
4
14
18
45
53
60
64
102
106
158
162
file2:
28 1 2
54 1 2
90 1 1
103 1 1
155 1 17
191 1 1
235 1 1
245 4 1
275 4 1
362 4 1
377 18 1
391 18 1
413 18 2
466 18 2
492 18 2
494 18 41
498 45 1
522 45 1
529 57 3
542 53 1
560 58 6
562 164 25
568 164 5
I want to extract the value from file2 if the second column of file two matches the value in file 1.
So the expected output will be:
245 4 1
275 4 1
362 4 1
377 18 1
391 18 1
413 18 2
466 18 2
492 18 2
494 18 41
498 45 1
522 45 1
542 53 1
I saw many of the solution online is using python or Perl, however, I want to use linux command to do this, any idea?
This should do it?
awk 'FNR==NR{a[$0]++};FNR!=NR{if($2 in a){print}}' file1 file2
245 4 1
275 4 1
362 4 1
377 18 1
391 18 1
413 18 2
466 18 2
492 18 2
494 18 41
498 45 1
522 45 1
542 53 1
Explanation:
we hand awk both files (order is important in this case!).
as long as we read the first file (FNR==NR) we store each value in an array a[$1]++
when we reach the second file we just check if values from the second file's second column ($2) are in the array; if yes, we print them.

Filter Values in Python of a Pandas Dataframe of a large array with multiple conditions

I have a dataset that I need to filter once a value has been exceeded but not after based on a groupby() of a second column. Here is an example of the dataframe:
df2 = df.groupby(['UWI']).[df.DIP > 85].reset_index(drop = True)
where I have a dataframe that looks like this:
UWI DIP
0 17 70
1 17 80
2 17 90
3 17 80
4 17 83
5 2 62
6 2 75
7 2 87
8 2 91
I want the returned dataframe to look like this:
UWI DIP
0 17 90
1 17 80
2 17 83
3 2 87
4 2 91
This is a large dataframe so efficiency would be appreciated.
IIUC using cummax
df[df.DIP.gt(85).groupby(df['UWI']).cummax()]
UWI DIP
2 17 90
3 17 80
4 17 83
7 2 87
8 2 91

Adding rows that match a criteria in another column in Excel

This is a sample data
Polling_Booth INC SAD BSP PS_NO
1 89 47 2 1
2 97 339 6 1
3 251 485 8 1
4 356 355 25 2
5 290 333 9 2
6 144 143 4 3
7 327 196 1 4
8 370 235 1 5
And this is what I'm trying to achieve
Polling_Booth INC SAD BSP PS_NO OP_INC OP_SAD OP_BSP
1 89 47 2 1
2 97 339 6 1
3 251 485 8 1 437 871 16
4 356 355 25 2
5 290 333 9 2 646 688 34
6 144 143 4 3 144 143 4
7 327 196 1 4 327 196 1
8 370 235 1 5 370 235 1
This is achieved adding up rows which has the same PS_NO, This is what I have tried
=if(E2=E3,sum(B2,B3),0) #same for all the rows
Any help would be much appreciated..Thanks
You could get it to look like your table by adding another condition to check if it's the last occurrence of the PS_No in column E and setting the result to an empty string if not
=IF(COUNTIF($E$2:$E2,$E2)=COUNTIF($E$2:$E$10,$E2),SUMIF($E$2:$E$10,$E2,B$2:B$10),"")
If the data is sorted by PS_No, you can do it more easily by
=IF($E3<>$E2,SUMIF($E$2:$E$10,$E2,B$2:B$10),"")
which I think is what you were trying in your question

How to divide 1 column into 5 segments with pandas and python?

I have a list of 1 column and 50 rows.
I want to divide it into 5 segments. And each segment has to become a column of a dataframe. I do not want the NAN to appear (figure2). How can I solve that?
Like this:
df = pd.DataFrame(result_list)
AWA=df[:10]
REM=df[10:20]
S1=df[20:30]
S2=df[30:40]
SWS=df[40:50]
result = pd.concat([AWA, REM, S1, S2, SWS], axis=1)
result
Figure2
You can use numpy's reshape function:
result_list = [i for i in range(50)]
pd.DataFrame(np.reshape(result_list, (10, 5), order='F'))
Out:
0 1 2 3 4
0 0 10 20 30 40
1 1 11 21 31 41
2 2 12 22 32 42
3 3 13 23 33 43
4 4 14 24 34 44
5 5 15 25 35 45
6 6 16 26 36 46
7 7 17 27 37 47
8 8 18 28 38 48
9 9 19 29 39 49

Resources