Replacing the first column values according to the second column pattern - python-3.x

How to use regex to replace values in Data Frames, here, 5th column according to pattern of the 1st column? The column 5 consist only in ones for now. However, I would like to start changing this column when in the 1st column pattern 34444 appears. Then program suppose to replace ones with 11111, 22222, 33333 etc. until the end of the file when the pattern appears.
Sample of the file:
0 5 1 2 3 4
11 1 1 1 -173.386856 -0.152110 -58.235509
12 2 1 1 -176.102464 -1.020643 -1.217859
13 3 1 1 -175.792961 -57.458357 -58.538891
14 4 1 1 -172.774153 -59.284206 -1.988605
15 5 1 1 -174.974179 -56.371161 -58.406157
16 6 1 3 138.998480 12.596951 0.223780
17 7 1 4 138.333252 11.884713 -0.281429
18 8 1 4 139.498084 13.356891 -0.480091
19 9 1 4 139.710930 11.981460 0.697098
20 10 1 4 138.452807 13.136061 0.990663
21 11 1 3 138.998480 12.596951 0.223780
22 12 1 4 138.333252 11.884713 -0.281429
23 13 1 4 139.498084 13.356891 -0.480091
24 14 1 4 139.710930 11.981460 0.697098
25 15 1 4 138.452807 13.136061 0.990663
Expected result:
0 5 1 2 3 4
11 1 1 1 -173.386856 -0.152110 -58.235509
12 2 1 1 -176.102464 -1.020643 -1.217859
13 3 1 1 -175.792961 -57.458357 -58.538891
14 4 1 1 -172.774153 -59.284206 -1.988605
15 5 1 1 -174.974179 -56.371161 -58.406157
16 6 1 3 138.998480 12.596951 0.223780
17 7 1 4 138.333252 11.884713 -0.281429
18 8 1 4 139.498084 13.356891 -0.480091
19 9 1 4 139.710930 11.981460 0.697098
20 10 1 4 138.452807 13.136061 0.990663
21 11 2 3 138.998480 12.596951 0.223780
22 12 2 4 138.333252 11.884713 -0.281429
23 13 2 4 139.498084 13.356891 -0.480091
24 14 2 4 139.710930 11.981460 0.697098
25 15 2 4 138.452807 13.136061 0.990663

Yeah, if you really want re, there is a way. But I doubt it would be really more efficient than a for-loop.
1. re.finditer
import pandas as pd
import numpy as np
import re
# present col1 as number-strings
arr1 = df['1'].values
str1 = "".join([str(i) for i in arr1])
ans = np.ones(len(str1), dtype=int)
# when a pattern is found, increase latter elements by 1
for match in re.finditer('34444', str1):
e = match.end()
ans[e:] += 1
# replace column 5
df['5'] = ans
# Output
df[['0', '5', '1']]
Out[50]:
0 5 1
11 1 1 1
12 2 1 1
13 3 1 1
14 4 1 1
15 5 1 1
16 6 1 3
17 7 1 4
18 8 1 4
19 9 1 4
20 10 1 4
21 11 2 3
22 12 2 4
23 13 2 4
24 14 2 4
25 15 2 4
2. naïve for-loop
Checks the array directly element-by-element. By comparison with re.finditer, no typecasting is involved, but an explicit for-loop is written. The same output is obtained. Please benchmark by yourself if efficiency became relevant, say, if there were tens of millions of rows involved.
arr1 = df['1'].values
ans = np.ones(len(str1), dtype=int)
n = len(arr1)
for i, el in enumerate(arr1):
# termination
if i > n - 5:
break
# ignore non-3 elements
if el != 3:
continue
# if found, increase latter elements by 1
if np.all(arr1[i+1:i+5] == 4):
ans[i+5:] += 1
df['5'] = ans

Related

Reassigning multiple columns with same array

I've been breaking my head over this simple thing. Ik we can assign a single value to multiple columns using .loc. But how to assign multiple columns with the same array.
Ik I can do this. Let's say we have a dataframe df in which I wish to replace some columns with the array arr:
df=pd.DataFrame({'a':[random.randint(1,25) for i in range(5)],'b':[random.randint(1,25) for i in range(5)],'c':[random.randint(1,25) for i in range(5)]})
>>df
a b c
0 14 8 5
1 10 25 9
2 14 14 8
3 10 6 7
4 4 18 2
arr = [i for i in range(5)]
#Suppose if I wish to replace columns `a` and `b` with array `arr`
df['a'],df['b']=[arr for j in range(2)]
Desired output:
a b c
0 0 0 16
1 1 1 10
2 2 2 1
3 3 3 20
4 4 4 11
Or I can also do this in a loopwise assignment. But is there a more efficient way without repetition or loops?
Let's try with assign:
cols = ['a', 'b']
df.assign(**dict.fromkeys(cols, arr))
a b c
0 0 0 5
1 1 1 9
2 2 2 8
3 3 3 7
4 4 4 2
I did an assign statement df.a = df.b = arr
df=pd.DataFrame({'a':[random.randint(1,25) for i in range(5)],'b':[random.randint(1,25) for i in range(5)],'c':[random.randint(1,25) for i in range(5)]})
arr = [i for i in range(5)]
df
a b c
0 2 8 18
1 17 15 25
2 6 5 17
3 12 15 25
4 10 10 6
df.a = df.b = arr
df
a b c
0 0 0 18
1 1 1 25
2 2 2 17
3 3 3 25
4 4 4 6

How to take mean of 3 values before flag change 0 to 1python

I have dataframe with columns A,B and flag. I want to calculate mean of 2 values before flag change from 0 to 1 , and record value when flag change from 0 to 1 and record value when flag changes from 1 to 0.
# Input dataframe
df=pd.DataFrame({'A':[1,3,4,7,8,11,1,15,20,15,16,87],
'B':[1,3,4,6,8,11,1,19,20,15,16,87],
'flag':[0,0,0,0,1,1,1,0,0,0,0,0]})
# Expected output
df_out=df=pd.DataFrame({'A_mean_before_flag_change':[5.5],
'B_mean_before_flag_change':[5],
'A_value_before_change_flag':[7],
'B_value_before_change_flag':[6]})
I try to create more general solution:
df=pd.DataFrame({'A':[1,3,4,7,8,11,1,15,20,15,16,87],
'B':[1,3,4,6,8,11,1,19,20,15,16,87],
'flag':[0,0,0,0,1,1,1,0,0,1,0,1]})
print (df)
A B flag
0 1 1 0
1 3 3 0
2 4 4 0
3 7 6 0
4 8 8 1
5 11 11 1
6 1 1 1
7 15 19 0
8 20 20 0
9 15 15 1
10 16 16 0
11 87 87 1
First create groups by mask for 0 with next 1 values of flag:
m1 = df['flag'].eq(0) & df['flag'].shift(-1).eq(1)
df['g'] = m1.iloc[::-1].cumsum()
print (df)
A B flag g
0 1 1 0 3
1 3 3 0 3
2 4 4 0 3
3 7 6 0 3
4 8 8 1 2
5 11 11 1 2
6 1 1 1 2
7 15 19 0 2
8 20 20 0 2
9 15 15 1 1
10 16 16 0 1
11 87 87 1 0
then filter out groups with size less like N:
N = 4
df1 = df[df['g'].map(df['g'].value_counts()).ge(N)].copy()
print (df1)
A B flag g
0 1 1 0 3
1 3 3 0 3
2 4 4 0 3
3 7 6 0 3
4 8 8 1 2
5 11 11 1 2
6 1 1 1 2
7 15 19 0 2
8 20 20 0 2
Filter last N rows:
df2 = df1.groupby('g').tail(N)
And aggregate last with mean:
d = {'mean':'_mean_before_flag_change', 'last': '_value_before_change_flag'}
df3 = df2.groupby('g')['A','B'].agg(['mean','last']).sort_index(axis=1, level=1).rename(columns=d)
df3.columns = df3.columns.map(''.join)
print (df3)
A_value_before_change_flag B_value_before_change_flag \
g
2 20 20
3 7 6
A_mean_before_flag_change B_mean_before_flag_change
g
2 11.75 12.75
3 3.75 3.50
I'm assuming that this needs to work for cases with more than one rising edge and that the consecutive values and averages get appended to the output lists:
# the first step is to extract the rising and falling edges using diff(), identify sections and length
df['flag_diff'] = df.flag.diff().fillna(0)
df['flag_sections'] = (df.flag_diff != 0).cumsum()
df['flag_sum'] = df.flag.groupby(df.flag_sections).transform('sum')
# then you can get the relevant indices by checking for the rising edges
rising_edges = df.index[df.flag_diff==1.0]
val_indices = [i-1 for i in rising_edges]
avg_indices = [(i-2,i-1) for i in rising_edges]
# and finally iterate over the relevant sections
df_out = pd.DataFrame()
df_out['A_mean_before_flag_change'] = [df.A.loc[tpl[0]:tpl[1]].mean() for tpl in avg_indices]
df_out['B_mean_before_flag_change'] = [df.B.loc[tpl[0]:tpl[1]].mean() for tpl in avg_indices]
df_out['A_value_before_change_flag'] = [df.A.loc[idx] for idx in val_indices]
df_out['B_value_before_change_flag'] = [df.B.loc[idx] for idx in val_indices]
df_out['length'] = [df.flag_sum.loc[idx] for idx in rising_edges]
df_out.index = rising_edges

Pandas - Fill N rows for a specific column with a integer value and increment the integer there after

I have a dataframe to which I added say a column named col_1. I want to add integer values to that column starting from the first row that increment after every 4th row. So the new resulting column should have values of as such.
col_1
1
1
1
1
2
2
2
2
The current approach I have is a very brute force one:
for x in range(len(df)):
if x <= 3:
df['col_1'][x] = 1
if x >3 and x <= 7:
df['col_1'][x] = 2
This might work for something small but when moving to something larger it will chew up a lot of time.
If there si default RangeIndex you can use integer division with add 1:
df['col_1'] = df.index // 4 + 1
Or for general solution use helper array by lenght of DataFrame:
df['col_1'] = np.arange(len(df)) // 4 + 1
For repeat 1 and 2 pattern use also modulo by 2 like:
df = pd.DataFrame({'a':range(20, 40)})
df['col_1'] = (np.arange(len(df)) // 4) % 2 + 1
print (df)
a col_1
0 20 1
1 21 1
2 22 1
3 23 1
4 24 2
5 25 2
6 26 2
7 27 2
8 28 1
9 29 1
10 30 1
11 31 1
12 32 2
13 33 2
14 34 2
15 35 2
16 36 1
17 37 1
18 38 1
19 39 1

if values existes from given list in multiple column and counts the number of column

i have below df
B C D E
2 2 4 11
11 0 5 3
12 10 1 11
5 9 7 15
1st i wants a unique value from whole df like below:
[0,1,2,3,4,5,7,9,10,11,12,15]
then i wants final output
value value exists in number of col
0 1
1 1
2 2
3 1
4 1
5 1
7 1
9 1
10 1
11 2
12 1
15 1
that means each value,how many columns its available
i wants that output
Using python you can do something like this:
# your input df as a list of lists
df = [[2,11,12,5], [2,0,10,9], [4,5,1,7], [11,3,11,15]]
#remove duplicates in each list
dfU = [list(set(l)) for l in df]
# sort each list (not required for this approach)
for l in dfU:
l.sort()
# the requested unique list
flatList = [item for sublist in df for item in sublist]
uniqueList = list(set(flatList))
print(uniqueList)
# output as a list of lists
output = []
for num in uniqueList:
cnt = 0
for idx in range(len(dfU)):
if dfU[idx].count(num) > 0:
cnt+=1
output.append([num,cnt])
print(output)
Side note, the count function is computationally expensive, so it would be better to do a linear scan along all sorted columns.
Use DataFrame.melt for reshape, remove duplicates by both columns and count by GroupBy.size with Series.reset_index for DataFrame:
df1 = (df.melt(value_name='value')
.drop_duplicates()
.groupby('value')
.size()
.reset_index(name='count'))
print (df1)
value count
0 0 1
1 1 1
2 2 2
3 3 1
4 4 1
5 5 2
6 7 1
7 9 1
8 10 1
9 11 2
10 12 1
11 15 1
Details:
print (df.melt(value_name='value'))
variable value
0 B 2
1 B 11
2 B 12
3 B 5
4 C 2
5 C 0
6 C 10
7 C 9
8 D 4
9 D 5
10 D 1
11 D 7
12 E 11
13 E 3
14 E 11
15 E 15
One 11 for index 14 is removed:
print (df.melt(value_name='value').drop_duplicates())
variable value
0 B 2
1 B 11
2 B 12
3 B 5
4 C 2
5 C 0
6 C 10
7 C 9
8 D 4
9 D 5
10 D 1
11 D 7
12 E 11
13 E 3
15 E 15
If want pure python solution:
from collections import Counter
L = sorted(Counter([y for x in df.T.values for y in set(x)]).items())
df1 = pd.DataFrame(L, columns=['value','count'])
print (df1)
value count
0 0 1
1 1 1
2 2 2
3 3 1
4 4 1
5 5 2
6 7 1
7 9 1
8 10 1
9 11 2
10 12 1
11 15 1

Understanding J From

In J:
a =: 2 3 $ 1 2 3 4 5 6
Gives:
1 2 3
4 5 6
Which is a 2 3 shaped array.
If I do:
0 1 { a
I (noting that 0 1 is a 2 shaped list) expected to have back:
1 2 3 4 5 6
But got the following instead:
1 2 3
4 5 6
Reading the documentation I was expecting the shape of the index to kinda govern the shape of the answer.
Can someone clarify what I am missing here?
Higher-dimensional arrays may help make this clear. An array with n dimensions has items with n-1 dimensions. When you select an item from ({) a three-dimensional array, your result is a two-dimensional array:
1 { i. 5 3 4
12 13 14 15
16 17 18 19
20 21 22 23
When you select multiple items from an array, the items are assembled into a new array, using each atom of x to select a item of y. This might be where you picked up the idea that the shape of x affects the shape of the result.
2 1 0 2 { 'set'
test
$ 2 1 0 2
4
$ 'test'
4
The dimensions of the result is equal to the dimensions of x plus the dimensions of the items of y. So, if you have a two-dimensional x taking two-dimensional items from a three-dimensional y, you will have a four-dimensional result:
(2 2 $ 1 1 0 1) { i. 5 3 4
12 13 14 15
16 17 18 19
20 21 22 23
12 13 14 15
16 17 18 19
20 21 22 23
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
16 17 18 19
20 21 22 23
$ (2 2 $ 1 1 0 1) { i. 5 3 4
2 2 3 4
One final note: the monadic Ravel (,) will reduce the result to a list (one-dimensional array).
, 0 1 { 2 3 $ 1 2 3 4 5 6
1 2 3 4 5 6
, i. 2 2 2 2
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
From ({) selects the items of a noun. For 2 3 $ 1 2 3 4 5 6 the items are the two rows because items are the components that make up the noun.
[ a=. 2 3 $ 1 2 3 4 5
1 2 3
4 5 1
0 { a
1 2 3
If you just had 1 2 3 then the items would be the individual atoms.
[ b=. 1 2 3
1 2 3
0 { b
1
If you used 1 3 $ 1 2 3 then there is only one item and the result would be
[ c=. 1 3 $ 1 2 3
1 2 3
0 { c
1 2 3
The number of items can be found with Tally (#), and is the lead dimension of the Shape ($) of the noun.
$ a
2 3
$ b
3
$ c
1 3
# a
2
# b
3
# c
1

Resources