How to extract data from data frame when value of column change - python-3.x

I want to extract part of the data frame when value change from 0 to 1.
logic1: when value change from 0 to 1, start to save data until value again change to 0. (also points before 1 and after 1)
logic2: when value change from 0 to 1, start to save data until value again change to 0. (don't need to save points before 1 and after 1)
only save data when the first time value of flag change from 0 to 1, after this if again value change from 0 to 1 don't need to do anything
df=pd.DataFrame({'value':[3,4,7,8,11,1,15,20,15,16,87],'flag':[0,0,0,1,1,1,0,0,1,1,0]})
Desired output:
df_out_1=pd.DataFrame({'value':[7,8,11,1,15]})
Desired output:
df_out_2=pd.DataFrame({'value':[8,11,1]})

Idea is get consecutive groups of 1 and 0 consecutive groups to s, filter only 1 groups and get first 1 group by compare by minimal value:
df = df.reset_index(drop=True)
s = df['flag'].ne(df['flag'].shift()).cumsum()
m = s.eq(s[df['flag'].eq(1)].min())
df2 = df.loc[m, ['value']]
print (df2)
value
3 8
4 11
5 1
And then filter values with aff and remove 1 to default RangeIndex:
df1 = df.loc[(df2.index + 1).union(df2.index - 1), ['value']]
print (df1)
value
2 7
3 8
4 11
5 1
6 15

Related

Get the difference between two dates when string value changes

I want to get the number of days between the change of string values (ie., the symbol column) in one column, grouped by their respective id. I want a separate column for datediff like the one below.
id date symbol datediff
1 2022-08-26 a 0
1 2022-08-27 a 0
1 2022-08-28 a 0
2 2022-08-26 a 0
2 2022-08-27 a 0
2 2022-08-28 a 0
2 2022-08-29 b 3
3 2022-08-29 c 0
3 2022-08-30 b 1
For id = 1, datediff = 0 since symbol stayed as a. For id = 2, datediff = 3 since symbol changed after 3 days from a to b. Hence, what I'm looking for is a code that computes the difference in which the id changes it's symbol.
I am currently using this code:
df['date'] = pd.to_datetime(df['date'])
diff = ['0 days 00:00:00']
for st,i in zip(df['symbol'],df.index):
if i > 0:#cannot evaluate previous from index 0
if df['symbol'][i] != df['symbol'][i-1]:
diff.append(df['date'][i] - df['data_ts'][i-1])
else:
diff.append('0 days 00:00:00')
The output becomes:
id date symbol datediff
1 2022-08-26 a 0
1 2022-08-27 a 0
1 2022-08-28 a 0
2 2022-08-26 a 0
2 2022-08-27 a 0
2 2022-08-28 a 0
2 2022-08-29 b 1
3 2022-08-29 c 0
3 2022-08-30 b 1
It also computes the difference between two different ids. But I want the computation to be separate from different ids.
I only see questions about difference of dates when values changes, but not when string changes. Thank you!
IIUC: my solution works with the assumption that the symbols within one id ends with a single changing symbol, if there is any (as in the example given in the question).
First use df.groupby on id and symbol and get the minimum date for each combination. Then, find the difference between the dates within each id. This gives the datediff. Finally, merge the findings with the original dataframe.
df1 = df.groupby(['id', 'symbol'], sort=False).agg({'date': np.min}).reset_index()
df1['datediff'] = abs(df1.groupby('id')['date'].diff().dt.days.fillna(0))
df1 = df1.drop(columns='date')
df_merge = pd.merge(df, df1, on=['id', 'symbol'])

pandas - show column name + sum in which the sum is higher than zero

I read my dataframe in with:
dataframe = pd.read_csv("testFile.txt", sep = "\t", index_col= 0)
I got a dataframe like this:
cell 17472131 17472132 17472133 17472134 17472135 17472136
cell_0 1 0 1 0 1 0
cell_1 0 0 0 0 1 0
cell_2 0 1 1 1 0 0
cell_3 1 0 0 0 1 0
with pandas I would like to get all the column names in which the sum of the column is > 1 and the total sum.
So I would like:
17472131 2
17472133 2
17472135 3
I figured out how to get the sums of each column with
dataframe.sum(axis=0)
but this also returns the columns with a sum lower than 2.. is there a way to only show the columns with a higher value than i.e. 1?
One pretty neat way is to use lambda function in loc:
df.set_index('cell').sum().loc[lambda x: x>1]
Output:
17472131 2
17472133 2
17472135 3
dtype: int64
Details: df.sum returns a pd.Series and we can use lambda x: x>1 to produce as boolean series which loc use boolean indexing to select only True parts of the pd.Series.

Pandas remove group if difference between first and last row in group exceeds value

I have a dataframe df:
df = pd.DataFrame({})
df['X'] = [3,8,11,6,7,8]
df['name'] = [1,1,1,2,2,2]
X name
0 3 1
1 8 1
2 11 1
3 6 2
4 7 2
5 8 2
For each group within 'name' and want to remove that group if the difference between the first and last row of that group is smaller than a specified value d_dif in absolute way:
For example, when d_dif= 5, I want to get:
X name
0 3 1
1 8 1
2 11 1
If your data is increasingly in X, you can use groupby().transform() and np.ptp
threshold = 5
ranges = df.groupby('name')['X'].transform(np.ptp)
df[ranges > threshold]
If you only care about first and last, then transform just first and last:
threshold = 5
groups = df.groupby('name')['X']
ranges = groups.transform('last') - groups.transform('first')
df[ranges.abs() > threshold]

if specific value/string occurs in the entire dataframe I want to sum its index values

i have a dataframe in which I need to find a specific image name in the entire dataframe and sum its index values every time they are found. SO my data frame looks like:
c 1 2 3 4
g
0 180731-1-61.jpg 180731-1-61.jpg 180731-1-61.jpg 180731-1-61.jpg
1 1209270004-2.jpg 180609-2-31.jpg 1209270004-2.jpg 1209270004-2.jpg
2 1209270004-1.jpg 180414-2-38.jpg 180707-1-31.jpg 1209050002-1.jpg
3 1708260004-1.jpg 1209270004-2.jpg 180609-2-31.jpg 1209270004-1.jpg
4 1108220001-5.jpg 1209270004-1.jpg 1108220001-5.jpg 1108220001-2.jpg
I need to find the 1209270004-2.jpg in entire dataframe. And as it is found at index 1 and 3 I want to add the index values so it should be
1+3+1+1=6.
I tried the code:
img_fname = '1209270004-2.jpg'
df2 = df1[df1.eq(img_fname).any(1)]
sum = int(np.sum(df2.index.values))
print(sum)
I am getting the answer of sum 4 i.e 1+3=4. But it should be 6.
If the string occurence is only once or twice or thrice or four times like for eg 180707-1-31 is in column 3. then the sum should be 45+45+3+45 = 138. Which signifies that if the string is not present in the dataframe take vallue as 45 instead the index value.
You can multiple boolean mask by index values and then sum:
img_fname = '1209270004-1.jpg'
s = df1.eq(img_fname).mul(df1.index.to_series(), 0).sum()
print (s)
1 2
2 4
3 0
4 3
dtype: int64
out = np.where(s == 0, 45, s).sum()
print (out)
54
If dataset does not have many columns, this can also work with your original question
df1 = pd.DataFrame({"A":["aa","ab", "cd", "ab", "aa"], "B":["ab","ab", "ab", "aa", "ab"]})
s = 0
for i in df1.columns:
s= s+ sum(df1.index[df1.loc[:,i] == "ab"].tolist())
Input :
A B
0 aa ab
1 ab ab
2 cd ab
3 ab aa
4 aa ab
Output :11
Based on second requirement:

Pandas Pivot Table Conditional Counting

I have a simple dataframe:
df = pd.DataFrame({'id': ['a','a','a','b','b'],'value':[0,15,20,30,0]})
df
id value
0 a 0
1 a 15
2 a 20
3 b 30
4 b 0
And I want a pivot table with the number of values greater than zero.
I tried this:
raw = pd.pivot_table(df, index='id',values='value',aggfunc=lambda x:len(x>0))
But returned this:
value
id
a 3
b 2
What I need:
value
id
a 2
b 1
I read lots of solutions with groupby and filter. Is it possible to achieve this only with pivot_table command? If it is not, which is the best approach?
Thanks in advance
UPDATE
Just to make it clearer why I am avoinding filter solution. In my real and complex df, I have other columns, like this:
df = pd.DataFrame({'id': ['a','a','a','b','b'],'value':[0,15,20,30,0],'other':[2,3,4,5,6]})
df
id other value
0 a 2 0
1 a 3 15
2 a 4 20
3 b 5 30
4 b 6 0
I need to sum the column 'other', but when i filter I got this:
df=df[df['value']>0]
raw = pd.pivot_table(df, index='id',values=['value','other'],aggfunc={'value':len,'other':sum})
other value
id
a 7 2
b 5 1
Instead of:
other value
id
a 9 2
b 11 1
Need sum for count Trues created by condition x>0:
raw = pd.pivot_table(df, index='id',values='value',aggfunc=lambda x:(x>0).sum())
print (raw)
value
id
a 2
b 1
As #Wen mentioned, another solution is:
df = df[df['value'] > 0]
raw = pd.pivot_table(df, index='id',values='value',aggfunc=len)
You can filter the dataframe before pivoting:
pd.pivot_table(df.loc[df['value']>0], index='id',values='value',aggfunc='count')

Resources