How to find the index of a row for a particular value in a particular column and then create a new column with that starting point? - python-3.x

Example of dataframe
What I'm trying to do with my dataframe...
Locate the first 0 value in a certain column (G in the example photo).
Create a new column (Time) with the value 0 lining up on the same row with the same 0 value in column (G).
And then each row after the 0 in column (Time) +(1/60) until the end of the data.
And -(1/60) before the 0 in (Time) column until the beginning of data.
What is the best method to achieve this?
Any advice would be appreciated. Thank you.

Pretty straight forward
identify index of row that contains the value you are looking for
then construct an array where start is negative, zero will be index row and end at value for end of series
import numpy as np
df = pd.DataFrame({"Time":np.full(25, 0), "G":[i if i>0 else 0 for i in range(10,-15,-1)]})
# find the where value is zero
idx = df[df["G"]==0].index[0]
intv = round(1/60, 3)
# construct a numpy array from a range of values (start, stop, increment)
df["Time"] = np.arange(-idx*intv, (len(df)-idx)*intv, intv)
df.loc[idx, "Time"] = 0 # just remove any rounding at zero point
print(df.to_string(index=False))
output
Time G
-0.170 10
-0.153 9
-0.136 8
-0.119 7
-0.102 6
-0.085 5
-0.068 4
-0.051 3
-0.034 2
-0.017 1
0.000 0
0.017 0
0.034 0
0.051 0
0.068 0
0.085 0
0.102 0
0.119 0
0.136 0
0.153 0
0.170 0
0.187 0
0.204 0
0.221 0
0.238 0

Related

Pandas time series - need to extract row value based on multiple conditionals based on other columns

I have a time series dataframe with the below columns. I am trying to figure out:
If df['PH'] ==1, then I need find the previous date where df['pivot_low_1'] == 1 and extract the value of df['low'] for that date. So, for 2010-01-12 where df['PH'] ==1, I would need to identify the previous non-zero df['pivot_low_1'] == 1 on 2010-01-07 and get df['low'] == 1127.00000.
low pivot_low_1 PH
date
2010-01-04 1114.00000 1 0
2010-01-05 1125.00000 0 0
2010-01-06 1127.25000 0 0
2010-01-07 1127.00000 1 0
2010-01-08 1131.00000 0 0
2010-01-11 1137.75000 0 0
2010-01-12 1127.75000 1 1
2010-01-13 1129.25000 0 0
2010-01-14 1138.25000 0 0
2010-01-15 1127.50000 1 0
2010-01-18 1129.50000 0 0
2010-01-19 1126.25000 0 0
2010-01-20 1125.25000 0 0
2010-01-21 1108.50000 0 0
2010-01-22 1086.25000 1 0
2010-01-25 1089.75000 0 0
2010-01-26 1081.00000 0 0
2010-01-27 1078.50000 0 0
2010-01-28 1074.25000 0 0
2010-01-29 1066.50000 1 1
2010-02-01 1068.00000 0 0
since you want a column in same dataframe but the output is correspondent to only certain rows , I will be replacing every other column with nan values,
data = pd.read_csv('file.csv')
data.columns=['low', 'pivot_low_1', 'PH']
count = 0
l = list()
new=list()
for index, row in data.iterrows():
if row['pivot_low_1']==1:
l.append(count)
if (row['PH']==1) and (row['pivot_low_1']==1):
new.append(data.iloc[l[len(l)-2]].low)
elif (row['PH']==1):
new.append(data.iloc[l[len(l)-1]].low)
elif (row['PH']==0):
new.append(np.nan)
count+=1
data['new'] = new
data
The output is as shown in this image, https://imgur.com/a/IqowZHZ , hope this helps

Google Sheets – countifs across rows

I have a schedule that my team fills out daily in a google sheet. On a seperate tab, I would like a running count per day per schedule code per agent.
Linking a sample spreadsheet here. In this example, I'm trying to input a countif that returns
2019-01-27 T 5 6 0 4
2019-01-27 C 3.5 0 0 7
2019-01-27 LC 0 0 0 0
2019-01-27 S 0 0 0 0
2019-01-27 L 0.5 0 0 1
2019-01-27 M 0.5 0 0 1
2019-01-27 SP 0 0 0 0
2019-01-27 U 0 0 0 0
2019-01-27 MCX 2 0 0 2
2019-01-27 OCX 0 0 0 0
2019-01-27 TR 0 0 0 0
But I cannot for the life of me get a countifs function to work. Any help is much appreciated!
https://docs.google.com/spreadsheets/d/1gp0ZrcYLJfEnUHxgxagAl99X_MCjEIdvwFyfSdGngSE/edit?usp=sharing
Combine INDIRECT with MATCH:
=COUNTIF(INDIRECT("'Mon 1/27'!F"&MATCH(D$1,'Mon 1/27'!$A$1:$A$5,0)&":AG"&MATCH(D$1,'Mon 1/27'!$A$1:$A$5,0)),$B2)
INDIRECT
MATCH
Here is how it works:
MATCH(D$1,'Mon 1/27'!$A$1:$A$5,0) will search the row number of the agent, referenced to cell A1
INDIRECT("'Mon 1/27'!A"&MATCH(D$1,'Mon 1/27'!$A$1:$A$5,0)&":AG"&MATCH(D$1,'Mon 1/27'!$A$1:$A$5,0)) will return a range referenced always to columns F and AF but with the row number returned in step 1, ie, F3:AG3,F4:AG4, and son.
COUNTIF will just count the criteria in the range from step 2.
Hope this helps.
IMPORTANT: In the expected output you posted, the MCX result for Barack Obama is 2, but my formula gets 4. Are you sure your output is right?

build function which takes values from above row in pandas dataframe

i have the following dataframe:
i want to build func to apply on column 'c' that will take the subtraction from columns 'd' and 'u' and add the value from the row above in column 'c'.
so the the table will look as follow:
for example in row number 2 the calculation will be: 44.37 - 0 + 149.77 = 194.14
in row number 4 the calculation will be 11.09 - 6.45 + 210.78 = 215.42
and so on..
i tried to build function using iloc with while loop or shift but non of them worked as i got an error:
("'numpy.float64' object has no attribute 'iloc'", 'occurred at index 0')
("'numpy.float64' object has no attribute 'shift'", 'occurred at index 0')
any idea how to make this function will be great.
thanks!!
You can apply direct subtraction of columns and use cummulative sum to add values
d u
0 0.000 149.75
1 0.000 44.37
2 0.000 16.64
3 6.450 11.09
4 77.345 5.54
5 64.520 16.40
df1['C'] = (df1['u'] - df1['d']).cumsum()
OUt:
d u c
0 0.000 149.75 149.750
1 0.000 44.37 194.120
2 0.000 16.64 210.760
3 6.450 11.09 215.400
4 77.345 5.54 143.595
5 64.520 16.40 95.475

Python/Pandas Subtract Only if Value is not 0

I'm starting with data that looks something like this, but with a lot more rows:
Location Sample a b c d e f g h i
1 w 14.6 0 0 0 0 0 0 0 16.8
2 x 0 13.6 0 0 0 0 0 0 16.5
3 y 0 0 15.5 0 0 0 0 0 16.9
4 z 0 0 0 0 14.3 0 0 0 15.7
...
The data is indexed by the first two columns. I need to subtract the values in column i from each of the values in a - h, adding a new column to the right of the data frame for each original column. However, if there is a zero in the first column, I want it to stay zero instead of subtracting. For example, if my code worked I would have the following columns added to the data frame on the right
Location Sample ... a2 b2 c2 d2 e2 f2 g2 h2
1 w ... -2.2 0 0 0 0 0 0 0
2 x ... 0 -2.9 0 0 0 0 0 0
3 y ... 0 0 -1.4 0 0 0 0 0
4 z ... 0 0 0 0 -1.4 0 0 0
...
I'm trying to use where in pandas to only subtract the value in column i if the value in the current column is not zero using the following code:
import pandas as pd
normalizer = i
columns = list(df.columns.values)
for column in columns:
if column == normalizer: continue
newcol = gene + "2"
df[newcol] = df.where(df[column] == 0,
df[column] - df[normalizer], axis = 0)
I'm using a for loop because the number of columns will not always be the same, and the column that is being subtracted will have a different name using different data sets.
I'm getting this error: "ValueError: Wrong number of items passed 9, placement implies 1".
I think the subtraction is causing the issue, but I can't figure out how to change it to make it work. Any assistance would be greatly appreciated.
Thanks in advance.
Method 1 (pretty fast: roughly 3 times faster than method 2)
1. Select columns that is relavent
2. Do subtraction
3. Elementwise mutiplication with a 0, 1 matrix that constructed before the substraction. Each element in (df_ref > 0) is 0 if it was originally 0 and 1 otherwise.
ith_col = df["i"]
subdf = df.iloc[:, 2:-1] # a - h columns
df_temp = subdf.sub(ith_col, axis=0).multiply(subdf > 0).add(0)
df_temp.columns = ['a2', 'b2', 'c2', 'd2', 'e2', 'f2', 'g2', 'h2'] # rename columns
df_desired = pd.concat([df, df_temp], axis=1)
Note in this method, the 0 is negative. Thus, we have an extra add(0) in the end. Yes, a 0 can be negative. :P
Method 2 (more readable)
1. Find the greater than 0 part with a condition.
2. Select rows that is relavent
3. Substract
4. Fill in 0.
ith_col = df["i"]
df[df > 0].iloc[:,2:-1].sub(ith_col, axis=0).fillna(0)
The second method is pretty similar to #Wen's answer. Credits to him :P
Speed comparison of two methods (tested on Python 3 and pandas 0.20)
%timeit subdf.sub(ith_col, axis=0).multiply(subdf > 0).add(0)
688 µs ± 30.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df[df > 0].iloc[:,2:-1].sub(ith_col, axis=0).fillna(0)
2.97 ms ± 248 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Reference:
DataFrame.multiply perform elementwise multiplication with another data frame.
Using mask + fillna
df.iloc[:,2:-1]=df.iloc[:,2:-1].mask(df.iloc[:,2:-1]==0).sub(df['i'],0).fillna(0)
df
Out[116]:
Location Sample a b c d e f g h i
0 1 w -2.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 16.8
1 2 x 0.0 -2.9 0.0 0.0 0.0 0.0 0.0 0.0 16.5
2 3 y 0.0 0.0 -1.4 0.0 0.0 0.0 0.0 0.0 16.9
3 4 z 0.0 0.0 0.0 0.0 -1.4 0.0 0.0 0.0 15.7
Update
normalizer = ['i','Location','Sample']
df.loc[:,~df.columns.isin(normalizer)]=df.loc[:,~df.columns.isin(normalizer)].mask(df.loc[:,~df.columns.isin(normalizer)]==0).sub(df['i'],0).fillna(0)

delete specific rows from csv using pandas

I have a csv file in the format shown below:
I have written the following code that reads the file and randomly deletes the rows that have steering value as 0. I want to keep just 10% of the rows that have steering value as 0.
df = pd.read_csv(filename, header=None, names = ["center", "left", "right", "steering", "throttle", 'break', 'speed'])
df = df.drop(df.query('steering==0').sample(frac=0.90).index)
However, I get the following error:
df = df.drop(df.query('steering==0').sample(frac=0.90).index)
locs = rs.choice(axis_length, size=n, replace=replace, p=weights)
File "mtrand.pyx", line 1104, in mtrand.RandomState.choice
(numpy/random/mtrand/mtrand.c:17062)
ValueError: a must be greater than 0
Can you guys help me?
sample DataFrame built with #andrew_reece's code
In [9]: df
Out[9]:
center left right steering throttle brake
0 center_54.jpg left_75.jpg right_39.jpg 1 0 0
1 center_20.jpg left_81.jpg right_49.jpg 3 1 1
2 center_34.jpg left_96.jpg right_11.jpg 0 4 2
3 center_98.jpg left_87.jpg right_34.jpg 0 0 0
4 center_67.jpg left_12.jpg right_28.jpg 1 1 0
5 center_11.jpg left_25.jpg right_94.jpg 2 1 0
6 center_66.jpg left_27.jpg right_52.jpg 1 3 3
7 center_18.jpg left_50.jpg right_17.jpg 0 0 4
8 center_60.jpg left_25.jpg right_28.jpg 2 4 1
9 center_98.jpg left_97.jpg right_55.jpg 3 3 0
.. ... ... ... ... ... ...
90 center_31.jpg left_90.jpg right_43.jpg 0 1 0
91 center_29.jpg left_7.jpg right_30.jpg 3 0 0
92 center_37.jpg left_10.jpg right_15.jpg 1 0 0
93 center_18.jpg left_1.jpg right_83.jpg 3 1 1
94 center_96.jpg left_20.jpg right_56.jpg 3 0 0
95 center_37.jpg left_40.jpg right_38.jpg 0 3 1
96 center_73.jpg left_86.jpg right_71.jpg 0 1 0
97 center_85.jpg left_31.jpg right_0.jpg 3 0 4
98 center_34.jpg left_52.jpg right_40.jpg 0 0 2
99 center_91.jpg left_46.jpg right_17.jpg 0 0 0
[100 rows x 6 columns]
In [10]: df.steering.value_counts()
Out[10]:
0 43 # NOTE: 43 zeros
1 18
2 15
4 12
3 12
Name: steering, dtype: int64
In [11]: df.shape
Out[11]: (100, 6)
your solution (unchanged):
In [12]: df = df.drop(df.query('steering==0').sample(frac=0.90).index)
In [13]: df.steering.value_counts()
Out[13]:
1 18
2 15
4 12
3 12
0 4 # NOTE: 4 zeros (~10% from 43)
Name: steering, dtype: int64
In [14]: df.shape
Out[14]: (61, 6)
NOTE: make sure that steering column has numeric dtype! If it's a string (object) then you would need to change your code as follows:
df = df.drop(df.query('steering=="0"').sample(frac=0.90).index)
# NOTE: ^ ^
after that you can save the modified (reduced) DataFrame to CSV:
df.to_csv('/path/to/filename.csv', index=False)
Here's a one-line approach, using concat() and sample():
import numpy as np
import pandas as pd
# first, some sample data
# generate filename fields
positions = ['center','left','right']
N = 100
fnames = ['{}_{}.jpg'.format(loc, np.random.randint(100)) for loc in np.repeat(positions, N)]
df = pd.DataFrame(np.array(fnames).reshape(3,100).T, columns=positions)
# generate numeric fields
values = [0,1,2,3,4]
probas = [.5,.2,.1,.1,.1]
df['steering'] = np.random.choice(values, p=probas, size=N)
df['throttle'] = np.random.choice(values, p=probas, size=N)
df['brake'] = np.random.choice(values, p=probas, size=N)
print(df.shape)
(100,3)
The first few rows of sample output:
df.head()
center left right steering throttle brake
0 center_72.jpg left_26.jpg right_59.jpg 3 3 0
1 center_75.jpg left_68.jpg right_26.jpg 0 0 2
2 center_29.jpg left_8.jpg right_88.jpg 0 1 0
3 center_22.jpg left_26.jpg right_23.jpg 1 0 0
4 center_88.jpg left_0.jpg right_56.jpg 4 1 0
5 center_93.jpg left_18.jpg right_15.jpg 0 0 0
Now drop all but 10% of rows with steering==0:
newdf = pd.concat([df.loc[df.steering!=0],
df.loc[df.steering==0].sample(frac=0.1)])
With the probability weightings I used in this example, you'll see somewhere between 50-60 remaining entries in newdf, with about 5 steering==0 cases remaining.
Using a mask on steering combined with a random number should work:
df = df[(df.steering != 0) | (np.random.rand(len(df)) < 0.1)]
This does generate some extra random values, but it's nice and compact.
Edit: That said, I tried your example code and it worked as well. My guess is the error is coming from the fact that your df.query() statement is returning an empty dataframe, which probably means that the "sample" column does not contain any zeros, or alternatively that the column is read as strings rather than numeric. Try converting the column to integer before running the above snippet.

Resources