how to iterate over column and each iteration result save in result dataframe python - python-3.x

I have one dataframe with multiple columns ,i need to calculate same thing for all columns , is there any way to do this ? i have many columns so can not do one by one
df=pd.DataFrame({r'A':[1,24,69,67],r'A\0001\delta':[1,46,454,67],r'A\0002\delta':[1,46,454,67],r'A\00100\delta':[1,46,70,67]})
i want to calculate:
diff=df[r'A\0001\delta'].diff()
if diff greater than 60 save row in result dataframe
same thing i want to do for more than 100 columns and want to save results in result dataframe by rows

At least one value greater than 60 on a row
>>> df.loc[df.diff().gt(60).any(axis=1)]
A A\0001\delta A\0002\delta A\00100\delta
2 69 454 454 70
All values greater than 60 on a row:
>>> df.loc[df.diff().gt(60).all(axis=1)]
Empty DataFrame
Columns: [A, A\0001\delta, A\0002\delta, A\00100\delta]
Index: []

Related

How to return index of a row 60 seconds before current row

I have a large (>32 M rows) Pandas dataframe.
In column 'Time_Stamp' I have a Unix timestamp in seconds. These values are not linear, there are gaps, and some timestamps can be duplicated (ex: 1, 2, 4, 6, 6, 9,...).
I would like to set column 'Result' of current row to the index of the row that is 60 seconds before current row (closest match if there are no rows exactly 60 seconds before current row, and if more than one match, take maximum of all matches).
I've tried this to first get the list of indexes, but it always return an empty list:
df.index[df['Time_Stamp'] <= df.Time_Stamp-60].tolist()
I cannot use a for loop due to the large number of rows.
Edit 20.01.2020:
Based on comment below, I'm adding a sample dataset, and instead of returning the index I want to return the column Value:
In [2]: df
Out[2]:
Time_Stamp Value
0 1 2.4
1 2 3.1
2 4 6.3
3 6 7.2
4 6 6.1
5 9 6.0
So with the precious help of ALollz, I managed to achieve what i wanted to do in the end, here's my code:
#make copy of dataframe
df2 = df[['Time_Stamp','Value']].copy()
#add Time_gap to Time_Stamp in df2
df2['Time_Stamp'] = df2.Time_Stamp +Time_gap
#sort df2 on Time_Stamp
df2.sort_values(by = 'Time_Stamp', ascending=True,inplace = True)
df2 = df2.reset_index(drop=True)
df3 = pd.merge_asof(df, df2, on='Time_Stamp', direction='forward')

Take the mean of n numbers in a DataFrame column and "drag" formula down similar to Excel

I'm trying to take the mean of n numbers in a pandas DataFrame column and "drag" the formula down each row to get the respective mean.
Let's say there are 6 rows of data with "Numbers" in column A and "Averages" in column B. I want to take the average of A1:A2, then "drag" that formula down to get the average of A2:A3, A3:A4, etc.
list = [55,6,77,75,9,127,13]
finallist = pd.DataFrame(list)
finallist.columns = ['Numbers']
Below gives me the average of rows 0:2 in the Numbers column. So calling out the rows with .iloc[0:2]) works, but when I try to shift down a row it doesn't work:
finallist['Average'] = statistics.mean(finallist['Numbers'].iloc[0:2])
Below I'm trying to take the average of the first two rows, then shift down by 1 as you move down the rows, but I get a value of NaN:
finallist['Average'] = statistics.mean(finallist['Numbers'].iloc[0:2].shift(1))
I expected the .iloc[0:2].shift(1)) to shift the mean function down 1 row but still apply to 2 total rows, but I got a value of NaN.
Here's a screenshot of my output:
What's happening in your shift(1) approach is that you're actually shifting the index in your data "down" once, so this code:
df['Numbers'].iloc[0:2].shift(1)
Produces the output:
0 NaN
1 55.0
Then you take the average of these two, which evalutes to NaN, and then you assign that single value to every element of the Averages Series here:
df['Averages'] = statistics.mean(df['Numbers'].iloc[0:2].shift(1))
You can instead use rolling() combined with mean() to get a sliding average across the entire data frame like this:
import pandas as pd
values = [55,6,77,75,9,127,13]
df = pd.DataFrame(values)
df.columns = ['Numbers']
df['Averages'] = df.rolling(2, min_periods=1).mean()
This produces the following output:
Numbers Averages
0 55 55.0
1 6 30.5
2 77 41.5
3 75 76.0
4 9 42.0
5 127 68.0
6 13 70.0

Compare row with all other previous string in one column and change value of another column in Python

I have a csv file named namelist.csv, it includes:
Index String Size Name
1 AAA123000DDD 10 One
2 AAA123DDDQQQ 20 One
3 AAA123000DDD 25 One
4 AAA123D 20 One
5 ABA 15 One
6 FFFrrrSSSBBB 60 Two
7 FFFrrrSSSBBB 30 Two
8 FFFrrrSS 50 Two
9 AAA12 70 Two
I want to compare row in column String of each name group: if the string in each row is match or is substring of all above rows then remove the previous rows and sum the value of Size column to the value of subtring row.
Example: i take row 3rd: AAA123000DDD, i compare it to 2 row 1st and 2nd, it see that it is a match with 1st row, it will remove the 1st row then sum value of the 1st row column Size to the 3rd row column Size .
then the table will be like:
Index String Size Name
2 AAA123DDDQQQ 20 One
3 AAA123000DDD 35 One
4 AAA123D 20 One
...
the final result will be:
Index String Size Name
3 AAA123000DDD 35 One
4 AAA123D 40 One
5 ABA 15 One
8 FFFrrrSS 140 Two
9 AAA12 70 Two
i think of using groupby of pandas to group all Name column, but i don't know how to apply the comparison of String column and sum of Size column.
I am new to Python so any help I will very appreciate.
Assuming Name is distinct with String, here's how you would do the aggregation. I kept Name so that it also shows in the final DataFrame.
df_group = df.groupby(['String', 'Name'])['Size'].sum().reset_index()
Edit:
To match the substrings (and using the example above that it appears that a substring will not match with multiple strings), you can make a mapping of substrings to full strings and then group by the full string column as before:
all_strings = set(df['Strings'])
substring_dict = dict()
for row in df.itertuples():
for item in all_strings:
if row.String in item:
substring_dict[row.String] = item
def match_substring(x):
return substring_dict[x]
df['full_strings'] = df.String.apply(match_substring)
df_group = df.groupby(['full_strings', 'Name'])['Size'].sum().reset_index()

How do you search through a row (Row 1) of a CSV file, but search through the next row (Row 2) at the same time?

Imagine there are THREE columns and a certain number of rows in a dataframe. First column are random values, second column are Names, third column are Ages.
I want to search through every row (First Row) of this dataframe and find when value 1 appears in the first column. Then simultaneously, I want to know that if value 1 does indeed exist in the column, does value 2 appear in the SAME column but in the next row.
If this is the case. Copy First Rows, Value, Name And Age into an empty dataframe. Every time this condition is met, copy these rows into an empty dataframe
EmptyDataframe = pd.DataFrame(columns['Name','Age'])
csvfile = pd.DataFrame(columns['Value', 'Name', 'Age'])
row_for_csv_dataframe = next(csv.iterrows())
for index, row_for_csv_dataframe in csv.iterrows():
if row_for_csv_dataframe['Value'] == '1':
# How to code this:
# if the NEXT row after row_for_csv_dataframe finds the 'Value' == 2
# then copy 'Age' and 'Name' from row_for_csv_dataframe into the empty DataFrame.
Assuming you have a dataframe data like this:
Value Name Age
0 1 Anne 10
1 2 Bert 20
2 3 Caro 30
3 2 Dora 40
4 1 Emil 50
5 1 Flip 60
6 2 Gabi 70
You could do something like this, although this is probably not the most efficient:
iterator1 = data.iterrows()
iterator2 = data.iterrows()
iterator2.__next__()
for current, next in zip(iterator1,iterator2):
if(current[1].Value==1 and next[1].Value==2):
print(current[1].Value, current[1].Name, current[1].Age)
And would get this result:
1 Anne 10
1 Flip 60

How to select bunch of rows

I have dataframe with multiple columns , i want to select bunch of rows if column B have consecutive 1 and check in these rows if column A have any value equal to 0.04 then need this bunch of rows and extract start value and end value of column A for this bunch of rows
Here is my dataframe
Here is my desired output:
filtter Consecutive groups .diff().abs().cumsum().bfill() not following the specific considitons (x['B'].eq(1).any() and x['A'].eq(0.04).any()
agg first and last
followed by grouping consecutivity column to extract first and last rows with use of agg fun
df['temp'] = df.B.diff().abs().cumsum().bfill()
df.groupby('temp').filter(lambda x: (x['B'].eq(1).any() and x['A'].eq(0.04).any()))\
.groupby('temp').agg({'A':['first','last']})
Out:
A
first last
temp
3.0 344.0 39.9

Resources