Average data points in a range while condition is met in a Pandas DataFrame - python-3.x

I have a very large dataset with over 400,000 rows and growing. I understand that you are not supposed to use iterows to modify a pandas data frame. However I'm a little lost on what I should do in this case, since I'm not sure I could use .loc() or some rolling filter to modify a data frame in the way I need to. I'm trying to figure out if I can take a data frame and average the range while the condition is met. For example:
Condition
Temp.
Pressure
1
8
20
1
7
23
1
8
22
1
9
21
0
4
33
0
3
35
1
9
21
1
11
20
1
10
22
While the condition is == 1 the outputed dataframe would look like this:
Condition
Avg. Temp.
Avg. Pressure
1
8
21.5
1
10
21
Has anyone attempted something similar that can put me on the right path? I was thinking of using something like this:
df = pd.csv_read(csv_file)
for index, row in df.iterrows():
if row['condition'] == 1:
#start index = first value that equals 1
else: #end index & calculate rolling average of range
len = end - start
new_df = df.rolling(len).mean()
I know that my code isn't great, I also know I could brute force it doing something similar as I have shown above, but as I said it has a lot of rows and continues to grow so I need to be efficient.

TRY:
result = df.groupby((df.Condition != df.Condition.shift()).cumsum()).apply(
lambda x: x.rolling(len(x)).mean().dropna()).reset_index(drop=True)
print(result.loc[result.Condition.eq(1)]) # filter by required condition
OUTPUT:
Condition Temp. Pressure
0 1.0 8.0 21.5
2 1.0 10.0 21.0

Related

Calculating growth index of stock performance which is reseted every four years

I am analysing a stock-index in 4-year cycles and would like to start with a base of 1 at the beginning of the first year and the calculate the returns on top, so that I get a column in the dataframe that goes 1, 1.02, 1.03, 1.025...
The return indexed calculation would be (todaysValue/yesterdaysValue)*yesterdaysIndexValue.
The df looks like this:
Datetimeindex Stockindex CycleYear Diff Daynumber
01.01.1968 96.47 0 1 1
...
03.01.1972 101.67 0 1 1
...
06.09.2022 3908.19 2 0 699
07.09.2022 3979.87 2 0 700
08.09.2022 4006.18 2 0 701
I would now like to add a column df['Retindex'] that starts every 4 years at 1 and calculates the indexed-returns until the end of year 4.
I have created the column to that has a 1 at the start of each cycle.
df['Retindex'] = df['Daynumber'].loc[df['Daynumber'] == 1]
Then I tried creating the rest of the index with this:
for id in df[df['Retindex'].isnull() == True].index: df.loc[id, 'Retindex'] = (df[Stockindex]/df[Stockindex].shift().loc[id]) * df['Retindex'].shift().loc[id]
Here I am getting the error: "ValueError: Incompatible indexer with Series"
I have tried other ways as well but I am unfortunately not progressing on this. Can anyone help?

How can I find the highest value between rows every time that they met a certain condition?

I have been struggling with a problem with my data frame build in pandas that is current like this
MyDataFrame:
Index Status Value
0 A 10
1 A 8
2 A 5
3 B 9
4 B 5
5 A 1
6 B 2
7 A 3
8 A 5
9 A 1
The desired output would be:
Index Status Value
0 A 10
1 B 9
2 A 1
3 B 2
4 A 5
So far I tried to use range and while conditions to filter, however, if I put a conditional like :
for i in range:
if Status[i] == "A":
print(Value[i])
if Status == "B":
break
** The code above is more an example of what I have been trying to reach my goal, I tried to use .iloc and range with while, but maybe in the wrong way idk.*
The desired output isn't printed.
One thing that complicates this filtering process is that MyDataFrame changes every time that I run the script since it uses another base of data to create this DataFrame.
I believe that I'm missing something simple, but it has been almost a week and I can't figure out.
Thanks in advance for all your answers and support.
Let us try using shift with cumsum create the groupby key , then it is groupby + agg
out = df.groupby(df.Status.ne(df.Status.shift()).cumsum()).agg({'Status':'first','Value':'max'})
Out[14]:
Status Value
Status
1 A 10
2 B 9
3 A 1
4 B 2
5 A 5
Very close to #BEN_YO:
grp = (df['Status'] != df['Status'].shift()).cumsum()
df.loc[df.groupby(grp)['Value'].idxmax()]
Output:
Status Value
Index
0 A 10
3 B 9
5 A 1
6 B 2
8 A 5
Create groups using shift and inequality with cumsum, then groupby and find the index of the max value of 'Value', idxmax, and filter the dataframe using loc

Setting a value to a cell in a pandas dataframe

I have the following pandas dataframe:
K = pd.DataFrame({"A":[1,2,3,4], "B":[5,6,7,8]})
Then I set the cell in the first row and first column to 11:
K.iloc[0]["A"] = 11
And when I check the dataframe again, I see that the value assignment is done and K.iloc[0]["A"] is equal to 11. However when I add a column to this data frame and do the same operation for a cell in the new column, the value assignment is not successful:
K["C"] = 0
K.iloc[0]["C"] = 11
So, when I check the dataframe again, the value of K.iloc[0]["C"] is still zero. I appreciate if somebody can tell me what is going on here and how I can resolve this issue.
For simplicity, I would do the operations in a different order and use loc:
K.loc[0, 'C'] = 0
K.loc[0, ['A', 'C']] = 11
When you use K.iloc[0]["C"], you first take the first line, so you have a copy of a slice from your dataframe, then you take the column C. So you change the copy from the slice, not the original dataframe.
That your first call, K.iloc[0]["A"] = 11 worked fine was in some sens a luck.
The good habit is to use loc in "one shot", so you have access to the original value of the dataframe, not on a slice copy :
K.loc[0,"C"] = 11
Be careful that iloc and loc are different function, even if they seems quite similar here.
If default index, RangeIndex is possible use DataFrame.loc, but it set index values by label 0 (what is same like position 0):
K['C'] = 0
K.loc[0, ["A", "C"]] = 11
print (K)
A B C
0 11 5 11
1 2 6 0
2 3 7 0
3 4 8 0
Reason why your solution failed is possible find in docs:
This can work at times, but it is not guaranteed to, and therefore should be avoided:
dfc['A'][0] = 111
Solution with DataFrame.iloc is possible with get positions of columns by Index.get_indexer:
print (K.columns.get_indexer(["A", "C"]))
[0 2]
K['C'] = 0
K.iloc[0, K.columns.get_indexer(["A", "C"])] = 11
print (K)
A B C
0 11 5 11
1 2 6 0
2 3 7 0
3 4 8 0
loc should work :
K.loc[0]['C'] = 11
K.loc[0, 'C'] = 11
Both the above versions of loc will be able to assign values to the dataframe K.

Take the mean of n numbers in a DataFrame column and "drag" formula down similar to Excel

I'm trying to take the mean of n numbers in a pandas DataFrame column and "drag" the formula down each row to get the respective mean.
Let's say there are 6 rows of data with "Numbers" in column A and "Averages" in column B. I want to take the average of A1:A2, then "drag" that formula down to get the average of A2:A3, A3:A4, etc.
list = [55,6,77,75,9,127,13]
finallist = pd.DataFrame(list)
finallist.columns = ['Numbers']
Below gives me the average of rows 0:2 in the Numbers column. So calling out the rows with .iloc[0:2]) works, but when I try to shift down a row it doesn't work:
finallist['Average'] = statistics.mean(finallist['Numbers'].iloc[0:2])
Below I'm trying to take the average of the first two rows, then shift down by 1 as you move down the rows, but I get a value of NaN:
finallist['Average'] = statistics.mean(finallist['Numbers'].iloc[0:2].shift(1))
I expected the .iloc[0:2].shift(1)) to shift the mean function down 1 row but still apply to 2 total rows, but I got a value of NaN.
Here's a screenshot of my output:
What's happening in your shift(1) approach is that you're actually shifting the index in your data "down" once, so this code:
df['Numbers'].iloc[0:2].shift(1)
Produces the output:
0 NaN
1 55.0
Then you take the average of these two, which evalutes to NaN, and then you assign that single value to every element of the Averages Series here:
df['Averages'] = statistics.mean(df['Numbers'].iloc[0:2].shift(1))
You can instead use rolling() combined with mean() to get a sliding average across the entire data frame like this:
import pandas as pd
values = [55,6,77,75,9,127,13]
df = pd.DataFrame(values)
df.columns = ['Numbers']
df['Averages'] = df.rolling(2, min_periods=1).mean()
This produces the following output:
Numbers Averages
0 55 55.0
1 6 30.5
2 77 41.5
3 75 76.0
4 9 42.0
5 127 68.0
6 13 70.0

Efficiently concatanate a large number of columns

I tried to concatenate a large number of columns containing integers in one string.
Basically, starting from:
df = pd.DataFrame({'id':[1,2,3,4],'a':[0,1,2,3], 'b':[4,5,6,7], 'c':[8,9,0,1]})
To obtain:
id join
0 1 481
1 2 592
2 3 603
3 4 714
I found several methods to do this (here and here):
Method 1:
conc['glued']=''
i=1
while i < len(df.columns):
conc['glued'] = conc['glued'] + df[df.columns[i]].values.astype(str)
i=i+1
This method work, but is a bit long (45min on my "test" case of 18,000 rows x 40,000 columns). I am concerned by the loop on the columns as this program should be applied at the end on tables of 600.000 columns and I am afraid it will be too long.
Method 2a
conc['join']=[''.join(row) for row in df[df.columns[1:]].values.astype(str)]
Method 2b
conc['apply'] = df[df.columns[1:]].apply(lambda x: ''.join(x.astype(str)), axis=1)
Both of these methods are 10 times more efficient than the previous one, iterate on rows which is good and work perfectly on my "debug" table df. But, when I apply it to my "test" table of 18k x 40k, it leads to a MemoryError: (I have 60% of my 32GB of RAM occupied after reading the corresponding csv file).
I can copy my DataFrame without overpass the memory, but curiously, applying this method make the code crash.
Do you see how I can fix and improve this code to use an efficient row based iteration? Thank you !
Appendix:
Here is the code I use on my test case:
geno_reader = pd.read_csv(genotype_file,header=0,compression='gzip', usecols=geno_columns_names)
fimpute_geno = pd.DataFrame({'SampID': geno_reader['SampID']})
I should use the chunksize option to read this file but I haven't yet really understand how to use it after reading.
Method 1:
fimpute_geno['Calls'] = ''
for i in range(1,len(geno_reader.columns)):
fimpute_geno['Calls'] = fimpute_geno['Calls']\
+ geno_reader[geno_reader.columns[i]].values.astype(int).astype(str)
This work in 45min.
There is some quite disgusting piece of code like the .astype(int).astype(str). I don't know why Python don't recognize my integers and consider them as float.
Method 2:
fimpute_geno['Calls'] = geno_reader[geno_reader.columns[1:]]\
.apply(lambda x: ''.join(x.astype(int).astype(str)), axis=1)
This leads to an MemoryError:
Here' something to try. It would require that you convert your columns to strings though. your sample frame
b c id
0 4 8 1
1 5 9 2
2 6 0 3
3 7 1 4
then
#you could also do this conc[['b','c','id']] for the next two lines
conc.ix[:,'b':'id'] = conc.ix[:,'b':'id'].astype('str')
conc['join'] = np.sum(conc.ix[:,'b':'id'],axis=1)
Would give
a b c id join
0 0 4 8 1 481
1 1 5 9 2 592
2 2 6 0 3 603
3 3 7 1 4 714

Resources