Computing 10d rolling average on a descending date column in pandas [duplicate] - python-3.x

Suppose I have a time series:
In[138] rng = pd.date_range('1/10/2011', periods=10, freq='D')
In[139] ts = pd.Series(randn(len(rng)), index=rng)
In[140]
Out[140]:
2011-01-10 0
2011-01-11 1
2011-01-12 2
2011-01-13 3
2011-01-14 4
2011-01-15 5
2011-01-16 6
2011-01-17 7
2011-01-18 8
2011-01-19 9
Freq: D, dtype: int64
If I use one of the rolling_* functions, for instance rolling_sum, I can get the behavior I want for backward looking rolling calculations:
In [157]: pd.rolling_sum(ts, window=3, min_periods=0)
Out[157]:
2011-01-10 0
2011-01-11 1
2011-01-12 3
2011-01-13 6
2011-01-14 9
2011-01-15 12
2011-01-16 15
2011-01-17 18
2011-01-18 21
2011-01-19 24
Freq: D, dtype: float64
But what if I want to do a forward-looking sum? I've tried something like this:
In [161]: pd.rolling_sum(ts.shift(-2, freq='D'), window=3, min_periods=0)
Out[161]:
2011-01-08 0
2011-01-09 1
2011-01-10 3
2011-01-11 6
2011-01-12 9
2011-01-13 12
2011-01-14 15
2011-01-15 18
2011-01-16 21
2011-01-17 24
Freq: D, dtype: float64
But that's not exactly the behavior I want. What I am looking for as an output is:
2011-01-10 3
2011-01-11 6
2011-01-12 9
2011-01-13 12
2011-01-14 15
2011-01-15 18
2011-01-16 21
2011-01-17 24
2011-01-18 17
2011-01-19 9
ie - I want the sum of the "current" day plus the next two days. My current solution is not sufficient because I care about what happens at the edges. I know I could solve this manually by setting up two additional columns that are shifted by 1 and 2 days respectively and then summing the three columns, but there's got to be a more elegant solution.

Why not just do it on the reversed Series (and reverse the answer):
In [11]: pd.rolling_sum(ts[::-1], window=3, min_periods=0)[::-1]
Out[11]:
2011-01-10 3
2011-01-11 6
2011-01-12 9
2011-01-13 12
2011-01-14 15
2011-01-15 18
2011-01-16 21
2011-01-17 24
2011-01-18 17
2011-01-19 9
Freq: D, dtype: float64

I struggled with this then found an easy way using shift.
If you want a rolling sum for the next 10 periods, try:
df['NewCol'] = df['OtherCol'].shift(-10).rolling(10, min_periods = 0).sum()
We use shift so that "OtherCol" shows up 10 rows ahead of where it normally would be, then we do a rolling sum over the previous 10 rows. Because we shifted, the previous 10 rows are actually the future 10 rows of the unshifted column. :)

Pandas recently added a new feature which enables you to implement forward looking rolling. You have to upgrade to pandas 1.1.0 to get the new feature.
indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=3)
ts.rolling(window=indexer, min_periods=1).sum()

Maybe you can try bottleneck module. When ts is large, bottleneck is much faster than pandas
import bottleneck as bn
result = bn.move_sum(ts[::-1], window=3, min_count=1)[::-1]
And bottleneck has other rolling functions, such as move_max, move_argmin, move_rank.

Try this one for a rolling window of 3:
window = 3
ts.rolling(window).sum().shift(-window + 1)

Related

Sum all elements in a column in pandas

I have a data in one column in Python dataframe.
1-2 3-4 8-9
4-5 6-2
3-1 4-2 1-4
The need is to sum all the data available in that column.
I tried to apply below logic but it's not working for list of list.
lst=[]
str='5-7 6-1 6-3'
str2 = str.split(' ')
for ele in str2:
lst.append(ele.split('-'))
print(lst)
sum(lst)
Can anyone please let me know the simplest method ?
My expected result should be:
27
17
15
I think we can do a split
df.col.str.split(' |-').map(lambda x : sum(int(y) for y in x))
Out[149]:
0 27
1 17
2 15
Name: col, dtype: int64
Or
pd.DataFrame(df.col.str.split(' |-').tolist()).astype(float).sum(1)
Out[156]:
0 27.0
1 17.0
2 15.0
dtype: float64
Using pd.Series.str.extractall:
df = pd.DataFrame({"col":['1-2 3-4 8-9', '4-5 6-2', '3-1 4-2 1-4']})
print (df["col"].str.extractall("(\d+)")[0].astype(int).groupby(level=0).sum())
0 27
1 17
2 15
Name: 0, dtype: int32
Use .str.extractall and sum on a level:
df['data'].str.extractall('(\d+)').astype(int).sum(level=0)
Output:
0
0 27
1 17
2 15
A for loop works fine here, and should be performant, since we are dealing with strings:
Using #HenryYik's sample data:
df.assign(sum_ = [sum(int(n) for n in ent
if n.isdigit())
for ent in df.col])
Out[1329]:
col sum_
0 1-2 3-4 8-9 27
1 4-5 6-2 17
2 3-1 4-2 1-4 15
I hazard that it will be faster taking it out and working within Python, before returning back to the pandas dataframe.

For and if loop combination takes lot of time in Pandas (Data manipulation)

I have two datasets, each about half a million observations. I am writing the below code and it seems the code never seems to stop executing. I would like to know if there is a better way of doing it. Appreciate inputs.
Below are sample formats of my dataframes. Both dataframes share a set of 'sid' values , meaning all the 'sid' values in 'df2' will have a match in 'df1' 'sid' values. The 'tid' values and consequently the 'rid' values (which are a combination of 'sid' and 'tid' values) may not appear in both sets.
The task is simple. I would like to create the 'tv' column in df2. Wherever the 'rid' in df2 matches with the 'rid' in 'df1', the 'tv' column in df2 takes the corresponding 'tv' value from df1. If it does not match, the 'tv' value in 'df2' will be the median 'tv' value for the matching 'sid' subset in 'df1'.
In fact my original task includes creating a few more similar columns like 'tv' in df2 (based on their values in 'df1' ; these columns exist in 'df1').
I believe as my code contains for loop combined with if else statement and multiple value assignment statements, it is taking forever to execute. Appreciate any inputs.
df1
sid tid rid tv
0 0 0 0-0 9
1 0 1 0-1 8
2 0 3 0-3 4
3 1 5 1-5 2
4 1 7 1-7 3
5 1 9 1-9 14
6 1 10 1-10 24
7 1 11 1-11 13
8 2 14 2-14 2
9 2 16 2-16 5
10 3 17 3-17 6
11 3 18 3-18 8
12 3 20 3-20 5
13 3 21 3-21 11
14 4 23 4-23 6
df2
sid tid rid
0 0 0 0-0
1 0 2 0-2
2 1 3 1-3
3 1 6 1-6
4 1 9 1-9
5 2 10 2-10
6 2 12 2-12
7 3 1 3-1
8 3 15 3-15
9 3 1 3-1
10 4 19 4-19
11 4 22 4-22
rids = [rid.split('-') for rid in df1.rid]
for r in df2.rid:
s,t = r.split('-')
if [s,t] in rids:
df2.loc[df2.rid== r,'tv'] = df1.loc[df1.rid == r,'tv']
else:
df2.loc[df2.rid== r,'tv'] = df1.loc[df1.sid == int(s),'tv'].median()
The expected df2 shall be as follows:
sid tid rid tv
0 0 0 0-0 9.0
1 0 2 0-2 8.0
2 1 3 1-3 13.0
3 1 6 1-6 13.0
4 1 9 1-9 14.0
5 2 10 2-10 3.5
6 2 12 2-12 3.5
7 3 1 3-1 7.0
8 3 15 3-15 7.0
9 3 1 3-1 7.0
10 4 19 4-19 6.0
11 4 22 4-22 6.0
You can left merge on df2 with a subset(because you need only tv column you can also pass the df1 without any subset) of df1 on 'rid' then calculate median and fill values:
out=df2.merge(df1[['rid','tv']],on='rid',how='left')
out['tv']=out['tv_y'].fillna(out['sid'].map(df1.groupby('sid')['tv'].median()))
out= out.drop(['tv_x','tid_y','tv_y'], axis=1)
out = out.rename(columns = {'tid_x': 'tid'})
out
OR
Since you said that:
all the 'sid' values in 'df2' will have a match in 'df1' 'sid' values
So you can also left merge them on ['sid','rid'] and then fillna() value of tv with the median of df1 'tv' column by mapping values using map() method:
out=df2.merge(df1,on=['sid','rid'],how='left')
out['tv']=out['tv_y'].fillna(out['sid'].map(df1.groupby('sid')['tv'].median()))
out= out.drop(['tv_x','tv_y'], axis=1)
out
output of out:
sid tid rid tv
0 0 0 0-0 9.0
1 0 2 0-2 8.0
2 1 3 1-3 13.0
3 1 6 1-6 13.0
4 1 9 1-9 14.0
5 2 10 2-10 3.5
6 2 12 2-12 3.5
7 3 1 3-1 7.0
8 3 15 3-15 7.0
9 3 1 3-1 7.0
10 4 19 4-19 6.0
11 4 22 4-22 6.0
Here is a suggestion without any loops, based on dictionaries:
matching_values = dict(zip(df1['rid'][df1['rid'].isin(df2['rid'])], df1['tv'][df1['rid'].isin(df2['rid'])]))
df2[df2['rid'].isin(df1['rid'])]['tv'] = df2[df2['rid'].isin(df1['rid'])]['rid']
df2[df2['rid'].isin(df1['rid'])]['tv'].replace(matching_values)
median_values = df2[(~df2['rid'].isin(df1['rid']) & (df2['sid'].isin(df1['sid'])].groupby('sid')['tv'].median().to_dict()
df2[(~df2['rid'].isin(df1['rid']) & (df2['sid'].isin(df1['sid'])]['tv'] = df2[(~df2['rid'].isin(df1['rid']) & (df2['sid'].isin(df1['sid'])]['sid']
df2[(~df2['rid'].isin(df1['rid']) & (df2['sid'].isin(df1['sid'])]['tv'].replace(median_values)
This should do the trick. The logic here is that we first create a dictionary, in which the "rid and "sid" values are the keys and the median and matching "tv" values are the dictionary values. Next, we replace the "tv" values in df2 with the rid and sid keys, respectively, (because they are the dictionary keys) which can thus easily be replaced by the correct tv values by calling .replace().
Don't use for loops in pandas, that is known to be slow. That way you don't get to benefit from all the internal optimizations that have been made.
Try to use the split-apply-combine pattern:
split df1 into sid to calculate the median: df1.groupby('sid')['tv'].median()
join df2 on df1: df2.join(df1.set_index('rid'), on='rid')
fill the NaN values with the median calculated in step 1.
(Haven't tested the code).

How to find again the index after pivoting dataframe?

I created a dataframe form a csv file containing data on number of deaths by year (running from 1946 to 2021) and month (within year):
dataD = pd.read_csv('MY_FILE.csv', sep=',')
First rows (out of 902...) of output are :
dataD
Year Month Deaths
0 2021 2 55500
1 2021 1 65400
2 2020 12 62800
3 2020 11 64700
4 2020 10 56900
As expected, the dataframe contains an index numbered 0,1,2, ... and so on.
Now, I pivot this dataframe in order to have only 1 row by year and months in column, using the following code:
dataDW = dataD.pivot(index='Year', columns='Month', values='Deaths')
The first rows of the result are now:
Month 1 2 3 4 5 6 7 8 9 10 11 12
Year
1946 70900.0 53958.0 57287.0 45376.0 42591.0 37721.0 37587.0 34880.0 35188.0 37842.0 42954.0 49596.0
1947 60453.0 56891.0 56442.0 45121.0 42605.0 37894.0 38364.0 36763.0 35768.0 40488.0 41361.0 46007.0
1948 46161.0 45412.0 51983.0 43829.0 42003.0 37084.0 39069.0 35272.0 35314.0 39588.0 43596.0 53899.0
1949 87861.0 58592.0 52772.0 44154.0 41896.0 39141.0 40042.0 37372.0 36267.0 40534.0 47049.0 47918.0
1950 51927.0 47749.0 50439.0 47248.0 45515.0 40095.0 39798.0 38124.0 37075.0 42232.0 44418.0 49860.0
My question is:
What do I have to change in the previous pivoting code in order to find again the index 0,1,2,..etc. when I output the pivoted file? I think I need to specify index=*** in order to make the pivot instruction run. But afterwards, I would like to recover an index "as usual" (if I can say), exactly like in my first file dataD.
Any possibility?
You can reset_index() after pivoting:
dataDW = dataD.pivot(index='Year', columns='Month', values='Deaths').reset_index()
This would give you the following:
Month Year 1 2 3 4 5 6 7 8 9 10 11 12
0 1946 70900.0 53958.0 57287.0 45376.0 42591.0 37721.0 37587.0 34880.0 35188.0 37842.0 42954.0 49596.0
1 1947 60453.0 56891.0 56442.0 45121.0 42605.0 37894.0 38364.0 36763.0 35768.0 40488.0 41361.0 46007.0
2 1948 46161.0 45412.0 51983.0 43829.0 42003.0 37084.0 39069.0 35272.0 35314.0 39588.0 43596.0 53899.0
3 1949 87861.0 58592.0 52772.0 44154.0 41896.0 39141.0 40042.0 37372.0 36267.0 40534.0 47049.0 47918.0
4 1950 51927.0 47749.0 50439.0 47248.0 45515.0 40095.0 39798.0 38124.0 37075.0 42232.0 44418.0 49860.0
Note that the "Month" here might look like the index name but is actually df.columns.name. You can unset it if preferred:
df.columns.name = None
Which then gives you:
Year 1 2 3 4 5 6 7 8 9 10 11 12
0 1946 70900.0 53958.0 57287.0 45376.0 42591.0 37721.0 37587.0 34880.0 35188.0 37842.0 42954.0 49596.0
1 1947 60453.0 56891.0 56442.0 45121.0 42605.0 37894.0 38364.0 36763.0 35768.0 40488.0 41361.0 46007.0
2 1948 46161.0 45412.0 51983.0 43829.0 42003.0 37084.0 39069.0 35272.0 35314.0 39588.0 43596.0 53899.0
3 1949 87861.0 58592.0 52772.0 44154.0 41896.0 39141.0 40042.0 37372.0 36267.0 40534.0 47049.0 47918.0
4 1950 51927.0 47749.0 50439.0 47248.0 45515.0 40095.0 39798.0 38124.0 37075.0 42232.0 44418.0 49860.0

generate normalized discrete values for feature engineering

There is a dataframe, with one columns store the discrete values, shown as follows. I would like to create another column storing the normalized values. For instance, for 4050, the corresponding entry will be 4. Are there any efficient ways to do that instead of writing my own function? In Sklearn, are there any functions to generating normalized values?
Based on your comment:
there are around 20 different values, and the range is from 1000 to 9999, so I would like to use every 1000 as a category
This isn't really normalization in the strict sense of the word. However, to do that, you can easily use floor division (//):
df['new_column'] = df['values']//1000
For example:
>>> df
values
0 2021
1 8093
2 9870
3 4508
4 2645
5 1441
6 8888
7 8921
8 7292
9 8571
df['new_column'] = df['values']//1000
>>> df
values new_column
0 2021 2
1 8093 8
2 9870 9
3 4508 4
4 2645 2
5 1441 1
6 8888 8
7 8921 8
8 7292 7
9 8571 8

How to split rows in pandas with special condition of date?

I have a DataFrame like:
Code Date sales
1 2/2013 10
1 3/2013 11
2 3/2013 12
2 4/2013 14
...
I want to convert it into a DataFrame with a timeline, code, and sales of each type of item:
Date Code Sales1 Code Sales2
2/2013 1 10 NA NA
3/2013 1 11 2 12
4/2013 NA NA 2 14
....
or into a simpler way:
Date Code Sales1 Date Code Sales2 .....
2/2013 1 10 3/2013 2 12
3/2013 1 11 4/2013 2 14
or even into the simplest way, splitting into many small DataFrames
IIUC using concatwith the groupby result
df.index=df.groupby('Code').cumcount()# create the key for concat
pd.concat([x for _,x in df.groupby('Code')],1)
Out[392]:
Code Date sales Code Date sales
0 1 2/2013 10 2 3/2013 12
1 1 3/2013 11 2 4/2013 14
Actually, I was stupid to split the data that way, I rethink and solve the problem with the pivot_table
pd.pivot_table(df, values = ['sales'], index = ['code'], columns = ['date'])
and the result should be like.
sum
date 2/2013 3/2013 4/2013 ....
code
1 10 11 NaN
2 NaN 12 14
...

Resources