Pandas shift datetimeindex takes too long time running - python-3.x

I have a running time issue with shifting a large dataframe with datetime index.
Example using created dummy data:
df = pd.DataFrame({'col1':[0,1,2,3,4,5,6,7,8,9,10,11,12,13]*10**5,'col3':list(np.random.randint(0,100000,14*10**5)),'col2':list(pd.date_range('2020-01-01','2020-08-01',freq='M'))*2*10**5})
df.col3=df.col3.astype(str)
df.drop_duplicates(subset=['col3','col2'],keep='first',inplace=True)
If I shift not using datetimeindex, it only takes about 12s:
%%time
tmp=df.groupby('col3')['col1'].shift(2,fill_value=0)
Wall time: 12.5 s
But when I use datetimeindex, as that situation that I need, it takes about 40 minutes:
%%time
tmp=df.set_index('col2').groupby('col3')['col1'].shift(2,freq='M',fill_value=0)
Wall time: 40min 25s
In my situation, I need the data from shift(1) until shift(6) and merge them with original data by col2 and col3. So I use for looping and merge.
Is there any solution for this? Thanks for your answer, will appreciate so much any respond.
Ben's answer solves it:
%%time
tmp=df1[['col1','col3', 'col2']].assign(col2 = lambda x: x['col2'] + MonthEnd(2)).set_index(['col3', 'col2']).add_suffix(f'_{2}').fillna(0).reindex(pd.MultiIndex.from_frame(df1[['col3','col2']])).reset_index()
Wall time: 5.94 s
also implement to the looping:
%%time
res=(pd.concat([df1.assign(col2 = lambda x: x['col2'] + MonthEnd(i)).set_index(['col3', 'col2']).add_suffix(f'_{i}') for i in range(0,7)],axis=1).fillna(0)).reindex(pd.MultiIndex.from_frame(df1[['col3','col2']])).reset_index()
Wall time: 1min 44s
Actually, my real data is already using MonthEnd(0) so I just use loop in range(1,7). I also implement to multiple columns so I don't use astype and implement reindex because I use left merge.

The two operations are slightly different, and the results are not the same because your data (at least the dummy here) is not ordered and especially if you have missing dates for some col3 values. That said, the time difference seems enormous. So I think you should go a bit differently.
One way is to add X MonthEnd to col2 for X from 0 to 6, use concat all of them, after set_index the col3 and col2, add_suffix to keep track of the "shift" value. fillna and convert the dtype to original one. The rest is mostly cosmetic depending on your needs.
from pandas.tseries.offsets import MonthEnd
res = (
pd.concat([
df.assign(col2 = lambda x: x['col2'] + MonthEnd(i))
.set_index(['col3', 'col2'])
.add_suffix(f'_{i}')
for i in range(0,7)],
axis=1)
.fillna(0)
# depends on your original data
.astype(df['col1'].dtype)
# if you want a left merge ordered like original df
#.reindex(pd.MultiIndex.from_frame(df[['col3','col2']]))
# if you want col2 and col3 back as columns
# .reset_index()
)
Note that concat does a outer join by default, so you end up with month that where not in your original data and col1_0 is actually the original data with my random numbers.
print(res.head(10))
col1_0 col1_1 col1_2 col1_3 col1_4 col1_5 col1_6
col3 col2
0 2020-01-31 7 0 0 0 0 0 0
2020-02-29 8 7 0 0 0 0 0
2020-03-31 2 8 7 0 0 0 0
2020-04-30 3 2 8 7 0 0 0
2020-05-31 4 3 2 8 7 0 0
2020-06-30 12 4 3 2 8 7 0
2020-07-31 13 12 4 3 2 8 7
2020-08-31 0 13 12 4 3 2 8
2020-09-30 0 0 13 12 4 3 2
2020-10-31 0 0 0 13 12 4 3

This is an issue with groupby + shift. The problem is that if you specify an axis other than 0 or a frequency it falls back to a very slow loop over the groups. If neither of those are specified it's able to use a much faster path, which is why you see an order of magitude difference between the performance.
The relevant code in for DataFrame.GroupBy.shift is:
def shift(self, periods=1, freq=None, axis=0, fill_value=None):
"""..."""
if freq is not None or axis != 0:
return self.apply(lambda x: x.shift(periods, freq, axis, fill_value))
Previously this issue extended to specifying a fill_value

Related

Sum all elements in a column in pandas

I have a data in one column in Python dataframe.
1-2 3-4 8-9
4-5 6-2
3-1 4-2 1-4
The need is to sum all the data available in that column.
I tried to apply below logic but it's not working for list of list.
lst=[]
str='5-7 6-1 6-3'
str2 = str.split(' ')
for ele in str2:
lst.append(ele.split('-'))
print(lst)
sum(lst)
Can anyone please let me know the simplest method ?
My expected result should be:
27
17
15
I think we can do a split
df.col.str.split(' |-').map(lambda x : sum(int(y) for y in x))
Out[149]:
0 27
1 17
2 15
Name: col, dtype: int64
Or
pd.DataFrame(df.col.str.split(' |-').tolist()).astype(float).sum(1)
Out[156]:
0 27.0
1 17.0
2 15.0
dtype: float64
Using pd.Series.str.extractall:
df = pd.DataFrame({"col":['1-2 3-4 8-9', '4-5 6-2', '3-1 4-2 1-4']})
print (df["col"].str.extractall("(\d+)")[0].astype(int).groupby(level=0).sum())
0 27
1 17
2 15
Name: 0, dtype: int32
Use .str.extractall and sum on a level:
df['data'].str.extractall('(\d+)').astype(int).sum(level=0)
Output:
0
0 27
1 17
2 15
A for loop works fine here, and should be performant, since we are dealing with strings:
Using #HenryYik's sample data:
df.assign(sum_ = [sum(int(n) for n in ent
if n.isdigit())
for ent in df.col])
Out[1329]:
col sum_
0 1-2 3-4 8-9 27
1 4-5 6-2 17
2 3-1 4-2 1-4 15
I hazard that it will be faster taking it out and working within Python, before returning back to the pandas dataframe.

Doubts pandas filtering data row by row

How can I solve this issue related on pandas? I've a dataframe of the following approach:
datetime64ns
type(int)
datetime64ns(analysis)
2019-02-02T10:02:05
4
2019-02-02T10:02:01
3
2019-02-02T10:02:02
4
2019-02-02T10:02:02
2019-02-02T10:02:04
3
2019-02-02T10:02:04
The goal is to do the following issue:
# psuedocode
for all the rows:
if datetime(analysis) exists and type=4:
insert in the a new row column type4=1
elseif datetime(analysis) exists and type=2:
insert in the a new row column type2=1
the idea to develop it is in order to make a group by count value. I'm sure that is possible because I manage to develop it in the past but I lost my .py file. Thanks for the attention
Need this?
df = pd.concat([df, pd.get_dummies(df['type(int)'].mask(
df['datetime64ns(analysis)'].isna()).astype('Int64')).add_prefix('type')], 1)
OUTPUT:
datetime64ns type(int) datetime64ns(analysis) type3 type4
0 2019-02-02T10:02:05 4 NaN 0 0
1 2019-02-02T10:02:01 3 NaN 0 0
2 2019-02-02T10:02:02 4 2019-02-02T10:02:02 0 1
3 2019-02-02T10:02:04 3 2019-02-02T10:02:04 1 0

Combining the respective columns from 2 separate DataFrames using pandas

I have 2 large DataFrames with the same set of columns but different values. I need to combine the values in respective columns (A and B here, maybe be more in actual data) into single values in the same columns (see required output below). I have a quick way of implementing this using np.vectorize and df.to_numpy() but I am looking for a way to implement this strictly with pandas. Criteria here is first readability of code then time complexity.
df1 = pd.DataFrame({'A':[1,2,3,4,5], 'B':[5,4,3,2,1]})
print(df1)
A B
0 1 5
1 2 4
2 3 3
3 4 2
4 5 1
and,
df2 = pd.DataFrame({'A':[10,20,30,40,50], 'B':[50,40,30,20,10]})
print(df2)
A B
0 10 50
1 20 40
2 30 30
3 40 20
4 50 10
I have one way of doing it which is quite fast -
#This function might change into something more complex
def conc(a,b):
return str(a)+'_'+str(b)
conc_v = np.vectorize(conc)
required = pd.DataFrame(conc_v(df1.to_numpy(), df2.to_numpy()), columns=df1.columns)
print(required)
#Required Output
A B
0 1_10 5_50
1 2_20 4_40
2 3_30 3_30
3 4_40 2_20
4 5_50 1_10
Looking for an alternate way (strictly pandas) of solving this.
Criteria here is first readability of code
Another simple way is using add and radd
df1.astype(str).add(df2.astype(str).radd('-'))
A B
0 1-10 5-50
1 2-20 4-40
2 3-30 3-30
3 4-40 2-20
4 5-50 1-10

Selective multiplication of a pandas dataframe

I have a pandas Dataframe and Series of the form
df = pd.DataFrame({'Key':[2345,2542,5436,2468,7463],
'Segment':[0] * 5,
'Values':[2,4,6,6,4]})
print (df)
Key Segment Values
0 2345 0 2
1 2542 0 4
2 5436 0 6
3 2468 0 6
4 7463 0 4
s = pd.Series([5436, 2345])
print (s)
0 5436
1 2345
dtype: int64
In the original df, I want to multiply the 3rd column(Values) by 7 except for the keys which are present in the series. So my final df should look like
What should be the best way to achieve this in Python 3.x?
Use DataFrame.loc with Series.isin for filter Value column with inverted condition for non membership with multiple by scalar:
df.loc[~df['Key'].isin(s), 'Values'] *= 7
print (df)
Key Segment Values
0 2345 0 2
1 2542 0 28
2 5436 0 6
3 2468 0 42
4 7463 0 28
Another method could be using numpy.where():
df['Values'] *= np.where(~df['Key'].isin([5436, 2345]), 7,1)

Reindexing multi-index dataframes in Pandas

I am trying to reindex one multi-index dataframe based on another multi-index dataframe. For singly-indexed dfs, this works:
index1 = range(3, 7)
index2 = range(1, 11)
values = [np.random.random() for x in index1]
df = pd.DataFrame(values, index=index1, columns=["values"])
print(df)
print(df.reindex(index2, fill_value=0))
Output:
values
3 0.458003
4 0.945828
5 0.783369
6 0.784599
values
1 0.000000
2 0.000000
3 0.458003
4 0.945828
5 0.783369
6 0.784599
7 0.000000
8 0.000000
9 0.000000
10 0.000000
New rows are added, based on index2, and the value for y is set to 0. This is what I expect.
Now, let's try something similar for a multi-index df:
data_dict = {
"scan": 1,
"x": [2,3,5,7,8,9],
"y": [np.random.random() for x in range(1,7)]
}
index1 = ["scan", "x"]
df = pd.DataFrame.from_dict(data_dict).set_index(index)
print(df)
index2 = list(range(4, 13))
print(df.reindex(index2, level="x").fillna(0))
Output:
y
scan x
1 2 0.771531
3 0.451761
5 0.434075
7 0.135785
8 0.309137
9 0.838330
y
scan x
1 5 0.434075
7 0.135785
8 0.309137
9 0.838330
What gives? The output is different than the input: the first two values have been removed. But the other values - intermediate (e.g., 4) or larger (e.g., 10 or higher) - are not there. What am I missing?
The actual dataframes have 6 index levels and tens to hundreds of rows, but I think this code captures the problem. I spent a little time looking at df.realign, df.join, and a lot of time scouring SO, but I haven't found a solution. Apologies if it's a duplicate!
Let me suggest a workaround:
print(df.reindex(pd.MultiIndex.from_product([df.index.get_level_values(0).unique(), index2], names=['scan', 'x'])).fillna(0))
y
scan x
1 4 0.000000
5 0.718190
6 0.000000
7 0.612991
8 0.609323
9 0.991806
10 0.000000
11 0.000000
12 0.000000
Building on #Sergey's workaround, here's what I ended up with. I expanded the example to have more levels, more closely replicating my own data.
Generate a df:
data_dict = {
"sample": "A",
"scan": 1,
"meas_time": datetime.now(),
"x": [2,3,5,7,8,9],
"y": [np.random.random() for x in range(1,7)]
}
index1 = ["sample", "scan", "meas_time", "x"]
df = pd.DataFrame.from_dict(data_dict).set_index(index1)
print(df)
Try to reindex:
index2 = range(4, 13)
print(df.reindex(labels=index2, level="x").fillna(0))
Implementing Sergey's workaround:
df.reindex(
pd.MultiIndex.from_product(
[df.index.get_level_values("sample").unique(),
df.index.get_level_values("scan").unique(),
df.index.get_level_values("meas_time").unique(),
index2],
names=["sample", "scan", "meas_time", "x"])
).fillna(0)
Notes: if .unique() isn't included, a multiple (product?!?) of the dataframe is calculated for each level. This is likely why my kernel crashed previously; I wasn't including .unique().
This seems like really odd pandas behavior. I also found a workaround which involved chaining .reset_index().set_index("x").reindex("blah").set_index([list]). I'd really like to know why reindexing is treated the way it is.

Resources