Python - Create copies of rows based on column value and increase date by number of iterations - python-3.x

I have a dataframe in Python:
md
Out[94]:
Key_ID ronDt multidays
0 Actuals-788-8AA-0001 2017-01-01 1.0
11 Actuals-788-8AA-0012 2017-01-09 1.0
20 Actuals-788-8AA-0021 2017-01-16 1.0
33 Actuals-788-8AA-0034 2017-01-25 1.0
36 Actuals-788-8AA-0037 2017-01-28 1.0
... ... ...
55239 Actuals-789-8LY-0504 2020-02-12 1.0
55255 Actuals-788-T11-0001 2018-08-23 8.0
55257 Actuals-788-T11-0003 2018-09-01 543.0
55258 Actuals-788-T15-0001 2019-02-20 368.0
55259 Actuals-788-T15-0002 2020-02-24 2.0
I want to create an additional record for every multiday and increase the date (ronDt) by number of times that record was duplicated.
For example:
row[0] would repeat one time with the new date reading 2017-01-02.
row[55255] would be repeated 8 times with the corresponding dates ranging from 2018-08-24 - 2018-08-31.
When I did this in VBA, I used loops, and in Alteryx I used multirow functions. What is the best way to achieve this in Python? Thanks.

Here's a way to in pandas:
# get list of dates possible
df['datecol'] = df.apply(lambda x: pd.date_range(start=x['ronDt'], periods=x['multidays'], freq='D'), 1)
# convert the list into new rows
df = df.explode('datecol').drop('ronDt', 1)
# rename the columns
df.rename(columns={'datecol': 'ronDt'}, inplace=True)
print(df)
Key_ID multidays ronDt
0 Actuals-788-8AA-0001 1.0 2017-01-01
1 Actuals-788-8AA-0012 1.0 2017-01-09
2 Actuals-788-8AA-0021 1.0 2017-01-16
3 Actuals-788-8AA-0034 1.0 2017-01-25
4 Actuals-788-8AA-0037 1.0 2017-01-28
.. ... ... ...
8 Actuals-788-T15-0001 368.0 2020-02-20
8 Actuals-788-T15-0001 368.0 2020-02-21
8 Actuals-788-T15-0001 368.0 2020-02-22
9 Actuals-788-T15-0002 2.0 2020-02-24
9 Actuals-788-T15-0002 2.0 2020-02-25

# Get count of duplication for each row which corresponding to multidays col
df = df.groupby(df.columns.tolist()).size().reset_index().rename(columns={0: 'multidays'})
# Assume ronDt dtype is str so convert it to datetime object
# Then sum ronDt and multidays columns
df['ronDt_new'] = pd.to_datetime(df['ronDt']) + pd.to_timedelta(df['multidays'], unit='d')

Related

how to update rows based on previous row of dataframe python

I have a time series data given below:
date product price amount
11/01/2019 A 10 20
11/02/2019 A 10 20
11/03/2019 A 25 15
11/04/2019 C 40 50
11/05/2019 C 50 60
I have a high dimensional data, and I have just added the simplified version with two columns {price, amount}. I am trying to transform it relatively based on time index illustrated below:
date product price amount
11/01/2019 A NaN NaN
11/02/2019 A 0 0
11/03/2019 A 15 -5
11/04/2019 C NaN NaN
11/05/2019 C 10 10
I am trying to get relative changes of each product based on time indexes. If previous date does not exist for a specified product, I am adding "NaN".
Can you please tell me is there any function to do this?
Group by product and use .diff()
df[["price", "amount"]] = df.groupby("product")[["price", "amount"]].diff()
output :
date product price amount
0 2019-11-01 A NaN NaN
1 2019-11-02 A 0.0 0.0
2 2019-11-03 A 15.0 -5.0
3 2019-11-04 C NaN NaN
4 2019-11-05 C 10.0 10.0

Fill missing value in different columns of dataframe using mean or median of last n values

I have a dataframe which contains timeseries data. What i want to do is efficiently fill all the missing values in different columns by substituting with a median value using timedelta of say "N" mins. E.g if for a column say i have data for 10:20, 10:21,10:22,10:23,10:24,.... and data in 10:22 is missing then with timedelta of say 2 mins i would want it to be filled by median value of 10:20,10:21,10:23 and 10:24.
One way i can do is :
for all column in dataframe:
Find index which has nan value
for all index which has nan value:
extract all values using between_time with index-timedelta and index_+deltatime
find the media of extracted value
set value in the index with that extracted median value.
This looks like 2 for loops running and not a very efficient one. Is there a efficient way to do it.
Thanks
IIUC you can resample your time column, then fillna with rolling window set to center:
# dummy data setup
np.random.seed(500)
n = 2
df = pd.DataFrame({"time":pd.to_timedelta([f"10:{i}:00" for i in range(15)]),
"value":np.random.randint(2, 10, 15)})
df = df.drop(df.index[[5,10]]).reset_index(drop=True)
print (df)
time value
0 10:00:00 4
1 10:01:00 9
2 10:02:00 3
3 10:03:00 3
4 10:04:00 8
5 10:06:00 9
6 10:07:00 2
7 10:08:00 9
8 10:09:00 9
9 10:11:00 7
10 10:12:00 3
11 10:13:00 3
12 10:14:00 7
s = df.set_index("time").resample("60S").asfreq()
print (s.fillna(s.rolling(n*2+1, min_periods=1, center=True).mean()))
value
time
10:00:00 4.0
10:01:00 9.0
10:02:00 3.0
10:03:00 3.0
10:04:00 8.0
10:05:00 5.5
10:06:00 9.0
10:07:00 2.0
10:08:00 9.0
10:09:00 9.0
10:10:00 7.0
10:11:00 7.0
10:12:00 3.0
10:13:00 3.0
10:14:00 7.0

Iterate over rows in a data frame create a new column then adding more columns based on the new column

I have a data frame as below:
Date Quantity
2019-04-25 100
2019-04-26 148
2019-04-27 124
The output that I need is to take the quantity difference between two next dates and average over 24 hours and create 23 columns with hourly quantity difference added to the column before such as below:
Date Quantity Hour-1 Hour-2 ....Hour-23
2019-04-25 100 102 104 .... 146
2019-04-26 148 147 146 .... 123
2019-04-27 124
I'm trying to iterate over a loop but it's not working ,my code is as below:
for i in df.index:
diff=(df.get_value(i+1,'Quantity')-df.get_value(i,'Quantity'))/24
for j in range(24):
df[i,[1+j]]=df.[i,[j]]*(1+diff)
I did some research but I have not found how to create columns like above iteratively. I hope you could help me. Thank you in advance.
IIUC using resample and interpolate, then we pivot the output
s=df.set_index('Date').resample('1 H').interpolate()
s=pd.pivot_table(s,index=s.index.date,columns=s.groupby(s.index.date).cumcount(),values=s,aggfunc='mean')
s.columns=s.columns.droplevel(0)
s
Out[93]:
0 1 2 3 ... 20 21 22 23
2019-04-25 100.0 102.0 104.0 106.0 ... 140.0 142.0 144.0 146.0
2019-04-26 148.0 147.0 146.0 145.0 ... 128.0 127.0 126.0 125.0
2019-04-27 124.0 NaN NaN NaN ... NaN NaN NaN NaN
[3 rows x 24 columns]
If I have understood the question correctly.
for loop approach:
list_of_values = []
for i,row in df.iterrows():
if i < len(df) - 2:
qty = row['Quantity']
qty_2 = df.at[i+1,'Quantity']
diff = (qty_2 - qty)/24
list_of_values.append(diff)
else:
list_of_values.append(0)
df['diff'] = list_of_values
Output:
Date Quantity diff
2019-04-25 100 2
2019-04-26 148 -1
2019-04-27 124 0
Now create the columns required.
i.e.
df['Hour-1'] = df['Quantity'] + df['diff']
df['Hour-2'] = df['Quantity'] + 2*df['diff']
.
.
.
.
There are other approaches which will work way better.

Create a pandas column based on a lookup value from another dataframe

I have a pandas dataframe that has some data values by hour (which is also the index of this lookup dataframe). The dataframe looks like this:
In [1] print (df_lookup)
Out[1] 0 1.109248
1 1.102435
2 1.085014
3 1.073487
4 1.079385
5 1.088759
6 1.044708
7 0.902482
8 0.852348
9 0.995912
10 1.031643
11 1.023458
12 1.006961
...
23 0.889541
I want to multiply the values from this lookup dataframe to create a column of another dataframe, which has datetime as index.
The dataframe looks like this:
In [2] print (df)
Out[2]
Date_Label ID data-1 data-2 data-3
2015-08-09 00:00:00 1 2513.0 2502 NaN
2015-08-09 00:00:00 1 2113.0 2102 NaN
2015-08-09 01:00:00 2 2006.0 1988 NaN
2015-08-09 02:00:00 3 2016.0 2003 NaN
...
2018-07-19 23:00:00 33 3216.0 333 NaN
I want to calculate the data-3 column from data-2 column, where the weight given to 'data-2' column depends on corresponding value in df_lookup. I get the desired values by looping over the index as follows, but that is too slow:
for idx in df.index:
df.loc[idx,'data-3'] = df.loc[idx, 'data-2']*df_lookup.at[idx.hour]
Is there a faster way someone could suggest?
Using .loc
df['data-2']*df_lookup.loc[df.index.hour].values
Out[275]:
Date_Label
2015-08-09 00:00:00 2775.338496
2015-08-09 00:00:00 2331.639296
2015-08-09 01:00:00 2191.640780
2015-08-09 02:00:00 2173.283042
Name: data-2, dtype: float64
#df['data-3']=df['data-2']*df_lookup.loc[df.index.hour].values
I'd probably try doing a join.
# Fix column name
df_lookup.columns = ['multiplier']
# Get hour index
df['hour'] = df.index.hour
# Join
df = df.join(df_lookup, how='left', on=['hour'])
df['data-3'] = df['data-2'] * df['multiplier']
df = df.drop(['multiplier', 'hour'], axis=1)

Pandas Cumulative Sum of Difference Between Value Counts in Two Dataframe Columns

The charts below show my basic challenge: subtract NUMBER OF STOCKS WITH DATA END from NUMBER OF STOCKS WITH DATA START. The challenge I am having is that the date range for each series does not match so I need to merge both sets to a common date range, perform the subtraction, and save results to a new comma seperated value file.
Input data in file named 'meta.csv' contains 3187 lines. Fields per line are data for ticker, start, & end. Head and tail as shown here:
0000 ticker,start,end
0001 A,1999-11-18,2016-12-27
0002 AA,2016-11-01,2016-12-27
0003 AAL,2005-09-27,2016-12-27
0004 AAMC,2012-12-13,2016-12-27
0005 AAN,1984-09-07,2016-12-27
...
3183 ZNGA,2011-12-16,2016-12-27
3184 ZOES,2014-04-11,2016-12-27
3185 ZQK,1990-03-26,2015-09-09
3186 ZTS,2013-02-01,2016-12-27
3187 ZUMZ,2005-05-06,2016-12-27
Python code and console output:
import pandas as pd
df = pd.read_csv('meta.csv')
s = df.groupby('start').size().cumsum()
e = df.groupby('end').size().cumsum()
#s.plot(title='NUMBER OF STOCKS WITH DATA START',
# grid=True,style='k.')
#e.plot(title='NUMBER OF STOCKS WITH DATA END',
# grid=True,style='k.')
print(s.head(5))
print(s.tail(5))
print(e.tail(5))
OUT:
start
1962-01-02 11
1962-11-19 12
1970-01-02 30
1971-08-06 31
1972-06-01 54
dtype: int64
start
2016-07-05 3182
2016-10-04 3183
2016-11-01 3184
2016-12-05 3185
2016-12-08 3186
end
2016-12-08 544
2016-12-15 545
2016-12-16 546
2016-12-21 547
2016-12-27 3186
dtype: int64
Chart output when comments removed for code shown above:
I want to create one population file with the date and number of stocks with active data which should have a head and tail shown as follows:
date,num_stocks
1962-01-02,11
1962-11-19,12
1970-01-02,30
1971-08-06,31
1972-06-01,54
...
2016-12-08,2642
2016-12-15,2641
2016-12-16,2640
2016-12-21,2639
2016-12-27,2639
The ultimate goal is to be able to plot the number of stocks with data over any specified date range by reading the population file.
To align the dates with their respective counts. I'd take the difference of pd.Series.value_counts
df.start.value_counts().sub(df.end.value_counts(), fill_value=0)
1984-09-07 1.0
1990-03-26 1.0
1999-11-18 1.0
2005-05-06 1.0
2005-09-27 1.0
2011-12-16 1.0
2012-12-13 1.0
2013-02-01 1.0
2014-04-11 1.0
2015-09-09 -1.0
2016-11-01 1.0
2016-12-27 -9.0
dtype: float64
Thanks to the crucial tip provided by piRSquared I solved the challenge using this code:
import pandas as pd
df = pd.read_csv('meta.csv')
x = df.start.value_counts().sub(df.end.value_counts(), fill_value=0)
x.iloc[-1] = 0
r = x.cumsum()
r.to_csv('pop.csv')
z = pd.read_csv('pop.csv', index_col=0, header=None)
z.plot(title='NUMBER OF STOCKS WITH DATA',legend=None,
grid=True,style='k.')
'pop.csv' file head/tail:
1962-01-02 11.0
1962-11-19 12.0
1970-01-02 30.0
1971-08-06 31.0
1972-06-01 54.0
...
2016-12-08 2642.0
2016-12-15 2641.0
2016-12-16 2640.0
2016-12-21 2639.0
2016-12-27 2639.0
Chart:

Resources