Using condition in function while generating values for dataframe - python-3.x

I have to create a dataframe having columns start_date and end_date where end_date > start_date using a function which randomly generates date values.
I tried something like this:
Project = pd.DataFrame({'Name': np.random.choice(['Starbucks','Macdonalds', 'KFC', 'Maruti',
'Honda','Mercedes', 'BMW', 'Reebok','Nike','Lee'],10),
'Start_Date':Project.apply(lambda row: gen_datetime(), axis = 1),
'End_Date': Project.apply(lambda row: gen_datetime() where('End_Date' > 'Start_Date' ), axis = 1)})
I don't know how to use the condition statement:
def gen_datetime(min_year=2017, max_year=datetime.now().year):
start = date(min_year, 10, 28)
years = max_year - min_year + 1
end = start + timedelta(days=365 * years)
for i in range(10):
random_date = start + (end - start) * random.random()
return random_date

Idea is generate random end time from start time by adding random timedelta:
N = 10
shift_end_date = 20
def gen_datetime(min_year=2017, max_year=datetime.now().year):
start = date(min_year, 10, 28)
years = max_year - min_year + 1
end = start + timedelta(days=365 * years)
dates = pd.date_range(start, end - timedelta(shift_end_date))
return np.random.choice(dates, N)
names = ['Starbucks','Macdonalds', 'KFC', 'Maruti',
'Honda','Mercedes', 'BMW', 'Reebok','Nike','Lee']
Project = pd.DataFrame({'Name': np.random.choice(names,N),
'Start_Date':gen_datetime()})
days = pd.to_timedelta(np.random.randint(1, shift_end_date, size=N), unit='d')
Project['End_Date'] = Project['Start_Date'] + days
print(Project)
Name Start_Date End_Date
0 Maruti 2018-07-31 2018-08-13
1 KFC 2017-11-20 2017-11-21
2 Maruti 2018-07-22 2018-07-23
3 Reebok 2018-05-13 2018-05-15
4 KFC 2018-08-16 2018-08-29
5 Starbucks 2018-03-18 2018-03-23
6 Reebok 2018-02-13 2018-03-03
7 Lee 2018-04-26 2018-05-10
8 Reebok 2018-09-11 2018-09-15
9 Honda 2018-05-15 2018-05-19
Improved solution - function return both arrays for start and end days and use parameter origin in to_datetime, need pandas 0.20.1+:
N = 10
def gen_datetime(min_year=2017, max_year=datetime.now().year):
start = pd.Timestamp(min_year, 10, 28)
years = max_year - min_year + 1
end = 365 * years
#get random sorted 2d array for days from start date
d = np.sort(np.random.randint(end, size=[2,N]), axis=0)
#convert to datetime with origin parameter
a = pd.to_datetime(d[0], unit='D',
origin=start)
b = pd.to_datetime(d[1], unit='D',
origin=start)
#return both arrays together
return a,b
#extract output to 2 variables
start, end = gen_datetime()
names = ['Starbucks','Macdonalds', 'KFC', 'Maruti',
'Honda','Mercedes', 'BMW', 'Reebok','Nike','Lee']
Project = pd.DataFrame({'Name': np.random.choice(names,N),
'Start_Date':start,
'End_Date':end}, columns=['Name','Start_Date','End_Date'])
print(Project)
Name Start_Date End_Date
0 Reebok 2017-11-20 2018-06-28
1 Nike 2018-06-12 2018-07-23
2 Reebok 2018-04-26 2018-07-06
3 BMW 2018-02-20 2018-07-14
4 Starbucks 2018-04-02 2018-09-10
5 Starbucks 2017-12-14 2018-03-29
6 Lee 2018-05-17 2018-09-13
7 Macdonalds 2017-11-01 2018-08-20
8 Reebok 2018-04-09 2018-06-27
9 Macdonalds 2018-02-21 2018-10-07

Related

Perform a specific calculation in a dataframe based on number of days in each month

I have dataframe as shown below. Which consist of customers last 3 months number of transactions.
id age_days n_month3 n_month2 n_month1
1 201 60 15 30
2 800 0 15 5
3 800 0 0 10
4 100 0 0 0
5 600 0 6 5
6 800 0 0 15
7 500 10 10 30
8 200 0 0 0
9 500 0 0 0
From the above I would like to derive a column called recency as shown in the explanations
Explanation:
month3 is the current month
m3 = number of days of current month
m2 = number of days of previous month
m1 = number of days of the month of 2 months back from current month
if df["n_month3"] != 0:
recency = m3 / df["n_month3"]
elif df["n_month2"] != 0:
recency = m3 + (m2 / df["n_month2"])
elif df["n_month1"] != 0:
recency = m3 + m2 + (m1 / df["n_month1"])
else:
if df["age_days"] <= (m3 + m2 + m1):
recency = df["age_days"]
else:
recency = (m3 + m2 + m1) + 1
Expected output:
Let say current month is April, then
m3 = 30
m2 = 31
m1 = 28
id age_days n_month3 n_month2 n_month1 recency
1 201 60 15 30 (m3/60) = 30/60 = 0.5
2 800 0 15 5 m3 + (m2/15) = 30 + 31/15 = 32
3 800 0 0 10 m3 + m2 + m1/10 = 30 + 31 + 28/10
4 100 0 0 0 m3+m2+m1+1 = 90
5 600 0 6 5 m3 + (m2/6) = 30 + 31/6
6 800 0 0 15 m3 + m2 + m1/15 = 30 + 31 + 28/15
7 500 10 10 30 (m3/10) = 30/10 = 3
8 10 0 0 0 10(age_days)
9 500 0 0 0 m3+m2+m1+1 = 90
I am facing issue with dynamically defining m3, m2 and m1 based on the current month.
Here is one way to do it (as of this answer, current month is April 2022):
from calendar import monthrange
from datetime import datetime
m3, m2, m1 = (
monthrange(datetime.now().year, datetime.now().month - i)[1] for i in range(0, 3)
)
print(m3) # 30 [days in April]
print(m2) # 31 [days in March]
print(m1) # 28 [days in February]
If you need to extend to 12 months, in order to deal with years cut-off, you can do this:
current_year = datetime.now().year
current_month = datetime.now().month
m12, m11, m10, m9, m8, m7, m6, m5, m4, m3, m2, m1 = [
monthrange(current_year, current_month - i)[1] for i in range(0, current_month)
] + [monthrange(current_year - 1, 12 - i)[1] for i in range(0, 12 - current_month)]
print(m5) # 30 [days in September 2021]

To find sum and percentage from columns of two different dataframe and append result in third dataframe

I have made 2 identical looking dataframe which looks like below:
df1:
date id email Count
4/22/2019 1 abc#xyz.com 10
4/22/2019 1 def#xyz.com 4
4/23/2019 1 abc#xyz.com 5
4/23/2019 1 def#xyz.com 10
df2:
date id Email_ID Count
4/22/2019 1 fgh#xyz.com 5
4/22/2019 1 ijk#xyz.com 6
4/23/2019 1 fgh#xyz.com 7
4/23/2019 1 ijk#xyz.com 8
I want to make a dataframe3 which has sum and percentage of 'Count' column of each dataframe(df1 and df2) and calculate individual percentage[like df1_count%=(df1_count/df1_count+df2_count)*100] according to the date. Output df3 should be something like this below:
df3:
Count Count%
date df1_count df2_count df1_count% df2_count%
4/22/2019 14 11 56% 44%
4/23/2019 15 15 50% 50%
How can it be done by pandas? I am able to do it using 'for' loop but not able to do by pandas functionality, any leads will help
Output as per solution #jezrael
Count Count count% count%
df1_count df2_count df1_count% df2_count%
Date
4/22/2019 14 11 56% 44%
4/23/2019 15 15 50% 50%
Use concat with aggregation sum:
df = pd.concat([df1.groupby('date')['Count'].sum(),
df2.groupby('date')['Count'].sum()], axis=1, keys=('df1_count','df2_count'))
And then add new columns:
s = (df['df1_count'] + df['df2_count'])
df['df1_count%'] = df['df1_count'] / s * 100
df['df2_count%'] = df['df2_count'] / s * 100
df = df.reset_index()
print (df)
date df1_count df2_count df1_count% df2_count%
0 4/22/2019 14 11 56.0 44.0
1 4/23/2019 15 15 50.0 50.0
If need percentages to values first convert to strings with Series.round for truncate decimals:
s = (df['df1_count'] + df['df2_count'])
df['df1_count%'] = (df['df1_count'] / s * 100).round().astype(str) + '%'
df['df2_count%'] = (df['df2_count'] / s * 100).round().astype(str) + '%'
df = df.reset_index()
print (df)
date df1_count df2_count df1_count% df2_count%
0 4/22/2019 14 11 56.0% 44.0%
1 4/23/2019 15 15 50.0% 50.0%
EDIT:
df = pd.concat([df1.groupby('date')['Count'].sum(),
df2.groupby('date')['Count'].sum()], axis=1,
keys=('Count_df1_count','Count_df2_count'))
s = (df['Count_df1_count'] + df['Count_df2_count'])
df['Count%_df1_count%'] = (df['Count_df1_count'] / s * 100).round().astype(str) + '%'
df['Count%_df2_count%'] = (df['Count_df2_count'] / s * 100).round().astype(str) + '%'
df.columns = df.columns.str.split('_', expand=True, n=1)
print (df)
Count Count%
df1_count df2_count df1_count% df2_count%
date
4/22/2019 14 11 56.0% 44.0%
4/23/2019 15 15 50.0% 50.0%

Counter number of date including between two date

I have a data set like this:
ID date value_1 value_2 tech start_date last_date
ab 2017-06-01 3476.44 324 A 2015-05-04 2018-06-01
ab 2017-07-01 3556.65 332 A 2016-06-07 2018-07-01
ab 2017-08-01 3552.65 120 B 2016-01-08 2018-01-01
ab 2017-09-01 3201.66 987 C 2015-04-08 2018-04-01
bc 2017-10-01 3059.02 652 C 2015-06-09 2018-03-01
bc 2017-11-01 2853.37 345 C 2018-01-01 2018-08-01
bc 2017-12-01 2871.29 554 C 2015-10-01 2018-01-01
I want to keep the ID and the tech fixed and count how many the date inclouding between start_date and last_date.
Like:
ID count
ab 4
ab 4
ab 4
ab 4
bc 2
bc 2
bc 2
I build an a function for do the count and next I do an a group by:
def count_c(data):
d = {}
d['count'] = np.sum(
[x > data['start_date '] & x < data['last_date '] for x in data['date']])
return pd.Series(d, index=['count'])
df_model1 = flag.groupby('date').apply(count_c)
Quite simple actually, instead of using a function use the datetime library and subtract each date.
import pandas as pd
import numpy as np
from datetime import datetime
df = pd.DataFrame(columns=['ID', 'date', 'value_1', 'value_2', 'tech', 'start_date', 'last_date']) # Your DataFrame
days_list = []
EDIT: Solution now counts the amount of rows in between start_date and end_date column
for i, row in df.iterrows():
s_date = datetime.strptime(row['start_date'], '%m/%d/%y')
e_date = datetime.strptime(row['last_date'],'%m/%d/%y')
days = abs((e_date - s_date).days)
days_list.append(days)
days_list = np.array(days_list)
df['Days'] = days_list
def dates(df):
"""
:param df: DataFrame
:param start_date: (str) mm/dd/yy
:param end_date: (str) mm/dd/yy
:return: number of rows
"""
n = 0
for _, ro in df.iterrows():
y = datetime.strptime(ro['start_date'], '%m/%d/%y')
t = datetime.strptime(ro['last_date'], '%m/%d/%y')
d = datetime.strptime(ro['date'], '%m/%d/%y')
if y < d < t:
n += 1
print(dates(df))

day of Year values starting from a particular date

I have a dataframe with a date column. The duration is 365 days starting from 02/11/2017 and ending at 01/11/2018.
Date
02/11/2017
03/11/2017
05/11/2017
.
.
01/11/2018
I want to add an adjacent column called Day_Of_Year as follows:
Date Day_Of_Year
02/11/2017 1
03/11/2017 2
05/11/2017 4
.
.
01/11/2018 365
I apologize if it's a very basic question, but unfortunately I haven't been able to start with this.
I could use datetime(), but that would return values such as 1 for 1st january, 2 for 2nd january and so on.. irrespective of the year. So, that wouldn't work for me.
First convert column to_datetime and then subtract datetime, convert to days and add 1:
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
df['Day_Of_Year'] = df['Date'].sub(pd.Timestamp('2017-11-02')).dt.days + 1
print (df)
Date Day_Of_Year
0 02/11/2017 1
1 03/11/2017 2
2 05/11/2017 4
3 01/11/2018 365
Or subtract by first value of column:
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
df['Day_Of_Year'] = df['Date'].sub(df['Date'].iat[0]).dt.days + 1
print (df)
Date Day_Of_Year
0 2017-11-02 1
1 2017-11-03 2
2 2017-11-05 4
3 2018-11-01 365
Using strftime with '%j'
s=pd.to_datetime(df.Date,dayfirst=True).dt.strftime('%j').astype(int)
s-s.iloc[0]
Out[750]:
0 0
1 1
2 3
Name: Date, dtype: int32
#df['new']=s-s.iloc[0]
Python has dayofyear. So put your column in the right format with pd.to_datetime and then apply Series.dt.dayofyear. Lastly, use some modulo arithmetic to find everything in terms of your original date
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
df['day of year'] = df['Date'].dt.dayofyear - df['Date'].dt.dayofyear[0] + 1
df['day of year'] = df['day of year'] + 365*((365 - df['day of year']) // 365)
Output
Date day of year
0 2017-11-02 1
1 2017-11-03 2
2 2017-11-05 4
3 2018-11-01 365
But I'm doing essentially the same as Jezrael in more lines of code, so my vote goes to her/him

Loop through rows and columns to calculate a compounded rate in a new DataFrame

I have been struggling to run a loop through each row and column. When looping through each row, I want to calculate a compound rate of return.
There are two different DataFrames (df1 and df2), where df1 shows stock symbols and df2 show their respective prices. I am trying to build a new DataFrame (df3) based on the 'if statements' listed below.
If df1.row[1] = df1.row[0], then (df2.row[1]/df2.row[0]) * df3[0]
If df1.row[1] <> df1.row[0], then df3[1] = df3[0]
First DataFrame = df1
Date 1 2 3 4 5
0 2000-12-05 PXX.TO MX.TO CAE.TO HRX.TO FR.TO
1 2000-12-06 PXX.TO MX.TO CAE.TO HRX.TO FR.TO
2 2000-12-07 FTS.TO MX.TO CAE.TO HRX.TO FR.TO
3 2000-12-08 FTS.TO MX.TO CAE.TO HRX.TO FR.TO
4 2000-12-09 FTS.TO G.TO CAE.TO HRX.TO TB.TO
5 2000-12-10 FTS.TO G.TO KYU.TO HRX.TO TB.TO
6 2000-12-11 FTS.TO G.TO KYU.TO HRX.TO TB.TO
7 2000-12-12 BAM-A.TO G.TO KYU.TO HRX.TO TB.TO
8 2000-12-13 BAM-A.TO PLI.TO KYU.TO HRX.TO TB.TO
9 2000-12-14 BAM-A.TO PLI.TO KYU.TO HRX.TO TB.TO
10 2000-12-15 BAM-A.TO PLI.TO KYU.TO HRX.TO TB.TO
Second DataFrame = df2
Date 1 2 3 4 5
0 2000-12-05 2.3 60.10 2.30 34.98 35.00
1 2000-12-06 2.35 60.70 2.38 35.43 35.01
2 2000-12-07 56.76 61.31 2.46 35.89 35.02
3 2000-12-08 57.33 61.92 2.54 36.35 35.04
4 2000-12-09 57.90 100.20 2.63 36.83 300.90
5 2000-12-10 58.48 101.00 69.56 37.30 304.18
6 2000-12-11 59.07 101.81 70.46 37.78 307.50
7 2000-12-12 4.50 102.62 71.37 38.27 310.85
8 2000-12-13 4.54 44.50 72.29 38.77 314.24
9 2000-12-14 4.57 45.39 73.23 39.27 317.66
10 2000-12-15 4.61 46.30 74.18 39.78 321.12
Desired Output = df3
Date 1 2 3 4 5
0 2000-12-05 1.0000 1.0000 1.0000 1.0000 1.0000
1 2000-12-06 1.0200 1.0100 1.0340 1.0129 1.0003
2 2000-12-07 1.0200 1.0201 1.0692 1.0260 1.0007
3 2000-12-08 1.0302 1.0303 1.1055 1.0393 1.0010
4 2000-12-09 1.0405 1.0303 1.1431 1.0528 1.0010
5 2000-12-10 1.0509 1.0385 1.1431 1.0664 1.0119
6 2000-12-11 1.0614 1.0469 1.1579 1.0802 1.0230
7 2000-12-12 1.0614 1.0552 1.1729 1.0941 1.0341
8 2000-12-13 1.0699 1.0552 1.1880 1.1083 1.0454
9 2000-12-14 1.0785 1.0763 1.2034 1.1226 1.0568
10 2000-12-15 1.0871 1.0979 1.2190 1.1371 1.0683
Below show the formulas for the values in df3 for column 1
df3.row[0] = 1
df3.row[1] = (2.35/2.30) * 1 = 1.0200
df3.row[2] = (56.76/56.76) * 1.0200 = 1.0200
df3.row[3] = (57.33/56.76) * 1.0200 = 1.0302
df3.row[4] = (57.90/57.33) * 1.0302 = 1.0405
df3.row[5] = (58.48/57.90) * 1.0405 = 1.0509
df3.row[6] = (59.07/58.48) * 1.0509 = 1.0614
df3.row[7] = (4.50/4.50) * 1.0614 = 1.0614
df3.row[8] = (4.54/4.50) * 1.0614 = 1.0699
df3.row[9] = (4.57/4.54) * 1.0699 = 1.0785
df3.row[10] = (4.61/4.57) * 1.0785 = 1.0871
Below is what I have so far. Not confident that this is the best approach.
StartFromDay = 1
NumOfHoldings = 10
df3 = pd.DataFrame(columns = np.arange(1,NumOfHoldings+1))
df3.index.names = ['Date']
for col in df1.columns:
#First row should equal 1
df3.iloc[0][col] == 1
for i in range(StartFromDay, len(df1)):
#first row of each column
prevrow = df1.iloc[0][col]
if df1.iloc[i][col] == prevrow:
###### If Statements to calculate compound return#######
Loops are slow, so we'll do it in a vectorized way. First, set the indexes appropriately:
df1.set_index('Date', inplace=True)
df2.set_index('Date', inplace=True)
Next, generate a boolean mask which is True wherever the symbol is the same:
same_stock = df1.iloc[1:].values == df1.iloc[:-1].values
We have to use values because the shifted series are not aligned on the index anymore.
And make a matrix with all the df2.row[1]/df2.row[0] values:
ret = df2.iloc[1:].values / df2.iloc[:-1].values
Next, replace the returns where the symbol changed:
ret[~same_stock] = 1 # pretend return is flat when symbol changed
Now create a DataFrame with the result:
simpret = pd.DataFrame(np.vstack(([1,1,1,1,1], ret)), df1.index)
df3 = simpret.cumprod()

Resources