Using condition in function while generating values for dataframe

Using condition in function while generating values for dataframe - python-3.x

I have to create a dataframe having columns start_date and end_date where end_date > start_date using a function which randomly generates date values.
I tried something like this:
Project = pd.DataFrame({'Name': np.random.choice(['Starbucks','Macdonalds', 'KFC', 'Maruti',
'Honda','Mercedes', 'BMW', 'Reebok','Nike','Lee'],10),
'Start_Date':Project.apply(lambda row: gen_datetime(), axis = 1),
'End_Date': Project.apply(lambda row: gen_datetime() where('End_Date' > 'Start_Date' ), axis = 1)})
I don't know how to use the condition statement:
def gen_datetime(min_year=2017, max_year=datetime.now().year):
start = date(min_year, 10, 28)
years = max_year - min_year + 1
end = start + timedelta(days=365 * years)
for i in range(10):
random_date = start + (end - start) * random.random()
return random_date

Idea is generate random end time from start time by adding random timedelta:
N = 10
shift_end_date = 20
def gen_datetime(min_year=2017, max_year=datetime.now().year):
start = date(min_year, 10, 28)
years = max_year - min_year + 1
end = start + timedelta(days=365 * years)
dates = pd.date_range(start, end - timedelta(shift_end_date))
return np.random.choice(dates, N)
names = ['Starbucks','Macdonalds', 'KFC', 'Maruti',
'Honda','Mercedes', 'BMW', 'Reebok','Nike','Lee']
Project = pd.DataFrame({'Name': np.random.choice(names,N),
'Start_Date':gen_datetime()})
days = pd.to_timedelta(np.random.randint(1, shift_end_date, size=N), unit='d')
Project['End_Date'] = Project['Start_Date'] + days
print(Project)
Name Start_Date End_Date
0 Maruti 2018-07-31 2018-08-13
1 KFC 2017-11-20 2017-11-21
2 Maruti 2018-07-22 2018-07-23
3 Reebok 2018-05-13 2018-05-15
4 KFC 2018-08-16 2018-08-29
5 Starbucks 2018-03-18 2018-03-23
6 Reebok 2018-02-13 2018-03-03
7 Lee 2018-04-26 2018-05-10
8 Reebok 2018-09-11 2018-09-15
9 Honda 2018-05-15 2018-05-19
Improved solution - function return both arrays for start and end days and use parameter origin in to_datetime, need pandas 0.20.1+:
N = 10
def gen_datetime(min_year=2017, max_year=datetime.now().year):
start = pd.Timestamp(min_year, 10, 28)
years = max_year - min_year + 1
end = 365 * years
#get random sorted 2d array for days from start date
d = np.sort(np.random.randint(end, size=[2,N]), axis=0)
#convert to datetime with origin parameter
a = pd.to_datetime(d[0], unit='D',
origin=start)
b = pd.to_datetime(d[1], unit='D',
origin=start)
#return both arrays together
return a,b
#extract output to 2 variables
start, end = gen_datetime()
names = ['Starbucks','Macdonalds', 'KFC', 'Maruti',
'Honda','Mercedes', 'BMW', 'Reebok','Nike','Lee']
Project = pd.DataFrame({'Name': np.random.choice(names,N),
'Start_Date':start,
'End_Date':end}, columns=['Name','Start_Date','End_Date'])
print(Project)
Name Start_Date End_Date
0 Reebok 2017-11-20 2018-06-28
1 Nike 2018-06-12 2018-07-23
2 Reebok 2018-04-26 2018-07-06
3 BMW 2018-02-20 2018-07-14
4 Starbucks 2018-04-02 2018-09-10
5 Starbucks 2017-12-14 2018-03-29
6 Lee 2018-05-17 2018-09-13
7 Macdonalds 2017-11-01 2018-08-20
8 Reebok 2018-04-09 2018-06-27
9 Macdonalds 2018-02-21 2018-10-07

Related

Perform a specific calculation in a dataframe based on number of days in each month

I have dataframe as shown below. Which consist of customers last 3 months number of transactions.
id age_days n_month3 n_month2 n_month1
1 201 60 15 30
2 800 0 15 5
3 800 0 0 10
4 100 0 0 0
5 600 0 6 5
6 800 0 0 15
7 500 10 10 30
8 200 0 0 0
9 500 0 0 0
From the above I would like to derive a column called recency as shown in the explanations
Explanation:
month3 is the current month
m3 = number of days of current month
m2 = number of days of previous month
m1 = number of days of the month of 2 months back from current month
if df["n_month3"] != 0:
recency = m3 / df["n_month3"]
elif df["n_month2"] != 0:
recency = m3 + (m2 / df["n_month2"])
elif df["n_month1"] != 0:
recency = m3 + m2 + (m1 / df["n_month1"])
else:
if df["age_days"] <= (m3 + m2 + m1):
recency = df["age_days"]
else:
recency = (m3 + m2 + m1) + 1
Expected output:
Let say current month is April, then
m3 = 30
m2 = 31
m1 = 28
id age_days n_month3 n_month2 n_month1 recency
1 201 60 15 30 (m3/60) = 30/60 = 0.5
2 800 0 15 5 m3 + (m2/15) = 30 + 31/15 = 32
3 800 0 0 10 m3 + m2 + m1/10 = 30 + 31 + 28/10
4 100 0 0 0 m3+m2+m1+1 = 90
5 600 0 6 5 m3 + (m2/6) = 30 + 31/6
6 800 0 0 15 m3 + m2 + m1/15 = 30 + 31 + 28/15
7 500 10 10 30 (m3/10) = 30/10 = 3
8 10 0 0 0 10(age_days)
9 500 0 0 0 m3+m2+m1+1 = 90
I am facing issue with dynamically defining m3, m2 and m1 based on the current month.

Here is one way to do it (as of this answer, current month is April 2022):
from calendar import monthrange
from datetime import datetime
m3, m2, m1 = (
monthrange(datetime.now().year, datetime.now().month - i)[1] for i in range(0, 3)
)
print(m3) # 30 [days in April]
print(m2) # 31 [days in March]
print(m1) # 28 [days in February]
If you need to extend to 12 months, in order to deal with years cut-off, you can do this:
current_year = datetime.now().year
current_month = datetime.now().month
m12, m11, m10, m9, m8, m7, m6, m5, m4, m3, m2, m1 = [
monthrange(current_year, current_month - i)[1] for i in range(0, current_month)
] + [monthrange(current_year - 1, 12 - i)[1] for i in range(0, 12 - current_month)]
print(m5) # 30 [days in September 2021]

To find sum and percentage from columns of two different dataframe and append result in third dataframe

I have made 2 identical looking dataframe which looks like below:
df1:
date id email Count
4/22/2019 1 abc#xyz.com 10
4/22/2019 1 def#xyz.com 4
4/23/2019 1 abc#xyz.com 5
4/23/2019 1 def#xyz.com 10
df2:
date id Email_ID Count
4/22/2019 1 fgh#xyz.com 5
4/22/2019 1 ijk#xyz.com 6
4/23/2019 1 fgh#xyz.com 7
4/23/2019 1 ijk#xyz.com 8
I want to make a dataframe3 which has sum and percentage of 'Count' column of each dataframe(df1 and df2) and calculate individual percentage[like df1_count%=(df1_count/df1_count+df2_count)*100] according to the date. Output df3 should be something like this below:
df3:
Count Count%
date df1_count df2_count df1_count% df2_count%
4/22/2019 14 11 56% 44%
4/23/2019 15 15 50% 50%
How can it be done by pandas? I am able to do it using 'for' loop but not able to do by pandas functionality, any leads will help
Output as per solution #jezrael
Count Count count% count%
df1_count df2_count df1_count% df2_count%
Date
4/22/2019 14 11 56% 44%
4/23/2019 15 15 50% 50%

Use concat with aggregation sum:
df = pd.concat([df1.groupby('date')['Count'].sum(),
df2.groupby('date')['Count'].sum()], axis=1, keys=('df1_count','df2_count'))
And then add new columns:
s = (df['df1_count'] + df['df2_count'])
df['df1_count%'] = df['df1_count'] / s * 100
df['df2_count%'] = df['df2_count'] / s * 100
df = df.reset_index()
print (df)
date df1_count df2_count df1_count% df2_count%
0 4/22/2019 14 11 56.0 44.0
1 4/23/2019 15 15 50.0 50.0
If need percentages to values first convert to strings with Series.round for truncate decimals:
s = (df['df1_count'] + df['df2_count'])
df['df1_count%'] = (df['df1_count'] / s * 100).round().astype(str) + '%'
df['df2_count%'] = (df['df2_count'] / s * 100).round().astype(str) + '%'
df = df.reset_index()
print (df)
date df1_count df2_count df1_count% df2_count%
0 4/22/2019 14 11 56.0% 44.0%
1 4/23/2019 15 15 50.0% 50.0%
EDIT:
df = pd.concat([df1.groupby('date')['Count'].sum(),
df2.groupby('date')['Count'].sum()], axis=1,
keys=('Count_df1_count','Count_df2_count'))
s = (df['Count_df1_count'] + df['Count_df2_count'])
df['Count%_df1_count%'] = (df['Count_df1_count'] / s * 100).round().astype(str) + '%'
df['Count%_df2_count%'] = (df['Count_df2_count'] / s * 100).round().astype(str) + '%'
df.columns = df.columns.str.split('_', expand=True, n=1)
print (df)
Count Count%
df1_count df2_count df1_count% df2_count%
date
4/22/2019 14 11 56.0% 44.0%
4/23/2019 15 15 50.0% 50.0%

Counter number of date including between two date

I have a data set like this:
ID date value_1 value_2 tech start_date last_date
ab 2017-06-01 3476.44 324 A 2015-05-04 2018-06-01
ab 2017-07-01 3556.65 332 A 2016-06-07 2018-07-01
ab 2017-08-01 3552.65 120 B 2016-01-08 2018-01-01
ab 2017-09-01 3201.66 987 C 2015-04-08 2018-04-01
bc 2017-10-01 3059.02 652 C 2015-06-09 2018-03-01
bc 2017-11-01 2853.37 345 C 2018-01-01 2018-08-01
bc 2017-12-01 2871.29 554 C 2015-10-01 2018-01-01
I want to keep the ID and the tech fixed and count how many the date inclouding between start_date and last_date.
Like:
ID count
ab 4
ab 4
ab 4
ab 4
bc 2
bc 2
bc 2
I build an a function for do the count and next I do an a group by:
def count_c(data):
d = {}
d['count'] = np.sum(
[x > data['start_date '] & x < data['last_date '] for x in data['date']])
return pd.Series(d, index=['count'])
df_model1 = flag.groupby('date').apply(count_c)

Quite simple actually, instead of using a function use the datetime library and subtract each date.
import pandas as pd
import numpy as np
from datetime import datetime
df = pd.DataFrame(columns=['ID', 'date', 'value_1', 'value_2', 'tech', 'start_date', 'last_date']) # Your DataFrame
days_list = []
EDIT: Solution now counts the amount of rows in between start_date and end_date column
for i, row in df.iterrows():
s_date = datetime.strptime(row['start_date'], '%m/%d/%y')
e_date = datetime.strptime(row['last_date'],'%m/%d/%y')
days = abs((e_date - s_date).days)
days_list.append(days)
days_list = np.array(days_list)
df['Days'] = days_list
def dates(df):
"""
:param df: DataFrame
:param start_date: (str) mm/dd/yy
:param end_date: (str) mm/dd/yy
:return: number of rows
"""
n = 0
for _, ro in df.iterrows():
y = datetime.strptime(ro['start_date'], '%m/%d/%y')
t = datetime.strptime(ro['last_date'], '%m/%d/%y')
d = datetime.strptime(ro['date'], '%m/%d/%y')
if y < d < t:
n += 1
print(dates(df))

day of Year values starting from a particular date

I have a dataframe with a date column. The duration is 365 days starting from 02/11/2017 and ending at 01/11/2018.
Date
02/11/2017
03/11/2017
05/11/2017
.
.
01/11/2018
I want to add an adjacent column called Day_Of_Year as follows:
Date Day_Of_Year
02/11/2017 1
03/11/2017 2
05/11/2017 4
.
.
01/11/2018 365
I apologize if it's a very basic question, but unfortunately I haven't been able to start with this.
I could use datetime(), but that would return values such as 1 for 1st january, 2 for 2nd january and so on.. irrespective of the year. So, that wouldn't work for me.

First convert column to_datetime and then subtract datetime, convert to days and add 1:
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
df['Day_Of_Year'] = df['Date'].sub(pd.Timestamp('2017-11-02')).dt.days + 1
print (df)
Date Day_Of_Year
0 02/11/2017 1
1 03/11/2017 2
2 05/11/2017 4
3 01/11/2018 365
Or subtract by first value of column:
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
df['Day_Of_Year'] = df['Date'].sub(df['Date'].iat[0]).dt.days + 1
print (df)
Date Day_Of_Year
0 2017-11-02 1
1 2017-11-03 2
2 2017-11-05 4
3 2018-11-01 365

Using strftime with '%j'
s=pd.to_datetime(df.Date,dayfirst=True).dt.strftime('%j').astype(int)
s-s.iloc[0]
Out[750]:
0 0
1 1
2 3
Name: Date, dtype: int32
#df['new']=s-s.iloc[0]

Python has dayofyear. So put your column in the right format with pd.to_datetime and then apply Series.dt.dayofyear. Lastly, use some modulo arithmetic to find everything in terms of your original date
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
df['day of year'] = df['Date'].dt.dayofyear - df['Date'].dt.dayofyear[0] + 1
df['day of year'] = df['day of year'] + 365*((365 - df['day of year']) // 365)
Output
Date day of year
0 2017-11-02 1
1 2017-11-03 2
2 2017-11-05 4
3 2018-11-01 365
But I'm doing essentially the same as Jezrael in more lines of code, so my vote goes to her/him

Loop through rows and columns to calculate a compounded rate in a new DataFrame

I have been struggling to run a loop through each row and column. When looping through each row, I want to calculate a compound rate of return.
There are two different DataFrames (df1 and df2), where df1 shows stock symbols and df2 show their respective prices. I am trying to build a new DataFrame (df3) based on the 'if statements' listed below.
If df1.row[1] = df1.row[0], then (df2.row[1]/df2.row[0]) * df3[0]
If df1.row[1] <> df1.row[0], then df3[1] = df3[0]
First DataFrame = df1
Date 1 2 3 4 5
0 2000-12-05 PXX.TO MX.TO CAE.TO HRX.TO FR.TO
1 2000-12-06 PXX.TO MX.TO CAE.TO HRX.TO FR.TO
2 2000-12-07 FTS.TO MX.TO CAE.TO HRX.TO FR.TO
3 2000-12-08 FTS.TO MX.TO CAE.TO HRX.TO FR.TO
4 2000-12-09 FTS.TO G.TO CAE.TO HRX.TO TB.TO
5 2000-12-10 FTS.TO G.TO KYU.TO HRX.TO TB.TO
6 2000-12-11 FTS.TO G.TO KYU.TO HRX.TO TB.TO
7 2000-12-12 BAM-A.TO G.TO KYU.TO HRX.TO TB.TO
8 2000-12-13 BAM-A.TO PLI.TO KYU.TO HRX.TO TB.TO
9 2000-12-14 BAM-A.TO PLI.TO KYU.TO HRX.TO TB.TO
10 2000-12-15 BAM-A.TO PLI.TO KYU.TO HRX.TO TB.TO
Second DataFrame = df2
Date 1 2 3 4 5
0 2000-12-05 2.3 60.10 2.30 34.98 35.00
1 2000-12-06 2.35 60.70 2.38 35.43 35.01
2 2000-12-07 56.76 61.31 2.46 35.89 35.02
3 2000-12-08 57.33 61.92 2.54 36.35 35.04
4 2000-12-09 57.90 100.20 2.63 36.83 300.90
5 2000-12-10 58.48 101.00 69.56 37.30 304.18
6 2000-12-11 59.07 101.81 70.46 37.78 307.50
7 2000-12-12 4.50 102.62 71.37 38.27 310.85
8 2000-12-13 4.54 44.50 72.29 38.77 314.24
9 2000-12-14 4.57 45.39 73.23 39.27 317.66
10 2000-12-15 4.61 46.30 74.18 39.78 321.12
Desired Output = df3
Date 1 2 3 4 5
0 2000-12-05 1.0000 1.0000 1.0000 1.0000 1.0000
1 2000-12-06 1.0200 1.0100 1.0340 1.0129 1.0003
2 2000-12-07 1.0200 1.0201 1.0692 1.0260 1.0007
3 2000-12-08 1.0302 1.0303 1.1055 1.0393 1.0010
4 2000-12-09 1.0405 1.0303 1.1431 1.0528 1.0010
5 2000-12-10 1.0509 1.0385 1.1431 1.0664 1.0119
6 2000-12-11 1.0614 1.0469 1.1579 1.0802 1.0230
7 2000-12-12 1.0614 1.0552 1.1729 1.0941 1.0341
8 2000-12-13 1.0699 1.0552 1.1880 1.1083 1.0454
9 2000-12-14 1.0785 1.0763 1.2034 1.1226 1.0568
10 2000-12-15 1.0871 1.0979 1.2190 1.1371 1.0683
Below show the formulas for the values in df3 for column 1
df3.row[0] = 1
df3.row[1] = (2.35/2.30) * 1 = 1.0200
df3.row[2] = (56.76/56.76) * 1.0200 = 1.0200
df3.row[3] = (57.33/56.76) * 1.0200 = 1.0302
df3.row[4] = (57.90/57.33) * 1.0302 = 1.0405
df3.row[5] = (58.48/57.90) * 1.0405 = 1.0509
df3.row[6] = (59.07/58.48) * 1.0509 = 1.0614
df3.row[7] = (4.50/4.50) * 1.0614 = 1.0614
df3.row[8] = (4.54/4.50) * 1.0614 = 1.0699
df3.row[9] = (4.57/4.54) * 1.0699 = 1.0785
df3.row[10] = (4.61/4.57) * 1.0785 = 1.0871
Below is what I have so far. Not confident that this is the best approach.
StartFromDay = 1
NumOfHoldings = 10
df3 = pd.DataFrame(columns = np.arange(1,NumOfHoldings+1))
df3.index.names = ['Date']
for col in df1.columns:
#First row should equal 1
df3.iloc[0][col] == 1
for i in range(StartFromDay, len(df1)):
#first row of each column
prevrow = df1.iloc[0][col]
if df1.iloc[i][col] == prevrow:
###### If Statements to calculate compound return#######

Loops are slow, so we'll do it in a vectorized way. First, set the indexes appropriately:
df1.set_index('Date', inplace=True)
df2.set_index('Date', inplace=True)
Next, generate a boolean mask which is True wherever the symbol is the same:
same_stock = df1.iloc[1:].values == df1.iloc[:-1].values
We have to use values because the shifted series are not aligned on the index anymore.
And make a matrix with all the df2.row[1]/df2.row[0] values:
ret = df2.iloc[1:].values / df2.iloc[:-1].values
Next, replace the returns where the symbol changed:
ret[~same_stock] = 1 # pretend return is flat when symbol changed
Now create a DataFrame with the result:
simpret = pd.DataFrame(np.vstack(([1,1,1,1,1], ret)), df1.index)
df3 = simpret.cumprod()

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Using condition in function while generating values for dataframe - python-3.x

Related

Perform a specific calculation in a dataframe based on number of days in each month

To find sum and percentage from columns of two different dataframe and append result in third dataframe

Counter number of date including between two date

day of Year values starting from a particular date

Loop through rows and columns to calculate a compounded rate in a new DataFrame

Categories

Resources