Replacing the day number for each value in a data frame column - python-3.x

I'm trying to replace the day number values within a datetime column using the values from another column.
This is my dataframe:
ID Code Day_to_replace Base_date
0 123 403 28 22/02/2013
1 456 402 21 22/03/2011
2 789 401 14 01/05/2017
and this is what I want to end up with:
ID Code Day_to_replace Base_date New_Date
0 123 403 28 22/02/2013 28/02/2013
1 456 402 21 22/03/2011 21/03/2011
2 789 401 14 01/05/2017 14/05/2017
I can do this using a static value but can't work out how to use a value from another column to apply to each record.
newdf['New_Date'] = newdf['Base_Date'].apply(lambda x: x.replace(day=1))
Thanks

First convert values to datetimes:
df['Base_date'] = pd.to_datetime(df['Base_date'], format='%d/%m/%Y')
Use DataFrame.apply with axis=1 for loop per rows:
df['New_Date'] = df.apply(lambda x: x['Base_date'].replace(day=x['Day_to_replace']), axis=1)
Or convert datetimes to month period and back for first day and add days timedeltas with subtracting 1 by to_timedelta:
df['New_Date'] = (df['Base_date'].dt.to_period('m').dt.to_timestamp() +
pd.to_timedelta(df['Day_to_replace'].sub(1), unit='d'))
Or convert values to strings, add days and convert to datetimes:
df['New_Date'] = pd.to_datetime(df['Base_date'].dt.strftime('%Y-%m-') +
df['Day_to_replace'].astype(str))
print (df)
ID Code Day_to_replace Base_date New_Date
0 123 403 28 2013-02-22 2013-02-28
1 456 402 21 2011-03-22 2011-03-21
2 789 401 14 2017-05-01 2017-05-14

Related

How to aggregate Python Pandas dataframe such that value of a variable corresponds to the row a variable is selected in aggfunc?

I have the following data
ID DATE AGE COUNT
1 Nat 16 1
1 2021-06-06 19 2
1 2020-01-05 20 3
2 Nat 23 3
2 Nat 16 3
2 2019-02-04 36 12
I want to aggregate this so that the DATE will be the earliest valid date (in time), while AGE will be extracted from the corresponding row the earliest date is selected. The output should be
ID DATE AGE COUNT
1 2021-06-06 19 1
2 2019-02-04 36 3
My code which gives this error TypeError: Must provide 'func' or named aggregation **kwargs..
df_agg = pd.pivot_table(df, index=['ID'],
values=['DATE', 'AGE'],
aggfunc={'DATE': np.min, 'AGE': None, 'COUNT': np.min})
I don't want to use 'AGE': np.min since for ID=1, AGE=16 will be extracted which is not what I want.
///////////// Edits ///////////////
Edits made to provide a more generic example.
You can try .first_valid_index():
x = df.loc[df.groupby("ID").apply(lambda x: x["DATE"].first_valid_index())]
print(x)
Prints:
ID DATE AGE
1 1 2021-06-06 19
5 2 2019-02-04 36
EDIT: Using .pivot_table(). You can extract the "DATE"/"AGE" together as a list, for "COUNT" you can use np.min or "min". The second step would be explode the "DATE"/"AGE" list to separate columns:
df_agg = pd.pivot_table(
df,
index=["ID"],
values=["DATE", "AGE", "COUNT"],
aggfunc={
"DATE": lambda x: df.loc[x.first_valid_index()][
["DATE", "AGE"]
].tolist(),
"COUNT": "min",
},
)
df_agg[["DATE", "AGE"]] = pd.DataFrame(df_agg["DATE"].apply(pd.Series))
print(df_agg)
Prints:
COUNT DATE AGE
ID
1 1 2021-06-06 19
2 3 2019-02-04 36
You can sort values and drop the duplicates (sort_index is optional)
df.sort_values(['DATE']).drop_duplicates('ID').sort_index()
ID DATE AGE
1 1 2021-06-06 19
5 2 2019-02-04 36
With groupby and transform:
df[df['DATE'] == df.groupby("ID")['DATE'].transform('min')]
Assuming you have an index, a simple solution would be:
def min_val(group):
group = group.loc[group.DATE.idxmin]
return group
df.groupby(['ID']).apply(min_val)
If you do not have an index you can use:
df.reset_index().groupby(['ID']).apply(min_val).drop(columns=['ID'])

Get the last date before an nth date for each month in Python

I am using a csv with an accumulative number that changes daily.
Day Accumulative Number
0 9/1/2020 100
1 11/1/2020 102
2 18/1/2020 98
3 11/2/2020 105
4 24/2/2020 95
5 6/3/2020 120
6 13/3/2020 100
I am now trying to find the best way to aggregate it and compare the monthly results before a specific date. So, I want to check the balance on the 11th of each month but for some months, there is no activity for the specific day. As a result, I trying to get the latest day before the 12th of each Month. So, the above would be:
Day Accumulative Number
0 11/1/2020 102
1 11/2/2020 105
2 6/3/2020 120
What I managed to do so far is to just get the latest day of each month:
dateparse = lambda x: pd.datetime.strptime(x, "%d/%m/%Y")
df = pd.read_csv("Accumulative.csv",quotechar="'", usecols=["Day","Accumulative Number"], index_col=False, parse_dates=["Day"], date_parser=dateparse, na_values=['.', '??'] )
df.index = df['Day']
grouped = df.groupby(pd.Grouper(freq='M')).sum()
print (df.groupby(df.index.month).apply(lambda x: x.iloc[-1]))
which returns:
Day Accumulative Number
1 2020-01-18 98
2 2020-02-24 95
3 2020-03-13 100
Is there a way to achieve this in Pandas, Python or do I have to use SQL logic in my script? Is there an easier way I am missing out in order to get the "balance" as per the 11th day of each month?
You can do groupby with factorize
n = 12
df = df.sort_values('Day')
m = df.groupby(df.Day.dt.strftime('%Y-%m')).Day.transform(lambda x :x.factorize()[0])==n
df_sub = df[m].copy()
You can try filtering the dataframe where the days are less than 12 , then take last of each group(grouped by month) :
df['Day'] = pd.to_datetime(df['Day'],dayfirst=True)
(df[df['Day'].dt.day.lt(12)]
.groupby([df['Day'].dt.year,df['Day'].dt.month],sort=False).last()
.reset_index(drop=True))
Day Accumulative_Number
0 2020-01-11 102
1 2020-02-11 105
2 2020-03-06 120
I would try:
# convert to datetime type:
df['Day'] = pd.to_datetime(df['Day'], dayfirst=True)
# select day before the 12th
new_df = df[df['Day'].dt.day < 12]
# select the last day in each month
new_df.loc[~new_df['Day'].dt.to_period('M').duplicated(keep='last')]
Output:
Day Accumulative Number
1 2020-01-11 102
3 2020-02-11 105
5 2020-03-06 120
Here's another way using expanding the date range:
# set as datetime
df2['Day'] = pd.to_datetime(df2['Day'], dayfirst=True)
# set as index
df2 = df2.set_index('Day')
# make a list of all dates
dates = pd.date_range(start=df2.index.min(), end=df2.index.max(), freq='1D')
# add dates
df2 = df2.reindex(dates)
# replace NA with forward fill
df2['Number'] = df2['Number'].ffill()
# filter to get output
df2 = df2[df2.index.day == 11].reset_index().rename(columns={'index': 'Date'})
print(df2)
Date Number
0 2020-01-11 102.0
1 2020-02-11 105.0
2 2020-03-11 120.0

how to take only maximum date value is there are two date in a week in dataframe

i have a dataframe called Data
Date Value Frequency
06/01/2020 256 A
07/01/2020 235 A
14/01/2020 85 Q
16/01/2020 625 Q
22/01/2020 125 Q
here it is observed that 6/01/2020 and 07/01/2020 are in the same week that is monday and tuesday.
Therefore i wanted to take maximum date from week.
my final dataframe should look like this
Date Value Frequency
07/01/2020 235 A
16/01/2020 625 Q
22/01/2020 125 Q
I want the maximum date from the week , like i have showed in my final dataframe example.
I am new to python, And i am searching answer for this which i didnt find till now ,Please help
First convert column to datetimes by to_datetime and use DataFrameGroupBy.idxmax for rows with maximum datetime per rows with Series.dt.strftime, last select rows by DataFrame.loc:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
print (df['Date'].dt.strftime('%Y-%U'))
0 2020-01
1 2020-01
2 2020-02
3 2020-02
4 2020-03
Name: Date, dtype: object
df = df.loc[df.groupby(df['Date'].dt.strftime('%Y-%U'))['Date'].idxmax()]
print (df)
Date Value Frequency
1 2020-01-07 235 A
3 2020-01-16 625 Q
4 2020-01-22 125 Q
If format of datetimes cannot be changed:
d = pd.to_datetime(df['Date'], dayfirst=True)
df = df.loc[d.groupby(d.dt.strftime('%Y-%U')).idxmax()]
print (df)
Date Value Frequency
1 07/01/2020 235 A
3 16/01/2020 625 Q
4 22/01/2020 125 Q

How to get the total count from a certain date from a dataframe having datatime column

I am new to pandas Dataframe.
From MYSQL I have bound the following dataset to a Dataframe. Here how to get the total count for a particular date in jupyter. Also how to set a Datepicker widget in jupyter and by selecting the date range in the calendar how to show the total count for that selected date.
To be more specific:
1) Get total count for Todays date(by inputting only date) from RegistrationDate column
2) Get total count for Last 7 days(by inputting only date) from RegistrationDate column
3) Get total count by selecting the date range from Datepicker widget from RegistrationDate column
No RegistrationDate
0 7 2019-07-23 12:23:25
1 9 2019-07-23 03:23:25
2 11 2019-07-23 08:10:10
3 13 2019-07-22 09:23:25
4 15 2019-07-22 04:01:02
5 17 2019-07-21 12:23:25
6 19 2019-07-20 12:23:25
7 21 2019-07-19 12:23:25
8 67 2019-06-04 12:23:25
9 68 2019-06-05 12:23:25
10 69 2019-06-06 12:23:25
First index by date
Set index label to 'RegistrationDate' using
df.set_label('RegistrationDate', inplace=True)
Objective 1
Get user input for date using
today = input('2019-07-22 04:01:02')
count1 = df.loc[today]
will return
15
Objective 3
Ensure that your df.['RegistrationDate'] is a Series type
df.['RegistrationDate'] = pd.to_datetime(df.['RegistrationDate'])
get user inputs on start and end dates
start_date = input("start date:\t")
end_date = input("end date:\t")
create a Boolean mask and ensure that the input dates are datetime.datetime or datetime strings or pd.Timestamp
mask = (df['RegistrationDate'] > start_date) & (df['RegistrationDate'] <= end_date)
re-assign this to a temp_df and sum over columns
temp_df = df.loc[mask]
total_in_range = temp_df['No'].sum()

Find occurrences of conditional value from one column and count values from another column in a dataframe

I have a dataframe containing userIds, week number, and a column X as shown below:
I am trying to group by the userIds if X is greater than 3 for 3 weeks.
I have tried using groupby and lambda in pandas but I am stuck
weekly_X = df.groupby(['Userid','Week #'], as_index=False)
UserIds Week X
123 14 3
123 15 4
123 16 7
123 17 2
123 18 1
456 14 4
456 15 5
456 16 11
456 17 2
456 18 6
The result I am aiming for is a dataframe containing user 456 and how many weeks the condition occurred.
df_3 = df.groupby('UserIds').apply(lambda x: (x.X > 3).sum() > 3).to_frame('ID_want').reset_index()
df = df[df.UserIds.isin(df_3.loc[df_3.ID_want == 1,'UserIds'])]
Get counts of values greater like 3 with aggregate sum and then filter values greater like 3:
s = df['X'].gt(3).astype(int).groupby(df['UserIds']).sum()
out = s[s.gt(3)].reset_index(name='count')
print (out)
UserIds count
0 456 4

Resources