I have a csvfile with a publication data column and the name of my csvfile contains a date. I would like to compare both.
name of my file: export_mb_cci_bayonne_2022_11_01
I did this:
import pandas as pd
import datetime as dt`
from datetime import datetime
df = pd.DataFrame({'city': ['bayonne', 'anglet', 'biarritz'], 'Publication_date': ['2022-10-31T11:00:04.083Z', '2021-12-03T01:00:00.000Z', '2022-09-25T11:00:04.083Z']})
df[['Publication_date','time']] = df['Publication_date'].str.split('T',expand=True)
df['Publication_date'] = pd.to_datetime(df['Publication_date']).dt.date
dateFile = directory.split("export_mb_cci_bayonne_")[-1].split(".")[0]
dateFile_str = dateFile
fichierDate = datetime.strptime(dateFile_str, '%Y_%m_%d').date()
### INSERT COLUMNS ###
df['Date_file'] = fichierDate
I obtain this result:
city Publication_date time Date_file
0 bayonne 2022-10-31 11:00:04.083Z 2022-11-01
1 anglet 2021-12-03 01:00:00.000Z 2022-11-01
2 biarritz 2022-09-25 11:00:04.083Z 2022-11-01
I would like to obtain:
city Publication_date time Date_file Type
0 bayonne 2022-10-31 11:00:04.083Z 2022-11-01 new publication
1 anglet 2021-12-03 01:00:00.000Z 2022-11-01 old publication
2 biarritz 2022-09-25 11:00:04.083Z 2022-11-01 old publication
New publication = all publication between 2022-10-01 and 2022-10-31
Old publication = all publication before 2022-10-01
The problem is that I will have a new file each month so my date_file will change (for exemple 2022-12-01 the next month). and I will compare with my new publication_date column.
I tried to do the difference between column publication_date and Date_file and obtain this result:
df['diff']= df['Publication_date'] - df['Date_file']
city Publication_date time Date_file diff
0 bayonne 2022-10-31 11:00:04.083Z 2022-11-01 -1 days
1 anglet 2021-12-03 01:00:00.000Z 2022-11-01 -333 days
2 biarritz 2022-09-25 11:00:04.083Z 2022-11-01 -37 days
And now I am lost, I tried this but I know it is wrong...
if '-30 days' < df['diff'] <= '-1 days':
### INSERT COLUMNS ###
df['test'] = 'new publication'
else:
df['test'] = 'old publication'
I am new in python and I am lost with datetime format ...
Related
I am using a csv with an accumulative number that changes daily.
Day Accumulative Number
0 9/1/2020 100
1 11/1/2020 102
2 18/1/2020 98
3 11/2/2020 105
4 24/2/2020 95
5 6/3/2020 120
6 13/3/2020 100
I am now trying to find the best way to aggregate it and compare the monthly results before a specific date. So, I want to check the balance on the 11th of each month but for some months, there is no activity for the specific day. As a result, I trying to get the latest day before the 12th of each Month. So, the above would be:
Day Accumulative Number
0 11/1/2020 102
1 11/2/2020 105
2 6/3/2020 120
What I managed to do so far is to just get the latest day of each month:
dateparse = lambda x: pd.datetime.strptime(x, "%d/%m/%Y")
df = pd.read_csv("Accumulative.csv",quotechar="'", usecols=["Day","Accumulative Number"], index_col=False, parse_dates=["Day"], date_parser=dateparse, na_values=['.', '??'] )
df.index = df['Day']
grouped = df.groupby(pd.Grouper(freq='M')).sum()
print (df.groupby(df.index.month).apply(lambda x: x.iloc[-1]))
which returns:
Day Accumulative Number
1 2020-01-18 98
2 2020-02-24 95
3 2020-03-13 100
Is there a way to achieve this in Pandas, Python or do I have to use SQL logic in my script? Is there an easier way I am missing out in order to get the "balance" as per the 11th day of each month?
You can do groupby with factorize
n = 12
df = df.sort_values('Day')
m = df.groupby(df.Day.dt.strftime('%Y-%m')).Day.transform(lambda x :x.factorize()[0])==n
df_sub = df[m].copy()
You can try filtering the dataframe where the days are less than 12 , then take last of each group(grouped by month) :
df['Day'] = pd.to_datetime(df['Day'],dayfirst=True)
(df[df['Day'].dt.day.lt(12)]
.groupby([df['Day'].dt.year,df['Day'].dt.month],sort=False).last()
.reset_index(drop=True))
Day Accumulative_Number
0 2020-01-11 102
1 2020-02-11 105
2 2020-03-06 120
I would try:
# convert to datetime type:
df['Day'] = pd.to_datetime(df['Day'], dayfirst=True)
# select day before the 12th
new_df = df[df['Day'].dt.day < 12]
# select the last day in each month
new_df.loc[~new_df['Day'].dt.to_period('M').duplicated(keep='last')]
Output:
Day Accumulative Number
1 2020-01-11 102
3 2020-02-11 105
5 2020-03-06 120
Here's another way using expanding the date range:
# set as datetime
df2['Day'] = pd.to_datetime(df2['Day'], dayfirst=True)
# set as index
df2 = df2.set_index('Day')
# make a list of all dates
dates = pd.date_range(start=df2.index.min(), end=df2.index.max(), freq='1D')
# add dates
df2 = df2.reindex(dates)
# replace NA with forward fill
df2['Number'] = df2['Number'].ffill()
# filter to get output
df2 = df2[df2.index.day == 11].reset_index().rename(columns={'index': 'Date'})
print(df2)
Date Number
0 2020-01-11 102.0
1 2020-02-11 105.0
2 2020-03-11 120.0
I have a pandas dataframe with dates and users which looks like this-
date = ['1/2/2020','1/9/2020','1/10/2020','1/17/2020','1/18/2020','1/24/2020','1/25/2020','5/17/2019','5/18/2019','5/24/2019','5/29/2019']
user =['A','B','C','B','A','A','B','C','A','A','B']
df = pd.DataFrame(data={"Date":date, "User":user})
I am trying to find all dates that are next to each other (Jan-1 and Jan-2) and convert them to a single date so both entries would then become the lower of the two. The number of entries are over a million. This data is created from a scan results that triggers nightly and sometime flows into the other day.
Update-
I wanted to consolidate the date of the scan so that I can show the visualization properly. As right now the results would have more entry on the day the scan starts but very few entries for the day where the scan overflowed. There is a primary date and time stored so I am not loosing the data. The user column is presented as it scans a file with all the usernames and the date stores the date when it was scanned.
So far I was able to read the dataframe and then sort it based on the date to have the entries one after the other.
The output should look like the following -
Is there a pytonic way of doing this?
One issue to consider is the case of multiple consecutive days and how you want to handle these. The following code sets the day to the first of the consecutive days in each block:
import pandas as pd
from datetime import timedelta
# prepend two dates to show multiple consecutive days "use-case"
date = ['12/31/2019','1/1/2020','1/2/2020','1/9/2020','1/10/2020','1/17/2020','1/18/2020','1/24/2020','1/25/2020','5/17/2019','5/18/2019','5/24/2019','5/29/2019']
user = ['Z','Z','A','B','C','B','A','A','B','C','A','A','B']
df = pd.DataFrame(data={"Date":date, "User":user})
# first convert to datetime to allow date operations
df.Date = pd.to_datetime(df.Date)
# check if the the date is one day after the row before (by shifting the Date column)
df['isConsecutive'] = (df.Date == df.Date.shift()+pd.DateOffset(1))
# get number of consecutive days in each block
df['numConsecutive'] = df.isConsecutive.groupby((~df.isConsecutive).cumsum()).cumsum()
# convert to timedelta
df.numConsecutive = df.numConsecutive.apply(lambda x: timedelta(days=x))
# take this as differnce to Date
df['NewDate'] = df.Date - df.numConsecutive
print(df)
This returns:
Date User isConsecutive numConsecutive NewDate
0 2019-12-31 Z False 0 days 2019-12-31
1 2020-01-01 Z True 1 days 2019-12-31
2 2020-01-02 A True 2 days 2019-12-31
3 2020-01-09 B False 0 days 2020-01-09
4 2020-01-10 C True 1 days 2020-01-09
5 2020-01-17 B False 0 days 2020-01-17
6 2020-01-18 A True 1 days 2020-01-17
7 2020-01-24 A False 0 days 2020-01-24
8 2020-01-25 B True 1 days 2020-01-24
9 2019-05-17 C False 0 days 2019-05-17
10 2019-05-18 A True 1 days 2019-05-17
11 2019-05-24 A False 0 days 2019-05-24
12 2019-05-29 B False 0 days 2019-05-29
I have a dataframe with a date column. The duration is 365 days starting from 02/11/2017 and ending at 01/11/2018.
Date
02/11/2017
03/11/2017
05/11/2017
.
.
01/11/2018
I want to add an adjacent column called Day_Of_Year as follows:
Date Day_Of_Year
02/11/2017 1
03/11/2017 2
05/11/2017 4
.
.
01/11/2018 365
I apologize if it's a very basic question, but unfortunately I haven't been able to start with this.
I could use datetime(), but that would return values such as 1 for 1st january, 2 for 2nd january and so on.. irrespective of the year. So, that wouldn't work for me.
First convert column to_datetime and then subtract datetime, convert to days and add 1:
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
df['Day_Of_Year'] = df['Date'].sub(pd.Timestamp('2017-11-02')).dt.days + 1
print (df)
Date Day_Of_Year
0 02/11/2017 1
1 03/11/2017 2
2 05/11/2017 4
3 01/11/2018 365
Or subtract by first value of column:
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
df['Day_Of_Year'] = df['Date'].sub(df['Date'].iat[0]).dt.days + 1
print (df)
Date Day_Of_Year
0 2017-11-02 1
1 2017-11-03 2
2 2017-11-05 4
3 2018-11-01 365
Using strftime with '%j'
s=pd.to_datetime(df.Date,dayfirst=True).dt.strftime('%j').astype(int)
s-s.iloc[0]
Out[750]:
0 0
1 1
2 3
Name: Date, dtype: int32
#df['new']=s-s.iloc[0]
Python has dayofyear. So put your column in the right format with pd.to_datetime and then apply Series.dt.dayofyear. Lastly, use some modulo arithmetic to find everything in terms of your original date
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
df['day of year'] = df['Date'].dt.dayofyear - df['Date'].dt.dayofyear[0] + 1
df['day of year'] = df['day of year'] + 365*((365 - df['day of year']) // 365)
Output
Date day of year
0 2017-11-02 1
1 2017-11-03 2
2 2017-11-05 4
3 2018-11-01 365
But I'm doing essentially the same as Jezrael in more lines of code, so my vote goes to her/him
I have a dataframe that looks like:
import pandas as pd
import datetime as dt
df= pd.DataFrame({'date':['2017-12-31','2017-12-31'],'type':['Asset','Liab'],'Amount':[100,-100],'Maturity Date':['2019-01-02','2018-01-01']})
df
I am trying to build a roll-off profile by checking if the 'Maturity Date' is greater than a 'date' in the future. I am trying to achieve something like:
#First Month
df1=df[df['Maturity Date']>'2018-01-31']
df1['date']='2018-01-31'
#Second Month
df2=df[df['Maturity Date']>'2018-02-28']
df2['date']='2018-02-28'
#third Month
df3=df[df['Maturity Date']>'2018-03-31']
df3['date']='2018-02-31'
#first quarter
qf1=df[df['Maturity Date']>'2018-06-30']
qf1['date']='2018-06-30'
#concatenate
df=pd.concat([df,df1,df2,df3,qf1])
df
I was wondering if there is a way to :
Allow an arbitrary long number of dates without repeating code
I think you need numpy.tile for repeat indices and assign to new column, last filter by boolean indexing and sorting by sort_values:
d = '2017-12-31'
df['Maturity Date'] = pd.to_datetime(df['Maturity Date'])
#generate first month and next quarters
c1 = pd.date_range(d, periods=4, freq='M')
c2 = pd.date_range(c1[-1], periods=2, freq='Q')
#join together
c = c1.union(c2[1:])
#repeat rows be indexing repeated index
df1 = df.loc[np.tile(df.index, len(c))].copy()
#assign column by datetimes
df1['date'] = np.repeat(c, len(df))
#filter by boolean indexing
df1 = df1[df1['Maturity Date'] > df1['date']]
print (df1)
Amount Maturity Date date type
0 100 2019-01-02 2017-12-31 Asset
1 -100 2018-01-01 2017-12-31 Liab
0 100 2019-01-02 2018-01-31 Asset
0 100 2019-01-02 2018-02-28 Asset
0 100 2019-01-02 2018-03-31 Asset
0 100 2019-01-02 2018-06-30 Asset
You could use a nifty tool in the Pandas arsenal called
pd.merge_asof. It
works similarly to pd.merge, except that it matches on "nearest" keys rather
than equal keys. Furthermore, you can tell pd.merge_asof to look for nearest
keys in only the backward or forward direction.
To make things interesting (and help check that things are working properly), let's add another row to df:
df = pd.DataFrame({'date':['2017-12-31', '2017-12-31'],'type':['Asset', 'Asset'],'Amount':[100,200],'Maturity Date':['2019-01-02', '2018-03-15']})
for col in ['date', 'Maturity Date']:
df[col] = pd.to_datetime(df[col])
df = df.sort_values(by='Maturity Date')
print(df)
# Amount Maturity Date date type
# 1 200 2018-03-15 2017-12-31 Asset
# 0 100 2019-01-02 2017-12-31 Asset
Now define some new dates:
dates = (pd.date_range('2018-01-31', periods=3, freq='M')
.union(pd.date_range('2018-01-1', periods=2, freq='Q')))
result = pd.DataFrame({'date': dates})
# date
# 0 2018-01-31
# 1 2018-02-28
# 2 2018-03-31
# 3 2018-06-30
Now we can merge rows, matching nearest dates from result with Maturity Dates from df:
result = pd.merge_asof(result, df.drop('date', axis=1),
left_on='date', right_on='Maturity Date', direction='forward')
In this case we want to "match" dates with Maturity Dates which are greater
so we use direction='forward'.
Putting it all together:
import pandas as pd
df = pd.DataFrame({'date':['2017-12-31', '2017-12-31'],'type':['Asset', 'Asset'],'Amount':[100,200],'Maturity Date':['2019-01-02', '2018-03-15']})
for col in ['date', 'Maturity Date']:
df[col] = pd.to_datetime(df[col])
df = df.sort_values(by='Maturity Date')
dates = (pd.date_range('2018-01-31', periods=3, freq='M')
.union(pd.date_range('2018-01-1', periods=2, freq='Q')))
result = pd.DataFrame({'date': dates})
result = pd.merge_asof(result, df.drop('date', axis=1),
left_on='date', right_on='Maturity Date', direction='forward')
result = pd.concat([df, result], axis=0)
result = result.sort_values(by=['Maturity Date', 'date'])
print(result)
yields
Amount Maturity Date date type
1 200 2018-03-15 2017-12-31 Asset
0 200 2018-03-15 2018-01-31 Asset
1 200 2018-03-15 2018-02-28 Asset
0 100 2019-01-02 2017-12-31 Asset
2 100 2019-01-02 2018-03-31 Asset
3 100 2019-01-02 2018-06-30 Asset
I am using a simple csv file which contains data on calory intake. It has 4 columns: cal, day, month, year. It looks like this:
cal month year day
3668.4333 1 2002 10
3652.2498 1 2002 11
3647.8662 1 2002 12
3646.6843 1 2002 13
...
3661.9414 2 2003 14
# data types
cal float64
month int64
year int64
day int64
I am trying to do some simple time series analysis. I hence would like to parse month, year, and day to a single column. I tried the following using pandas:
import pandas as pd
from pandas import Series, DataFrame, Panel
data = pd.read_csv('time_series_calories.csv', header=0, pars_dates=['day', 'month', 'year']], date_parser=True, infer_datetime_format=True)
My questions are: (1) How do I parse the data and (2) define the data type of the new column? I know there are quite a few other similar questions and answers (see e.g. here, here and here) - but I can't make it work so far.
You can use parameter parse_dates where define column names in list in read_csv:
import pandas as pd
import numpy as np
import io
temp=u"""cal,month,year,day
3668.4333,1,2002,10
3652.2498,1,2002,11
3647.8662,1,2002,12
3646.6843,1,2002,13
3661.9414,2,2003,14"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), parse_dates=[['year','month','day']])
print (df)
year_month_day cal
0 2002-01-10 3668.4333
1 2002-01-11 3652.2498
2 2002-01-12 3647.8662
3 2002-01-13 3646.6843
4 2003-02-14 3661.9414
print (df.dtypes)
year_month_day datetime64[ns]
cal float64
dtype: object
Then you can rename column:
df.rename(columns={'year_month_day':'date'}, inplace=True)
print (df)
date cal
0 2002-01-10 3668.4333
1 2002-01-11 3652.2498
2 2002-01-12 3647.8662
3 2002-01-13 3646.6843
4 2003-02-14 3661.9414
Or better is pass dictionary with new column name to parse_dates:
df = pd.read_csv(io.StringIO(temp), parse_dates={'dates': ['year','month','day']})
print (df)
dates cal
0 2002-01-10 3668.4333
1 2002-01-11 3652.2498
2 2002-01-12 3647.8662
3 2002-01-13 3646.6843
4 2003-02-14 3661.9414