When reading a csv file, the date column is set as month name (Jul-20 for July 2020), and when using parse_dates=True, Pandas converts it to 01-07-2020. How can I force pandas to convert it to end of month (ie, 31-07-2020)
Thanks
try using monthend from pandas.tseries.offsets
from pandas.tseries.offsets import MonthEnd
import pandas as pd
print(df)
month
0 2020-07-01
1 2020-08-02
df['month_end'] = df['month'] + MonthEnd(1)
print(df)
month month_end
0 2020-07-01 2020-07-31
1 2020-08-02 2020-08-31
You can use the inbuilt calendar and datetime modules and write your own apply method to achieve the desired result.
import calendar
import datetime
import pandas as pd
def parse_my_date(date):
date = datetime.datetime.strptime(date, '%B-%Y')
last_day = calendar.monthrange(date.year, date.month)[1]
date += datetime.timedelta(days=last_day-1)
return date
df['date'] = df['date'].apply(lambda x: parse_my_date(x))
Related
How do I convert Excel date format to number in Python? I'm importing a number of Excel files into Pandas dataframe in a loop and some values are formatted incorrectly in Excel. For example, the number column is imported as date and I'm trying to convert this date value into numeric.
Original New
1912-04-26 00:00:00 4500
How do I convert the date value in original to the numeric value in new? I know this code can convert numeric to date, but is there any similar function that does the opposite?
df.loc[0]['Date']= xlrd.xldate_as_datetime(df.loc[0]['Date'], 0)
I tried to specify the data type when I read in the files and also tried to simply change the data type of the column to 'float' but both didn't work.
Thank you.
I found that the number means the number of days from 1900-01-00.
Following code is to calculate how many days passed from 1900-01-00 until the given date.
import pandas as pd
from datetime import datetime, timedelta
df = pd.DataFrame(
{
'date': ['1912-04-26 00:00:00'],
}
)
print(df)
# date
#0 1912-04-26 00:00:00
def date_to_int(given_date):
given_date = datetime.strptime(given_date, '%Y-%m-%d %H:%M:%S')
base_date = datetime(1900, 1, 1) - timedelta(days=2)
delta = given_date - base_date
return delta.days
df['date'] = df['date'].apply(date_to_int)
print(df)
# date
#0 4500
I am trying to parse data from a csv file, sort them by date and write the sorted dataframe in a new csv file.
Say we have a very simple csv file with date entries following the pattern day/month/year:
Date,Reference
15/11/2020,'001'
02/11/2020,'002'
10/11/2020,'003'
26/11/2020,'004'
23/10/2020,'005'
I read the csv into a Pandas dataframe. When I attempt to order the dataframe based on the dates in ascending order I expect the data to be ordered as follows:
23/10/2020,'005'
02/11/2020,'002'
10/11/2020,'003'
15/11/2020,'001'
26/11/2020,'004'
Sadly, this is not what I get.
If I attempt to convert the date to datetime and then sort, then some date entries are converted to the month/day/year (e.g. 2020-10-23 instead of 2020-23-10) which messes up the ordering:
date reference
2020-02-11 '002'
2020-10-11 '003'
2020-10-23 '005'
2020-11-15 '001'
2020-11-26 '004'
If I sort without converting to datetime, then the ordering is also wrong:
date reference
02/11/2020 '002'
10/11/2020 '003'
15/11/2020 '001'
23/10/2020 '005'
26/11/2020 '004'
Here is my code:
import pandas as pd
df = pd.read_csv('order_dates.csv',
header=0,
names=['date', 'reference'],
dayfirst=True)
df.reset_index(drop=True, inplace=True)
# df.date = pd.to_datetime(df.date)
df.sort_val
df.sort_values(by='date', ascending=True, inplace=True)
print(df)
df.to_csv('sorted.csv')
Why is sorting by date so hard? Can someone explain why the above sorting attempts fail?
Ideally, I would like the sorted.csv to have the date entries in the day/month/year format.
Try:
df.loc[:,'date'] = pd.to_datetime(df.loc[:, 'date'], format='%d/%m-%Y')
What you can do is to specify the datetime format while reading the csv file. To do this try that:
>>> df = pd.read_csv('filename.csv', parse_dates=['Date'],infer_datetime_format='%d/%m/%Y').sort_values(by='Date')
This will read your dates from csv and give you this output where dates are sorted.
Date Reference
4 2020-10-23 '005
1 2020-11-02 '002'
2 2020-11-10 '003'
0 2020-11-15 '001'
3 2020-11-26 '004'
What's left now is to simply change the formatting to the desired one
>>> df['Date'] = df['Date'].dt.strftime('%d/%m/%Y')
Keep in mind however that this will change the Date back to string (object)
>>> df
Date Reference
4 23/10/2020 '005
1 02/11/2020 '002'
2 10/11/2020 '003'
0 15/11/2020 '001'
3 26/11/2020 '004'
>>> df.dtypes
Date object
I use the following code to move data from one Excel file to another:
import pandas as pd
inventory=pd.read_excel('Original_File.xlsx', skiprows=3)
inventory.to_excel('New_File.xlsx', index=False)
How do I place today's date in the first column of every row that contains data in New_File.xlsx?
Like this:
import pandas as pd
from datetime import datetime
inventory=pd.read_excel('Original_File.xlsx', skiprows=3)
inventory.insert(0, 'today_date', datetime.today().strftime('%Y-%m-%d'))
inventory.to_excel('New_File.xlsx', index=False)
I have df that with 3 columns on timestamps:
X. ...
01/01/2013 12:00:20 AM. ...
so I have been trying to convert these columns into the DateTime format for some further analysis
When I run:
df.dtype()
the info comes back with each of these columns as objects. I have been reading the data in from a csv so they should be string objects.
When converting them to DateTime I have been using:
df['X'] = pd.to_datetime(df['X'])
and
df['X'] = df['X'].astype('datetime64[ns]')
But in every case, the kernel just keeps running and I am not getting anywhere... I want to be able to use these dates and times to calculate the difference between timestamp columns in minutes and such.
Any help would be greatly appreciated. Thank You.
Here is a full example that works with me.You can try it out in your own setup:
import pandas as pd
df=pd.DataFrame([["1/1/2016 12:00:20 AM","3/1/2016"],
["6/15/2016 4:00:20 AM","7/14/2016"],
["7/14/2016 11:00:20 AM","8/15/2016"],
["8/7/2016 00:00:20 AM","9/6/2016"]]
,columns=['X','Y'])
print(df)
#convert one column
df['X'] = pd.to_datetime(df['X'])
print(df)
#convert all columns
df[df.columns] = df[df.columns].apply(pd.to_datetime)
print(df)
I have sales data from Jan 2014 until last week and data will refresh everyday.
I want to generate some insights automatically to compare to the latest week, for example how much sales decreased/increased from last week to this week and which is the hot product etc.
I am confused with how to store latest week dynamically
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Product': ['EyeWear', 'Packs', 'Watches', 'Irons', 'Glasses'],
'Country':['USA','India','Africa','UK','India'],
'Revenue':[98,90,87,69,78],
'Date':['20140101','20140102','20140103','20140104','20140105']},
index=[1,2,3,4,5])
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
df['year'] = df['Date'].dt.year
df['month'] = df['Date'].dt.month
df['week'] = df['Date'].dt.week
df['YearMonth'] = df['Date'].apply(lambda x:x.strftime('%Y%m'))