how to merge month day year columns in date column? - python-3.x

The date is in separate columns
Month Day Year
8 12 1993
8 12 1993
8 12 1993
I want to merge it in one column
Date
8/12/1993
8/12/1993
8/12/1993
I tried
df_date = df.Timestamp((df_filtered.Year*10000+df_filtered.Month*100+df_filtered.Day).apply(str),format='%Y%m%d')
I get this error
AttributeError: 'DataFrame' object has no attribute 'Timestamp'

Using pd.to_datetime with astype(str)
1. as string type:
df['Date'] = pd.to_datetime(df['Month'].astype(str) + df['Day'].astype(str) + df['Year'].astype(str), format='%d%m%Y').dt.strftime('%d/%m/%Y')
Month Day Year Date
0 8 12 1993 08/12/1993
1 8 12 1993 08/12/1993
2 8 12 1993 08/12/1993
2. as datetime type:
df['Date'] = pd.to_datetime(df['Month'].astype(str) + df['Day'].astype(str) + df['Year'].astype(str), format='%d%m%Y')
Month Day Year Date
0 8 12 1993 1993-12-08
1 8 12 1993 1993-12-08
2 8 12 1993 1993-12-08

Here is the solution:
df = pd.DataFrame({'Month': [8, 8, 8], 'Day': [12, 12, 12], 'Year': [1993, 1993, 1993]})
# This way dates will be a DataFrame
dates = df.apply(lambda row:
pd.Series(pd.Timestamp(row[2], row[0], row[1]), index=['Date']),
axis=1)
# And this way dates will be a Series:
# dates = df.apply(lambda row:
# pd.Timestamp(row[2], row[0], row[1]),
# axis=1)
apply method generates a new Series or DataFrame iteratively applying provided function (lambda in this case) and joining the results.
You can read about apply method in official documentation.
And here is the explanation of lambda expressions.
EDIT:
#JohnClements suggested a better solution, using pd.to_datetime method:
dates = pd.to_datetime(df).to_frame('Date')
Also, if you want your output to be a string, you can use
dates = df.apply(lambda row: f"{row[2]}/{row[0]}/{row[1]}",
axis=1)

You can try:
df = pd.DataFrame({'Month': [8,8,8], 'Day': [12,12,12], 'Year': [1993, 1993, 1993]})
df['date'] = pd.to_datetime(df)
Result:
Month Day Year date
0 8 12 1993 1993-08-12
1 8 12 1993 1993-08-12
2 8 12 1993 1993-08-12
Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
Month 3 non-null int64
Day 3 non-null int64
Year 3 non-null int64
date 3 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(3)
memory usage: 176.0 bytes

Related

Pandas strptime returns a string instead of DateTime object [duplicate]

This question already has answers here:
How to convert string to datetime format in pandas python?
(3 answers)
Closed 1 year ago.
I have the following pandas DataFrame:
data = pd.DataFrame({"id": [1, 2, 3, 4, 5],
"end_time": ["2016-01-13", "2016-01-01", "2016-11-12", "2016-01-17", "2016-03-13"]})
I want to transform the end_time column to a column of datetime objects. But when I do it like this (like it is suggested everywhere):
data["end"] = data["end_time"].apply(lambda x: datetime.datetime.strptime(x, "%Y-%m-%d"))
the output is still a string column:
id end_time end
0 1 2016-01-13 2016-01-13
1 2 2016-01-01 2016-01-01
2 3 2016-11-12 2016-11-12
3 4 2016-01-17 2016-01-17
4 5 2016-03-13 2016-03-13
How to solve this?
strftime is designed to return a string object, details.
If we want to convert end_time to datetime64[ns] and assign to new column named end then we can use:
data['end'] = pd.to_datetime(data.end_time)
strptime will also covert the string to datetime64[ns]. But preferable is to_datetime method.
data["end"] = data["end_time"].apply(lambda x: datetime.datetime.strptime(x, "%Y-%m-%d"))
data.info()
Output
id end_time end
0 1 2016-01-13 2016-01-13
1 2 2016-01-01 2016-01-01
2 3 2016-11-12 2016-11-12
3 4 2016-01-17 2016-01-17
4 5 2016-03-13 2016-03-13
Datatypes:
data.info()
Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 5 non-null int64
1 end_time 5 non-null object
2 end 5 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 248.0+ bytes

Pandas Dataframe: Reduce the value of a 'Days' by 1 if the corresponding 'Year' is a leap year

If 'Days' is greater than e.g 10 and corresponding 'Year' is a leap year, then reduce 'Days' by 1 only in that particular row. I tried some operations but couldn't do it. I am new in pandas. Appreciate any help.
sample data:
data = [['1', '2005'], ['2', '2006'], ['3', '2008'],['50','2009'],['69','2008']]
df=pd.DataFrame(data,columns=['Days','Year'])
I want 'Days' of row 5 to become 69 and everything else remains the same.
In [98]: import calendar
In [99]: data = [['1', '2005'], ['2', '2006'], ['3', '2008'],['50','2009'],['70','2008']] ;df=pd.DataFrame(data,column
...: s=['Days','Year'])
In [100]: df = df.astype(int)
In [102]: df["New_Days"] = df.apply(lambda x: x["Days"]-1 if (x["Days"] > 10 and calendar.isleap(x["Year"])) else x["D
...: ays"], axis=1)
In [103]: df
Out[103]:
Days Year New_Days
0 1 2005 1
1 2 2006 2
2 3 2008 3
3 50 2009 50
4 70 2008 69

Get the last date before an nth date for each month in Python

I am using a csv with an accumulative number that changes daily.
Day Accumulative Number
0 9/1/2020 100
1 11/1/2020 102
2 18/1/2020 98
3 11/2/2020 105
4 24/2/2020 95
5 6/3/2020 120
6 13/3/2020 100
I am now trying to find the best way to aggregate it and compare the monthly results before a specific date. So, I want to check the balance on the 11th of each month but for some months, there is no activity for the specific day. As a result, I trying to get the latest day before the 12th of each Month. So, the above would be:
Day Accumulative Number
0 11/1/2020 102
1 11/2/2020 105
2 6/3/2020 120
What I managed to do so far is to just get the latest day of each month:
dateparse = lambda x: pd.datetime.strptime(x, "%d/%m/%Y")
df = pd.read_csv("Accumulative.csv",quotechar="'", usecols=["Day","Accumulative Number"], index_col=False, parse_dates=["Day"], date_parser=dateparse, na_values=['.', '??'] )
df.index = df['Day']
grouped = df.groupby(pd.Grouper(freq='M')).sum()
print (df.groupby(df.index.month).apply(lambda x: x.iloc[-1]))
which returns:
Day Accumulative Number
1 2020-01-18 98
2 2020-02-24 95
3 2020-03-13 100
Is there a way to achieve this in Pandas, Python or do I have to use SQL logic in my script? Is there an easier way I am missing out in order to get the "balance" as per the 11th day of each month?
You can do groupby with factorize
n = 12
df = df.sort_values('Day')
m = df.groupby(df.Day.dt.strftime('%Y-%m')).Day.transform(lambda x :x.factorize()[0])==n
df_sub = df[m].copy()
You can try filtering the dataframe where the days are less than 12 , then take last of each group(grouped by month) :
df['Day'] = pd.to_datetime(df['Day'],dayfirst=True)
(df[df['Day'].dt.day.lt(12)]
.groupby([df['Day'].dt.year,df['Day'].dt.month],sort=False).last()
.reset_index(drop=True))
Day Accumulative_Number
0 2020-01-11 102
1 2020-02-11 105
2 2020-03-06 120
I would try:
# convert to datetime type:
df['Day'] = pd.to_datetime(df['Day'], dayfirst=True)
# select day before the 12th
new_df = df[df['Day'].dt.day < 12]
# select the last day in each month
new_df.loc[~new_df['Day'].dt.to_period('M').duplicated(keep='last')]
Output:
Day Accumulative Number
1 2020-01-11 102
3 2020-02-11 105
5 2020-03-06 120
Here's another way using expanding the date range:
# set as datetime
df2['Day'] = pd.to_datetime(df2['Day'], dayfirst=True)
# set as index
df2 = df2.set_index('Day')
# make a list of all dates
dates = pd.date_range(start=df2.index.min(), end=df2.index.max(), freq='1D')
# add dates
df2 = df2.reindex(dates)
# replace NA with forward fill
df2['Number'] = df2['Number'].ffill()
# filter to get output
df2 = df2[df2.index.day == 11].reset_index().rename(columns={'index': 'Date'})
print(df2)
Date Number
0 2020-01-11 102.0
1 2020-02-11 105.0
2 2020-03-11 120.0

Error during conversion column pandas data frame python 3

I have a big problem with pandas. I have an important data frame containing
Ref_id PRICE YEAR MONTH BRAND
100000 '5000' '2012' '4' 'FORD'
100001 '10000' '2015' '5' 'MERCEDES'
...
I want to convert my PRICE, YEAR and MONTH columns but when I use .astype(int) or .apply(lambda x : int(x)) on the column I receive an ValueError. The length of my data frame is 1.8 million on rows.
ValueError: invalid literal for int() with base 10: 'PRICE'
So I don't understand why pandas wants to convert the name of the column.
Could you explain me why ?
Best,
C.
Try this:
In [59]: cols = 'PRICE YEAR MONTH'.split()
In [60]: cols
Out[60]: ['PRICE', 'YEAR', 'MONTH']
In [61]: for c in cols:
...: df[c] = pd.to_numeric(df[c], errors='coerce')
...:
In [62]: df
Out[62]:
Ref_id PRICE YEAR MONTH BRAND
0 100000 5000.0 2012 4 FORD
1 100001 10000.0 2015 5 MERCEDES
2 100002 NaN 2016 6 AUDI
Reproducing your error:
In [65]: df
Out[65]:
Ref_id PRICE YEAR MONTH BRAND
0 100000 5000 2012 4 FORD
1 100001 10000 2015 5 MERCEDES
2 100002 PRICE 2016 6 AUDI # pay attention at `PRICE` value !!!
In [66]: df['PRICE'].astype(int)
...
skipped
...
ValueError: invalid literal for int() with base 10: 'PRICE'
As #jezrael has added in this comment most probably you have "bad" (unexpected) values in your data set.
You can use one of the following techniques in order clean it up:
In [155]: df
Out[155]:
Ref_id PRICE YEAR MONTH BRAND
0 100000 5000 2012 4 FORD
1 100001 10000 2015 5 MERCEDES
2 Ref_id PRICE YEAR MONTH BRAND
3 100002 15000 2016 5 AUDI
In [156]: df.dtypes
Out[156]:
Ref_id object
PRICE object
YEAR object
MONTH object
BRAND object
dtype: object
In [157]: df = df.drop(df.loc[df.PRICE == 'PRICE'].index)
In [158]: df
Out[158]:
Ref_id PRICE YEAR MONTH BRAND
0 100000 5000 2012 4 FORD
1 100001 10000 2015 5 MERCEDES
3 100002 15000 2016 5 AUDI
In [159]: for c in cols:
...: df[c] = pd.to_numeric(df[c], errors='coerce')
...:
In [160]: df
Out[160]:
Ref_id PRICE YEAR MONTH BRAND
0 100000 5000 2012 4 FORD
1 100001 10000 2015 5 MERCEDES
3 100002 15000 2016 5 AUDI
In [161]: df.dtypes
Out[161]:
Ref_id object
PRICE int64
YEAR int64
MONTH int64
BRAND object
dtype: object
or simply:
In [159]: for c in cols:
...: df[c] = pd.to_numeric(df[c], errors='coerce')
...:
In [165]: df
Out[165]:
Ref_id PRICE YEAR MONTH BRAND
0 100000 5000.0 2012.0 4.0 FORD
1 100001 10000.0 2015.0 5.0 MERCEDES
2 Ref_id NaN NaN NaN BRAND
3 100002 15000.0 2016.0 5.0 AUDI
and then .dropna(how='any') if you know that there were no NaN's in your original data set:
In [166]: df = df.dropna(how='any')
In [167]: df
Out[167]:
Ref_id PRICE YEAR MONTH BRAND
0 100000 5000.0 2012.0 4.0 FORD
1 100001 10000.0 2015.0 5.0 MERCEDES
3 100002 15000.0 2016.0 5.0 AUDI

Mixed date formats dd/mm/yyyy and d/m/y in pandas

I have a dataframe with mixed date formats in a column. Some of it is in the format dd/mm/yyyy and some of it is in the format d/m/y. How can I set the column as datetime by applying the appropriate format depending on the value of the cell?
I am reading from a csv file:
DayofWeek,Date
Friday,22/05/2015
Friday,10/2/12
Friday,10/10/14
Friday,21/10/2011
Friday,8/7/11
df = pd.read_csv('dates.txt', parse_dates=['Date'], dayfirst=True)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
DayofWeek 5 non-null object
Date 5 non-null datetime64[ns]
dtypes: datetime64[ns](1), object(1)
memory usage: 100.0+ bytes
print(df)
DayofWeek Date
0 Friday 2015-05-22
1 Friday 2012-02-10
2 Friday 2014-10-10
3 Friday 2011-10-21
4 Friday 2011-07-08

Resources