How to split rows into columns in a dataframe - python-3.x

I have this dataframe (with dimension 840rows x 1columns):
0 151284 Apr 19 11:37 0-01-20200419063614
1 48054 Apr 21 12:50 0-01-20200421074934
2 187588 Apr 21 13:55 0-01-20200421085439
3 51584 Apr 21 14:37 0-01-20200421143636
4 63522 Apr 22 08:40 0-01-20200422083937
I want to convert this dataframe into a format like this:
id datetime size
151284 2020-04-19 11:37:00 0-01-20200419063614
. . .
datetime being in the format: (yyyy-mm-dd)(hr-min-sec). So basically splitting a single column into three columns and also combining date and time into a single datetime column in a standard format.
Any help is appreciated.
EDIT: output of df.columns: Index(['col'], dtype='object')

Like this:
In [70]: df = pd.DataFrame({'col':['151284 Apr 19 11:37 0-01-20200419063614', '48054 Apr 21 12:50 0-01-20200421074934', '187588 Apr 21 13:55 0-01-20200421085439', '51584 Apr 21 14:37 0-01-20200421143636',
...: '63522 Apr 22 08:40 0-01-20200422083937']})
In [54]: df['id'] = df.col.str.split(' ').str[0]
In [55]: df['Datetime'] = df.col.str.split(' ').str[1] + ' ' + df.col.str.split(' ').str[2] + ' ' + df.col.str.split(' ').str[3]
In [57]: df['Size'] = df.col.str.split(' ').str[-1]
In [63]: from dateutil import parser
In [65]: def format_datetime(x):
...: return parser.parse(x)
...:
In [67]: df['Datetime'] = df.Datetime.apply(format_datetime)
In [79]: df
Out[79]:
id Datetime Size
0 151284 2020-04-19 11:37:00 0-01-20200419063614
1 48054 2020-04-21 12:50:00 0-01-20200421074934
2 187588 2020-04-21 13:55:00 0-01-20200421085439
3 51584 2020-04-21 14:37:00 0-01-20200421143636
4 63522 2020-04-22 08:40:00 0-01-20200422083937

Related

How to update column value based on another value in pandas

I have the below data frame
A
B
Jan
10
Feb
20
Mar
30
Apr
20
Required Output - I want to check for March from A and get its corresponding value from B and add that value to remaining B values to update the dataframe using pandas
A
B
Jan
40
Feb
50
Apr
50
You can do it in one line, pandas-style, using set_index():
df = df.set_index('A').pipe(lambda x: x.assign(B=x['B'] + x.loc['Mar', 'B'])).drop('Mar').reset_index()
Output:
>>> df
A B
0 Jan 40
1 Feb 50
2 Apr 50
Or in multiple lines (not so pandas-style):
df['B'] += df.loc[df['A'] == 'Mar', 'B'].iloc[0]
df = df[df['A'] != 'Mar']
Or a third and slightly shorter way:
tmp = df.set_index('A').T
df = (tmp.pop('Mar').iloc[0] + tmp.T['B']).reset_index()
You can find the value corresponding to 'Mar', add that value to the rest of the df, then drop the row containing 'Mar'
df.loc[df['A'] != 'Mar','B'] += df.loc[df['A'] == 'Mar', 'B'].values
df = df[df['A'] != 'Mar']
Result:
>>> df
A B
0 Jan 40
1 Feb 50
3 Apr 50

Pandas: How to ctrate DateTime index

There is Pandas Dataframe as:
year month count
0 2014 Jan 12
1 2014 Feb 10
2 2015 Jan 12
3 2015 Feb 10
How to create DateTime index from 'year' and 'month',so result would be :
count
2014.01.31 12
2014.02.28 10
2015.01.31 12
2015.02.28 10
Use to_datetime with DataFrame.pop for use and remove columns and add offsets.MonthEnd:
dates = pd.to_datetime(df.pop('year').astype(str) + df.pop('month'), format='%Y%b')
df.index = dates + pd.offsets.MonthEnd()
print (df)
count
2014-01-31 12
2014-02-28 10
2015-01-31 12
2015-02-28 10
Or:
dates = pd.to_datetime(df.pop('year').astype(str) + df.pop('month'), format='%Y%b')
df.index = dates + pd.to_timedelta(dates.dt.daysinmonth - 1, unit='d')
print (df)
count
2014-01-31 12
2014-02-28 10
2015-01-31 12
2015-02-28 10

How to run a script on x axis of plots in matplotlib [duplicate]

I want to transform an integer between 1 and 12 into an abbrieviated month name.
I have a df which looks like:
client Month
1 sss 02
2 yyy 12
3 www 06
I want the df to look like this:
client Month
1 sss Feb
2 yyy Dec
3 www Jun
Most of the info I found was not in python>pandas>dataframe hence the question.
You can do this efficiently with combining calendar.month_abbr and df[col].apply()
import calendar
df['Month'] = df['Month'].apply(lambda x: calendar.month_abbr[x])
Since the abbreviated month names is the first three letters of their full names, we could first convert the Month column to datetime and then use dt.month_name() to get the full month name and finally use str.slice() method to get the first three letters, all using pandas and only in one line of code:
df['Month'] = pd.to_datetime(df['Month'], format='%m').dt.month_name().str.slice(stop=3)
df
Month client
0 Feb sss
1 Dec yyy
2 Jun www
The calendar module is useful, but calendar.month_abbr is array-like: it cannot be used directly in a vectorised fashion. For an efficient mapping, you can construct a dictionary and then use pd.Series.map:
import calendar
d = dict(enumerate(calendar.month_abbr))
df['Month'] = df['Month'].map(d)
Performance benchmarking shows a ~130x performance differential:
import calendar
d = dict(enumerate(calendar.month_abbr))
mapper = calendar.month_abbr.__getitem__
np.random.seed(0)
n = 10**5
df = pd.DataFrame({'A': np.random.randint(1, 13, n)})
%timeit df['A'].map(d) # 7.29 ms per loop
%timeit df['A'].map(mapper) # 946 ms per loop
Solution 1: One liner
df['Month'] = pd.to_datetime(df['Month'], format='%m').dt.strftime('%b')
Solution 2: Using apply()
def mapper(month):
return month.strftime('%b')
df['Month'] = df['Month'].apply(mapper)
Reference:
http://strftime.org/
https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html
using datetime object methods
I'm surpised this answer doesn't have a solution using strftime
note, you'll need to have a valid datetime object before using the strftime method, use pd.to_datetime(df['date_column']) to cast your target column to a datetime object.
import pandas as pd
dates = pd.date_range('01-Jan 2020','01-Jan 2021',freq='M')
df = pd.DataFrame({'dates' : dates})
df['month_name'] = df['dates'].dt.strftime('%b')
dates month_name
0 2020-01-31 Jan
1 2020-02-29 Feb
2 2020-03-31 Mar
3 2020-04-30 Apr
4 2020-05-31 May
5 2020-06-30 Jun
6 2020-07-31 Jul
7 2020-08-31 Aug
8 2020-09-30 Sep
9 2020-10-31 Oct
10 2020-11-30 Nov
11 2020-12-31 Dec
another method would be to slice the name using dt.month_name()
df['month_name_str_slice'] = df['dates'].dt.month_name().str[:3]
dates month_name month_name_str_slice
0 2020-01-31 Jan Jan
1 2020-02-29 Feb Feb
2 2020-03-31 Mar Mar
3 2020-04-30 Apr Apr
4 2020-05-31 May May
5 2020-06-30 Jun Jun
6 2020-07-31 Jul Jul
7 2020-08-31 Aug Aug
8 2020-09-30 Sep Sep
9 2020-10-31 Oct Oct
10 2020-11-30 Nov Nov
11 2020-12-31 Dec Dec
You can do this easily with a column apply.
import pandas as pd
df = pd.DataFrame({'client':['sss', 'yyy', 'www'], 'Month': ['02', '12', '06']})
look_up = {'01': 'Jan', '02': 'Feb', '03': 'Mar', '04': 'Apr', '05': 'May',
'06': 'Jun', '07': 'Jul', '08': 'Aug', '09': 'Sep', '10': 'Oct', '11': 'Nov', '12': 'Dec'}
df['Month'] = df['Month'].apply(lambda x: look_up[x])
df
Month client
0 Feb sss
1 Dec yyy
2 Jun www
One way of doing that is with the apply method in the dataframe but, to do that, you need a map to convert the months. You could either do that with a function / dictionary or with Python's own datetime.
With the datetime it would be something like:
def mapper(month):
date = datetime.datetime(2000, month, 1) # You need a dateobject with the proper month
return date.strftime('%b') # %b returns the months abbreviation, other options [here][1]
df['Month'].apply(mapper)
In a simillar way, you could build your own map for custom names. It would look like this:
months_map = {01: 'Jan', 02: 'Feb'}
def mapper(month):
return months_map[month]
Obviously, you don't need to define this functions explicitly and could use a lambda directly in the apply method.
Use strptime and lambda function for this:
from time import strptime
df['Month'] = df['Month'].apply(lambda x: strptime(x,'%b').tm_mon)
Suppose we have a DF like this, and Date is already in DateTime Format:
df.head(3)
value
date
2016-05-19 19736
2016-05-26 18060
2016-05-27 19997
Then we can extract month number and month name easily like this :
df['month_num'] = df.index.month
df['month'] = df.index.month_name()
value year month_num month
date
2017-01-06 37353 2017 1 January
2019-01-06 94108 2019 1 January
2019-01-05 77897 2019 1 January
2019-01-04 94514 2019 1 January
Having tested all of these on a large dataset, I have found the following to be fastest:
import calendar
def month_mapping():
# I'm lazy so I have a stash of functions already written so
# I don't have to write them out every time. This returns the
# {1:'Jan'....12:'Dec'} dict in the laziest way...
abbrevs = {}
for month in range (1, 13):
abbrevs[month] = calendar.month_abbr[month]
return abbrevs
abbrevs = month_mapping()
df['Month Abbrev'} = df['Date Col'].dt.month.map(mapping)
You can use Pandas month_name() function. Example:
>>> idx = pd.date_range(start='2018-01', freq='M', periods=3)
>>> idx
DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-31'],
dtype='datetime64[ns]', freq='M')
>>> idx.month_name()
Index(['January', 'February', 'March'], dtype='object')
For more detail visit this link.
the best way would be to do with month_name() as commented by
Nurul Akter Towhid.
df['Month'] = df.Month.dt.month_name()
First you need to strip "0 " in the beginning (as u might get the exception leading zeros in decimal integer literals are not permitted; use an 0o prefix for octal integers)
step1)
def func(i):
if i[0] == '0':
i = i[1]
return(i)
df["Month"] = df["Month"].apply(lambda x: func(x))
Step2:
df["Month"] = df["Month"].apply(lambda x: calendar.month_name(x))

When converting Date entry in pandas, how to ignore empty values?

I have an excel with some date entries as below.
Case Number Status Date/Time Opened Date/Time Resolved Date/Time Closed
1 Closed 2/1/2017 7:15 AM 2/1/2017 10:44 AM 2/21/2017 11:50 AM
2 Assigned 2/2/2017 2:09 PM
3 Resolved 2/8/2017 10:32 AM 9/11/2017 8:49 PM
4 Closed 8/27/2018 6:00 AM 10/15/2018 9:10 AM 10/15/2018 9:10 AM
5 Resolved 12/26/2018 3:25 PM 2/11/2019 9:08 AM
Initially I'm converting them from the above pattern to $year-$mm-$dd.
Case Number Status Date/Time Opened Date/Time Resolved Date/Time Closed
1 Closed 2017-02-01 2017-02-01 2017-02-21
2 Assigned 2017-02-02 NaN NaN
3 Resolved 2017-02-08 2017-09-11 NaN
4 Closed 2018-08-27 2018-10-15 2018-10-15
5 Resolved 2018-12-26 2019-02-11 NaN
With these converted dates, I'm trying extract month and year in the format $mon $year.
I'm using following code to extract month and year.
df['Month Opened'] = pd.to_datetime(df["Date/Time Opened"]).map(lambda x: calendar.month_abbr[x.month] + " " + str(x.year))
When I have applied this formula with 'Date/Time Opened', I can see that it works as below.
Case Number Status Date/Time Opened Date/Time Resolved Date/Time Closed Month Opened
1 Closed 2017-02-01 2017-02-01 2017-02-21 Feb 2017
2 Assigned 2017-02-02 NaN NaN Feb 2017
3 Resolved 2017-02-08 2017-09-11 NaN Feb 2017
4 Closed 2018-08-27 2018-10-15 2018-10-15 Aug 2018
5 Resolved 2018-12-26 2019-02-11 NaN Dec 2018
Here's my complete code - http://tpcg.io/X5S8Pe
import pandas as pd
import calendar
CaseDetails = {
'Case Number': [1, 2, 3, 4, 5],
'Status': ['Closed', 'Assigned', 'Resolved', 'Closed', 'Resolved'],
'Date/Time Opened': ['2/1/2017 7:15 AM', '2/2/2017 2:09 PM', '2/8/2017 10:32 AM', '8/27/2018 6:00 AM', '12/26/2018 3:25 PM'],
'Date/Time Resolved': ['2/1/2017 10:44 AM', '', '9/11/2017 8:49 PM', '10/15/2018 9:10 AM', '2/11/2019 9:08 AM'],
'Date/Time Closed': ['2/21/2017 11:50 AM', '', '', '10/15/2018 9:10 AM', '']
}
df = pd.DataFrame(CaseDetails,columns= ['Case Number', 'Status', 'Date/Time Opened', 'Date/Time Resolved', 'Date/Time Closed'])
df['Date/Time Opened'] = pd.to_datetime(df['Date/Time Opened']).dt.date
df['Date/Time Resolved'] = pd.to_datetime(df['Date/Time Resolved']).dt.date
df['Date/Time Closed'] = pd.to_datetime(df['Date/Time Closed']).dt.date
print (df)
df['Month Opened'] = pd.to_datetime(df["Date/Time Opened"]).map(lambda x: calendar.month_abbr[x.month] + " " + str(x.year))
df['Month Closed'] = pd.to_datetime(df["Date/Time Closed"]).map(lambda x: calendar.month_abbr[x.month] + " " + str(x.year))
print (df)
As expected, my code converted entries under 'Date/Time Opened' to the desired format. When tried converting other 2 dates columns, I get the following error.
Traceback (most recent call last):
File "main.py", line 21, in <module>
df['Month Closed'] = pd.to_datetime(df["Date/Time Closed"]).map(lambda x: calendar.month_abbr[x.month] + " " + str(x.year))
File "/usr/lib64/python2.7/site-packages/pandas/core/series.py", line 2158, in map
new_values = map_f(values, arg)
File "pandas/_libs/src/inference.pyx", line 1569, in pandas._libs.lib.map_infer (pandas/_libs/lib.c:66440)
File "main.py", line 21, in <lambda>
df['Month Closed'] = pd.to_datetime(df["Date/Time Closed"]).map(lambda x: calendar.month_abbr[x.month] + " " + str(x.year))
File "/usr/lib64/python2.7/calendar.py", line 56, in __getitem__
funcs = self._months[i]
TypeError: list indices must be integers, not float
I wanted to know is there a way to covert the columns with empty values?
Here is possible use Series.dt.strftime - it working nice with missing values:
df['Date/Time Opened'] = pd.to_datetime(df['Date/Time Opened']).dt.strftime('%b %Y')
df['Date/Time Resolved'] = pd.to_datetime(df['Date/Time Resolved']).dt.strftime('%b %Y')
df['Date/Time Closed'] = pd.to_datetime(df['Date/Time Closed']).dt.strftime('%b %Y')
Alternative is use apply with list of columns:
cols = ['Date/Time Opened','Date/Time Resolved','Date/Time Closed']
df[cols] = df[cols].apply(lambda x: pd.to_datetime(x).dt.strftime('%b %Y'))
print (df)
Case Number Status Date/Time Opened Date/Time Resolved Date/Time Closed
0 1 Closed Feb 2017 Feb 2017 Feb 2017
1 2 Assigned Feb 2017 NaT NaT
2 3 Resolved Feb 2017 Sep 2017 NaT
3 4 Closed Aug 2018 Oct 2018 Oct 2018
4 5 Resolved Dec 2018 Feb 2019 NaT
In your solution is possible use nice trick np.nan != np.nan, so in your function is added if-else statement:
f = lambda x: calendar.month_abbr[x.month] + " " + str(x.year) if x == x else np.nan
df['Date/Time Opened'] = pd.to_datetime(df['Date/Time Opened']).map(f)
df['Date/Time Resolved'] = pd.to_datetime(df['Date/Time Resolved']).map(f)
df['Date/Time Closed'] = pd.to_datetime(df['Date/Time Closed']).map(f)
print (df)
Case Number Status Date/Time Opened Date/Time Resolved Date/Time Closed
0 1 Closed Feb 2017 Feb 2017 Feb 2017
1 2 Assigned Feb 2017 NaN NaN
2 3 Resolved Feb 2017 Sep 2017 NaN
3 4 Closed Aug 2018 Oct 2018 Oct 2018
4 5 Resolved Dec 2018 Feb 2019 NaN
Or alternative:
f = lambda x: calendar.month_abbr[x.month] + " " + str(x.year) if x == x else np.nan
cols = ['Date/Time Opened','Date/Time Resolved','Date/Time Closed']
df[cols] = df[cols].apply(lambda x: pd.to_datetime(x).map(f))

How to combine multiple columns in a Data Frame to Pandas datetime format

I have a pandas data frame with values as below
ProcessID1 UserID Date Month Year Time
248 Tony 29 4 2017 23:30:56
436 Jeff 28 4 2017 20:02:19
500 Greg 4 5 2017 11:48:29
I would like to know is there any way I can combine columns of Date,Month&Year & time to a pd.datetimeformat?
Use to_datetime with automatic convert column Day,Month,Year with add times converted to_timedelta:
df['Datetime'] = pd.to_datetime(df.rename(columns={'Date':'Day'})[['Day','Month','Year']]) + \
pd.to_timedelta(df['Time'])
Another solutions are join all column converted to strings first:
df['Datetime'] = pd.to_datetime(df[['Date','Month','Year', 'Time']]
.astype(str).apply(' '.join, 1), format='%d %m %Y %H:%M:%S')
df['Datetime'] = (pd.to_datetime(df['Year'].astype(str) + '-' +
df['Month'].astype(str) + '-' +
df['Date'].astype(str) + ' ' +
df['Time']))
print (df)
ProcessID1 UserID Date Month Year Time Datetime
0 248 Tony 29 4 2017 23:30:56 2017-04-29 23:30:56
1 436 Jeff 28 4 2017 20:02:19 2017-04-28 20:02:19
2 500 Greg 4 5 2017 11:48:29 2017-05-04 11:48:29
Last if need remove these columns:
df = df.drop(['Date','Month','Year', 'Time'], axis=1)
print (df)
ProcessID1 UserID Datetime
0 248 Tony 2017-04-29 23:30:56
1 436 Jeff 2017-04-28 20:02:19
2 500 Greg 2017-05-04 11:48:29
Concatenate the columns together to a string format and use pd.to_datetime to convert to datetime.
import pandas as pd
import io
txt = """
ProcessID1 UserID Date Month Year Time
248 Tony 29 4 2017 23:30:56
436 Jeff 28 4 2017 20:02:19
500 Greg 4 5 2017 11:48:29
"""
df = pd.read_csv(io.StringIO(txt), sep="[\t ,]+")
df['Datetime'] = pd.to_datetime(df['Date'].astype(str) \
+ '-' + df['Month'].astype(str) \
+ '-' + df['Year'].astype(str) \
+ ' ' + df['Time'],
format='%d-%m-%Y %H:%M:%S')
df
import pandas as pd
You can also do this by using apply() method:-
df['Datetime']=df[['Year','Month','Date']].astype(str).apply('-'.join,1)+' '+df['Time']
Finally convert 'Datetime' to datetime dtype by using pandas to_datetime() method:-
df['Datetime']=pd.to_datetime(df['Datetime'])
Output of df:
ProcessID1 UserID Date Month Year Time Datetime
0 248 Tony 29 4 2017 23:30:56 2017-04-29 23:30:56
1 436 Jeff 28 4 2017 20:02:19 2017-04-28 20:02:19
2 500 Greg 4 5 2017 11:48:29 2017-05-04 11:48:29
Now if you want to remove 'Date','Month','Year' and 'Time' column then use:-
df=df.drop(columns=['Date','Month','Year', 'Time'])

Resources