Sorting datetime column when month is formatted as name - python-3.x

I have a column Date Time in my dataframe which has the date and time as a string.
Product Date Time
ABC Wed Jan 2 11:14:24 2019
ABC123 Wed Jan 2 11:14:24 2019
ABCXZY Wed Jan 2 11:14:24 2019
BVF123 Mon Jan 14 10:24:20 2019
ABC456 Mon Jan 14 10:24:20 2019
ABC000 Mon Feb 4 10:44:08 2019
ABCXYZ Mon Feb 4 10:44:08 2019
ABC678 Mon Feb 4 10:44:08 2019
ABCQYZ Wed Feb 20 09:14:40 2019
ABC090 Wed Feb 20 09:14:40 2019
I have converted this column to a datetime format using -
df['Date'] = pd.to_datetime(df['Date Time']).dt.strtime('%d-%b-%Y')
I want to now sort this dataframe on the basis of the Date column to plot the quantities for each date in ascending order of date, but when I use -
df.sort_values(by='Date', inplace=True, ascending=True)
it only gets sorted by the date and ignores the month name, i.e as
02-Jan-2019
04-Feb-2019
08-Mar-2019
13-Feb-2019
14-Jan-2019
20-Feb-2019
21-Mar-2019
instead of
02-Jan-2019
14-Jan-2019
04-Feb-2019
13-Feb-2019
20-Feb-2019
08-Mar-2019
21-Mar-2019
How can I get the desired sorting using pandas datetime or any other module?

pd.to_datetime(df['Date Time']).dt.strtime('%d-%b-%Y')
returns a series of string ("object type" to be precise) but not a series of datetime. That's why your sorting is wrong.
Here is a code to do it :
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
### Dataframe
data = {'Product' : ['ABC', 'ABC123', 'ABCXZY', 'BVF123', 'ABC456', 'ABC000', 'ABCXYZ', 'ABC678', 'ABCQYZ', 'ABC090'], 'Date Time' : ['Wed Jan 2 11:14:24 2019', 'Wed Jan 2 11:14:24 2019', 'Wed Jan 2 11:14:24 2019', 'Mon Jan 14 10:24:20 2019', 'Mon Jan 14 10:24:20 2019', 'Mon Feb 4 10:44:08 2019', 'Mon Feb 4 10:44:08 2019', 'Mon Feb 4 10:44:08 2019', 'Wed Feb 20 09:14:40 2019', 'Wed Feb 20 09:14:40 2019']}
df = pd.DataFrame(data)
### Conversion to datetime
df['Date'] = pd.to_datetime(df.loc[:, 'Date Time'])
### Sorting
df.sort_values(by = 'Date', inplace = True)
### Plot
ax = df.groupby('Date').count().Product.plot()
ax.xaxis.set_major_formatter(mdates.DateFormatter('%d-%b-%Y')) # Formatting x labels

We can do argsort
df=df.iloc[pd.to_datetime(df['Date Time'],format='%d-%b-%Y').argsort()]
Out[20]:
Date Time
3 14-Jan-2019
0 04-Feb-2019
2 13-Feb-2019
4 20-Feb-2019
1 08-Mar-2019
5 21-Mar-2019
Update
s=df.groupby(['Date Time']).size()
s.index=pd.to_datetime(s.index,format='%d-%b-%Y')
s.sort_index(inplace=True)
s.index=s.index.strftime('%d-%b-%Y')

Related

Pandas: How to ctrate DateTime index

There is Pandas Dataframe as:
year month count
0 2014 Jan 12
1 2014 Feb 10
2 2015 Jan 12
3 2015 Feb 10
How to create DateTime index from 'year' and 'month',so result would be :
count
2014.01.31 12
2014.02.28 10
2015.01.31 12
2015.02.28 10
Use to_datetime with DataFrame.pop for use and remove columns and add offsets.MonthEnd:
dates = pd.to_datetime(df.pop('year').astype(str) + df.pop('month'), format='%Y%b')
df.index = dates + pd.offsets.MonthEnd()
print (df)
count
2014-01-31 12
2014-02-28 10
2015-01-31 12
2015-02-28 10
Or:
dates = pd.to_datetime(df.pop('year').astype(str) + df.pop('month'), format='%Y%b')
df.index = dates + pd.to_timedelta(dates.dt.daysinmonth - 1, unit='d')
print (df)
count
2014-01-31 12
2014-02-28 10
2015-01-31 12
2015-02-28 10

How to split rows into columns in a dataframe

I have this dataframe (with dimension 840rows x 1columns):
0 151284 Apr 19 11:37 0-01-20200419063614
1 48054 Apr 21 12:50 0-01-20200421074934
2 187588 Apr 21 13:55 0-01-20200421085439
3 51584 Apr 21 14:37 0-01-20200421143636
4 63522 Apr 22 08:40 0-01-20200422083937
I want to convert this dataframe into a format like this:
id datetime size
151284 2020-04-19 11:37:00 0-01-20200419063614
. . .
datetime being in the format: (yyyy-mm-dd)(hr-min-sec). So basically splitting a single column into three columns and also combining date and time into a single datetime column in a standard format.
Any help is appreciated.
EDIT: output of df.columns: Index(['col'], dtype='object')
Like this:
In [70]: df = pd.DataFrame({'col':['151284 Apr 19 11:37 0-01-20200419063614', '48054 Apr 21 12:50 0-01-20200421074934', '187588 Apr 21 13:55 0-01-20200421085439', '51584 Apr 21 14:37 0-01-20200421143636',
...: '63522 Apr 22 08:40 0-01-20200422083937']})
In [54]: df['id'] = df.col.str.split(' ').str[0]
In [55]: df['Datetime'] = df.col.str.split(' ').str[1] + ' ' + df.col.str.split(' ').str[2] + ' ' + df.col.str.split(' ').str[3]
In [57]: df['Size'] = df.col.str.split(' ').str[-1]
In [63]: from dateutil import parser
In [65]: def format_datetime(x):
...: return parser.parse(x)
...:
In [67]: df['Datetime'] = df.Datetime.apply(format_datetime)
In [79]: df
Out[79]:
id Datetime Size
0 151284 2020-04-19 11:37:00 0-01-20200419063614
1 48054 2020-04-21 12:50:00 0-01-20200421074934
2 187588 2020-04-21 13:55:00 0-01-20200421085439
3 51584 2020-04-21 14:37:00 0-01-20200421143636
4 63522 2020-04-22 08:40:00 0-01-20200422083937

Convert column values into rows in the order in which columns are present

Below is a sample dataframe I have. I need to convert each row into multiple rows based on month.
df = pd.DataFrame({'Jan': [100,200,300],
'Feb': [400,500,600],
'March':[700,800,900],
})
Desired output :
Jan 100
Feb 400
March 700
Jan 200
Feb 500
March 800
Jan 300
Feb 600
March 900
Tried using pandas melt function but what it does is it will group Jan together, then Feb and March. It will be like 3 rows for Jan, then 3 for Feb and same for March. But i want to achieve the above output. Could someone please help ?
Use DataFrame.stack with some data cleaning by Series.reset_index with Series.rename_axis:
df1 = (df.stack()
.reset_index(level=0, drop=True)
.rename_axis('months')
.reset_index(name='val'))
Or use numpy - flatten values and repeat columns names by numpy.tile:
df1 = pd.DataFrame({'months': np.tile(df.columns, len(df)),
'val': df.values.reshape(1,-1).ravel()})
print (df1)
months val
0 Jan 100
1 Feb 400
2 March 700
3 Jan 200
4 Feb 500
5 March 800
6 Jan 300
7 Feb 600
8 March 900

How to run a script on x axis of plots in matplotlib [duplicate]

I want to transform an integer between 1 and 12 into an abbrieviated month name.
I have a df which looks like:
client Month
1 sss 02
2 yyy 12
3 www 06
I want the df to look like this:
client Month
1 sss Feb
2 yyy Dec
3 www Jun
Most of the info I found was not in python>pandas>dataframe hence the question.
You can do this efficiently with combining calendar.month_abbr and df[col].apply()
import calendar
df['Month'] = df['Month'].apply(lambda x: calendar.month_abbr[x])
Since the abbreviated month names is the first three letters of their full names, we could first convert the Month column to datetime and then use dt.month_name() to get the full month name and finally use str.slice() method to get the first three letters, all using pandas and only in one line of code:
df['Month'] = pd.to_datetime(df['Month'], format='%m').dt.month_name().str.slice(stop=3)
df
Month client
0 Feb sss
1 Dec yyy
2 Jun www
The calendar module is useful, but calendar.month_abbr is array-like: it cannot be used directly in a vectorised fashion. For an efficient mapping, you can construct a dictionary and then use pd.Series.map:
import calendar
d = dict(enumerate(calendar.month_abbr))
df['Month'] = df['Month'].map(d)
Performance benchmarking shows a ~130x performance differential:
import calendar
d = dict(enumerate(calendar.month_abbr))
mapper = calendar.month_abbr.__getitem__
np.random.seed(0)
n = 10**5
df = pd.DataFrame({'A': np.random.randint(1, 13, n)})
%timeit df['A'].map(d) # 7.29 ms per loop
%timeit df['A'].map(mapper) # 946 ms per loop
Solution 1: One liner
df['Month'] = pd.to_datetime(df['Month'], format='%m').dt.strftime('%b')
Solution 2: Using apply()
def mapper(month):
return month.strftime('%b')
df['Month'] = df['Month'].apply(mapper)
Reference:
http://strftime.org/
https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html
using datetime object methods
I'm surpised this answer doesn't have a solution using strftime
note, you'll need to have a valid datetime object before using the strftime method, use pd.to_datetime(df['date_column']) to cast your target column to a datetime object.
import pandas as pd
dates = pd.date_range('01-Jan 2020','01-Jan 2021',freq='M')
df = pd.DataFrame({'dates' : dates})
df['month_name'] = df['dates'].dt.strftime('%b')
dates month_name
0 2020-01-31 Jan
1 2020-02-29 Feb
2 2020-03-31 Mar
3 2020-04-30 Apr
4 2020-05-31 May
5 2020-06-30 Jun
6 2020-07-31 Jul
7 2020-08-31 Aug
8 2020-09-30 Sep
9 2020-10-31 Oct
10 2020-11-30 Nov
11 2020-12-31 Dec
another method would be to slice the name using dt.month_name()
df['month_name_str_slice'] = df['dates'].dt.month_name().str[:3]
dates month_name month_name_str_slice
0 2020-01-31 Jan Jan
1 2020-02-29 Feb Feb
2 2020-03-31 Mar Mar
3 2020-04-30 Apr Apr
4 2020-05-31 May May
5 2020-06-30 Jun Jun
6 2020-07-31 Jul Jul
7 2020-08-31 Aug Aug
8 2020-09-30 Sep Sep
9 2020-10-31 Oct Oct
10 2020-11-30 Nov Nov
11 2020-12-31 Dec Dec
You can do this easily with a column apply.
import pandas as pd
df = pd.DataFrame({'client':['sss', 'yyy', 'www'], 'Month': ['02', '12', '06']})
look_up = {'01': 'Jan', '02': 'Feb', '03': 'Mar', '04': 'Apr', '05': 'May',
'06': 'Jun', '07': 'Jul', '08': 'Aug', '09': 'Sep', '10': 'Oct', '11': 'Nov', '12': 'Dec'}
df['Month'] = df['Month'].apply(lambda x: look_up[x])
df
Month client
0 Feb sss
1 Dec yyy
2 Jun www
One way of doing that is with the apply method in the dataframe but, to do that, you need a map to convert the months. You could either do that with a function / dictionary or with Python's own datetime.
With the datetime it would be something like:
def mapper(month):
date = datetime.datetime(2000, month, 1) # You need a dateobject with the proper month
return date.strftime('%b') # %b returns the months abbreviation, other options [here][1]
df['Month'].apply(mapper)
In a simillar way, you could build your own map for custom names. It would look like this:
months_map = {01: 'Jan', 02: 'Feb'}
def mapper(month):
return months_map[month]
Obviously, you don't need to define this functions explicitly and could use a lambda directly in the apply method.
Use strptime and lambda function for this:
from time import strptime
df['Month'] = df['Month'].apply(lambda x: strptime(x,'%b').tm_mon)
Suppose we have a DF like this, and Date is already in DateTime Format:
df.head(3)
value
date
2016-05-19 19736
2016-05-26 18060
2016-05-27 19997
Then we can extract month number and month name easily like this :
df['month_num'] = df.index.month
df['month'] = df.index.month_name()
value year month_num month
date
2017-01-06 37353 2017 1 January
2019-01-06 94108 2019 1 January
2019-01-05 77897 2019 1 January
2019-01-04 94514 2019 1 January
Having tested all of these on a large dataset, I have found the following to be fastest:
import calendar
def month_mapping():
# I'm lazy so I have a stash of functions already written so
# I don't have to write them out every time. This returns the
# {1:'Jan'....12:'Dec'} dict in the laziest way...
abbrevs = {}
for month in range (1, 13):
abbrevs[month] = calendar.month_abbr[month]
return abbrevs
abbrevs = month_mapping()
df['Month Abbrev'} = df['Date Col'].dt.month.map(mapping)
You can use Pandas month_name() function. Example:
>>> idx = pd.date_range(start='2018-01', freq='M', periods=3)
>>> idx
DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-31'],
dtype='datetime64[ns]', freq='M')
>>> idx.month_name()
Index(['January', 'February', 'March'], dtype='object')
For more detail visit this link.
the best way would be to do with month_name() as commented by
Nurul Akter Towhid.
df['Month'] = df.Month.dt.month_name()
First you need to strip "0 " in the beginning (as u might get the exception leading zeros in decimal integer literals are not permitted; use an 0o prefix for octal integers)
step1)
def func(i):
if i[0] == '0':
i = i[1]
return(i)
df["Month"] = df["Month"].apply(lambda x: func(x))
Step2:
df["Month"] = df["Month"].apply(lambda x: calendar.month_name(x))

Aggregate time series with group by and create chart with multiple series

I have time series data and I want to create a chart of the monthly (x-axis) counts of the number of records (lines chart), grouped by sentiment (multiple lines)
Data looks like this
created_at id polarity sentiment
0 Fri Nov 02 11:22:47 +0000 2018 1058318498663870464 0.000000 neutral
1 Fri Nov 02 11:20:54 +0000 2018 1058318026758598656 0.011905 neutral
2 Fri Nov 02 09:41:37 +0000 2018 1058293038739607552 0.800000 positive
3 Fri Nov 02 09:40:48 +0000 2018 1058292834699231233 0.800000 positive
4 Thu Nov 01 18:23:17 +0000 2018 1058061933243518976 0.233333 neutral
5 Thu Nov 01 17:50:39 +0000 2018 1058053723157618690 0.400000 positive
6 Wed Oct 31 18:57:53 +0000 2018 1057708251758903296 0.566667 positive
7 Sun Oct 28 17:21:24 +0000 2018 1056596810570100736 0.000000 neutral
8 Sun Oct 21 13:00:53 +0000 2018 1053994531845296128 0.136364 neutral
9 Sun Oct 21 12:55:12 +0000 2018 1053993101205868544 0.083333 neutral
So far I have managed to aggregate to the monthly totals, with the following code:
import pandas as pd
tweets = process_twitter_json(file_name)
#print(tweets[:10])
df = pd.DataFrame.from_records(tweets)
print(df.head(10))
#make the string date into a date field
df['tweet_datetime'] = pd.to_datetime(df['created_at'])
df.index = df['tweet_datetime']
#print('Monthly counts')
monthly_sentiment = df.groupby('sentiment')['tweet_datetime'].resample('M').count()
I'm struggling with how to chart the data.
Do I pivot to turn each of the discreet values within the sentiment
field as separate columns
I've tried .unstack() that turns the sentiment values into rows,
which is almost there, but the problem is dates become string column
headers, which is no good for charting
OK I changed the monthly aggregation method and used Grouper instead of resample, this meant that when I did the unstack() the resulting dataframe was vertical (deep and narrow) with dates as rows rather than horizontal with the dates as columns headers which meant I no longer had issues with dates being stored as strings when I came to chart it.
Full code:
import pandas as pd
tweets = process_twitter_json(file_name)
df = pd.DataFrame.from_records(tweets)
df['tweet_datetime'] = pd.to_datetime(df['created_at'])
df.index = df['tweet_datetime']
grouper = df.groupby(['sentiment', pd.Grouper(key='tweet_datetime', freq='M')]).id.count()
result = grouper.unstack('sentiment').fillna(0)
##=================================================
##PLOTLY - charts in Jupyter
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
print (__version__)# requires version >= 1.9.0
import plotly.graph_objs as go
init_notebook_mode(connected=True)
trace0 = go.Scatter(
x = result.index,
y = result['positive'],
name = 'Positive',
line = dict(
color = ('rgb(205, 12, 24)'),
width = 4)
)
trace1 = go.Scatter(
x = result.index,
y = result['negative'],
name = 'Negative',
line = dict(
color = ('rgb(22, 96, 167)'),
width = 4)
)
trace2 = go.Scatter(
x = result.index,
y = result['neutral'],
name = 'Neutral',
line = dict(
color = ('rgb(12, 205, 24)'),
width = 4)
)
data = [trace0, trace1, trace2]
iplot(data)

Resources