How to run a script on x axis of plots in matplotlib [duplicate] - python-3.x

I want to transform an integer between 1 and 12 into an abbrieviated month name.
I have a df which looks like:
client Month
1 sss 02
2 yyy 12
3 www 06
I want the df to look like this:
client Month
1 sss Feb
2 yyy Dec
3 www Jun
Most of the info I found was not in python>pandas>dataframe hence the question.

You can do this efficiently with combining calendar.month_abbr and df[col].apply()
import calendar
df['Month'] = df['Month'].apply(lambda x: calendar.month_abbr[x])

Since the abbreviated month names is the first three letters of their full names, we could first convert the Month column to datetime and then use dt.month_name() to get the full month name and finally use str.slice() method to get the first three letters, all using pandas and only in one line of code:
df['Month'] = pd.to_datetime(df['Month'], format='%m').dt.month_name().str.slice(stop=3)
df
Month client
0 Feb sss
1 Dec yyy
2 Jun www

The calendar module is useful, but calendar.month_abbr is array-like: it cannot be used directly in a vectorised fashion. For an efficient mapping, you can construct a dictionary and then use pd.Series.map:
import calendar
d = dict(enumerate(calendar.month_abbr))
df['Month'] = df['Month'].map(d)
Performance benchmarking shows a ~130x performance differential:
import calendar
d = dict(enumerate(calendar.month_abbr))
mapper = calendar.month_abbr.__getitem__
np.random.seed(0)
n = 10**5
df = pd.DataFrame({'A': np.random.randint(1, 13, n)})
%timeit df['A'].map(d) # 7.29 ms per loop
%timeit df['A'].map(mapper) # 946 ms per loop

Solution 1: One liner
df['Month'] = pd.to_datetime(df['Month'], format='%m').dt.strftime('%b')
Solution 2: Using apply()
def mapper(month):
return month.strftime('%b')
df['Month'] = df['Month'].apply(mapper)
Reference:
http://strftime.org/
https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html

using datetime object methods
I'm surpised this answer doesn't have a solution using strftime
note, you'll need to have a valid datetime object before using the strftime method, use pd.to_datetime(df['date_column']) to cast your target column to a datetime object.
import pandas as pd
dates = pd.date_range('01-Jan 2020','01-Jan 2021',freq='M')
df = pd.DataFrame({'dates' : dates})
df['month_name'] = df['dates'].dt.strftime('%b')
dates month_name
0 2020-01-31 Jan
1 2020-02-29 Feb
2 2020-03-31 Mar
3 2020-04-30 Apr
4 2020-05-31 May
5 2020-06-30 Jun
6 2020-07-31 Jul
7 2020-08-31 Aug
8 2020-09-30 Sep
9 2020-10-31 Oct
10 2020-11-30 Nov
11 2020-12-31 Dec
another method would be to slice the name using dt.month_name()
df['month_name_str_slice'] = df['dates'].dt.month_name().str[:3]
dates month_name month_name_str_slice
0 2020-01-31 Jan Jan
1 2020-02-29 Feb Feb
2 2020-03-31 Mar Mar
3 2020-04-30 Apr Apr
4 2020-05-31 May May
5 2020-06-30 Jun Jun
6 2020-07-31 Jul Jul
7 2020-08-31 Aug Aug
8 2020-09-30 Sep Sep
9 2020-10-31 Oct Oct
10 2020-11-30 Nov Nov
11 2020-12-31 Dec Dec

You can do this easily with a column apply.
import pandas as pd
df = pd.DataFrame({'client':['sss', 'yyy', 'www'], 'Month': ['02', '12', '06']})
look_up = {'01': 'Jan', '02': 'Feb', '03': 'Mar', '04': 'Apr', '05': 'May',
'06': 'Jun', '07': 'Jul', '08': 'Aug', '09': 'Sep', '10': 'Oct', '11': 'Nov', '12': 'Dec'}
df['Month'] = df['Month'].apply(lambda x: look_up[x])
df
Month client
0 Feb sss
1 Dec yyy
2 Jun www

One way of doing that is with the apply method in the dataframe but, to do that, you need a map to convert the months. You could either do that with a function / dictionary or with Python's own datetime.
With the datetime it would be something like:
def mapper(month):
date = datetime.datetime(2000, month, 1) # You need a dateobject with the proper month
return date.strftime('%b') # %b returns the months abbreviation, other options [here][1]
df['Month'].apply(mapper)
In a simillar way, you could build your own map for custom names. It would look like this:
months_map = {01: 'Jan', 02: 'Feb'}
def mapper(month):
return months_map[month]
Obviously, you don't need to define this functions explicitly and could use a lambda directly in the apply method.

Use strptime and lambda function for this:
from time import strptime
df['Month'] = df['Month'].apply(lambda x: strptime(x,'%b').tm_mon)

Suppose we have a DF like this, and Date is already in DateTime Format:
df.head(3)
value
date
2016-05-19 19736
2016-05-26 18060
2016-05-27 19997
Then we can extract month number and month name easily like this :
df['month_num'] = df.index.month
df['month'] = df.index.month_name()
value year month_num month
date
2017-01-06 37353 2017 1 January
2019-01-06 94108 2019 1 January
2019-01-05 77897 2019 1 January
2019-01-04 94514 2019 1 January

Having tested all of these on a large dataset, I have found the following to be fastest:
import calendar
def month_mapping():
# I'm lazy so I have a stash of functions already written so
# I don't have to write them out every time. This returns the
# {1:'Jan'....12:'Dec'} dict in the laziest way...
abbrevs = {}
for month in range (1, 13):
abbrevs[month] = calendar.month_abbr[month]
return abbrevs
abbrevs = month_mapping()
df['Month Abbrev'} = df['Date Col'].dt.month.map(mapping)

You can use Pandas month_name() function. Example:
>>> idx = pd.date_range(start='2018-01', freq='M', periods=3)
>>> idx
DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-31'],
dtype='datetime64[ns]', freq='M')
>>> idx.month_name()
Index(['January', 'February', 'March'], dtype='object')
For more detail visit this link.

the best way would be to do with month_name() as commented by
Nurul Akter Towhid.
df['Month'] = df.Month.dt.month_name()

First you need to strip "0 " in the beginning (as u might get the exception leading zeros in decimal integer literals are not permitted; use an 0o prefix for octal integers)
step1)
def func(i):
if i[0] == '0':
i = i[1]
return(i)
df["Month"] = df["Month"].apply(lambda x: func(x))
Step2:
df["Month"] = df["Month"].apply(lambda x: calendar.month_name(x))

Related

How to update column value based on another value in pandas

I have the below data frame
A
B
Jan
10
Feb
20
Mar
30
Apr
20
Required Output - I want to check for March from A and get its corresponding value from B and add that value to remaining B values to update the dataframe using pandas
A
B
Jan
40
Feb
50
Apr
50
You can do it in one line, pandas-style, using set_index():
df = df.set_index('A').pipe(lambda x: x.assign(B=x['B'] + x.loc['Mar', 'B'])).drop('Mar').reset_index()
Output:
>>> df
A B
0 Jan 40
1 Feb 50
2 Apr 50
Or in multiple lines (not so pandas-style):
df['B'] += df.loc[df['A'] == 'Mar', 'B'].iloc[0]
df = df[df['A'] != 'Mar']
Or a third and slightly shorter way:
tmp = df.set_index('A').T
df = (tmp.pop('Mar').iloc[0] + tmp.T['B']).reset_index()
You can find the value corresponding to 'Mar', add that value to the rest of the df, then drop the row containing 'Mar'
df.loc[df['A'] != 'Mar','B'] += df.loc[df['A'] == 'Mar', 'B'].values
df = df[df['A'] != 'Mar']
Result:
>>> df
A B
0 Jan 40
1 Feb 50
3 Apr 50

How to split rows into columns in a dataframe

I have this dataframe (with dimension 840rows x 1columns):
0 151284 Apr 19 11:37 0-01-20200419063614
1 48054 Apr 21 12:50 0-01-20200421074934
2 187588 Apr 21 13:55 0-01-20200421085439
3 51584 Apr 21 14:37 0-01-20200421143636
4 63522 Apr 22 08:40 0-01-20200422083937
I want to convert this dataframe into a format like this:
id datetime size
151284 2020-04-19 11:37:00 0-01-20200419063614
. . .
datetime being in the format: (yyyy-mm-dd)(hr-min-sec). So basically splitting a single column into three columns and also combining date and time into a single datetime column in a standard format.
Any help is appreciated.
EDIT: output of df.columns: Index(['col'], dtype='object')
Like this:
In [70]: df = pd.DataFrame({'col':['151284 Apr 19 11:37 0-01-20200419063614', '48054 Apr 21 12:50 0-01-20200421074934', '187588 Apr 21 13:55 0-01-20200421085439', '51584 Apr 21 14:37 0-01-20200421143636',
...: '63522 Apr 22 08:40 0-01-20200422083937']})
In [54]: df['id'] = df.col.str.split(' ').str[0]
In [55]: df['Datetime'] = df.col.str.split(' ').str[1] + ' ' + df.col.str.split(' ').str[2] + ' ' + df.col.str.split(' ').str[3]
In [57]: df['Size'] = df.col.str.split(' ').str[-1]
In [63]: from dateutil import parser
In [65]: def format_datetime(x):
...: return parser.parse(x)
...:
In [67]: df['Datetime'] = df.Datetime.apply(format_datetime)
In [79]: df
Out[79]:
id Datetime Size
0 151284 2020-04-19 11:37:00 0-01-20200419063614
1 48054 2020-04-21 12:50:00 0-01-20200421074934
2 187588 2020-04-21 13:55:00 0-01-20200421085439
3 51584 2020-04-21 14:37:00 0-01-20200421143636
4 63522 2020-04-22 08:40:00 0-01-20200422083937

Aggregate time series with group by and create chart with multiple series

I have time series data and I want to create a chart of the monthly (x-axis) counts of the number of records (lines chart), grouped by sentiment (multiple lines)
Data looks like this
created_at id polarity sentiment
0 Fri Nov 02 11:22:47 +0000 2018 1058318498663870464 0.000000 neutral
1 Fri Nov 02 11:20:54 +0000 2018 1058318026758598656 0.011905 neutral
2 Fri Nov 02 09:41:37 +0000 2018 1058293038739607552 0.800000 positive
3 Fri Nov 02 09:40:48 +0000 2018 1058292834699231233 0.800000 positive
4 Thu Nov 01 18:23:17 +0000 2018 1058061933243518976 0.233333 neutral
5 Thu Nov 01 17:50:39 +0000 2018 1058053723157618690 0.400000 positive
6 Wed Oct 31 18:57:53 +0000 2018 1057708251758903296 0.566667 positive
7 Sun Oct 28 17:21:24 +0000 2018 1056596810570100736 0.000000 neutral
8 Sun Oct 21 13:00:53 +0000 2018 1053994531845296128 0.136364 neutral
9 Sun Oct 21 12:55:12 +0000 2018 1053993101205868544 0.083333 neutral
So far I have managed to aggregate to the monthly totals, with the following code:
import pandas as pd
tweets = process_twitter_json(file_name)
#print(tweets[:10])
df = pd.DataFrame.from_records(tweets)
print(df.head(10))
#make the string date into a date field
df['tweet_datetime'] = pd.to_datetime(df['created_at'])
df.index = df['tweet_datetime']
#print('Monthly counts')
monthly_sentiment = df.groupby('sentiment')['tweet_datetime'].resample('M').count()
I'm struggling with how to chart the data.
Do I pivot to turn each of the discreet values within the sentiment
field as separate columns
I've tried .unstack() that turns the sentiment values into rows,
which is almost there, but the problem is dates become string column
headers, which is no good for charting
OK I changed the monthly aggregation method and used Grouper instead of resample, this meant that when I did the unstack() the resulting dataframe was vertical (deep and narrow) with dates as rows rather than horizontal with the dates as columns headers which meant I no longer had issues with dates being stored as strings when I came to chart it.
Full code:
import pandas as pd
tweets = process_twitter_json(file_name)
df = pd.DataFrame.from_records(tweets)
df['tweet_datetime'] = pd.to_datetime(df['created_at'])
df.index = df['tweet_datetime']
grouper = df.groupby(['sentiment', pd.Grouper(key='tweet_datetime', freq='M')]).id.count()
result = grouper.unstack('sentiment').fillna(0)
##=================================================
##PLOTLY - charts in Jupyter
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
print (__version__)# requires version >= 1.9.0
import plotly.graph_objs as go
init_notebook_mode(connected=True)
trace0 = go.Scatter(
x = result.index,
y = result['positive'],
name = 'Positive',
line = dict(
color = ('rgb(205, 12, 24)'),
width = 4)
)
trace1 = go.Scatter(
x = result.index,
y = result['negative'],
name = 'Negative',
line = dict(
color = ('rgb(22, 96, 167)'),
width = 4)
)
trace2 = go.Scatter(
x = result.index,
y = result['neutral'],
name = 'Neutral',
line = dict(
color = ('rgb(12, 205, 24)'),
width = 4)
)
data = [trace0, trace1, trace2]
iplot(data)

Plotting line graph on the same figure using matplotlib [duplicate]

I have a temperature file with many years temperature records, in a format as below:
2012-04-12,16:13:09,20.6
2012-04-12,17:13:09,20.9
2012-04-12,18:13:09,20.6
2007-05-12,19:13:09,5.4
2007-05-12,20:13:09,20.6
2007-05-12,20:13:09,20.6
2005-08-11,11:13:09,20.6
2005-08-11,11:13:09,17.5
2005-08-13,07:13:09,20.6
2006-04-13,01:13:09,20.6
Every year has different numbers, time of the records, so the pandas datetimeindices are all different.
I want to plot the different year's data in the same figure for comparing . The X-axis is Jan to Dec, the Y-axis is temperature. How should I go about doing this?
Try:
ax = df1.plot()
df2.plot(ax=ax)
If you a running Jupyter/Ipython notebook and having problems using;
ax = df1.plot()
df2.plot(ax=ax)
Run the command inside of the same cell!! It wont, for some reason, work when they are separated into sequential cells. For me at least.
Chang's answer shows how to plot a different DataFrame on the same axes.
In this case, all of the data is in the same dataframe, so it's better to use groupby and unstack.
Alternatively, pandas.DataFrame.pivot_table can be used.
dfp = df.pivot_table(index='Month', columns='Year', values='value', aggfunc='mean')
When using pandas.read_csv, names= creates column headers when there are none in the file. The 'date' column must be parsed into datetime64[ns] Dtype so the .dt extractor can be used to extract the month and year.
import pandas as pd
# given the data in a file as shown in the op
df = pd.read_csv('temp.csv', names=['date', 'time', 'value'], parse_dates=['date'])
# create additional month and year columns for convenience
df['Year'] = df.date.dt.year
df['Month'] = df.date.dt.month
# groupby the month a year and aggreate mean on the value column
dfg = df.groupby(['Month', 'Year'])['value'].mean().unstack()
# display(dfg)
Year 2005 2006 2007 2012
Month
4 NaN 20.6 NaN 20.7
5 NaN NaN 15.533333 NaN
8 19.566667 NaN NaN NaN
Now it's easy to plot each year as a separate line. The OP only has one observation for each year, so only a marker is displayed.
ax = dfg.plot(figsize=(9, 7), marker='.', xticks=dfg.index)
To do this for multiple dataframes, you can do a for loop over them:
fig = plt.figure(num=None, figsize=(10, 8))
ax = dict_of_dfs['FOO'].column.plot()
for BAR in dict_of_dfs.keys():
if BAR == 'FOO':
pass
else:
dict_of_dfs[BAR].column.plot(ax=ax)
This can also be implemented without the if condition:
fig, ax = plt.subplots()
for BAR in dict_of_dfs.keys():
dict_of_dfs[BAR].plot(ax=ax)
You can make use of the hue parameter in seaborn. For example:
import seaborn as sns
df = sns.load_dataset('flights')
year month passengers
0 1949 Jan 112
1 1949 Feb 118
2 1949 Mar 132
3 1949 Apr 129
4 1949 May 121
.. ... ... ...
139 1960 Aug 606
140 1960 Sep 508
141 1960 Oct 461
142 1960 Nov 390
143 1960 Dec 432
sns.lineplot(x='month', y='passengers', hue='year', data=df)

How to combine multiple columns in a Data Frame to Pandas datetime format

I have a pandas data frame with values as below
ProcessID1 UserID Date Month Year Time
248 Tony 29 4 2017 23:30:56
436 Jeff 28 4 2017 20:02:19
500 Greg 4 5 2017 11:48:29
I would like to know is there any way I can combine columns of Date,Month&Year & time to a pd.datetimeformat?
Use to_datetime with automatic convert column Day,Month,Year with add times converted to_timedelta:
df['Datetime'] = pd.to_datetime(df.rename(columns={'Date':'Day'})[['Day','Month','Year']]) + \
pd.to_timedelta(df['Time'])
Another solutions are join all column converted to strings first:
df['Datetime'] = pd.to_datetime(df[['Date','Month','Year', 'Time']]
.astype(str).apply(' '.join, 1), format='%d %m %Y %H:%M:%S')
df['Datetime'] = (pd.to_datetime(df['Year'].astype(str) + '-' +
df['Month'].astype(str) + '-' +
df['Date'].astype(str) + ' ' +
df['Time']))
print (df)
ProcessID1 UserID Date Month Year Time Datetime
0 248 Tony 29 4 2017 23:30:56 2017-04-29 23:30:56
1 436 Jeff 28 4 2017 20:02:19 2017-04-28 20:02:19
2 500 Greg 4 5 2017 11:48:29 2017-05-04 11:48:29
Last if need remove these columns:
df = df.drop(['Date','Month','Year', 'Time'], axis=1)
print (df)
ProcessID1 UserID Datetime
0 248 Tony 2017-04-29 23:30:56
1 436 Jeff 2017-04-28 20:02:19
2 500 Greg 2017-05-04 11:48:29
Concatenate the columns together to a string format and use pd.to_datetime to convert to datetime.
import pandas as pd
import io
txt = """
ProcessID1 UserID Date Month Year Time
248 Tony 29 4 2017 23:30:56
436 Jeff 28 4 2017 20:02:19
500 Greg 4 5 2017 11:48:29
"""
df = pd.read_csv(io.StringIO(txt), sep="[\t ,]+")
df['Datetime'] = pd.to_datetime(df['Date'].astype(str) \
+ '-' + df['Month'].astype(str) \
+ '-' + df['Year'].astype(str) \
+ ' ' + df['Time'],
format='%d-%m-%Y %H:%M:%S')
df
import pandas as pd
You can also do this by using apply() method:-
df['Datetime']=df[['Year','Month','Date']].astype(str).apply('-'.join,1)+' '+df['Time']
Finally convert 'Datetime' to datetime dtype by using pandas to_datetime() method:-
df['Datetime']=pd.to_datetime(df['Datetime'])
Output of df:
ProcessID1 UserID Date Month Year Time Datetime
0 248 Tony 29 4 2017 23:30:56 2017-04-29 23:30:56
1 436 Jeff 28 4 2017 20:02:19 2017-04-28 20:02:19
2 500 Greg 4 5 2017 11:48:29 2017-05-04 11:48:29
Now if you want to remove 'Date','Month','Year' and 'Time' column then use:-
df=df.drop(columns=['Date','Month','Year', 'Time'])

Resources