How to stop months being ordered alphabetically in pandas pivot table - python-3.x

alphabetically-ordered months
How can I stop pandas converting my chronologically-ordered data in a csv into alphabetical order (like in my current plot). This is the code I am using:
import seaborn as sns
df = pd.read_csv("C:/Users/Paul/Desktop/calendar.csv")
df2 = df.pivot("Month", "Year", "hPM2.5")
ax = sns.heatmap(df2, annot=True, fmt="d")

I think you can use ordered categorical:
import pandas as pd
import seaborn as sns
df = pd.DataFrame({'Month':['January','February','September'],
'Year':[2015,2015,2016],
'hPM2.5':[7,8,9]})
print (df)
Month Year hPM2.5
0 January 2015 7
1 February 2015 8
2 September 2016 9
cats = ['January','February','March','April','May','June',
'July','August','September','October','November','December']
df['Month'] = df['Month'].astype('category',
ordered=True,
categories=cats)
df2 = df.pivot("Month", "Year", "hPM2.5")
sns.heatmap(df2, annot=True)

Related

how to set datetime type index for weekly column in pandas dataframe

I have a data as given below:
date product price amount
201901 A 10 20
201902 A 10 20
201903 A 20 30
201904 C 40 50
This data is saved in test.txt file.
Date column is given as a weekly column as a concatenation of year and weekid. I am trying to set the date column as an index, with given code:
import pandas as pd
import numpy as np
data=pd.read_csv("test.txt", sep="\t", parse_dates=['date'])
But it gives an error. How can I set the date column as an index with datetime type?
Use index_col parameter for setting index:
data=pd.read_csv("test.txt", sep="\t", index_col=[0])
EDIT: Using column name as index:
data=pd.read_csv("test.txt", sep="\t", index_col=['date'])
For converting index from int to date time, do this:
data.index = pd.to_datetime(data.index, format='%Y%m')
There might be simpler solutions than this too, using apply first I converted your Year-Weekid into Year-month-day format and then just simply used set_index to make date as index column.
import pandas as pd
data ={
'date' : [201901,201902,201903,201904,201905],
'product' : ['A','A','A','C','C'],
'price' : [10,10,10,20,20],
'amount' : [20,20,30,50,60]
}
df = pd.DataFrame(data)
# str(x)+'1' converts to Year-WeekId-Weekday, so 1 represents `Monday` so 2019020
# means 2019 Week2 Monday.
# If you want you can try with other formats too
df['date'] = df['date'].apply(lambda x: pd.to_datetime(str(x)+'1',format='%Y%W%w'))
df.set_index(['date'],inplace=True)
df
Edit:
To see datetime in Year-WeekID format you can style the dataframe as follows, however if you set date as index column following code won't be able to work. And also remember following code just applies some styling so just useful for display purpose only, internally it will remain as date-time object.
df['date'] = df['date'].apply(lambda x: pd.to_datetime(str(x)+'1',format='%Y%W%w'))
style_format = {'date':'{:%Y%W}'}
df.style.format(style_format)
You also can use the date_parser parameter:
import pandas as pd
from io import StringIO
from datetime import datetime
dateparse = lambda x: datetime.strptime(x, '%Y%m')
inputtxt = StringIO("""date product price amount
201901 A 10 20
201902 A 10 20
201903 A 20 30
201904 C 40 50""")
df = pd.read_csv(inputtxt, sep='\s+', parse_dates=['date'], date_parser=dateparse)
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 4 non-null datetime64[ns]
1 product 4 non-null object
2 price 4 non-null int64
3 amount 4 non-null int64
dtypes: datetime64[ns](1), int64(2), object(1)
memory usage: 256.0+ bytes

webscraping with beautiful soup

I'm trying to scrape the table from a website that contains the Current interests rates. I used python with beautiful soup but I can't locate the html parts. Please send help ! thank you.
I only need to scrape the current interest rates table, not everything else and convert it into csv file. Here is the link to my webiste: https://www.global-rates.com/en/interest-rates/libor/american-dollar/usd-libor-interest-rate-12-months.aspx
here is the picture of the current interests rate table:
I tried something like this:
import bs4
import requests
from bs4 import BeautifulSoup
import pandas as pd
URL = 'https://www.global-rates.com/en/interest-rates/libor/american-dollar/usd-libor-interest-rate-12-months.aspx'
response = requests.get(URL)
soup=bs4.BeautifulSoup(response.content, 'html.parser')
print(soup.title)
print(soup.title.string)
print(len(response.text))
table = soup.find('table', attrs = {'class':'tableheader'}).tbody
print(table)
columns = ['Current interest rates']
df = pd.DataFrame(columns = columns)
trs = table.find_all('tr')
for tr in trs:
tds = tr.find_all('td')
row = [td.text.replace('\n', '') for td in tds]
df = df.append(pd.Series(row, index = columns), ignore_index = True)
df.to_csv('libor.csv', index = False)
but this gave me attribute errors: "None Type' object has no attribute 'tbody'
oh I also want to make automatically scraping the Mondays' interests rate if that's possible.
Thank you for your help
Here is my attempt with just pandas
import pandas as pd
# Get all tables on page
dfs = pd.read_html('https://www.global-rates.com/en/interest-rates/libor/american-dollar/usd-libor-interest-rate-12-months.aspx')
# Find the Current interest rates table
df = [df for df in dfs if df.iloc[0][0] == 'Current interest rates'][0]
# Remove first row that contains column names
df = df.iloc[1:].copy()
# Set column names
df.columns = ['DATE','INTEREST_RATE']
# Convert date from november 02 2020 to 2020-11-02
df['DATE'] = pd.to_datetime(df['DATE'])
# Remove percentage sign from interest rate
df['INTEREST_RATE'] = df['INTEREST_RATE'].str.replace('%','').str.strip()
# Convert percentage to float type
df['INTEREST_RATE'] = df['INTEREST_RATE'].astype(float)
# Add day of the week column
df['DAY'] = df['DATE'].dt.day_name()
# Output all to CSV
df.to_csv('all_data.csv', index=False)
# Only Mondays
df_monday = df[df['DAY'] == 'Monday']
# Output only Mondays
df_monday.to_csv('monday_data.csv', index=False)
# Add day number of week (Monday = 0)
df['DAY_OF_WEEK_NUMBER'] = df['DATE'].dt.dayofweek
# Add week number of year
df['WEEK_OF_YEAR_NUMBER'] = df['DATE'].dt.weekofyear
# 1. Sort by week of year then day of week
# 2. Group by week of year
# 3. Select first record in group, which will be the earliest day available of that week
df_first_day_of_week = df.sort_values(['WEEK_OF_YEAR_NUMBER','DAY_OF_WEEK_NUMBER']).groupby('WEEK_OF_YEAR_NUMBER').first()
# # Output earliest day of the week data
df_first_day_of_week.to_csv('first_day_of_week.csv', index=False)
You can use this example to scrape the "Current interest rates":
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.global-rates.com/en/interest-rates/libor/american-dollar/usd-libor-interest-rate-12-months.aspx'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for row in soup.select('table:has(td:contains("Current interest rates"))[style="width:208px;border:1px solid #CCCCCC;"] tr:not(:has([colspan]))'):
tds = [td.get_text(strip=True) for td in row.select('td')]
all_data.append(tds)
df = pd.DataFrame(all_data, columns=['Date', 'Rate'])
print(df)
df.to_csv('data.csv', index=False)
Prints:
Date Rate
0 november 02 2020 0.33238 %
1 october 30 2020 0.33013 %
2 october 29 2020 0.33100 %
3 october 28 2020 0.32763 %
4 october 27 2020 0.33175 %
5 october 26 2020 0.33200 %
6 october 23 2020 0.33663 %
7 october 22 2020 0.33513 %
8 october 21 2020 0.33488 %
9 october 20 2020 0.33713 %
10 october 19 2020 0.33975 %
11 october 16 2020 0.33500 %
And saves data.csv:
EDIT: To get only mondays, you can do this with the dataframe:
df['Date'] = pd.to_datetime(df['Date'])
print(df[df['Date'].dt.weekday==0])
Prints:
Date Rate
0 2020-11-02 0.33238 %
5 2020-10-26 0.33200 %
10 2020-10-19 0.33975 %

Sort pandas dataframe by a column

I have a pandas dataframe as below:
import pandas as pd
import numpy as np
import datetime
# intialise data of lists.
data = {'A' :[1,1,1,1,2,2,2,2],
'B' :[2,3,1,5,7,7,1,6]}
# Create DataFrame
df = pd.DataFrame(data)
df
I want to sort 'B' by each group of 'A'
Expected Output:
A B
0 1 1
1 1 2
2 1 3
3 1 5
4 2 1
5 2 6
6 2 7
7 2 7
You can sort a dataframe using the sort_values command. This command will sort your dataframe with priority on A and then B as requested.
df.sort_values(by=['A', 'B'])
Docs

Change the bar item name in Pandas

I have a test excel file like:
df = pd.DataFrame({'name':list('abcdefg'),
'age':[10,20,5,23,58,4,6]})
print (df)
name age
0 a 10
1 b 20
2 c 5
3 d 23
4 e 58
5 f 4
6 g 6
I use Pandas and matplotlib to read and plot it:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
excel_file = 'test.xlsx'
df = pd.read_excel(excel_file, sheet_name=0)
df.plot(kind="bar")
plt.show()
the result shows:
it use index number as item name, how can I change it to the name, which stored in column name?
You can specify columns for x and y values in plot.bar:
df.plot(x='name', y='age', kind="bar")
Or create Series first by DataFrame.set_index and select age column:
df.set_index('name')['age'].plot(kind="bar")
#if multiple columns
#df.set_index('name').plot(kind="bar")

Parse dates and create time series from .csv

I am using a simple csv file which contains data on calory intake. It has 4 columns: cal, day, month, year. It looks like this:
cal month year day
3668.4333 1 2002 10
3652.2498 1 2002 11
3647.8662 1 2002 12
3646.6843 1 2002 13
...
3661.9414 2 2003 14
# data types
cal float64
month int64
year int64
day int64
I am trying to do some simple time series analysis. I hence would like to parse month, year, and day to a single column. I tried the following using pandas:
import pandas as pd
from pandas import Series, DataFrame, Panel
data = pd.read_csv('time_series_calories.csv', header=0, pars_dates=['day', 'month', 'year']], date_parser=True, infer_datetime_format=True)
My questions are: (1) How do I parse the data and (2) define the data type of the new column? I know there are quite a few other similar questions and answers (see e.g. here, here and here) - but I can't make it work so far.
You can use parameter parse_dates where define column names in list in read_csv:
import pandas as pd
import numpy as np
import io
temp=u"""cal,month,year,day
3668.4333,1,2002,10
3652.2498,1,2002,11
3647.8662,1,2002,12
3646.6843,1,2002,13
3661.9414,2,2003,14"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), parse_dates=[['year','month','day']])
print (df)
year_month_day cal
0 2002-01-10 3668.4333
1 2002-01-11 3652.2498
2 2002-01-12 3647.8662
3 2002-01-13 3646.6843
4 2003-02-14 3661.9414
print (df.dtypes)
year_month_day datetime64[ns]
cal float64
dtype: object
Then you can rename column:
df.rename(columns={'year_month_day':'date'}, inplace=True)
print (df)
date cal
0 2002-01-10 3668.4333
1 2002-01-11 3652.2498
2 2002-01-12 3647.8662
3 2002-01-13 3646.6843
4 2003-02-14 3661.9414
Or better is pass dictionary with new column name to parse_dates:
df = pd.read_csv(io.StringIO(temp), parse_dates={'dates': ['year','month','day']})
print (df)
dates cal
0 2002-01-10 3668.4333
1 2002-01-11 3652.2498
2 2002-01-12 3647.8662
3 2002-01-13 3646.6843
4 2003-02-14 3661.9414

Resources