webscraping with beautiful soup

webscraping with beautiful soup - python-3.x

I'm trying to scrape the table from a website that contains the Current interests rates. I used python with beautiful soup but I can't locate the html parts. Please send help ! thank you.
I only need to scrape the current interest rates table, not everything else and convert it into csv file. Here is the link to my webiste: https://www.global-rates.com/en/interest-rates/libor/american-dollar/usd-libor-interest-rate-12-months.aspx
here is the picture of the current interests rate table:
I tried something like this:
import bs4
import requests
from bs4 import BeautifulSoup
import pandas as pd
URL = 'https://www.global-rates.com/en/interest-rates/libor/american-dollar/usd-libor-interest-rate-12-months.aspx'
response = requests.get(URL)
soup=bs4.BeautifulSoup(response.content, 'html.parser')
print(soup.title)
print(soup.title.string)
print(len(response.text))
table = soup.find('table', attrs = {'class':'tableheader'}).tbody
print(table)
columns = ['Current interest rates']
df = pd.DataFrame(columns = columns)
trs = table.find_all('tr')
for tr in trs:
tds = tr.find_all('td')
row = [td.text.replace('\n', '') for td in tds]
df = df.append(pd.Series(row, index = columns), ignore_index = True)
df.to_csv('libor.csv', index = False)
but this gave me attribute errors: "None Type' object has no attribute 'tbody'
oh I also want to make automatically scraping the Mondays' interests rate if that's possible.
Thank you for your help

Here is my attempt with just pandas
import pandas as pd
# Get all tables on page
dfs = pd.read_html('https://www.global-rates.com/en/interest-rates/libor/american-dollar/usd-libor-interest-rate-12-months.aspx')
# Find the Current interest rates table
df = [df for df in dfs if df.iloc[0][0] == 'Current interest rates'][0]
# Remove first row that contains column names
df = df.iloc[1:].copy()
# Set column names
df.columns = ['DATE','INTEREST_RATE']
# Convert date from november 02 2020 to 2020-11-02
df['DATE'] = pd.to_datetime(df['DATE'])
# Remove percentage sign from interest rate
df['INTEREST_RATE'] = df['INTEREST_RATE'].str.replace('%','').str.strip()
# Convert percentage to float type
df['INTEREST_RATE'] = df['INTEREST_RATE'].astype(float)
# Add day of the week column
df['DAY'] = df['DATE'].dt.day_name()
# Output all to CSV
df.to_csv('all_data.csv', index=False)
# Only Mondays
df_monday = df[df['DAY'] == 'Monday']
# Output only Mondays
df_monday.to_csv('monday_data.csv', index=False)
# Add day number of week (Monday = 0)
df['DAY_OF_WEEK_NUMBER'] = df['DATE'].dt.dayofweek
# Add week number of year
df['WEEK_OF_YEAR_NUMBER'] = df['DATE'].dt.weekofyear
# 1. Sort by week of year then day of week
# 2. Group by week of year
# 3. Select first record in group, which will be the earliest day available of that week
df_first_day_of_week = df.sort_values(['WEEK_OF_YEAR_NUMBER','DAY_OF_WEEK_NUMBER']).groupby('WEEK_OF_YEAR_NUMBER').first()
# # Output earliest day of the week data
df_first_day_of_week.to_csv('first_day_of_week.csv', index=False)

You can use this example to scrape the "Current interest rates":
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.global-rates.com/en/interest-rates/libor/american-dollar/usd-libor-interest-rate-12-months.aspx'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for row in soup.select('table:has(td:contains("Current interest rates"))[style="width:208px;border:1px solid #CCCCCC;"] tr:not(:has([colspan]))'):
tds = [td.get_text(strip=True) for td in row.select('td')]
all_data.append(tds)
df = pd.DataFrame(all_data, columns=['Date', 'Rate'])
print(df)
df.to_csv('data.csv', index=False)
Prints:
Date Rate
0 november 02 2020 0.33238 %
1 october 30 2020 0.33013 %
2 october 29 2020 0.33100 %
3 october 28 2020 0.32763 %
4 october 27 2020 0.33175 %
5 october 26 2020 0.33200 %
6 october 23 2020 0.33663 %
7 october 22 2020 0.33513 %
8 october 21 2020 0.33488 %
9 october 20 2020 0.33713 %
10 october 19 2020 0.33975 %
11 october 16 2020 0.33500 %
And saves data.csv:
EDIT: To get only mondays, you can do this with the dataframe:
df['Date'] = pd.to_datetime(df['Date'])
print(df[df['Date'].dt.weekday==0])
Prints:
Date Rate
0 2020-11-02 0.33238 %
5 2020-10-26 0.33200 %
10 2020-10-19 0.33975 %

Related

how to set datetime type index for weekly column in pandas dataframe

I have a data as given below:
date product price amount
201901 A 10 20
201902 A 10 20
201903 A 20 30
201904 C 40 50
This data is saved in test.txt file.
Date column is given as a weekly column as a concatenation of year and weekid. I am trying to set the date column as an index, with given code:
import pandas as pd
import numpy as np
data=pd.read_csv("test.txt", sep="\t", parse_dates=['date'])
But it gives an error. How can I set the date column as an index with datetime type?

Use index_col parameter for setting index:
data=pd.read_csv("test.txt", sep="\t", index_col=[0])
EDIT: Using column name as index:
data=pd.read_csv("test.txt", sep="\t", index_col=['date'])
For converting index from int to date time, do this:
data.index = pd.to_datetime(data.index, format='%Y%m')

There might be simpler solutions than this too, using apply first I converted your Year-Weekid into Year-month-day format and then just simply used set_index to make date as index column.
import pandas as pd
data ={
'date' : [201901,201902,201903,201904,201905],
'product' : ['A','A','A','C','C'],
'price' : [10,10,10,20,20],
'amount' : [20,20,30,50,60]
}
df = pd.DataFrame(data)
# str(x)+'1' converts to Year-WeekId-Weekday, so 1 represents `Monday` so 2019020
# means 2019 Week2 Monday.
# If you want you can try with other formats too
df['date'] = df['date'].apply(lambda x: pd.to_datetime(str(x)+'1',format='%Y%W%w'))
df.set_index(['date'],inplace=True)
df
Edit:
To see datetime in Year-WeekID format you can style the dataframe as follows, however if you set date as index column following code won't be able to work. And also remember following code just applies some styling so just useful for display purpose only, internally it will remain as date-time object.
df['date'] = df['date'].apply(lambda x: pd.to_datetime(str(x)+'1',format='%Y%W%w'))
style_format = {'date':'{:%Y%W}'}
df.style.format(style_format)

You also can use the date_parser parameter:
import pandas as pd
from io import StringIO
from datetime import datetime
dateparse = lambda x: datetime.strptime(x, '%Y%m')
inputtxt = StringIO("""date product price amount
201901 A 10 20
201902 A 10 20
201903 A 20 30
201904 C 40 50""")
df = pd.read_csv(inputtxt, sep='\s+', parse_dates=['date'], date_parser=dateparse)
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 4 non-null datetime64[ns]
1 product 4 non-null object
2 price 4 non-null int64
3 amount 4 non-null int64
dtypes: datetime64[ns](1), int64(2), object(1)
memory usage: 256.0+ bytes

How to run a script on x axis of plots in matplotlib [duplicate]

I want to transform an integer between 1 and 12 into an abbrieviated month name.
I have a df which looks like:
client Month
1 sss 02
2 yyy 12
3 www 06
I want the df to look like this:
client Month
1 sss Feb
2 yyy Dec
3 www Jun
Most of the info I found was not in python>pandas>dataframe hence the question.

You can do this efficiently with combining calendar.month_abbr and df[col].apply()
import calendar
df['Month'] = df['Month'].apply(lambda x: calendar.month_abbr[x])

Since the abbreviated month names is the first three letters of their full names, we could first convert the Month column to datetime and then use dt.month_name() to get the full month name and finally use str.slice() method to get the first three letters, all using pandas and only in one line of code:
df['Month'] = pd.to_datetime(df['Month'], format='%m').dt.month_name().str.slice(stop=3)
df
Month client
0 Feb sss
1 Dec yyy
2 Jun www

The calendar module is useful, but calendar.month_abbr is array-like: it cannot be used directly in a vectorised fashion. For an efficient mapping, you can construct a dictionary and then use pd.Series.map:
import calendar
d = dict(enumerate(calendar.month_abbr))
df['Month'] = df['Month'].map(d)
Performance benchmarking shows a ~130x performance differential:
import calendar
d = dict(enumerate(calendar.month_abbr))
mapper = calendar.month_abbr.__getitem__
np.random.seed(0)
n = 10**5
df = pd.DataFrame({'A': np.random.randint(1, 13, n)})
%timeit df['A'].map(d) # 7.29 ms per loop
%timeit df['A'].map(mapper) # 946 ms per loop

Solution 1: One liner
df['Month'] = pd.to_datetime(df['Month'], format='%m').dt.strftime('%b')
Solution 2: Using apply()
def mapper(month):
return month.strftime('%b')
df['Month'] = df['Month'].apply(mapper)
Reference:
http://strftime.org/
https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html

using datetime object methods
I'm surpised this answer doesn't have a solution using strftime
note, you'll need to have a valid datetime object before using the strftime method, use pd.to_datetime(df['date_column']) to cast your target column to a datetime object.
import pandas as pd
dates = pd.date_range('01-Jan 2020','01-Jan 2021',freq='M')
df = pd.DataFrame({'dates' : dates})
df['month_name'] = df['dates'].dt.strftime('%b')
dates month_name
0 2020-01-31 Jan
1 2020-02-29 Feb
2 2020-03-31 Mar
3 2020-04-30 Apr
4 2020-05-31 May
5 2020-06-30 Jun
6 2020-07-31 Jul
7 2020-08-31 Aug
8 2020-09-30 Sep
9 2020-10-31 Oct
10 2020-11-30 Nov
11 2020-12-31 Dec
another method would be to slice the name using dt.month_name()
df['month_name_str_slice'] = df['dates'].dt.month_name().str[:3]
dates month_name month_name_str_slice
0 2020-01-31 Jan Jan
1 2020-02-29 Feb Feb
2 2020-03-31 Mar Mar
3 2020-04-30 Apr Apr
4 2020-05-31 May May
5 2020-06-30 Jun Jun
6 2020-07-31 Jul Jul
7 2020-08-31 Aug Aug
8 2020-09-30 Sep Sep
9 2020-10-31 Oct Oct
10 2020-11-30 Nov Nov
11 2020-12-31 Dec Dec

You can do this easily with a column apply.
import pandas as pd
df = pd.DataFrame({'client':['sss', 'yyy', 'www'], 'Month': ['02', '12', '06']})
look_up = {'01': 'Jan', '02': 'Feb', '03': 'Mar', '04': 'Apr', '05': 'May',
'06': 'Jun', '07': 'Jul', '08': 'Aug', '09': 'Sep', '10': 'Oct', '11': 'Nov', '12': 'Dec'}
df['Month'] = df['Month'].apply(lambda x: look_up[x])
df
Month client
0 Feb sss
1 Dec yyy
2 Jun www

One way of doing that is with the apply method in the dataframe but, to do that, you need a map to convert the months. You could either do that with a function / dictionary or with Python's own datetime.
With the datetime it would be something like:
def mapper(month):
date = datetime.datetime(2000, month, 1) # You need a dateobject with the proper month
return date.strftime('%b') # %b returns the months abbreviation, other options [here][1]
df['Month'].apply(mapper)
In a simillar way, you could build your own map for custom names. It would look like this:
months_map = {01: 'Jan', 02: 'Feb'}
def mapper(month):
return months_map[month]
Obviously, you don't need to define this functions explicitly and could use a lambda directly in the apply method.

Use strptime and lambda function for this:
from time import strptime
df['Month'] = df['Month'].apply(lambda x: strptime(x,'%b').tm_mon)

Suppose we have a DF like this, and Date is already in DateTime Format:
df.head(3)
value
date
2016-05-19 19736
2016-05-26 18060
2016-05-27 19997
Then we can extract month number and month name easily like this :
df['month_num'] = df.index.month
df['month'] = df.index.month_name()
value year month_num month
date
2017-01-06 37353 2017 1 January
2019-01-06 94108 2019 1 January
2019-01-05 77897 2019 1 January
2019-01-04 94514 2019 1 January

Having tested all of these on a large dataset, I have found the following to be fastest:
import calendar
def month_mapping():
# I'm lazy so I have a stash of functions already written so
# I don't have to write them out every time. This returns the
# {1:'Jan'....12:'Dec'} dict in the laziest way...
abbrevs = {}
for month in range (1, 13):
abbrevs[month] = calendar.month_abbr[month]
return abbrevs
abbrevs = month_mapping()
df['Month Abbrev'} = df['Date Col'].dt.month.map(mapping)

You can use Pandas month_name() function. Example:
>>> idx = pd.date_range(start='2018-01', freq='M', periods=3)
>>> idx
DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-31'],
dtype='datetime64[ns]', freq='M')
>>> idx.month_name()
Index(['January', 'February', 'March'], dtype='object')
For more detail visit this link.

the best way would be to do with month_name() as commented by
Nurul Akter Towhid.
df['Month'] = df.Month.dt.month_name()

First you need to strip "0 " in the beginning (as u might get the exception leading zeros in decimal integer literals are not permitted; use an 0o prefix for octal integers)
step1)
def func(i):
if i[0] == '0':
i = i[1]
return(i)
df["Month"] = df["Month"].apply(lambda x: func(x))
Step2:
df["Month"] = df["Month"].apply(lambda x: calendar.month_name(x))

Format datetime values in pandas by stripping

I have a df['timestamp'] column which has values in format: yyyy-mm-ddThh:mm:ssZ. The dtype is object.
Now, I want to split the value into 3 new columns, 1 for day, 1 for day index(mon,tues,wed,..) and 1 for hour like this:
Current:column=timestamp
yyyy-mm-ddThh:mm:ssZ
Desried:
New Col1|New Col2|New Col3
dd|hh|day_index
What function should I use?

Since you said column timestamp is of type object, I assume it's string. Since the format is fixed, use str.slice to get corresponding chars. To get the week days, use dt.day_name() on datetime64, which is converted from timestamp.
data = {'timestamp': ['2019-07-01T05:23:33Z', '2019-07-03T02:12:33Z', '2019-07-23T11:05:23Z', '2019-07-12T08:15:51Z'], 'Val': [1.24,1.259, 1.27,1.298] }
df = pd.DataFrame(data)
ds = pd.to_datetime(df['timestamp'], format='%Y-%m-%d', errors='coerce')
df['datetime'] = ds
df['dd'] = df['timestamp'].str.slice(start=8, stop=10)
df['hh'] = df['timestamp'].str.slice(start=11, stop=13)
df['weekday'] = df['datetime'].dt.day_name()
print(df)
The output:
timestamp Val datetime dd hh weekday
0 2019-07-01T05:23:33Z 1.240 2019-07-01 05:23:33+00:00 01 05 Monday
1 2019-07-03T02:12:33Z 1.259 2019-07-03 02:12:33+00:00 03 02 Wednesday
2 2019-07-23T11:05:23Z 1.270 2019-07-23 11:05:23+00:00 23 11 Tuesday
3 2019-07-12T08:15:51Z 1.298 2019-07-12 08:15:51+00:00 12 08 Friday

First convert the df['timestamp'] column to a DateTime object. Then extract Year, Month & Day from it. Code below.
df['timestamp'] = pd.to_datetime(df['timestamp'], format='%Y-%m-%d', errors='coerce')
df['Year'] = df['timestamp'].dt.year
df['Month'] = df['timestamp'].dt.month
df['Day'] = df['timestamp'].dt.day

How to stop months being ordered alphabetically in pandas pivot table

alphabetically-ordered months
How can I stop pandas converting my chronologically-ordered data in a csv into alphabetical order (like in my current plot). This is the code I am using:
import seaborn as sns
df = pd.read_csv("C:/Users/Paul/Desktop/calendar.csv")
df2 = df.pivot("Month", "Year", "hPM2.5")
ax = sns.heatmap(df2, annot=True, fmt="d")

I think you can use ordered categorical:
import pandas as pd
import seaborn as sns
df = pd.DataFrame({'Month':['January','February','September'],
'Year':[2015,2015,2016],
'hPM2.5':[7,8,9]})
print (df)
Month Year hPM2.5
0 January 2015 7
1 February 2015 8
2 September 2016 9
cats = ['January','February','March','April','May','June',
'July','August','September','October','November','December']
df['Month'] = df['Month'].astype('category',
ordered=True,
categories=cats)
df2 = df.pivot("Month", "Year", "hPM2.5")
sns.heatmap(df2, annot=True)

Parse dates and create time series from .csv

I am using a simple csv file which contains data on calory intake. It has 4 columns: cal, day, month, year. It looks like this:
cal month year day
3668.4333 1 2002 10
3652.2498 1 2002 11
3647.8662 1 2002 12
3646.6843 1 2002 13
...
3661.9414 2 2003 14
# data types
cal float64
month int64
year int64
day int64
I am trying to do some simple time series analysis. I hence would like to parse month, year, and day to a single column. I tried the following using pandas:
import pandas as pd
from pandas import Series, DataFrame, Panel
data = pd.read_csv('time_series_calories.csv', header=0, pars_dates=['day', 'month', 'year']], date_parser=True, infer_datetime_format=True)
My questions are: (1) How do I parse the data and (2) define the data type of the new column? I know there are quite a few other similar questions and answers (see e.g. here, here and here) - but I can't make it work so far.

You can use parameter parse_dates where define column names in list in read_csv:
import pandas as pd
import numpy as np
import io
temp=u"""cal,month,year,day
3668.4333,1,2002,10
3652.2498,1,2002,11
3647.8662,1,2002,12
3646.6843,1,2002,13
3661.9414,2,2003,14"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), parse_dates=[['year','month','day']])
print (df)
year_month_day cal
0 2002-01-10 3668.4333
1 2002-01-11 3652.2498
2 2002-01-12 3647.8662
3 2002-01-13 3646.6843
4 2003-02-14 3661.9414
print (df.dtypes)
year_month_day datetime64[ns]
cal float64
dtype: object
Then you can rename column:
df.rename(columns={'year_month_day':'date'}, inplace=True)
print (df)
date cal
0 2002-01-10 3668.4333
1 2002-01-11 3652.2498
2 2002-01-12 3647.8662
3 2002-01-13 3646.6843
4 2003-02-14 3661.9414
Or better is pass dictionary with new column name to parse_dates:
df = pd.read_csv(io.StringIO(temp), parse_dates={'dates': ['year','month','day']})
print (df)
dates cal
0 2002-01-10 3668.4333
1 2002-01-11 3652.2498
2 2002-01-12 3647.8662
3 2002-01-13 3646.6843
4 2003-02-14 3661.9414

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

webscraping with beautiful soup - python-3.x

Related

how to set datetime type index for weekly column in pandas dataframe

How to run a script on x axis of plots in matplotlib [duplicate]

Format datetime values in pandas by stripping

How to stop months being ordered alphabetically in pandas pivot table

Parse dates and create time series from .csv

Categories

Resources