I try to crawler a small table data from here, the process is shown by the figure below:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://oilprice.com/rig-count'
# html = urllib.request.urlopen(url)
html = requests.get(url).text
soup = BeautifulSoup(html, 'html.parser')
contents = soup.find_all('div', {'class': 'info_table'})
print(contents[0].children)
rows = []
for child in contents[0].children:
row = []
for td in child:
print(td) # not work after this line
try:
row.append(td.text.replace('\n', ''))
except:
continue
if len(row) > 0:
rows.append(row)
df = pd.DataFrame(rows[1:], columns=rows[0])
print(df)
Since the output of contents is quite large html data, so I don't know how to correctly extract them and save as dataframe. Could someone share an answer or give me some tips? Thanks.
You can use:
table = soup.find('div', {'class': 'info_table'})
data = [[cell.text.strip() for cell in row.find_all('div')]
for row in table.find_all('div', recursive=False)]
df = pd.DataFrame(data[1:], columns=data[0])
Output:
>>> df
Date Oil Rigs Gas Rigs Total Rigs Frac Spread Production Million Bpd
0 4th Mar 2022 519 130 650 280
1 25th Feb 2022 522 127 650 290
2 18th Feb 2022 520 124 645 283 11.60
3 11th Feb 2022 516 118 635 275 11.60
4 4th Feb 2022 497 116 613 264 11.60
.. ... ... ... ... ... ...
358 26th Dec 2014 1499 340 1840 367 9.12
359 19th Dec 2014 1536 338 1875 415 9.13
360 12th Dec 2014 1546 346 1893 411 9.14
361 5th Dec 2014 1575 344 1920 428 9.12
362 21st Nov 2014 1574 355 1929 452 9.08
[363 rows x 6 columns]
Update
A lazy solution to let Pandas guess the datatype is to convert your data to csv:
import io
table = soup.find('div', {'class': 'info_table'})
data = ['\t'.join(cell.text.strip() for cell in row.find_all('div'))
for row in table.find_all('div', recursive=False)]
buf = io.StringIO()
buf.writelines('\n'.join(data))
buf.seek(0)
df = pd.read_csv(buf, sep='\t', parse_dates=['Date'])
Output:
>>> df
Date Oil Rigs Gas Rigs Total Rigs Frac Spread Production Million Bpd
0 2022-03-04 519 130 650 280 NaN
1 2022-02-25 522 127 650 290 NaN
2 2022-02-18 520 124 645 283 11.60
3 2022-02-11 516 118 635 275 11.60
4 2022-02-04 497 116 613 264 11.60
.. ... ... ... ... ... ...
358 2014-12-26 1499 340 1840 367 9.12
359 2014-12-19 1536 338 1875 415 9.13
360 2014-12-12 1546 346 1893 411 9.14
361 2014-12-05 1575 344 1920 428 9.12
362 2014-11-21 1574 355 1929 452 9.08
[363 rows x 6 columns]
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 363 entries, 0 to 362
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 363 non-null datetime64[ns]
1 Oil Rigs 363 non-null int64
2 Gas Rigs 363 non-null int64
3 Total Rigs 363 non-null int64
4 Frac Spread 363 non-null int64
5 Production Million Bpd 360 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(4)
memory usage: 17.1 KB
The best answer must correspond to the smallest change, you just need to use re for reasonable matching:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import bs4
import re
url = 'https://oilprice.com/rig-count'
# html = urllib.request.urlopen(url)
html = requests.get(url).text
soup = BeautifulSoup(html, 'html.parser')
contents = soup.find_all('div', {'class': 'info_table'})
rows = []
for child in contents[0].children:
row = []
for td in child:
if type(td) == bs4.element.Tag:
data = re.sub('\s','',re.findall('(<[/]?[a-zA-Z].*?>)([\s\S]*?)?(<[/]?[a-zA-Z].*?>)',str(td))[0][1])
row.append(data)
if row != []:
rows.append(row)
df = pd.DataFrame(rows[1:], columns=rows[0])
print(df)
I apply list comprehension technique.
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://oilprice.com/rig-count'
req = requests.get(url).text
lst = []
soup = BeautifulSoup(req, 'lxml')
data = [x.get_text().replace('\t', '').replace('\n\n',' ').replace('\n','') for x in soup.select('div.info_table_holder div div.info_table_row')]
lst.extend(data)
df = pd.DataFrame(lst, columns=['Data'])
print(df)
Output:
0 4th Mar 2022 519 130 650 280
1 25th Feb 2022 522 127 650 290
2 18th Feb 2022 520 124 645 283 11.60
3 11th Feb 2022 516 118 635 275 11.60
4 4th Feb 2022 497 116 613 264 11.60
... ...
2007 4th Feb 2000 157 387 0 0 0 0
2008 28th Jan 2000 171 381 0 0 0 0
2009 21st Jan 2000 186 338 0 0 0 0
2010 14th Jan 2000 169 342 0 0 0 0
2011 7th Jan 2000 134 266 0 0 0 0
[2012 rows x 1 columns]
Related
I have to convert text files into csv's after processing the contents of the text file as pandas dataframe.
Below is the code i am using. out_txt is my input text file and out_csv is my output csv file.
df = pd.read_csv(out_txt, sep='\s', header=None, on_bad_lines='warn', encoding = "ANSI")
df = df.replace(r'[^\w\s]|_]/()|~"{}="', '', regex=True)
df.to_csv(out_csv, header=None)
If "on_bad_lines = 'warn'" is not decalred the csv files are not created. But if i use this condition those bad lines are getting skipped (obviously) with the warning
Skipping line 6: Expected 8 fields in line 7, saw 9. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
I would like to retain these bad lines in the csv. I have highlighted the bad lines detected in the below image (my input text file).
Below is the contents of the text file which is getting saved. In this content i would like to remove characters like #, &, (, ).
75062 220 8 6 110 220 250 <1
75063 260 5 2 584 878 950 <1
75064 810 <2 <2 456 598 3700 <1
75065 115 5 2 96 74 5000 <1
75066 976 <5 2 5 68 4200 <1
75067 22 210 4 348 140 4050 <1
75068 674 5 4 - 54 1130 3850 <1
75069 414 5 y) 446 6.6% 2350 <1
75070 458 <5 <2 548 82 3100 <1
75071 4050 <5 2 780 6430 3150 <1
75072 115 <7 <1 64 5.8% 4050 °#&4«x<i1
75073 456 <7 4 46 44 3900 <1
75074 376 <7 <2 348 3.8% 2150 <1
75075 378 <6 y) 30 40 2000 <1
I would split on \s later with str.split rather than read_csv :
df = (
pd.read_csv(out_txt, header=None, encoding='ANSI')
.replace(r'[^\w\s]|_]/()|~"{}="', '', regex=True)
.squeeze().str.split(expand=True)
)
Another variant (skipping everything that comes in-between the numbers):
df = (
pd.read_csv(out_txt, header=None, encoding='ANSI')
[0].str.findall(r"\b(\d+)\b"))
.str.split(expand=True)
)
Output :
print(df)
0 1 2 3 4 5 6 7
0 375020 1060 115 38 440 350 7800 1
1 375021 920 80 26 310 290 5000 1
2 375022 1240 110 28 460 430 5900 1
3 375023 830 150 80 650 860 6200 1
4 375024 185 175 96 800 1020 2400 1
5 375025 680 370 88 1700 1220 172 1
6 375026 550 290 72 2250 1460 835 2
7 375027 390 120 60 1620 1240 158 1
8 375028 630 180 76 820 1360 180 1
9 375029 460 280 66 380 790 3600 1
10 375030 660 260 62 11180 1040 300 1
11 375031 530 200 84 1360 1060 555 1
import requests
from bs4 import BeautifulSoup
URL = 'https://www.mohfw.gov.in/'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table')
table_body = table.find_all('tr')
print(table_body)
This is my code and I'm unable to extract table data even after extracting HTML content what am I doing wrong?
The data in the table is stored inside HTML comment (<!-- ... -->). To parse it, you can use this example:
import requests
from bs4 import BeautifulSoup, Comment
url = 'https://www.mohfw.gov.in/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
soup2 = BeautifulSoup(soup.table.find(text=lambda t: isinstance(t, Comment)), 'html.parser')
for row in soup2.select('tr'):
tds = [td.get_text(strip=True) for td in row.select('td')]
print('{:<5}{:<60}{:<10}{:<10}{:<10}'.format(*tds))
Prints:
1 Andaman and Nicobar Islands 47 133 0
2 Andhra Pradesh 18159 19393 492
3 Arunachal Pradesh 387 153 3
4 Assam 6818 12888 48
5 Bihar 7549 14018 197
6 Chandigarh 164 476 11
7 Chhattisgarh 1260 3451 21
8 Dadra and Nagar Haveli and Daman and Diu 179 371 2
9 Delhi 17407 97693 3545
10 Goa 1272 1817 19
11 Gujarat 11289 32103 2089
12 Haryana 5495 18185 322
13 Himachal Pradesh 382 984 11
14 Jammu and Kashmir 5488 6446 222
15 Jharkhand 2069 2513 42
16 Karnataka 30661 19729 1032
17 Kerala 5376 4862 37
18 Ladakh 176 970 1
19 Madhya Pradesh 5562 14127 689
20 Maharashtra 114947 158140 11194
21 Manipur 635 1129 0
22 Meghalaya 309 66 2
23 Mizoram 112 160 0
24 Nagaland 525 391 0
25 Odisha 4436 10877 79
26 Puducherry 774 947 22
27 Punjab 2587 6277 230
28 Rajasthan 6666 19970 538
29 Sikkim 155 88 0
30 Tamil Nadu 46717 107416 2236
31 Telangana 13327 27295 396
32 Tripura 676 1604 3
33 Uttarakhand 937 2995 50
34 Uttar Pradesh 15720 26675 1046
35 West Bengal 13679 21415 1023
Cases being reassigned to states 531
Total# 342473 635757 25602
I have two dataframe as shown below.
df1:
Date t_factor plan plan_score
0 2020-02-01 5 NaN 0
1 2020-02-02 23 NaN 0
2 2020-02-03 14 start 0
3 2020-02-04 23 start 0
4 2020-02-05 23 start 0
5 2020-02-06 23 NaN 0
6 2020-02-07 30 foundation 0
7 2020-02-08 29 foundation 0
8 2020-02-09 100 NaN 0
9 2020-02-10 38 learn 0
10 2020-02-11 38 learn 0
11 2020-02-12 38 learn 0
12 2020-02-13 70 NaN 0
13 2020-02-14 70 practice 0
14 2020-02-15 38 NaN 0
15 2020-02-16 38 NaN 0
16 2020-02-17 70 exam 0
17 2020-02-18 70 exam 0
18 2020-02-19 38 exam 0
19 2020-02-20 38 NaN 0
20 2020-02-21 70 NaN 0
21 2020-02-22 70 test 0
22 2020-02-23 38 test 0
23 2020-02-24 38 NaN 0
24 2020-02-25 70 NaN 0
25 2020-02-26 70 NaN 0
26 2020-02-27 70 NaN 0
df2:
From to plan score
2020-02-03 2020-02-05 start 20
2020-02-07 2020-02-08 foundation 25
2020-02-10 2020-02-12 learn 10
2020-02-14 2020-02-16 practice 20
2020-02-15 2020-02-21 exam 30
2020-02-20 2020-02-23 test 10
Explanation:
I have loaded the both data frame and I would like to export this dataframes as 1 excel file with Sheet1 = df1 and Sheet2 = df2.
I tried below.
import pandas as pd
from pandas import ExcelWriter
def save_xls(list_dfs, xls_path):
with ExcelWriter(xls_path) as writer:
for n, df in enumerate(list_dfs):
df.to_excel(writer,'sheet%s' % n)
writer.save()
save_xls([df1, df2], os.getcwd())
And it is giving me following error.
---------------------------------------------------------------------------
OptionError Traceback (most recent call last)
~/admvenv/lib/python3.7/site-packages/pandas/io/excel/_base.py in __new__(cls, path, engine, **kwargs)
630 try:
--> 631 engine = config.get_option(f"io.excel.{ext}.writer")
632 if engine == "auto":
~/admvenv/lib/python3.7/site-packages/pandas/_config/config.py in __call__(self, *args, **kwds)
230 def __call__(self, *args, **kwds):
--> 231 return self.__func__(*args, **kwds)
232
~/admvenv/lib/python3.7/site-packages/pandas/_config/config.py in _get_option(pat, silent)
101 def _get_option(pat, silent=False):
--> 102 key = _get_single_key(pat, silent)
103
~/admvenv/lib/python3.7/site-packages/pandas/_config/config.py in _get_single_key(pat, silent)
87 _warn_if_deprecated(pat)
---> 88 raise OptionError(f"No such keys(s): {repr(pat)}")
89 if len(keys) > 1:
OptionError: "No such keys(s): 'io.excel..writer'"
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<ipython-input-16-80bc8a5d0d2f> in <module>
----> 1 save_xls([df1, df2], os.getcwd())
<ipython-input-15-0d1448e7aea8> in save_xls(list_dfs, xls_path)
1 def save_xls(list_dfs, xls_path):
----> 2 with ExcelWriter(xls_path) as writer:
3 for n, df in enumerate(list_dfs):
4 df.to_excel(writer,'sheet%s' % n)
5 writer.save()
~/admvenv/lib/python3.7/site-packages/pandas/io/excel/_base.py in __new__(cls, path, engine, **kwargs)
633 engine = _get_default_writer(ext)
634 except KeyError:
--> 635 raise ValueError(f"No engine for filetype: '{ext}'")
636 cls = get_writer(engine)
637
ValueError: No engine for filetype: ''
Your code is fine, you are just missing the excel file name, and therefore the extension. That is what your error is saying.
Try
save_xls([df1, df2], os.getcwd() + '/name.xlsx')
or include a default excel file name in your function.
I am using sns.lineplot to show the confidence intervals in a plot.
sns.lineplot(x = threshold, y = mrl_array, err_style = 'band', ci=95)
plt.show()
I'm getting the following plot, which doesn't show the confidence interval:
What's the problem?
There is probably only a single observation per x value.
If there is only one observation per x value, then there is no confidence interval to plot.
Bootstrapping is performed per x value, but there needs to be more than one obsevation for this to take effect.
ci: Size of the confidence interval to draw when aggregating with an estimator. 'sd' means to draw the standard deviation of the data. Setting to None will skip bootstrapping.
Note the following examples from seaborn.lineplot.
This is also the case for sns.relplot with kind='line'.
The question specifies sns.lineplot, but this answer applies to any seaborn plot that displays a confidence interval, such as seaborn.barplot.
Data
import seaborn as sns
# load data
flights = sns.load_dataset("flights")
year month passengers
0 1949 Jan 112
1 1949 Feb 118
2 1949 Mar 132
3 1949 Apr 129
4 1949 May 121
# only May flights
may_flights = flights.query("month == 'May'")
year month passengers
4 1949 May 121
16 1950 May 125
28 1951 May 172
40 1952 May 183
52 1953 May 229
64 1954 May 234
76 1955 May 270
88 1956 May 318
100 1957 May 355
112 1958 May 363
124 1959 May 420
136 1960 May 472
# standard deviation for each year of May data
may_flights.set_index('year')[['passengers']].std(axis=1)
year
1949 NaN
1950 NaN
1951 NaN
1952 NaN
1953 NaN
1954 NaN
1955 NaN
1956 NaN
1957 NaN
1958 NaN
1959 NaN
1960 NaN
dtype: float64
# flight in wide format
flights_wide = flights.pivot("year", "month", "passengers")
month Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
year
1949 112 118 132 129 121 135 148 148 136 119 104 118
1950 115 126 141 135 125 149 170 170 158 133 114 140
1951 145 150 178 163 172 178 199 199 184 162 146 166
1952 171 180 193 181 183 218 230 242 209 191 172 194
1953 196 196 236 235 229 243 264 272 237 211 180 201
1954 204 188 235 227 234 264 302 293 259 229 203 229
1955 242 233 267 269 270 315 364 347 312 274 237 278
1956 284 277 317 313 318 374 413 405 355 306 271 306
1957 315 301 356 348 355 422 465 467 404 347 305 336
1958 340 318 362 348 363 435 491 505 404 359 310 337
1959 360 342 406 396 420 472 548 559 463 407 362 405
1960 417 391 419 461 472 535 622 606 508 461 390 432
# standard deviation for each year
flights_wide.std(axis=1)
year
1949 13.720147
1950 19.070841
1951 18.438267
1952 22.966379
1953 28.466887
1954 34.924486
1955 42.140458
1956 47.861780
1957 57.890898
1958 64.530472
1959 69.830097
1960 77.737125
dtype: float64
Plots
may_flights has one observation per year, so no CI is shown.
sns.lineplot(data=may_flights, x="year", y="passengers")
sns.barplot(data=may_flights, x='year', y='passengers')
flights_wide shows there are twelve observations for each year, so the CI shows when all of flights is plotted.
sns.lineplot(data=flights, x="year", y="passengers")
sns.barplot(data=flights, x='year', y='passengers')
I have executed the following line in python:
teams = train.groupby('localTeam')['local_won'].sum()
print (teams)
and got this as output :
localTeam
AD Almeria 37
AD Ceuta 11
Alaves 263
Albacete 210
Alcorcon 79
Alcoyano 6
Algeciras 31
Alicante 2
Almeria 152
Alzira 7
Aragon 8
Athletic Club 502
Atletico Madrileno 93
Atletico Marbella 33
Atletico de Madrid 544
Aviles 13
Badajoz 78
Barakaldo 70
Barcelona 652
Barcelona Atletic 195
Betis 467
Bilbao Athletic 111
Burgos 126
Burgos CF 4
CD Malaga 201
Cadiz 289
Calvo Sotelo 50
Cartagena 49
Castellon 292
Castilla 222
...
Pontevedra 45
Racing de Ferrol 66
Racing de Santander 386
Rayo Vallecano 407
Real Burgos 58
Real Madrid 663
Real Oviedo 332
Real Sociedad 483
Real Union 8
Real Zaragoza 451
Recreativo de Huelva 310
Reus 7
Sabadell 231
Salamanca 283
Sant Andreu 86
Sestao 80
Sevilla 510
Sevilla Atletico 20
Sporting de Gijon 435
Tenerife 375
Terrassa 60
Toledo 61
UCAM Murcia 7
Universidad de Las Palmas 5
Valencia 518
Valladolid 449
Vecindario 7
Villarreal 273
Villarreal B 26
Xerez 168
Name: local_won, dtype: int64
Now I want to plot a horizontal barchart with the values sorted from highest to lowest.
You may sort the Series by .sort_values(ascending=False) and you may plot it via .plot(kind="bar"):
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame({"team":np.random.choice(list("ABCDE"), size=100),
"won":np.random.randint(0,2, size=100)})
teams = df.groupby('team')['won'].sum().sort_values(ascending=False)
teams.plot(kind="bar")
plt.show()