import requests
from bs4 import BeautifulSoup
URL = 'https://www.mohfw.gov.in/'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table')
table_body = table.find_all('tr')
print(table_body)
This is my code and I'm unable to extract table data even after extracting HTML content what am I doing wrong?
The data in the table is stored inside HTML comment (<!-- ... -->). To parse it, you can use this example:
import requests
from bs4 import BeautifulSoup, Comment
url = 'https://www.mohfw.gov.in/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
soup2 = BeautifulSoup(soup.table.find(text=lambda t: isinstance(t, Comment)), 'html.parser')
for row in soup2.select('tr'):
tds = [td.get_text(strip=True) for td in row.select('td')]
print('{:<5}{:<60}{:<10}{:<10}{:<10}'.format(*tds))
Prints:
1 Andaman and Nicobar Islands 47 133 0
2 Andhra Pradesh 18159 19393 492
3 Arunachal Pradesh 387 153 3
4 Assam 6818 12888 48
5 Bihar 7549 14018 197
6 Chandigarh 164 476 11
7 Chhattisgarh 1260 3451 21
8 Dadra and Nagar Haveli and Daman and Diu 179 371 2
9 Delhi 17407 97693 3545
10 Goa 1272 1817 19
11 Gujarat 11289 32103 2089
12 Haryana 5495 18185 322
13 Himachal Pradesh 382 984 11
14 Jammu and Kashmir 5488 6446 222
15 Jharkhand 2069 2513 42
16 Karnataka 30661 19729 1032
17 Kerala 5376 4862 37
18 Ladakh 176 970 1
19 Madhya Pradesh 5562 14127 689
20 Maharashtra 114947 158140 11194
21 Manipur 635 1129 0
22 Meghalaya 309 66 2
23 Mizoram 112 160 0
24 Nagaland 525 391 0
25 Odisha 4436 10877 79
26 Puducherry 774 947 22
27 Punjab 2587 6277 230
28 Rajasthan 6666 19970 538
29 Sikkim 155 88 0
30 Tamil Nadu 46717 107416 2236
31 Telangana 13327 27295 396
32 Tripura 676 1604 3
33 Uttarakhand 937 2995 50
34 Uttar Pradesh 15720 26675 1046
35 West Bengal 13679 21415 1023
Cases being reassigned to states 531
Total# 342473 635757 25602
Related
im relatively new to Dataframes in Python and running into an Issue I cant find.
im having a Dataframe with the following column layout
print(list(df.columns.values)) returns:
['iccid', 'system', 'last_updated', '01.01', '02.01', '03.01', '04.01', '05.01', '12.01', '18.01', '19.01', '20.01', '21.01', '22.01', '23.01', '24.01', '25.01', '26.01', '27.01', '28.01', '29.01', '30.01', '31.01']
normally i should have a column for each day in a specific month. in the example above its December 2022. Sometimes Days are missing which isnt an issue.
i tried to first get all given columns that are relevant by filtering them:
# Filter out the columns that are not related to the data
data_columns = [col for col in df.columns if '.' in col]
Now comes the issue:
Sometimes the column "system" could also be empty so i need to put the iccid inside the system value:
df.loc[df['system'] == 'Nicht benannt!', 'system'] = df.loc[df['system'] == 'Nicht benannt!', 'iccid'].iloc[0]
df.loc[df['system'] == '', 'system'] = df.loc[df['system'] == '', 'iccid'].iloc
grouped = df.groupby('system').sum(numeric_only=False)
then i tried to create that needed 'data_usage' column.
grouped['data_usage'] = grouped[data_columns[-1]]
grouped.reset_index(inplace=True)
By that line i should normally only get the result of the last column in the dataframe (which was a workaround that also didnt work as expected)
Now what im trying to get is the sum of all columns which contain a date in their name and add this sum to a new column named data_usage.
the issue im having here is im getting results for systems which dont have an initial system value which have a data_usage of 120000 (which is just value that represents the megabytes used) and if i check the sqlite file the system in total only used 9000 mb in that particular month.
For Example:
im having this column in the sqlite file:
iccid
system
last_updated
06.02
08.02
8931080320014183316
Nicht benannt!
2023-02-06
1196
1391
and in the dataframe i get the following result:
8931080320014183316 48129.0
I cant find the issue and would be very happy if someone can point me into the right direction.
Here are some example data as requested:
iccid
system
last_updated
01.12
02.12
03.12
04.12
05.12
06.12
07.12
08.12
09.12
10.12
11.12
12.12
13.12
14.12
15.12
16.12
17.12
18.12
19.12
20.12
21.12
22.12
23.12
28.12
29.12
30.12
31.12
8945020184547971966
U-O-51
2022-12-01
2
32
179
208
320
509
567
642
675
863
1033
1055
1174
2226
2277
2320
2466
2647
2679
2713
2759
2790
2819
2997
3023
3058
3088
8945020855461807911
L-O-382
2022-12-01
1
26
54
250
385
416
456
481
506
529
679
772
802
832
858
915
940
1019
1117
1141
1169
1193
1217
1419
1439
1461
1483
8945020855461809750
C-O-27
2022-12-01
1
123
158
189
225
251
456
489
768
800
800
800
800
800
800
2362
2386
2847
2925
2960
2997
3089
3116
3448
3469
3543
3586
8931080019070958450
L-O-123
2022-12-02
0
21
76
313
479
594
700
810
874
1181
1955
2447
2527
2640
2897
3008
3215
3412
3554
3639
3698
3782
3850
4741
4825
4925
5087
8931080453114183282
Nicht benannt!
2022-12-02
0
6
45
81
95
98
101
102
102
102
102
102
102
103
121
121
121
121
149
164
193
194
194
194
194
194
194
8931080894314183290
C-O-16 N
2022-12-02
0
43
145
252
386
452
532
862
938
1201
1552
1713
1802
1855
2822
3113
3185
3472
3527
3745
3805
3880
3938
4221
4265
4310
4373
8931080465814183308
L-O-83
2022-12-02
0
61
169
275
333
399
468
858
1094
1239
1605
1700
1928
2029
3031
4186
4333
4365
4628
4782
4842
4975
5265
5954
5954
5954
5954
8931082343214183316
Nicht benannt!
2022-12-02
0
52
182
506
602
719
948
1129
1314
1646
1912
1912
1912
1912
2791
3797
3944
4339
4510
4772
4832
5613
5688
6151
6482
6620
6848
8931087891314183324
L-O-119
2022-12-02
0
19
114
239
453
573
685
800
1247
1341
1341
1341
1341
1341
1341
1341
1341
1341
1341
1341
1341
1341
1423
2722
3563
4132
4385
I try to crawler a small table data from here, the process is shown by the figure below:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://oilprice.com/rig-count'
# html = urllib.request.urlopen(url)
html = requests.get(url).text
soup = BeautifulSoup(html, 'html.parser')
contents = soup.find_all('div', {'class': 'info_table'})
print(contents[0].children)
rows = []
for child in contents[0].children:
row = []
for td in child:
print(td) # not work after this line
try:
row.append(td.text.replace('\n', ''))
except:
continue
if len(row) > 0:
rows.append(row)
df = pd.DataFrame(rows[1:], columns=rows[0])
print(df)
Since the output of contents is quite large html data, so I don't know how to correctly extract them and save as dataframe. Could someone share an answer or give me some tips? Thanks.
You can use:
table = soup.find('div', {'class': 'info_table'})
data = [[cell.text.strip() for cell in row.find_all('div')]
for row in table.find_all('div', recursive=False)]
df = pd.DataFrame(data[1:], columns=data[0])
Output:
>>> df
Date Oil Rigs Gas Rigs Total Rigs Frac Spread Production Million Bpd
0 4th Mar 2022 519 130 650 280
1 25th Feb 2022 522 127 650 290
2 18th Feb 2022 520 124 645 283 11.60
3 11th Feb 2022 516 118 635 275 11.60
4 4th Feb 2022 497 116 613 264 11.60
.. ... ... ... ... ... ...
358 26th Dec 2014 1499 340 1840 367 9.12
359 19th Dec 2014 1536 338 1875 415 9.13
360 12th Dec 2014 1546 346 1893 411 9.14
361 5th Dec 2014 1575 344 1920 428 9.12
362 21st Nov 2014 1574 355 1929 452 9.08
[363 rows x 6 columns]
Update
A lazy solution to let Pandas guess the datatype is to convert your data to csv:
import io
table = soup.find('div', {'class': 'info_table'})
data = ['\t'.join(cell.text.strip() for cell in row.find_all('div'))
for row in table.find_all('div', recursive=False)]
buf = io.StringIO()
buf.writelines('\n'.join(data))
buf.seek(0)
df = pd.read_csv(buf, sep='\t', parse_dates=['Date'])
Output:
>>> df
Date Oil Rigs Gas Rigs Total Rigs Frac Spread Production Million Bpd
0 2022-03-04 519 130 650 280 NaN
1 2022-02-25 522 127 650 290 NaN
2 2022-02-18 520 124 645 283 11.60
3 2022-02-11 516 118 635 275 11.60
4 2022-02-04 497 116 613 264 11.60
.. ... ... ... ... ... ...
358 2014-12-26 1499 340 1840 367 9.12
359 2014-12-19 1536 338 1875 415 9.13
360 2014-12-12 1546 346 1893 411 9.14
361 2014-12-05 1575 344 1920 428 9.12
362 2014-11-21 1574 355 1929 452 9.08
[363 rows x 6 columns]
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 363 entries, 0 to 362
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 363 non-null datetime64[ns]
1 Oil Rigs 363 non-null int64
2 Gas Rigs 363 non-null int64
3 Total Rigs 363 non-null int64
4 Frac Spread 363 non-null int64
5 Production Million Bpd 360 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(4)
memory usage: 17.1 KB
The best answer must correspond to the smallest change, you just need to use re for reasonable matching:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import bs4
import re
url = 'https://oilprice.com/rig-count'
# html = urllib.request.urlopen(url)
html = requests.get(url).text
soup = BeautifulSoup(html, 'html.parser')
contents = soup.find_all('div', {'class': 'info_table'})
rows = []
for child in contents[0].children:
row = []
for td in child:
if type(td) == bs4.element.Tag:
data = re.sub('\s','',re.findall('(<[/]?[a-zA-Z].*?>)([\s\S]*?)?(<[/]?[a-zA-Z].*?>)',str(td))[0][1])
row.append(data)
if row != []:
rows.append(row)
df = pd.DataFrame(rows[1:], columns=rows[0])
print(df)
I apply list comprehension technique.
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://oilprice.com/rig-count'
req = requests.get(url).text
lst = []
soup = BeautifulSoup(req, 'lxml')
data = [x.get_text().replace('\t', '').replace('\n\n',' ').replace('\n','') for x in soup.select('div.info_table_holder div div.info_table_row')]
lst.extend(data)
df = pd.DataFrame(lst, columns=['Data'])
print(df)
Output:
0 4th Mar 2022 519 130 650 280
1 25th Feb 2022 522 127 650 290
2 18th Feb 2022 520 124 645 283 11.60
3 11th Feb 2022 516 118 635 275 11.60
4 4th Feb 2022 497 116 613 264 11.60
... ...
2007 4th Feb 2000 157 387 0 0 0 0
2008 28th Jan 2000 171 381 0 0 0 0
2009 21st Jan 2000 186 338 0 0 0 0
2010 14th Jan 2000 169 342 0 0 0 0
2011 7th Jan 2000 134 266 0 0 0 0
[2012 rows x 1 columns]
I have a df as shown below.
the data is like this.
Date y
0 2020-06-14 127
1 2020-06-15 216
2 2020-06-16 4
3 2020-06-17 90
4 2020-06-18 82
5 2020-06-19 70
6 2020-06-20 59
7 2020-06-21 48
8 2020-06-22 23
9 2020-06-23 25
10 2020-06-24 24
11 2020-06-25 22
12 2020-06-26 19
13 2020-06-27 10
14 2020-06-28 18
15 2020-06-29 157
16 2020-06-30 16
17 2020-07-01 14
18 2020-07-02 343
The code to create the data frame.
# Create a dummy dataframe
import pandas as pd
import numpy as np
y0 = [127,216,4,90, 82,70,59,48,23,25,24,22,19,10,18,157,16,14,343]
def initial_forecast(data):
data['y'] = y0
return data
# Initial date dataframe
df_dummy = pd.DataFrame({'Date': pd.date_range('2020-06-14', periods=19, freq='1D')})
# Dates
start_date = df_dummy.Date.iloc[1]
print(start_date)
end_date = df_dummy.Date.iloc[17]
print(end_date)
# Adding y0 in the dataframe
df_dummy = initial_forecast(df_dummy)
df_dummy
From the above I would like to interpolate the data for a particular date range.
I would like to interpolate(linear) between 2020-06-17 to 2020-06-27.
ie from 2020-06-17 to 2020-06-27 'y' values changes from 90 to 10 in 10 steps. so at an average in each step it reduces 8.
ie (90-10)/10(number of steps) = 8 in each steps
The expected output:
Date y y_new
0 2020-06-14 127 127
1 2020-06-15 216 216
2 2020-06-16 4 4
3 2020-06-17 90 90
4 2020-06-18 82 82
5 2020-06-19 70 74
6 2020-06-20 59 66
7 2020-06-21 48 58
8 2020-06-22 23 50
9 2020-06-23 25 42
10 2020-06-24 24 34
11 2020-06-25 22 26
12 2020-06-26 19 18
13 2020-06-27 10 10
14 2020-06-28 18 18
15 2020-06-29 157 157
16 2020-06-30 16 16
17 2020-07-01 14 14
18 2020-07-02 343 343
Note: In the remaining date range y_new value should be same as y value.
I tried below code, that is not giving desired output
# Function
def df_interpolate(df, start_date, end_date):
df["Date"]=pd.to_datetime(df["Date"])
df.loc[(df['Date'] >= start_date) & (df['Date'] <= end_date), 'y_new'] = np.nan
df['y_new'] = df['y'].interpolate().round()
return df
df1 = df_interpolate(df_dummy, '2020-06-17', '2020-06-27')
With some tweaks to your function it works. np.where to create the new column, removing the = from your conditionals, and casting to int as per your expected output.
def df_interpolate(df, start_date, end_date):
df["Date"] = pd.to_datetime(df["Date"])
df['y_new'] = np.where((df['Date'] > start_date) & (df['Date'] < end_date), np.nan, df['y'])
df['y_new'] = df['y_new'].interpolate().round().astype(int)
return df
Date y y_new
0 2020-06-14 127 127
1 2020-06-15 216 216
2 2020-06-16 4 4
3 2020-06-17 90 90
4 2020-06-18 82 82
5 2020-06-19 70 74
6 2020-06-20 59 66
7 2020-06-21 48 58
8 2020-06-22 23 50
9 2020-06-23 25 42
10 2020-06-24 24 34
11 2020-06-25 22 26
12 2020-06-26 19 18
13 2020-06-27 10 10
14 2020-06-28 18 18
15 2020-06-29 157 157
16 2020-06-30 16 16
17 2020-07-01 14 14
18 2020-07-02 343 343
Actually, I have a dataframe that contains some states and I have a list of their few cities and I want to add those cities to that dataset and want to group each city with their state names.
Eg.
#I have entered some random city names for example purpose
city = ['Akola','Aurangabad','Dhule','Jalgaon','Mumbai','Mumbai Suburban','Nagpur']
State Cases Active Recovered Death
0 Maharashtra 77793 2933 41402 1458 33681 1352 2710 123
1 Andhra Pradesh 4223 143 1613 67 2539 73 71 3
2 Karnataka 4320 257 2653 157 1610 96 57 4
3 Goa 166 87 109 87 57 0
4 Tamil Nadu 27256 1384 12134 786 14902 586 220 12
and I want those cities to add in a data frame in a new column like
State Cases Active Recovered Death |CITY
0 Maharashtra 77793 2933 41402 1458 33681 1352 2710 123 |AKOLA
1 Maharashtra 77793 2933 41402 1458 33681 1352 2710 123 |DHULE
2 Maharashtra 77793 2933 41402 1458 33681 1352 2710 123 |MUMBAI
3 Andhra Pradesh 4223 143 1613 67 2539 73 71 3 |JALGAON
4 Andhra Pradesh 4223 143 1613 67 2539 73 71 3 |NAGPUR
5 Karnataka 4320 257 2653 157 1610 96 57 4
6 Goa 166 87 109 87 57 0
7 Tamil Nadu 27256 1384 12134 786 14902 586 220 12 |AURANGABAD
8 Tamil Nadu 27256 1384 12134 786 14902 586 220 12 |MUMBAI SUBURBAN
# data is wrong so please focus in format
I need to find the max of two columns (p_1_logreg, p_2_logreg) where the comparison should be limited only to 14 rows.
My csv file
I tried to slice my index into:
int1_str1_str2_int2_str3_int4
The max should be found between rows where int1, str1, str2 int2 and str3 are fixed, and only the int4 would change (from index 0 to index 13, and so on).
I tried to fix each element at a time and use groupby, but I couldn't iterate over int4 value only.
Here is the code to find the max for column p_1_label, but the result is not what I am looking for.
max_1_row=raw_prob.loc[raw_prob.groupby(raw_prob['id'].str.split('_').str[1])['p_1_'+label].idxmax()]
max_1_row=max_1_row.loc[raw_prob.groupby(raw_prob['id'].str.split('_').str[3])['p_1_'+label].idxmax()]
max_1_row=max_1_row.loc[raw_prob.groupby(raw_prob['id'].str.split('_').str[5])['p_1_'+label].idxmax()]
Any ideas?
I think you need DataFrameGroupBy.idxmax by replaced last _ with empty string and then select by loc:
df = pd.read_csv('myProb.csv', index_col=[0])
idx = df.drop('id', 1).groupby(df['id'].str.replace('_\d+$', '')).idxmax()
print (idx.head(15))
p_0_logreg p_1_logreg p_2_logreg
id
6_PanaCleanerJune_sub_12_ICA 2 9 6
6_PanaCleanerJune_sub_13_ICA 17 19 23
6_PanaCleanerJune_sub_14_ICA 34 37 33
6_PanaCleanerJune_sub_15_ICA 52 51 43
6_PanaCleanerJune_sub_17_ICA 66 67 69
6_PanaCleanerJune_sub_18_ICA 82 79 76
6_PanaCleanerJune_sub_19_ICA 89 87 90
6_PanaCleanerJune_sub_20_ICA 98 103 104
6_PanaCleanerJune_sub_21_ICA 114 117 112
6_PanaCleanerJune_sub_22_ICA 129 133 127
6_PanaCleanerJune_sub_23_ICA 145 146 143
6_PanaCleanerJune_sub_24_ICA 155 166 161
6_PanaCleanerJune_sub_25_ICA 176 173 174
6_PanaCleanerJune_sub_26_ICA 186 191 189
6_PanaCleanerJune_sub_27_ICA 202 203 209
df1 = df.loc[idx['p_1_logreg']]
print (df1.head(15))
id p_0_logreg p_1_logreg p_2_logreg
9 6_PanaCleanerJune_sub_12_ICA_10 0.013452 0.985195 0.001353
19 6_PanaCleanerJune_sub_13_ICA_6 0.051184 0.948816 0.000000
37 6_PanaCleanerJune_sub_14_ICA_10 0.013758 0.979351 0.006890
51 6_PanaCleanerJune_sub_15_ICA_10 0.076056 0.923944 0.000000
67 6_PanaCleanerJune_sub_17_ICA_12 0.051060 0.947660 0.001280
79 6_PanaCleanerJune_sub_18_ICA_10 0.051184 0.948816 0.000000
87 6_PanaCleanerJune_sub_19_ICA_4 0.078162 0.917751 0.004087
103 6_PanaCleanerJune_sub_20_ICA_6 0.076400 0.921263 0.002337
117 6_PanaCleanerJune_sub_21_ICA_6 0.155002 0.791753 0.053245
133 6_PanaCleanerJune_sub_22_ICA_8 0.000000 0.998623 0.001377
146 6_PanaCleanerJune_sub_23_ICA_7 0.017549 0.973995 0.008457
166 6_PanaCleanerJune_sub_24_ICA_13 0.025215 0.974785 0.000000
173 6_PanaCleanerJune_sub_25_ICA_6 0.025656 0.960220 0.014124
191 6_PanaCleanerJune_sub_26_ICA_10 0.098872 0.895526 0.005602
203 6_PanaCleanerJune_sub_27_ICA_8 0.066493 0.932470 0.001037
df2 = df.loc[idx['p_2_logreg']]
print (df2.head(15))
id p_0_logreg p_1_logreg p_2_logreg
6 6_PanaCleanerJune_sub_12_ICA_7 0.000000 0.000351 0.999649
23 6_PanaCleanerJune_sub_13_ICA_10 0.000000 0.000351 0.999649
33 6_PanaCleanerJune_sub_14_ICA_6 0.080748 0.000352 0.918900
43 6_PanaCleanerJune_sub_15_ICA_2 0.017643 0.000360 0.981996
69 6_PanaCleanerJune_sub_17_ICA_14 0.882449 0.000290 0.117261
76 6_PanaCleanerJune_sub_18_ICA_7 0.010929 0.000360 0.988711
90 6_PanaCleanerJune_sub_19_ICA_7 0.010929 0.000351 0.988720
104 6_PanaCleanerJune_sub_20_ICA_7 0.006714 0.000360 0.992925
112 6_PanaCleanerJune_sub_21_ICA_1 0.869393 0.000339 0.130269
127 6_PanaCleanerJune_sub_22_ICA_2 0.000000 0.000351 0.999649
143 6_PanaCleanerJune_sub_23_ICA_4 0.017218 0.000360 0.982421
161 6_PanaCleanerJune_sub_24_ICA_8 0.369685 0.000712 0.629603
174 6_PanaCleanerJune_sub_25_ICA_7 0.307056 0.000496 0.692448
189 6_PanaCleanerJune_sub_26_ICA_8 0.850195 0.000368 0.149437
209 6_PanaCleanerJune_sub_27_ICA_14 0.000000 0.000351 0.999649
Detail:
print (df['id'].str.replace('_\d+$', '').head(15))
0 6_PanaCleanerJune_sub_12_ICA
1 6_PanaCleanerJune_sub_12_ICA
2 6_PanaCleanerJune_sub_12_ICA
3 6_PanaCleanerJune_sub_12_ICA
4 6_PanaCleanerJune_sub_12_ICA
5 6_PanaCleanerJune_sub_12_ICA
6 6_PanaCleanerJune_sub_12_ICA
7 6_PanaCleanerJune_sub_12_ICA
8 6_PanaCleanerJune_sub_12_ICA
9 6_PanaCleanerJune_sub_12_ICA
10 6_PanaCleanerJune_sub_12_ICA
11 6_PanaCleanerJune_sub_12_ICA
12 6_PanaCleanerJune_sub_12_ICA
13 6_PanaCleanerJune_sub_12_ICA
14 6_PanaCleanerJune_sub_13_ICA
Name: id, dtype: object