Scraping issue: Beatifulsoup; IndexError: list index out of range

Scraping issue: Beatifulsoup; IndexError: list index out of range - python-3.x

I'm trying to scrape: http://www.wtatennis.com/stats. But I run into an error when I have the complete code done. Potentially I'm starting to long at this but I don't see the error and therefore can't resolve it.
import requests, re
from bs4 import BeautifulSoup
r=requests.get("http://www.wtatennis.com/stats")
c=r.content
soup=BeautifulSoup(c,"html.parser")
all=soup.find_all("div",{"class":"view-content"})
#find the results, names, scores
for classes in all:
position = classes.find_all('td',{"class":"views-field views-field-counter views-align-center"})[0].text
wta_name = classes.find_all('td',{"class":"views-field views-field-field-lastname views-align-left"})[0].text
current_ranking = classes.find_all('td',{"class":"views-field views-field-field-current-rank views-align-center"})[0].text
match_count = classes.find_all('td',{"class":"views-field views-field-field-matchcount views-align-center"})[0].text
aces_count = classes.find_all('td',{"class":"views-field views-field-field-aces active views-align-center"})[0].text
df_count = classes.find_all('td',{"class":"views-field views-field-field-double-faults views-align-center"})[0].text
firstserver_perc = classes.find_all('td',{"class":"views-field views-field-field-first-serve-percent views-align-center"})[0].text
firstservewon_perc = classes.find_all('td',{"class":"views-field views-field-field-first-serve-won-percent views-align-center"})[0].text
secondservewon_perc = classes.find_all('td',{"class":"views-field views-field-field-second-serve-won-percent views-align-center"})[0].text
print (position)
print (wta_name)
print (current_ranking)
print (match_count)
print (aces_count)
print (df_count)
print (firstserver_perc)
print (firstservewon_perc)
print (secondservewon_perc)
Result
1
Goerges, Julia (GER)
12
7
61
25
59.8 %
76.0 %
52.4 %
IndexError Traceback (most recent call last)
<ipython-input-6-fabdb2904a0b> in <module>()
18 current_ranking = classes.find_all('td',{"class":"views-field views-field-field-current-rank views-align-center"})[0].text
19 match_count = classes.find_all('td',{"class":"views-field views-field-field-matchcount views-align-center"})[0].text
---> 20 aces_count = classes.find_all('td',{"class":"views-field views-field-field-aces active views-align-center"})[0].text
21 df_count = classes.find_all('td',{"class":"views-field views-field-field-double-faults views-align-center"})[0].text
22 firstserver_perc = classes.find_all('td',{"class":"views-field views-field-field-first-serve-percent views-align-center"})[0].text
IndexError: list index out of range

Here are the issues I found with your code:
The line all=soup.find_all("div",{"class":"view-content"}) is using the find_all, which is wrong because there are multiple div tags with class view-content. I changed this line to use the find() function instead of the find_all() function
After fixing the issue stated in the previous bulletpoint, you will have issues specficially at the printing area(Not getting all the data, just the first record of the table you're trying to parse).
Also Note that I removed the re library imported in your code, as it wasn't needed.
Here is my attempt at your problem:
import requests
from bs4 import BeautifulSoup
c = requests.get("http://www.wtatennis.com/stats").text
soup = BeautifulSoup(c, "html.parser")
c = soup.find("div", {"class": "view-content"})
position = c.find_all('td', {"class": "views-field views-field-counter views-align-center"})
wta_name = c.find_all('td', {"class": "views-field views-field-field-lastname views-align-left"})
current_ranking = c.find_all('td', {"class": "views-field views-field-field-current-rank views-align-center"})
match_count = c.find_all('td', {"class": "views-field views-field-field-matchcount views-align-center"})
aces_count = c.find_all('td', {"class": "views-field views-field-field-aces active views-align-center"})
df_count = c.find_all('td', {"class": "views-field views-field-field-double-faults views-align-center"})
firstserver_perc = c.find_all('td', {"class": "views-field views-field-field-first-serve-percent views-align-center"})
firstservewon_perc = c.find_all('td', {"class": "views-field views-field-field-first-serve-won-percent views-align-center"})
secondservewon_perc = c.find_all('td', {"class": "views-field views-field-field-second-serve-won-percent views-align-center"})
for i in range(0, len(position)):
print(position[i].text)
print(wta_name[i].text)
print(current_ranking[i].text)
print(match_count[i].text)
print(aces_count[i].text)
print(df_count[i].text)
print(firstserver_perc[i].text)
print(firstservewon_perc[i].text)
print(secondservewon_perc[i].text)
print("***************")
Output:
1
Goerges, Julia (GER)
12
7
61
25
59.8 %
76.0 %
52.4 %
***************
2
Svitolina, Elina (UKR)
3
10
60
13
60.1 %
72.2 %
47.5 %
***************
3
Wozniacki, Caroline (DEN)
1
12
58
37
64.3 %
71.9 %
50.3 %
***************
4
Pliskova, Karolina (CZE)
5
8
53
19
63.9 %
71.6 %
47.7 %
***************
5
Barty, Ashleigh (AUS)
16
9
50
27
61.0 %
67.7 %
53.6 %
***************
6
Mertens, Elise (BEL)
20
10
43
35
65.8 %
69.1 %
46.9 %
***************
7
Siniakova, Katerina (CZE)
52
8
39
31
61.2 %
65.5 %
46.5 %
***************
8
Osaka, Naomi (JPN)
53
5
38
11
62.5 %
69.4 %
44.8 %
***************
9
Pliskova, Kristyna (CZE)
78
5
38
17
59.3 %
70.4 %
41.3 %
***************
10
Keys, Madison (USA)
14
6
37
17
61.1 %
73.9 %
46.8 %
***************
11
Bertens, Kiki (NED)
28
6
35
26
61.2 %
70.1 %
39.6 %
***************
12
Sevastova, Anastasija (LAT)
15
7
34
11
60.2 %
71.4 %
47.7 %
***************
13
Konta, Johanna (GBR)
11
6
31
22
65.6 %
66.1 %
50.0 %
***************
14
Halep, Simona (ROU)
2
12
30
27
66.1 %
68.2 %
50.3 %
***************
15
Kontaveit, Anett (EST)
27
6
29
32
63.9 %
67.3 %
48.3 %
***************
16
Strycova, Barbora (CZE)
24
10
29
25
65.6 %
64.4 %
46.7 %
***************
17
Giorgi, Camila (ITA)
63
7
26
27
59.3 %
65.8 %
48.2 %
***************
18
Sharapova, Maria (RUS)
41
7
26
36
60.0 %
70.0 %
48.0 %
***************
19
Kanepi, Kaia (EST)
66
6
25
24
56.8 %
64.3 %
50.3 %
***************
20
Watson, Heather (GBR)
75
6
25
17
62.2 %
65.0 %
50.7 %
***************

Related

Using python extract from list of string

I have list of string from which i want to extract channel number value with low SIG value.
Below is mt list name "Outlines"
Flags: A - active, P - privacy, R - routeros-network, N - nstreme, T - tdma,
W - wds, B - bridge
ADDRESS SSID CHANNEL SIG NF SNR RADIO-... ROUTER...
AP 20:B5:C6:F1:B6:B0 INAPHDBD... 5815/2... -78 -60 -18
52:4F:54:43:4F:44 P2MPRWXX... 5835/20/a -80 -102 22
A 52:41:44:57:49:4E 5835/20/a -86 -102 16
APR B 4C:5E:0C:BF:EE:6E iBw 5865/2... -75 -102 27 4C5E0C... 6.38.5
A 00:19:70:2C:FD:82 TR6SL5 5835/20/a -86 -102 16
20:B5:C6:F0:E6:F7 5855/20/a -58 -103 45
Below is the code i tried, but didnt know how i can iterate when line starts without blank or any other random value.
RSSI = ''
Myindex = [l for l, elem in enumerate (outlines) if 'AP' in elem]
#Myindex = [elem for elem in outlines if 'AP' in elem]
#RSSI =int('nan')
for ind in Myindex:
newchannel = " ".join(outlines[ind].split()).split(' ')[3]
newRSSI = " ".join(outlines[ind].split()).split(' ')[4]
if RSSI < newRSSI or RSSI == '':
RSSI =newRSSI
channel = newchannel.split('/')[0]
print(channel)

You can get the positions in each line you need to parse from the line that contains 'SIG' and 'CHANNEL'.
You can loop all lines, extract the positions and convert the SIG to a number and put it into a dictionary of lists of CHANNELS (if 2 have the same SIG and both are lowest).
You can proceed to work with the dictionary data:
t = """Flags: A - active, P - privacy, R - routeros-network, N - nstreme, T - tdma,
W - wds, B - bridge
ADDRESS SSID CHANNEL SIG NF SNR RADIO-... ROUTER...
AP 20:B5:C6:F1:B6:B0 INAPHDBD... 5815/2... -78 -60 -18
52:4F:54:43:4F:44 P2MPRWXX... 5835/20/a -80 -102 22
A 52:41:44:57:49:4E 5835/20/a -86 -102 16
APR B 4C:5E:0C:BF:EE:6E iBw 5865/2... -75 -102 27 4C5E0C... 6.38.5
A 00:19:70:2C:FD:82 TR6SL5 5835/20/a -86 -102 16
20:B5:C6:F0:E6:F7 5855/20/a -58 -103 45
"""
sigstart = 0
channelstart = 0
data = {}
for line in t.split("\n"):
# find position of SIG and CHANNEL, ignore everything until we have that
if sigstart == 0:
try:
sigstart = line.index("SIG")
channelstart = line.index("CHANNEL")
except ValueError:
pass
continue
# parse line if long enough and has AP in it
if len(line)>= sigstart+3 and "AP" in line:
sig = int(line[sigstart:sigstart+3].strip())
channel = line[channelstart:sigstart].strip()
# add to dictionary - could use defaultdict if perfomance is an issue
# but for the example this works just fine
data.setdefault(sig,[]).append(channel)
# output all parsed data, sorted:
for sig,channel in sorted(data.items(), key = lambda x:x[0]):
print(sig, ', '.join(c.split("/")[0] for c in channel))
Output:
-78 5815 # lowerst, only 1 item
-75 5865
The very lowest would be sorted(data.items(), key = lambda x:x[0])[0].

I figure out to extract the channel value having lowest SIG value using regex, regex are very power to use specially for a nasty data(list of strings).
i used regex101 . com which makes life easier. below is the code i tried.
import re
outlines = """
Flags: A - active, P - privacy, R - routeros-network, N - nstreme, T - tdma,
W - wds, B - bridge
ADDRESS SSID CHANNEL SIG NF SNR RADIO-... ROUTER...
AP 20:b5:C6:F1:B6:B0 INAPHDBD... 5815/2... -78 -60 -18
52:4F:54:43:4F:44 P2MPRWXX... 5835/20/a -80 -102 22
A 52:41:44:57:49:4E 5835/20/a -86 -102 16
APR B 4C:5E:0C:BF:EE:6E iBw 5865/2... -75 -102 27 4C5E0C... 6.38.5
A 00:19:70:2C:FD:82 TR6SL5 5835/20/a -86 -102 16
20:b5:C6:F0:E6:F7 5855/20/a -58 -103 45
"""
regex = ".* ([0-9]+)\/[^ ]* ([-+0-9]*)[ ,]*"
oldsig =''
if __name__ == '__main__':

for line in outlines:
s = re.search(regex, line.replace('\r\n',''))

if s:
channel = s.group(1)
sig = s.group(2)
if oldsig < sig or oldsig=='':
oldsig = sig
nchannel1 = channel
print(nchannel1)
Which gives output
5835

cv2.imshow() starts without poping up a window when called in threading

cv2.imshow() do not pop up the window when called in the threading. An icon of python3 appears in the docker, but no window, no plot. Besides, I need the function of half-duplex communication, thus I cannot just remove threading. How can I correctly show the picture inside threading? Any response is appreciated.
Background: MacOS Catalina, python==3.5, opencv==3.4.2
import cv2
import threading
def run():
# start the camera
video_reader = cv2.VideoCapture(0)
while True:
# collect image
ret_val, image = video_reader.read()
cv2.imshow('image', image)
if cv2.waitKey(1) == 27:
break # esc to quit
t = threading.Thread(target=run, name="run")
t.start()
Some Warning messages that may help:
python3[86286:1964642] WARNING: NSWindow drag regions should only be invalidated on the Main Thread! This will throw an exception in the future. Called from (
0 AppKit 0x00007fff2fc177f0 -[NSWindow(NSWindow_Theme) _postWindowNeedsToResetDragMarginsUnlessPostingDisabled] + 371
1 AppKit 0x00007fff2fc14ce1 -[NSWindow _initContent:styleMask:backing:defer:contentView:] + 1416
2 AppKit 0x00007fff2fc14753 -[NSWindow initWithContentRect:styleMask:backing:defer:] + 42
3 AppKit 0x00007fff2fea8368 -[NSWindow initWithContentRect:styleMask:backing:defer:screen:] + 52
4 libopencv_highgui.3.4.2.dylib 0x000000010327bee7 cvNamedWindow + 727
5 libopencv_highgui.3.4.2.dylib 0x000000010327b7a3 cvShowImage + 195
6 libopencv_highgui.3.4.2.dylib 0x000000010327860d _ZN2cv6imshowERKNS_6StringERKNS_11_InputArrayE + 477
7 cv2.cpython-35m-darwin.so 0x00000001014f8393 _ZL18pyopencv_cv_imshowP7_objectS0_S0_ + 387
8 python3 0x0000000100ca759f PyCFunction_Call + 127
9 python3 0x0000000100d6eec7 PyEval_EvalFrameEx + 18583
10 python3 0x0000000100d68faf _PyEval_EvalCodeWithName + 335
11 python3 0x0000000100c746aa function_call + 106
12 python3 0x0000000100c30b35 PyObject_Call + 69
13 python3 0x0000000100d6fc9b PyEval_EvalFrameEx + 22123
14 python3 0x0000000100d6efb8 PyEval_EvalFrameEx + 18824
15 python3 0x0000000100d6efb8 PyEval_EvalFrameEx + 18824
16 python3 0x0000000100d68faf _PyEval_EvalCodeWithName + 335
17 python3 0x0000000100c746aa function_call + 106
18 python3 0x0000000100c30b35 PyObject_Call + 69
19 python3 0x0000000100c53694 method_call + 148
20 python3 0x0000000100c30b35 PyObject_Call + 69
21 python3 0x0000000100d77bf4 PyEval_CallObjectWithKeywords + 68
22 python3 0x0000000100de472a t_bootstrap + 122
23 libsystem_pthread.dylib 0x00007fff69e8cd76 _pthread_start + 125
24 libsystem_pthread.dylib 0x00007fff69e895d7 thread_start + 15
)
2019-10-14 15:20:27.485 python3[86286:1964642] WARNING: nextEventMatchingMask should only be called from the Main Thread! This will throw an exception in the future.

Apple only allows native UI functionality to be used on the main thread in macOS and iOS.

To Print each stage of Bubble sort in Python3.6

Sort the given set of numbers using Bubble Sort. The first line of the input contains the number of elements, the second line of the input contains the numbers to be sorted. In the output print the status of the array at the 3rd iteration and the final sorted array in the given format
alist=[]
def bubble_sort(alist):
for i in range(len(alist) - 1, 0, -1):
no_swap = True
for j in range(0, i):
if alist[j + 1] < alist[j]:
alist[j], alist[j + 1] = alist[j + 1], alist[j]
no_swap = False
if no_swap:
return
n=int(input())
for i in range(n):
alist.append(int(input()))
alist = [int(x) for x in alist]
bubble_sort(alist)
print('Sorted array: ', end='\n')
for i in alist:
print(i,end=" ")
Test Case 1
7
64
34
25
12
22
11
90
Expected Output:
It should print the following 3 lines
12 22 11 25 34 64 90
Sorted array:
11 12 22 25 34 64 90
Test Case 2
8
14
83
25
47
9
77
1
0
Expected Output:
It should print the 3 following lines
14 9 25 1 0 47 77 83
Sorted array:
0 1 9 14 25 47 77 83

Just add in your for loop a print when you reach the third iteration
alist=[]
def bubble_sort(alist):
number_of_iterations = 0
for i in range(len(alist) - 1, 0, -1):
no_swap = True
for j in range(0, i):
if alist[j + 1] < alist[j]:
alist[j], alist[j + 1] = alist[j + 1], alist[j]
no_swap = False
if i == len(alist) - 3:
print(*alist) # Using the unpacking operator for pretty print, if you are in python2 you can print it this way : " ".join(map(str, alist))
if no_swap:
return
n=5
alist = [7, 64, 34, 25, 12, 22, 11, 90]
bubble_sort(alist)
print('Sorted array: ', end='\n')
for i in alist:
print(i,end=" ")

How can I find the alphabetic position of the letters in a word and then add the numbers up?

I wrote some of the code which gives me the total of the word with the position of the English alphabet but I am looking for something that prints the line like this:
book: 2 + 15 + 15 + 11 = 43
def convert(string):
sum = 0
for c in string:
code_point = ord(c)
location = code_point - 65 if code_point >= 65 and code_point <= 90 else code_point - 97
sum += location + 1
return sum
print(convert('book'))

def convert(string):
parts = []
sum = 0
for c in string:
code_point = ord(c)
location = code_point - 65 if code_point >= 65 and code_point <= 90 else code_point - 97
sum += location + 1
parts.append(str(location + 1))
return "{0}: {1} = {2}".format(string, " + ".join(parts), sum)
print(convert('book'))
Heres the output:
book: 2 + 15 + 15 + 11 = 43
More info on string.format and string.join.

Why is python giving me a Syntax error?

I have this code from a tutorial:
#File called test
1 def sanitize(time_string):
2 if '-' in time_string:
3 splitter = '-'
4 elif ':' in time_string:
5 splitter = ':'
6 else:
7 return(time_string)
8 (mins, secs) = time_string.split(splitter)
9 return(mins + '.' + secs)
10
11
12
13 def get_coach_data(filename):
14 with open(filename) as f:
15 data = f.readline()
16 temp1 = data.strip().split(',')
17 return(Athlete(temp1.pop(0), temp1.pop(0), temp1)
18
19
20 james = get_coach_data('james2.txt')
21 julie = get_coach_data('julie2.txt')
22 mikey = get_coach_data('mikey2.txt')
23 sarah = get_coach_data('sarah2.txt')
24
25 print(james.name+"'s fastest times are: " + str(james.top3()))
26 print(juliename+"'s fastest times are: " + str(julie.top3()))
27 print(mikey.name+"'s fastest times are: " + str(mikey.top3()))
28 print(sarah.name+"'s fastest times are: " + str(sarah.top3()))
and I put this class separately because I thought it may have been causing the error:
1 class Athlete:
2 def __init__(self, a_name, a_dob=None, a_times=[]):
3 self.name = a_name
4 self.dob = a_dob
5 self.times = a_times
6
7 def top3(self):
8 return(sorted(set([sanitize(t) for t in self.times]))[0:3])
The error is:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "test.py", line 20
james = get_coach_data('james2.txt')
But the error doesn't make any sense. I am new to python. I appreciate anyones help. Thanks in advance.

The errors that I can see are:
return(Athlete(temp1.pop(0), temp1.pop(0), temp1)
in get_coach_data should just be
return Athlete(temp1.pop(0), temp1.pop(0), temp1)
on line 17
juliename should be julie.name on line 26

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Scraping issue: Beatifulsoup; IndexError: list index out of range - python-3.x

Related

Using python extract from list of string

cv2.imshow() starts without poping up a window when called in threading

To Print each stage of Bubble sort in Python3.6

How can I find the alphabetic position of the letters in a word and then add the numbers up?

Why is python giving me a Syntax error?

Categories

Resources