Python - Index Error - list index out of range - python-3.x

I am parsing data from a website but getting the error "IndexError: list index out of range". But, at the time of debugging I got all the values. Previously, it worked completely fine, but suddenly can't understand why I am getting this error.
str2 = cols[1].text.strip()
IndexError: list index out of range
Here is my code.
import requests
import DivisionModel
from bs4 import BeautifulSoup
from time import sleep
class DivisionParser:
def __init__(self, zoneName, zoneUrl):
self.zoneName = zoneName
self.zoneUrl = zoneUrl
def getDivision(self):
response = requests.get(self.zoneUrl)
soup = BeautifulSoup(response.content, 'html5lib')
table = soup.findAll('table', id='mytable')
rows = table[0].findAll('tr')
division = []
for row in rows:
if row.text.find('T No.') == -1:
cols = row.findAll('td')
str1 = cols[0].text.strip()
str2 = cols[1].text.strip()
str3 = cols[2].text.strip()
strurl = cols[2].findAll('a')[0].get('href')
str4 = cols[3].text.strip()
str5 = cols[4].text.strip()
str6 = cols[5].text.strip()
str7 = cols[6].text.strip()
divisionModel = DivisionModel.DivisionModel(self.zoneName, str2, str3, strurl, str4, str5, str6, str7)
division.append(divisionModel)
return division
These are the values at the time of debugging:
str1 = {str} '1'
str2 = {str} 'BHUSAWAL DIVN-ENGINEERING'
str3 = {str} 'DRMWBSL692019t1'
str4 = {str} 'Bhusawal Division - TRR/P- 44.898Tkms & 2.225Tkms on 9 Bridges total 47.123Tkms on ADEN MMR &'
str5 = {str} 'Open'
str6 = {str} '23/12/2019 15:00'
str7 = {str} '5'
strurl = {str} '/works/pdfdocs/122019/51822293/viewNitPdf_3021149.pdf'

when i am parsing data from website by checking T No. in a row and get all the values in td, website developer put "No Result" in some td row, so that's why at the run time my loop will not able to get values and throw "list index out of range error."
Well thanks to all for the help.
class DivisionParser:
def __init__(self, zoneName, zoneUrl):
self.zoneName = zoneName
self.zoneUrl = zoneUrl
def getDivision(self):
global rows
try:
response = requests.get(self.zoneUrl)
soup = BeautifulSoup(response.content, 'html5lib')
table = soup.findAll('table', id='mytable')
rows = table[0].findAll('tr')
except IndexError:
sleep(2)
division = []
for row in rows:
if row.text.find('T No.') == -1:
try:
cols = row.findAll('td')
str1 = cols[0].text.strip()
str2 = cols[1].text.strip()
str3 = cols[2].text.strip()
strurl = cols[2].findAll('a')[0].get('href')
str4 = cols[3].text.strip()
str5 = cols[4].text.strip()
str6 = cols[5].text.strip()
str7 = cols[6].text.strip()
divisionModel = DivisionModel.DivisionModel(self.zoneName, str2, str3, strurl, str4, str5, str6,
str7)
division.append(divisionModel)
except IndexError:
print("No Result")
return division

As a general rule, whatever comes from the cold and hostile outside world is totally unreliable. Here:
response = requests.get(self.zoneUrl)
soup = BeautifulSoup(response.content, 'html5lib')
you seem to suffer from the terrible delusion that the response will always be what you expect. Hint: it wont. It is guaranteed that sometimes the response will be something different - could be that the site is down, or decided to blacklist your IP because they don't like having you scraping their data, or whatever.
IOW, you really want to check the response's status code, AND the response content. Actually, you want to be prepared to just anything - FWIW, since you don't specify a timeout, your code could just stay frozen forever waiting for a response
so actually what you want here is along the line of
try:
response = requests.get(yoururl, timeout=some_appropriate_value)
# cf requests doc
response.raise_for_status()
# cf requests doc
except requests.exceptions.RequestException as e
# nothing else you can do here - depending on
# the context (script ? library code ?),
# you either want to re-raise the exception
# raise your own exception or well, just
# show the error message and exit.
# Only you can decide what's the appropriate course
print("couldn't fetch {}: {}".format(yoururl, e))
return
if not response.headers['content-type'].startswith("text/html"):
# idem - not what you expected, and you can't do much
# except mentionning the fact to the caller one way
# or another. Here I just print the error and return
# but if this is library code you want to raise an exception
# instead
print("{} returned non text/html content {}".format(yoururl, response.headers['content-type']))
print("response content:\n\n{}\n".format(response.text))
return
# etc...
request has some rather exhaustive doc, I suggest you read more than the quickstart to learn and use it properly. And that's only half the job - even if you do get a 200 response with no redirections and the right content type, it doesn't mean the markup is what you expect, so here again you have to double-check what you get from BeautifulSoup - for example here:
table = soup.findAll('table', id='mytable')
rows = table[0].findAll('tr')
There's absolutely no garantee that the markup contains any table with a matching id (nor any table at all FWIW), so you have to either check beforehand or handle exceptions:
table = soup.findAll('table', id='mytable')
if not tables:
# oops, no matching tables ?
print("no table 'mytable' found in markup")
print("markup:\n{}\n".format(response.text))
return
rows = table[0].findAll('tr')
# idem, the table might be empty, etc etc
One of the fun things with programming is that handling the nominal case is often rather straightforward - but then you have to handle all the possible corner cases, and this usually requires as much or more code than the nominal case ;-)

Related

problems to handle more than one (return) parameter in main()

I'm rewriting an old keyword-scanner from Python2 to Python3 and have problems to handle more than one return parameter in my final main()-function.
def scanner_pref():
dork = input('Dork: ')
number = input('Number of sites: ')
return dork, number
So, I need to return dork and number to the next function
def scanner(dork, number):
url = "http://www.google.de/search"
payload = {'q': dork, 'start':'0', 'num': int(number) *10}
[..]
so the scanner can proceed with the given parameters of payload.
But when I try to write the main()-function, it can't handle the scanner-function, because it suddendly requires the numbers parameter. see below
def main():
pref = scanner_pref()
scan = scanner(pref) <--
parser(h3tag=scan)
I don't really understand why scan = scanner(pref, ?) requires the number parameter when it receives the information from the scanner(pref) above and doesn't really care about the dork-parameter.
If I remove "number" from scanner_pref(), move it back to scanner(..) it works fine and no error or warning message appears.
def scanner_pref():
dork = input('Dork: ')
return dork
#
def scanner(dork, number):
url = "http://www.google.de/search"
number = ("Number of sites: ")
payload = {'q': dork, 'start':'0', 'num': int(number) *10}
#
def main():
pref = scanner_pref()
scan = scanner(pref)
parser(h3tag=scan)
works fine and without problems
scanner(dork, number) takes two arguments.
When you call pref = scanner_pref() the values dork and number are stored in perf as a tuple. When you pass pref to scanner you are still only passing one argument, a tuple with two values.
you have two easy options
pref_dork, pref_number = scanner_pref()
scan = scanner(pref_dork, pref_number)
or
pref = scanner_pref()
scan = scanner(pref[0],perf[1])

Twitter API, Searching with dollar signs

This code opens a twitter listener, and the search terms are in the variable, upgrades_str. Some searches work, and some don't. I added AMZN to the upgrades list just to be sure there's a frequently used term since this is using an open Twitter stream, and not searching existing tweets.
Below, I think we only need to review numbers 2 and 4.
I'm using Python 3.5.2 :: Anaconda 4.0.0 (64-bit) on Windows 10.
Variable searches
Searching with: upgrades_str: ['AMZN', 'SWK', 'AIQUY', 'SFUN', 'DOOR'] = returns tweets such as 'i'm tired of people'
Searching with: upgrades_str: ['$AMZN', '$SWK', '$AIQUY', '$SFUN', '$DOOR'] = returns tweets as as 'Chicago to south Florida. Hiphop lives'. This search is the one I wish worked.
Explicit searches
Searching by replacing the variable 'upgrades_str' with the explicit string: ['AMZN', 'SWK', 'AIQUY', 'SFUN', 'DOOR'] = returns 'After being walked in on twice, I have finally figured out how to lock the door here in Sweden'. This one at least has the search term 'door'.
Searching by replacing the variable 'upgrades_str' with the explicit string: ['$AMZN', '$SWK', '$AIQUY', '$SFUN', '$DOOR'] = returns '$AMZN $WFM $KR $REG $KIM: Amazon’s Whole Foods buy puts shopping centers at risk as real'. So the explicit call works, but not the identical variable.
Explicitly searching for ['$AMZN'] = returns a good tweet: 'FANG setting up really good for next week! Added $googl jun23 970c avg at 4.36. $FB $AMZN'.
Explicitly searching for ['cool'] returns 'I can’t believe I got such a cool Pillow!'
import tweepy
import dataset
from textblob import TextBlob
from sqlalchemy.exc import ProgrammingError
import json
db = dataset.connect('sqlite:///tweets.db')
class StreamListener(tweepy.StreamListener):
def on_status(self, status):
if status.retweeted:
return
description = status.user.description
loc = status.user.location
text = status.text
coords = status.coordinates
geo = status.geo
name = status.user.screen_name
user_created = status.user.created_at
followers = status.user.followers_count
id_str = status.id_str
created = status.created_at
retweets = status.retweet_count
bg_color = status.user.profile_background_color
blob = TextBlob(text)
sent = blob.sentiment
if geo is not None:
geo = json.dumps(geo)
if coords is not None:
coords = json.dumps(coords)
table = db['tweets']
try:
table.insert(dict(
user_description=description,
user_location=loc,
coordinates=coords,
text=text,
geo=geo,
user_name=name,
user_created=user_created,
user_followers=followers,
id_str=id_str,
created=created,
retweet_count=retweets,
user_bg_color=bg_color,
polarity=sent.polarity,
subjectivity=sent.subjectivity,
))
except ProgrammingError as err:
print(err)
def on_error(self, status_code):
if status_code == 420:
return False
access_token = 'token'
access_token_secret = 'tokensecret'
consumer_key = 'consumerkey'
consumer_secret = 'consumersecret'
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
stream_listener = StreamListener()
stream = tweepy.Stream(auth=api.auth, listener=stream_listener)
stream.filter(track=upgrades_str, languages=['en'])
Here's the answer, in case someone has the problem in the future: "Note that punctuation is not considered to be part of a #hashtag or #mention, so a track term containing punctuation will not match either #hashtags or #mentions." From: https://dev.twitter.com/streaming/overview/request-parameters#track
And for multiple terms, the string, which was converted from a list, needs to be changed to ['term1,term2']. Just strip out the apostrophes and spaces:
upgrades_str = re.sub('[\' \[\]]', '', upgrades_str)
upgrades_str = '[\''+format(upgrades_str)+'\']'

Getting the correct information from differently formulated queries

Howdo people,
I'm to put together a limited Q&A program that will allow the user to query Wikidata using SPARQL with very specific/limited query structures.
I've got the program going, but I'm running into issues when entering queries that are formulated differently.
def sparql_query(line):
m = re.search('What is the (.*) of (.*)?', line)
relation = m.group(1)
entity = m.group(2)
wdparams['search'] = entity
json = requests.get(wdapi, wdparams).json()
for result in json['search']:
entity_id = result['id']
wdparams['search'] = relation
wdparams['type'] = 'property'
json = requests.get(wdapi, wdparams).json()
for result in json['search']:
relation_id = result['id']
fire_sparql(entity_id, relation_id)
return fire_sparql
As you can see, this only works with queries following that specific structure, for example "What is the color of night?" Queries along the lines of 'What are the ingredients of pizza?' simply would cause the program to crash because it doesn't follow the 'correct' structure as set in the code. As such, I would like it to be able to differentiate between different types of query structures ("What is.." and "What are.." for example) and still collect the needed information (relation/property and entity).
This setup is required, insofar as I can determine, seeing as the property and entity need to be extracted from the query in order to get the proper results from Wikidata. This is unfortunately also what I'm running into problems with; I can't seem to use 'if' or 'while-or' statements without the code returning all sorts of issues.
So the question being: How can I make the code accept differently formulated queries whilst still retrieving the needed information from them?
Many thanks in advance.
The entirety of the code in case required:
#!/usr/bin/python3
import sys
import requests
import re
def main():
example_queries()
for line in sys.stdin:
line = line.rstrip()
answer = sparql_query(line)
print(answer)
def example_queries():
print("Example query?\n\n Ask your question.\n")
wdapi = 'https://www.wikidata.org/w/api.php'
wdparams = {'action': 'wbsearchentities', 'language': 'en', 'format': 'json'}
def sparql_query(line):
m = re.search('What is the (.*) of (.*)', line)
relation = m.group(1)
entity = m.group(2)
wdparams['search'] = entity
json = requests.get(wdapi, wdparams).json()
for result in json['search']:
entity_id = result['id']
wdparams['search'] = relation
wdparams['type'] = 'property'
json = requests.get(wdapi, wdparams).json()
for result in json['search']:
relation_id = result['id']
fire_sparql(entity_id, relation_id)
return fire_sparql
url = 'https://query.wikidata.org/sparql'
def fire_sparql(ent, rel):
query = 'SELECT * WHERE { wd:' + ent + ' wdt:' + rel + ' ?answer.}'
print(query)
data = requests.get(url, params={'query': query, 'format': 'json'}).json()
for item in data['results']['bindings']:
for key in item:
if item[key]['type'] == 'literal':
print('{} {}'.format(key, item[key]['value']))
else:
print('{} {}'.format(key, item[key]))
if __name__ == "__main__":
main()

Python3 Function return value

I believe I have something incorrect with my function but I still pretty new to python and functions in general and I am missing what my issue is. In the code below I am sending no values into the function, but once the function reads the database I want it to pull the values from the database back into the main program. I placed a print statement inside of the function to make sure its pulling the values from the database and everything works properly. I copy and pasted that same print line right after the function call in the main area but I am getting a NameError: 'wins' is not defined. This leads me to believe I am not returning the values correctly?
#Python3
import pymysql
def read_db():
db = pymysql.connect(host='**********',user='**********',password='**********',db='**********')
cursor = db.cursor()
sql = "SELECT * FROM loot WHERE id = 1"
cursor.execute(sql)
results = cursor.fetchall()
for row in results:
wins = row[2]
skips = row[3]
lumber1 = row[4]
ore1 = row[5]
mercury1 = row[6]
db.commit()
db.close()
print ("Wins = ",wins," Skips = ",skips," Lumber = ",lumber1," Ore = ",ore1," Mercury = ",mercury1)
return (wins, skips, lumber1, ore1, mercury1)
read_db()
print ("Wins = ",wins," Skips = ",skips," Lumber = ",lumber1," Ore = ",ore1," Mercury = ",mercury1)
Doing just read_db() causes the return values to be thrown out. You need to assign the return values to variables:
wins, skips, lumber, ore = read_db()
The fact that you're returning a variable called wins does not mean a wins variable will become available in the scope you call read_db() from. You need to assign return values explicitly.

Why do I get different length of array of outputed elements depending on the position of elements of iterating array?

So, the purpose of this code is to provide the list of URLs on the page, but I found out that the amount of outputed URLs depends on the position of elements in the array which is used while iterating, i.e. params = ["src", "href"]
The code contains the working programs with imported Requests library, used requests.get(), response.text, and such structures as lists and loops.
To copy the code, use Expand Snippet button.
Questions:
Why do I get 8 urls when I use "src" on the 0-s position in the params array
and 136 urls when I use "href" on the 0-s position in the params array, see:
How is it possible to obtain all elements (src and href) in the array all_urls?
import requests
domain = "https://www.python.org/"
response = requests.get(domain)
page = response.text
all_urls = set()
params = ["src", "href"]
def getURL(page, param):
start_link = page.find(param)
if start_link == -1:
return None, 0
start_quote = page.find('"', start_link)
end_quote = page.find('"', start_quote + 1)
url = page[start_quote + 1: end_quote]
return url, end_quote
for param in params:
while True:
url, n = getURL(page, param)
page = page[n:]
#count += 1
if url:
if url.startswith('/') or url.startswith('#!'):
all_urls.add(domain + url)
elif url.startswith('http'):
all_urls.add(url)
else:
continue
else:
break
print("all urls length:", len(all_urls))
To answer your questions:
1- This happens because you are consuming your page variable inside the loop
url, n = getURL(page, param)
page = page[n:] // this one here
This just slices the page string after each iteration and reassigns it to the same variable, hence you loose a chunk on each iteration. When you get to the last src or href you are probably already at the end of the document.
2- A very quick fix for your code would be to reset the page for each new param:
for param in params:
page = response.text
while True:
url, n = getURL(page, param)
page = page[n:]
....
However
There is a far better way to handle HTML. Why don't you just use a HTML Parser for this task?
For example you could use BeautifulSoup4, for example: (not optimal code and not tested, just for a fast demonstration)
import requests
from bs4 import BeautifulSoup
response = requests.get("https://www.python.org/")
page = BeautifulSoup(response.text, "html.parser")
all_urls = list()
elements = page.find_all(lambda tag: tag.has_attr('src') or tag.has_attr('href'))
for elem in elements:
if elem.has_attr('src'):
all_urls.append(elem['src'])
elif elem.has_attr('href'):
all_urls.append(elem['href'])
print("all urls with dups length:", len(all_urls))
print("all urls w/o dups length:", len(set(all_urls)))

Resources