So my assignment is to write a program that reads all data from a web page and prints all hyperlinks of the form:
link text
So far I've gotten this far with the help of my instructor.
from urllib.request import urlopen
def findTitle(webpage):
encoding = "utf-8"
for webpagestr in webpage:
webpagestr = str(webpagestr,encoding)
if "<title>" in webpagestr:
indexstart = webpagestr.find("<title>")
indexend = webpagestr.find("</title>")
title = webpagestr[indexstart+7:indexend]
return title
return title
def H1headings(webpage):
encoding = "utf-8"
for webpagestr in webpage:
webpagestr = str(webpagestr,encoding)
if "<h1>" in webpagestr:
indexstart = webpagestr.find("<h1>")
indexend = webpagestr.find("</h1>")
heading = webpagestr[indexstart+4:indexend]
print(heading)
def main():
address = input("Enter URL to find title and more information: ")
try:
webpage = urlopen(address)
title = findTitle(webpage)
print("Title is", title)
H1headings(webpage)
webpage.close()
except Exception as exceptObj:
print("Error: ", str(exceptObj))
main()
When I run this program it allows me to input a URL but after it gives me a:
Error: local variable 'title' referenced before assignment
I'm not sure what it means.
Then one of my attempts when I placed:
def findTitle(webpage):
title = "Not Found"
encoding = "utf-8"
ran the program, it would give me:
Enter URL to find title and more information: http://jeremycowart.com
Title is not found
Jeremy Cowart
It's what I'm looking for, but I believe I'm suppose to have the title and heading as well as link text.
I've gotten close but I just can't figure it out. Any help would be appreciated!
There are at least a couple problems I see.
In your function findTitle, you always return the variable title, even if you haven't found one, and therefore haven't set the variable title to anything, hence when it fails to find a title on the web page (whether because there isn't one or your code failed to find it) it's returning a variable that hasn't been created/assigned yet.
First, you should fix your code so it doesn't do that. You can either change the final return statement to return a default value, e.g. "Not found" or you can set title to the default value at the top of your function, which guarantees it will exist no matter when it is returned, e.g.:
def foo(x):
y = -1
<do stuff, maybe re-setting y>
return y
That is the pattern I tend to prefer.
There is another problem though, which is preventing you from properly detecting the title. If you step through your program you should see it. Essentially, you start looping, get the first line, check to see if it has a title; if it does you return it (good), if it doesn't you skip the if, but then you return it anyway (not so good) and you never get to the second line.
Related
I am using pyinputplus and specifically inputNum
https://pyinputplus.readthedocs.io/en/latest/
This is what my code looks like:
msg = 'Enter value to add/replace or s to skip field or q to quit: '
answer = pyip.inputNum(prompt=msg, allowRegexes=r'^[qQsS]$', blank=False)
My goal is to allow any number but also allow one of the following q,Q,s,S.
However when I run the code and enter 'sam' the code crashes because later on I am trying to convert to float(answer).
My expectation is that allowRegexes will not allow this and will show me the prompt again to re-enter.
Please advise!
It seems pyip.inputNum stops validating the input if you provide allowRegexes argument. See the allowRegexes doesn't seem work properly Github issue.
You can use the inputCustom method with a custom validation method:
import pyinputplus as pyip
import ast
def is_numeric(x):
try:
number = ast.literal_eval(x)
except:
return False
return isinstance(number, float) or isinstance(number, int)
def raiseIfInvalid(text):
if not is_numeric(text) and not text.casefold() in ['s', 'q']:
raise Exception('Input must be numeric or "q"/"s".')
msg = 'Enter value to add/replace or s to skip field or q to quit: '
answer = pyip.inputCustom(raiseIfInvalid, prompt=msg)
So, if the text is not and int or float and not equal to s/S/q/Q, the prompt will repeat showing up.
I am in the middle of learning scrappy right now and am building a simple scraper of a real estate site. With this code I am trying to scrape all of the URLs for the real estate listing of a specific city. I have run into the following error with my code - "Cannot mix str and non-str arguments".
I believe I have isolated my problem to following part of my code
props = response.xpath('//div[#class = "address ellipsis"]/a/#href').extract()
If I use the extract_first() function instead of the extract function in the props xpath assignment, the code kind of works. It grabs the first link for the property on each page. However, this ultimately is not what I want. I believe I have the xpath call correct as the code runs if I use the extract_first() method.
Can someone explain what I am doing wrong here? I have listed my full code below
import scrapy
from scrapy.http import Request
class AdvancedSpider(scrapy.Spider):
name = 'advanced'
allowed_domains = ['www.realtor.com']
start_urls = ['http://www.realtor.com/realestateandhomes-search/Houston_TX/']
def parse(self, response):
props = response.xpath('//div[#class = "address ellipsis"]/a/#href').extract()
for prop in props:
absolute_url = response.urljoin(props)
yield Request(absolute_url, callback=self.parse_props)
next_page_url = response.xpath('//a[#class = "next"]/#href').extract_first()
absolute_next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(absolute_next_page_url)
def parse_props(self, response):
pass
Please let me know if I can clarify anything.
You are passing props list of strings to response.urljoin() but meant prop instead:
for prop in props:
absolute_url = response.urljoin(prop)
Alecxe's is right, it was a simple oversight in the spelling of iterator in your loop. You can use the following notation:
for prop in response.xpath('//div[#class = "address ellipsis"]/a/#href').extract():
yield scrapy.Request(response.urljoin(prop), callback=self.parse_props)
It's cleaner and you're not instantiating the "absolute_url" per loop. On a larger scale, would help you save some memory.
Been working on this for a while now, but maybe i am just searching the wrong thing to get the answers i need.
I have a dictionary whose keys are specific words i would like to find in a web page. I would then like to highlight those words and save the resulting HTML into a local file.
EDIT: It occurred to me later that people my like to execute the code themselves. This link includes the word dictionaries and the HTML of the page I am using to test my code as it should have the most matches of any of the pages i am scanning. Alternately you can use the actual website. the link would replace rl[0] in the code.
try:
#rl[0] refers to a specific url being pulled from a list in another file.
req = urllib.request.Request(rl[0],None,headers)
opener = urllib.request.build_opener(proxy_support, urllib.request.HTTPCookieProcessor(cj))
resp = opener.open(req)
soup = BeautifulSoup(resp.read(),'html.parser')
resp.close
except urllib.error.URLError:
print("URL error when opening "+rl[0])
except urllib.error.HTTPError:
print("HTTP error when opening "+rl[0])
except http.client.HTTPException as err:
print(err, "HTTP exception error when opening "+rl[0])
except socket.timeout:
print("connection timedout accessing "+rl[0])
soup = None
else:
for l in [wdict1,wdict2,wdict3,wdict4]:
for i in l:
foundvocab = soup.find_all(text=re.compile(i))
for term in foundvocab:
#c indicates the highlight color determined earlier in the script based on which dictionary the word came from.
#numb is a term i defined earlier to use as a reference to another document this script creates.
fixed = term.replace(i,'<mark background-color="'+c+'">'+i+'<sup>'+numb+'</sup></mark>')
term.replace_with(fixed)
print(soup, file=path/local.html)
the issue i am having is that when the soup prints it prints the entire paragraph for each word it finds and does not highlight. alternately i can say:
foundvocab = soup.find_all(text=i)
and the resulting HTML file is blank.
ok. the below code finds and replaces what i need. now i just have display issues with the greater than and less than symbols which is a different problem. thank you to those that spared the time to take a look.
foundvocab = soup.findAll(text=re.compile('{0}'.format(i)), recursive=True)
for fw in foundvocab:
fixed_text = fw.replace('{0}'.format(i), '<mark background-color="'+c+'">'+i+'<sup>'+numb+'</sup></mark>')
fw.replace_with(fixed_text)
I've spent the day day and-a-half (I'm a beginner) trying to work out how to do something simple: Fill in an ID code and hit 'search'. I've read the requests docs and scoured stack overflow for examples, but none seem to help me.
What I want to do is run through a list of IDs that I have and check off which one's are valid and which are invalid. (This is for academic research)
Here's my code:
import requests
url = 'https://ndber.seai.ie/pass/ber/search.aspx'
payload = {'ctl00$DefaultContent$BERSearch$dfSearch$txtBERNumber':'100000000'
}
#the above dictionary key corresponds to the name , found using Firefox inspector in the source code.
#I have tried using input ID too with no difference in outcome
#enter in arbitrary number of '100000000'- expect to see 'Invalid ID Number' printed
r = requests.post(url, data = payload)
print(r.status_code, r.reason)
print(r.text)
if 'Number.'in r.text: #'Number' is an arbitrary word that happens to be in the page if a correct ID code is entered
print('Valid ID Number')
elif 'No results found.' in r.text: #as above, 'No results found' is in the page if an incorrect ID is entered
print('Invalid ID Number') #with the aboove number entered, I would expect this line to be printed
else:
print('Code Failure') #this line gets printed
I've tried to comment it as much as possible, so you can see what I'm at.
The output I get is Code Failure
The reason why BeautifulSoup comes into all this is because I asked for help on Reddit's excellent 'learn python' subreddit yesterday. A Redditor generously took the time to try to explain where I went wrong and what I should do differently. Here's what (s)he suggested I do:
enter image description here, which involves BS4.
So, is the Redditor right, or can this be done in a simple and 'light' way using the requests library?
Cheers!
Make sure to simulate exact same post with parameters, you can use firefox's firebug to see what parameters are beeing sent.
Below you can see the parameters, I've seen.
__EVENTARGUMENT=
__EVENTTARGET=
__EVENTVALIDATION=/wEdAAUFK13s0OHvBcOmXYSKCCYAA5nYY19nD30S2PPC/bgb6yltaY6CwJw6+g9S5X73j5ccPYOXbBJVHSWhVCIjHWYlwa4c5fa9Qdw0vorP2Y9f7xVxOMmbQICW5xMXXERtE7U18lHqXqDZVqJ8R/A9h66g
__SCROLLPOSITIONX=0
__SCROLLPOSITIONY=114
__VIEWSTATE=/wEPDwULLTE2MDEwODU4NjAPFgIeE1ZhbGlkYXRlUmVxdWVzdE1vZGUCARYCZg9kFgICAw9kFgICAw8WAh4FY2xhc3MFC21haW53cmFwcGVyFgQCBQ8PFgIeB1Zpc2libGVnZGQCCQ9kFgICAQ9kFgxmD2QWAgIBD2QWAgIBD2QWAmYPZBYCZg9kFgQCAQ8WAh4JaW5uZXJodG1sZWQCAw9kFgICAg9kFgJmD2QWBAIBD2QWAgIDDw8WCh4EXyFTQgKAAh4MRGVmYXVsdFdpZHRoHB4EVGV4dAUFY3RsMDAeB1Rvb2xUaXAFPlBsZWFzZSBlbnRlciBhIHZhbHVlLCB3aXRoIG5vIHNwZWNpYWwgY2hhcmFjdGVycywgd2l0aCBubyB0ZXh0HgVXaWR0aBxkZAIDD2QWAgIDDw8WCh8EAoACHwUcHwYFCTEwMDAwMDAwMB8HBT5QbGVhc2UgZW50ZXIgYSB2YWx1ZSwgd2l0aCBubyBzcGVjaWFsIGNoYXJhY3RlcnMsIHdpdGggbm8gdGV4dB8IHGRkAgQPFCsAAmQQFgAWABYAFgJmD2QWAgICD2QWAmYPPCsAEQIBEBYAFgAWAAwUKwAAZAIGDxYCHwJoFgQCAQ8WAh8CaGQCAw9kFgJmD2QWAmYPZBYCAgMPZBYCZg9kFgJmD2QWAgIBDxYCHwJoZAIIDxYCHwJoFgQCAQ8WAh8CaGQCAw9kFgJmD2QWAmYPZBYCAgMPZBYCZg9kFgJmD2QWAgIBDxYCHwJoZAIKDxYCHwJoFgQCAQ8WAh8CaGQCAw9kFgJmD2QWAmYPZBYCAgMPZBYCZg9kFgJmD2QWAgIBDxYCHwJoZAIMDxYCHwJoFgQCAQ8WAh8CaGQCAw9kFgJmD2QWAmYPZBYCAgMPZBYCZg9kFgJmD2QWAgIBDxYCHwJoZBgBBTNjdGwwMCREZWZhdWx0Q29udGVudCRCRVJTZWFyY2gkZ3JpZFJhdGluZ3MkZ3JpZHZpZXcPZ2QPHVcTs3MrN3SZqli/KiHgWASd1DRvCLPuS7Rs+GRQAQ==
__VIEWSTATEGENERATOR=1F9CCB97
ctl00$DefaultContent$BERSearch$dfSearch$Bottomsearch=Search
ctl00$DefaultContent$BERSearch$dfSearch$txtBERNumber=2313131
ctl00$DefaultContent$BERSearch$dfSearch$txtMPRN=11111111111
And belowe code just worked,
from __future__ import unicode_literals
import requests
url = 'https://ndber.seai.ie/pass/ber/search.aspx'
payload = {'__EVENTVALIDATION':'/wEdAAUFK13s0OHvBcOmXYSKCCYAA5nYY19nD30S2PPC/bgb6yltaY6CwJw6+g9S5X73j5ccPYOXbBJVHSWhVCIjHWYlwa4c5fa9Qdw0vorP2Y9f7xVxOMmbQICW5xMXXERtE7U18lHqXqDZVqJ8R/A9h66g',
'__VIEWSTATE':'/wEPDwULLTE2MDEwODU4NjAPFgIeE1ZhbGlkYXRlUmVxdWVzdE1vZGUCARYCZg9kFgICAw9kFgICAw8WAh4FY2xhc3MFC21haW53cmFwcGVyFgQCBQ8PFgIeB1Zpc2libGVnZGQCCQ9kFgICAQ9kFgxmD2QWAgIBD2QWAgIBD2QWAmYPZBYCZg9kFgQCAQ8WAh4JaW5uZXJodG1sZWQCAw9kFgICAg9kFgJmD2QWBAIBD2QWAgIDDw8WCh4EXyFTQgKAAh4MRGVmYXVsdFdpZHRoHB4EVGV4dAUFY3RsMDAeB1Rvb2xUaXAFPlBsZWFzZSBlbnRlciBhIHZhbHVlLCB3aXRoIG5vIHNwZWNpYWwgY2hhcmFjdGVycywgd2l0aCBubyB0ZXh0HgVXaWR0aBxkZAIDD2QWAgIDDw8WCh8EAoACHwUcHwYFCTEwMDAwMDAwMB8HBT5QbGVhc2UgZW50ZXIgYSB2YWx1ZSwgd2l0aCBubyBzcGVjaWFsIGNoYXJhY3RlcnMsIHdpdGggbm8gdGV4dB8IHGRkAgQPFCsAAmQQFgAWABYAFgJmD2QWAgICD2QWAmYPPCsAEQIBEBYAFgAWAAwUKwAAZAIGDxYCHwJoFgQCAQ8WAh8CaGQCAw9kFgJmD2QWAmYPZBYCAgMPZBYCZg9kFgJmD2QWAgIBDxYCHwJoZAIIDxYCHwJoFgQCAQ8WAh8CaGQCAw9kFgJmD2QWAmYPZBYCAgMPZBYCZg9kFgJmD2QWAgIBDxYCHwJoZAIKDxYCHwJoFgQCAQ8WAh8CaGQCAw9kFgJmD2QWAmYPZBYCAgMPZBYCZg9kFgJmD2QWAgIBDxYCHwJoZAIMDxYCHwJoFgQCAQ8WAh8CaGQCAw9kFgJmD2QWAmYPZBYCAgMPZBYCZg9kFgJmD2QWAgIBDxYCHwJoZBgBBTNjdGwwMCREZWZhdWx0Q29udGVudCRCRVJTZWFyY2gkZ3JpZFJhdGluZ3MkZ3JpZHZpZXcPZ2QPHVcTs3MrN3SZqli/KiHgWASd1DRvCLPuS7Rs+GRQAQ==',
'__VIEWSTATEGENERATOR':'1F9CCB97',
'ctl00$DefaultContent$BERSearch$dfSearch$Bottomsearch':'Search',
'ctl00$DefaultContent$BERSearch$dfSearch$txtBERNumber':'2313131',
'ctl00$DefaultContent$BERSearch$dfSearch$txtMPRN':'11111111111'
}
session = requests.session()
r = requests.post(url, data=payload)
if 'Number.'in r.text: #'Number' is an arbitrary word that happens to be in the page if a correct ID code is entered
print('Valid ID Number')
elif 'No results found.' in r.text: #as above, 'No results found' is in the page if an incorrect ID is entered
print('Invalid ID Number') #with the aboove number entered, I would expect this line to be printed
else:
print('Code Failure') #this line gets printed"""
Just put your parameters in the payload dictionary, rest of the parameters can stay they I think
BER/DEC Number > ctl00$DefaultContent$BERSearch$dfSearch$txtBERNumber
MPRN > ctl00$DefaultContent$BERSearch$dfSearch$txtMPRN
I have wrote code that gets this returned:
<div id="IncidentDetailContainer"><p>The Fire Service received a call reporting a car on fire at the above location. One fire appliance from Ashburton attended.</p><p>Fire crews confirmed one car well alight and severley damaged by fire. The vehicle was extinguished by fire crews using two breathing apparatus wearers and one hose reel jet. The cause of the fire is still under investigation by the Fire Service and Police.</p><p> </p><p> </p></div>
I want to search through it and find the "Ashburton" part, but so far no matter what I use I get none returned or [].
My question is this: is this a normal string that can be searched (and I'm doing something wrong) or is it because I have got it from a webpage source code that I can't search through it the normal way?
It should be simple, I know, but still I get none!
from bs4 import BeautifulSoup
from urllib import request
import sys, traceback
webpage = request.urlopen("http://www.dsfire.gov.uk/News/Newsdesk/IncidentsPast7days.cfm?siteCategoryId=3&T1ID=26&T2ID=35")
soup = BeautifulSoup(webpage)
incidents = soup.find(id="CollapsiblePanel1")
Links = []
for line in incidents.find_all('a'):
Links.append("http://www.dsfire.gov.uk/News/Newsdesk/"+line.get('href'))
n = 0
e = len(Links)
if e == n:
print("No Incidents Found Please Try Later")
sys.exit(0)
while n < e:
webpage = request.urlopen(Links[n])
soup = BeautifulSoup(webpage)
station = soup.find(id="IncidentDetailContainer")
#search string
print(soup.body.findAll(text='Ashburton'))
n=n+1
Just FYI, if the webpage doesn't have any incidents today it wont search for anything (obviously) that's why I included the returned string, so if you run it and get nothing, that's why.
Provide a pattern in the text=. From the html you have provided, If you want to find the tag that has "Ashburton" in it, you can use something like this,
soup.find_all('p', text=re.compile(r'Ashburton'))
You can get only the text by this,
soup.find_all(text=re.compile(r'Ashburton'))