altering html and saving html doc

altering html and saving html doc - python-3.x

Been working on this for a while now, but maybe i am just searching the wrong thing to get the answers i need.
I have a dictionary whose keys are specific words i would like to find in a web page. I would then like to highlight those words and save the resulting HTML into a local file.
EDIT: It occurred to me later that people my like to execute the code themselves. This link includes the word dictionaries and the HTML of the page I am using to test my code as it should have the most matches of any of the pages i am scanning. Alternately you can use the actual website. the link would replace rl[0] in the code.
try:
#rl[0] refers to a specific url being pulled from a list in another file.
req = urllib.request.Request(rl[0],None,headers)
opener = urllib.request.build_opener(proxy_support, urllib.request.HTTPCookieProcessor(cj))
resp = opener.open(req)
soup = BeautifulSoup(resp.read(),'html.parser')
resp.close
except urllib.error.URLError:
print("URL error when opening "+rl[0])
except urllib.error.HTTPError:
print("HTTP error when opening "+rl[0])
except http.client.HTTPException as err:
print(err, "HTTP exception error when opening "+rl[0])
except socket.timeout:
print("connection timedout accessing "+rl[0])
soup = None
else:
for l in [wdict1,wdict2,wdict3,wdict4]:
for i in l:
foundvocab = soup.find_all(text=re.compile(i))
for term in foundvocab:
#c indicates the highlight color determined earlier in the script based on which dictionary the word came from.
#numb is a term i defined earlier to use as a reference to another document this script creates.
fixed = term.replace(i,'<mark background-color="'+c+'">'+i+'<sup>'+numb+'</sup></mark>')
term.replace_with(fixed)
print(soup, file=path/local.html)
the issue i am having is that when the soup prints it prints the entire paragraph for each word it finds and does not highlight. alternately i can say:
foundvocab = soup.find_all(text=i)
and the resulting HTML file is blank.

ok. the below code finds and replaces what i need. now i just have display issues with the greater than and less than symbols which is a different problem. thank you to those that spared the time to take a look.
foundvocab = soup.findAll(text=re.compile('{0}'.format(i)), recursive=True)
for fw in foundvocab:
fixed_text = fw.replace('{0}'.format(i), '<mark background-color="'+c+'">'+i+'<sup>'+numb+'</sup></mark>')
fw.replace_with(fixed_text)

Related

web scraping help needed

I was wondering if someone could help me put together some code for
https://finance.yahoo.com/quote/TSCO.l?p=TSCO.L
I currently use this code to scrape the current price
currentPriceData = soup.find_all('div', {'class':'My(6px) Pos(r) smartphone_Mt(6px)'})[0].find('span').text
This works fine but I occasionally get an error not really sure why as the links are all correct. but I would like to try to get the price again
so something like
try:
currentPriceData = soup.find_all('div', {'class':'My(6px) Pos(r) smartphone_Mt(6px)'})[0].find('span').text
except Exception:
currentPriceData = soup.find('span', {'class':'Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(ib)'})[0].text
The problem is that I can't get it to scrape the number using this method any help would be greatly appreciated.

The data is embedded within the page as Javascript variable. But you can use json module to parse it.
For example:
import re
import json
import requests
url = 'https://finance.yahoo.com/quote/TSCO.l?p=TSCO.L'
html_data = requests.get(url).text
#the next line extracts from the HTML source javascript variable
#that holds all data that is rendered on page.
#BeautifulSoup cannot run Javascript, so we are going to use
#`json` module to extract the data.
#NOTE: When you view source in Firefox/Chrome, you can search for
# `root.App.main` to see it.
data = json.loads(re.search(r'root\.App\.main = ({.*?});\n', html_data).group(1))
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
# We now have the Javascript variable extracted to standard python
# dict, so now we just print contents of some keys:
price = data['context']['dispatcher']['stores']['QuoteSummaryStore']['price']['regularMarketPrice']['fmt']
currency_symbol = data['context']['dispatcher']['stores']['QuoteSummaryStore']['price']['currencySymbol']
print('{} {}'.format(price, currency_symbol))
Prints:
227.30 £

Unable to figure out where to add wait statement in python selenium

I am searching elements in my list(one by one) by inputing into searchbar of a website and get apple products name that appeared in search result and printed. However I am getting following exception
StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
I know its because of changing of element very fast so I need to add wait like
wait(driver, 10).until(EC.visibility_of_element_located((By.ID, "submitbutton")))
or explictly
Q1. But I don't understand where should I add it? Here is my code. Please help!
Q2. I want to go to all the next pages using but that's not working.
driver.find_element_by_xpath( '//div[#class="no-hover"]/a' ).click()
Earlier exception was raised on submitton button and now at if statement.

That's not what implicit wait is for. Since the page change regularly you can't be sure when the current object inside the variable is still valid.
My suggestion is to run the above code in loop using try except. Something like the following:
for element in mylist:
ok = False
while True:
try:
do_something_useful_while_the_page_can_change(element)
except StaleElementReferenceException:
# retry
continue
else:
# go to next element
break
Where:
def do_something_useful_while_the_page_can_change(element):
searchElement = driver.find_element_by_id("searchbar")
searchElement.send_keys(element)
driver.find_element_by_id("searchbutton").click()
items_count = 0
items = driver.find_elements_by_class_name( 'searchresult' )
for i, item in enumerate( items ):
if 'apple' in item.text:
print ('item.text')
items_count += len( items )

I think what you had was doing too much and can be simplified. You basically need to loop through a list of search terms, myList. Inside that loop you send the search term to the searchbox and click search. Still inside that loop you want to grab all the elements off the page that consist of search results, class='search-result-product-url' but also the text of the element contains 'apple'. The XPath locator I provided should do both so that the collection that is returned all are ones you want to print... so print each. End loop... back to next search term.
for element in mylist:
driver.find_element_by_id("search-input").send_keys(element)
driver.find_element_by_id("button-search").click()
# may need a wait here?
for item in driver.find_elements_by_xpath( "//a[#class='search-result-product-url'][contains(., 'apple')]" ):
print item.text

Finding Hyperlinks in Python 3

So my assignment is to write a program that reads all data from a web page and prints all hyperlinks of the form:
link text
So far I've gotten this far with the help of my instructor.
from urllib.request import urlopen
def findTitle(webpage):
encoding = "utf-8"
for webpagestr in webpage:
webpagestr = str(webpagestr,encoding)
if "<title>" in webpagestr:
indexstart = webpagestr.find("<title>")
indexend = webpagestr.find("</title>")
title = webpagestr[indexstart+7:indexend]
return title
return title
def H1headings(webpage):
encoding = "utf-8"
for webpagestr in webpage:
webpagestr = str(webpagestr,encoding)
if "<h1>" in webpagestr:
indexstart = webpagestr.find("<h1>")
indexend = webpagestr.find("</h1>")
heading = webpagestr[indexstart+4:indexend]
print(heading)
def main():
address = input("Enter URL to find title and more information: ")
try:
webpage = urlopen(address)
title = findTitle(webpage)
print("Title is", title)
H1headings(webpage)
webpage.close()
except Exception as exceptObj:
print("Error: ", str(exceptObj))
main()
When I run this program it allows me to input a URL but after it gives me a:
Error: local variable 'title' referenced before assignment
I'm not sure what it means.
Then one of my attempts when I placed:
def findTitle(webpage):
title = "Not Found"
encoding = "utf-8"
ran the program, it would give me:
Enter URL to find title and more information: http://jeremycowart.com
Title is not found
Jeremy Cowart
It's what I'm looking for, but I believe I'm suppose to have the title and heading as well as link text.
I've gotten close but I just can't figure it out. Any help would be appreciated!

There are at least a couple problems I see.
In your function findTitle, you always return the variable title, even if you haven't found one, and therefore haven't set the variable title to anything, hence when it fails to find a title on the web page (whether because there isn't one or your code failed to find it) it's returning a variable that hasn't been created/assigned yet.
First, you should fix your code so it doesn't do that. You can either change the final return statement to return a default value, e.g. "Not found" or you can set title to the default value at the top of your function, which guarantees it will exist no matter when it is returned, e.g.:
def foo(x):
y = -1
<do stuff, maybe re-setting y>
return y
That is the pattern I tend to prefer.
There is another problem though, which is preventing you from properly detecting the title. If you step through your program you should see it. Essentially, you start looping, get the first line, check to see if it has a title; if it does you return it (good), if it doesn't you skip the if, but then you return it anyway (not so good) and you never get to the second line.

Code worked initially but no longer works, uses "split" command, and basic string handling

I have a piece of code that is supposed to give a preset response when the user says "screen", I got this code to work but when I came back to it later without making any changes it stopped working. Here is the code:
file = open('phone_help_answers.txt',"r")
lines = file.readlines()
key_words_screen_1 = "screen"
user_input=input("What is your issue?")
string=user_input
for i in string.split(' '):
if i == key_words_screen_1:
def read():
with open('phone_help_answers.txt', 'r') as file:
print (lines[3])
Additionally certain letters will give my other preset responses.

I have found a solution, I just remove the "def read()" and leave only the "pinrt (lines[3])

Issue with searching for text within a returned string

I have wrote code that gets this returned:
<div id="IncidentDetailContainer"><p>The Fire Service received a call reporting a car on fire at the above location. One fire appliance from Ashburton attended.</p><p>Fire crews confirmed one car well alight and severley damaged by fire. The vehicle was extinguished by fire crews using two breathing apparatus wearers and one hose reel jet. The cause of the fire is still under investigation by the Fire Service and Police.</p><p> </p><p> </p></div>
I want to search through it and find the "Ashburton" part, but so far no matter what I use I get none returned or [].
My question is this: is this a normal string that can be searched (and I'm doing something wrong) or is it because I have got it from a webpage source code that I can't search through it the normal way?
It should be simple, I know, but still I get none!
from bs4 import BeautifulSoup
from urllib import request
import sys, traceback
webpage = request.urlopen("http://www.dsfire.gov.uk/News/Newsdesk/IncidentsPast7days.cfm?siteCategoryId=3&T1ID=26&T2ID=35")
soup = BeautifulSoup(webpage)
incidents = soup.find(id="CollapsiblePanel1")
Links = []
for line in incidents.find_all('a'):
Links.append("http://www.dsfire.gov.uk/News/Newsdesk/"+line.get('href'))
n = 0
e = len(Links)
if e == n:
print("No Incidents Found Please Try Later")
sys.exit(0)
while n < e:
webpage = request.urlopen(Links[n])
soup = BeautifulSoup(webpage)
station = soup.find(id="IncidentDetailContainer")
#search string
print(soup.body.findAll(text='Ashburton'))
n=n+1
Just FYI, if the webpage doesn't have any incidents today it wont search for anything (obviously) that's why I included the returned string, so if you run it and get nothing, that's why.

Provide a pattern in the text=. From the html you have provided, If you want to find the tag that has "Ashburton" in it, you can use something like this,
soup.find_all('p', text=re.compile(r'Ashburton'))
You can get only the text by this,
soup.find_all(text=re.compile(r'Ashburton'))

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

altering html and saving html doc - python-3.x

Related

web scraping help needed

Unable to figure out where to add wait statement in python selenium

Finding Hyperlinks in Python 3

Code worked initially but no longer works, uses "split" command, and basic string handling

Issue with searching for text within a returned string

Categories

Resources