So here is my problem. I am trying to output my scraping results in a GUI using tkinter in python. The code i use works in the shell, but when i use it with tkinter it doesnt Here is my code.
import sys
from tkinter import *
from urllib.request import urlopen
import re
def stockSearch():
searchTerm = userInput.get()
url = "http://finance.yahoo.com/q?s="+searchTerm+"&q1=1"
htmlfile = urlopen(url)
htmltext = str(htmlfile.read())
regex = '<span id="yfs_l84_'+searchTerm+'">(.+?)</span>'
pattern = re.compile(regex)
price = re.findall(pattern, htmltext)
outputStock = str(["The price of ", searchTerm, "is ", price])
sLabel2 = Label(sGui, text=outputStock).pack()
sGui = Tk()
userInput = StringVar()
sGui.geometry("450x450+200+200")
sGui.title("Stocks")
sLabel = Label(sGui, text="Stocks List", fg="black")
sLabel.pack()
sButton = Button(sGui, text="LookUp", command = stockSearch)
sButton.place(x=200, y=400)
uEntry = Entry(sGui, textvariable=userInput).pack()
sGui.mainloop()
If i input a search for Google (GOOG) for example, I return this:
"The price of GOOG is []"
However, if i use the same code, but i print the result in a shell as opposed to using tkinter, i get the price as it should.
Any ideas anyone?
It appears your code isn't properly handling case. If you search for "goog" the value shows up. The problem is this line:
regex = '<span id="yfs_l84_'+searchTerm+'">(.+?)</span>'
If you type "GOOG", the regex becomes:
<span id="yfs_l84_GOOG'">(.+?)</span>
However, the html that is returned doesn't have that pattern. Doing a case-insensitive search should solve that problem:
pattern = re.compile(regex, flags=re.IGNORECASE)
Also, there's no need to create a new Label every time -- you can create the label once and then change the text each time you do a lookup.
Related
I dug up my old code, used for Scopus scraping. It was created while I was learning programming. Now a window pops up on the Scopus site that I can't detect using windows_handle.
Scopus Pop-up window
import openpyxl
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.select import Select
import time
import pandas as pd
from openpyxl import load_workbook
DOI = []
TITLE = []
NUM_AUTHORS = []
NUM_AFFILIATIONS = []
AUTHORS = []
YEAR = []
JOURNAL = []
DOCUMENT_TYPE = []
COUNTRIES = []
SEARCH = []
DATES = []
chrome_driver_path = "C:\Development\chromedriver.exe"
driver = webdriver.Chrome(executable_path=chrome_driver_path)
driver.get("https://www.scopus.com/search/form.uri?display=basic#basic")
# searching details
search = input("Search documents: \n")
SEARCH.append(search)
date = input("Do you want to specify dates?(Yes/No)")
if date.capitalize() == "Yes":
driver.find_element(By.CLASS_NAME, 'flex-grow-1').send_keys(search)
driver.find_element(By.XPATH,
"/html/body/div/div/div[1]/div[2]/div/div[3]/div/div[2]/div["
"2]/micro-ui/scopus-homepage/div/div/els-tab/els-tab-panel[1]/div/form/div[2]/div[1]/button["
"2]/span[2]").click()
time.sleep(1)
starting_date = input("Put starting year.")
to_date = input("Put end date.")
DATES.append(starting_date)
DATES.append(to_date)
drop_menu_from = Select(driver.find_element(By.XPATH,
"/html/body/div/div/div[1]/div[2]/div/div[3]/div/div[2]/div["
"2]/micro-ui/scopus-homepage/div/div/els-tab/els-tab-panel["
"1]/div/form/div[2]/div[1]/els-select/div/label/select"))
drop_menu_from.select_by_visible_text(starting_date)
drop_menu_to = Select(driver.find_element(By.XPATH,
"/html/body/div/div/div[1]/div[2]/div/div[3]/div/div[2]/div["
"2]/micro-ui/scopus-homepage/div/div/els-tab/els-tab-panel["
"1]/div/form/div[2]/div[2]/els-select/div/label/select"))
drop_menu_to.select_by_visible_text(to_date)
driver.find_element(By.XPATH,
'/html/body/div/div/div[1]/div[2]/div/div[3]/div/div[2]/div['
'2]/micro-ui/scopus-homepage/div/div/els-tab/els-tab-panel[1]/div/form/div[4]/div['
'2]/button/span[1]').click()
else:
DATES = ["XXX", "YYY"]
driver.find_element(By.CLASS_NAME, 'flex-grow-1').send_keys(search)
driver.find_element(By.XPATH,
"/html/body/div/div/div[1]/div[2]/div/div[3]/div/div[2]/div["
"2]/micro-ui/scopus-homepage/div/div/els-tab/els-tab-panel[1]/div/form/div[2]/div["
"2]/button").click()
time.sleep(2)
doc_num = int(driver.find_element(By.XPATH,
"/html/body/div[1]/div/div[1]/div/div/div[3]/form/div[1]/div/header/h1/span[1]").text.replace(
",", ""))
time.sleep(5)
driver.find_element((By.XPATH, "/html/body/div[11]/div[2]/div[1]/div[4]/div/div[2]/button")).click()
This is how the beginning of the code looks like. The last element
driver.find_element((By.XPATH, "/html/body/div[11]/div[2]/div[1]/div[4]/div/div[2]/button")).click()
should find on click on dismiss button. I do not know how to handle it.
I have tried finding element by driver.find_element, checking if the pop-up window can be detected and handled via windows_handle.
Actually that is not a popup because its code is contained in HTML of the page itself. Popups are either prompts of the browser (not contained in the HTML) or other browser windows (have a separate HTML).
I suggest to target the button by using the text contained in it, in this case we look for a button containing exactly "Dismiss"
driver.find_element(By.XPATH, '//button[text()="Dismiss"]').click()
I'm following a guide and it's saying to print the first item from an html document that contains the dollar sign.
It seems to do it correctly, outputting a price to the terminal and it actually being present on the webpage. However, I don't want to have just that single listing, I want to have all of the listings and print them to the terminal.
I'm almost positive that you could do this with a for loop, but I don't know how to set that up correctly. Here's the code I have so far with a comment on line 14, and the code I'm asking about on line 15.
from bs4 import BeautifulSoup
import requests
import os
os.system("clear")
url = 'https://www.newegg.com/p/pl?d=RTX+3080'
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
prices = doc.find_all(text="$")
#Print all prices instead of just the specified number?
parent = prices[0].parent
strong = parent.find("strong")
print(strong.string)
You could try the following:
from bs4 import BeautifulSoup
import requests
import os
os.system("clear")
url = 'https://www.newegg.com/p/pl?d=RTX+3080'
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
prices = doc.find_all(text="$")
for price in prices:
parent = price.parent
strong = parent.find("strong")
print(strong.string)
I'm learning Python and decided to adapte code from an example to scrape Craigslist data to look at prices of cars. https://towardsdatascience.com/web-scraping-craigslist-a-complete-tutorial-c41cea4f4981
I've created a Jupyter notebook and modified the code for my use. I recreated the same error when running the code in Spyder Python 3.7.
I'm running into an issue at line 116.
File "C:/Users/UserM/Documents/GitHub/learning/Spyder Python Craigslist Scrape Untitled0.py", line 116
post_prices.append(post_price). I receive a "SynaxError: invalid syntax".
Any help appreciated. Thanks.
# -*- coding: utf-8 -*-
"""
Created on Wed Oct 2 12:26:06 2019
"""
#import get to call a get request on the site
from requests import get
#get the first page of the Chicago car prices
response = get('https://chicago.craigslist.org/search/cto?bundleDuplicates=1') #eliminate duplicates and show owner only sales
from bs4 import BeautifulSoup
html_soup = BeautifulSoup(response.text, 'html.parser')
#get the macro-container for the housing posts
posts = html_soup.find_all('li', class_= 'result-row')
print(type(posts)) #to double check that I got a ResultSet
print(len(posts)) #to double check I got 120 (elements/page
#grab the first post
post_one = posts[0]
#grab the price of the first post
post_one_price = post_one.a.text
post_one_price.strip()
#grab the time of the post in datetime format to save on cleaning efforts
post_one_time = post_one.find('time', class_= 'result-date')
post_one_datetime = post_one_time['datetime']
#title is a and that class, link is grabbing the href attribute of that variable
post_one_title = post_one.find('a', class_='result-title hdrlnk')
post_one_link = post_one_title['href']
#easy to grab the post title by taking the text element of the title variable
post_one_title_text = post_one_title.text
#the neighborhood is grabbed by finding the span class 'result-hood' and pulling the text element from that
post_one_hood = posts[0].find('span', class_='result-hood').text
#the price is grabbed by finding the span class 'result-price' and pulling the text element from that
post_one_hood = posts[0].find('span', class_='result-price').text
#build out the loop
from time import sleep
import re
from random import randint #avoid throttling by not sending too many requests one after the other
from warnings import warn
from time import time
from IPython.core.display import clear_output
import numpy as np
#find the total number of posts to find the limit of the pagination
results_num = html_soup.find('div', class_= 'search-legend')
results_total = int(results_num.find('span', class_='totalcount').text) #pulled the total count of posts as the upper bound of the pages array
#each page has 119 posts so each new page is defined as follows: s=120, s=240, s=360, and so on. So we need to step in size 120 in the np.arange function
pages = np.arange(0, results_total+1, 120)
iterations = 0
post_timing = []
post_hoods = []
post_title_texts = []
post_links = []
post_prices = []
for page in pages:
#get request
response = get("https://chicago.craigslist.org/search/cto?bundleDuplicates=1"
+ "s=" #the parameter for defining the page number
+ str(page) #the page number in the pages array from earlier
+ "&hasPic=1"
+ "&availabilityMode=0")
sleep(randint(1,5))
#throw warning for status codes that are not 200
if response.status_code != 200:
warn('Request: {}; Status code: {}'.format(requests, response.status_code))
#define the html text
page_html = BeautifulSoup(response.text, 'html.parser')
#define the posts
posts = html_soup.find_all('li', class_= 'result-row')
#extract data item-wise
for post in posts:
if post.find('span', class_ = 'result-hood') is not None:
#posting date
#grab the datetime element 0 for date and 1 for time
post_datetime = post.find('time', class_= 'result-date')['datetime']
post_timing.append(post_datetime)
#neighborhoods
post_hood = post.find('span', class_= 'result-hood').text
post_hoods.append(post_hood)
#title text
post_title = post.find('a', class_='result-title hdrlnk')
post_title_text = post_title.text
post_title_texts.append(post_title_text)
#post link
post_link = post_title['href']
post_links.append(post_link)
#removes the \n whitespace from each side, removes the currency symbol, and turns it into an int
#test removed: post_price = int(post.a.text.strip().replace("$", ""))
post_price = int(float((post.a.text.strip().replace("$", ""))) #does this work??
post_prices.append(post_price)
iterations += 1
print("Page " + str(iterations) + " scraped successfully!")
print("\n")
print("Scrape complete!")
import pandas as pd
eb_apts = pd.DataFrame({'posted': post_timing,
'neighborhood': post_hoods,
'post title': post_title_texts,
'URL': post_links,
'price': post_prices})
print(eb_apts.info())
eb_apts.head(10)
Welcome to StackOverflow. Usually when you see syntax errors in already working code, it means that you've either messed up indentation, forgot to terminate a string somewhere, or missed a closing bracket.
You can tell this when a line of what looks to be ok code is throwing you a syntax error. This is because the line before isn't ended properly and the interpreter is giving you hints around where to look.
In this case, you're short a paranthesis in the line before.
post_price = int(float((post.a.text.strip().replace("$", "")))
should be
post_price = int(float((post.a.text.strip().replace("$", ""))))
or delete the extra paranthesis after float
post_price = int(float(post.a.text.strip().replace("$", "")))
I'm trying to make a RSS reader with GUI in Python 3.7 and everything is almost working but there is a problem, instead of that each line pointing to another link, each line moves to the exact same link.
What can I do to solve this?
The Code:
import feedparser
from tkinter import *
import webbrowser
def callback(event):
webbrowser.open_new(article_link)
feed = feedparser.parse("http://www.widgeti.co.il/feed/")
feed_title = feed['feed']['title']
feed_entries = feed.entries
root = Tk()
text = Text(root)
for entry in feed.entries:
article_title = entry.title
article_link = entry.link
article_published_at = entry.published # Unicode string
article_published_at_parsed = entry.published_parsed # Time object
article_author = entry.author
content = entry.summary
article_tags = entry.tags
print ("{}[{}]".format(article_title, article_link))
print ("Published at {}".format(article_published_at))
print ("Published by {}".format(article_author))
print("Content {}".format(content))
print("catagory{}".format(article_tags))
link = Label(root, text="{}\n".format(article_title), fg="blue", cursor="hand2")
link.pack()
link.bind("<Button-1>" ,callback)
text.tag_config("here")
root.mainloop()
You are not passing the link in argument to the callback function, therefore when callback is triggered, it uses the current value of article_link which is the link of the last entry. To fix this you can pass the link to callback using a lambda function in the for loop:
def callback(event, article_link):
webbrowser.open_new(article_link)
# ...
for entry in feed.entries:
# ...
link.bind("<Button-1>", lambda event, link=article_link: callback(event, link))
I am trying to download worksheets for this workout, all the workouts are split on different days. All that needs to be done is add a new number at the end of the link. Here is my code.
import urllib
import urllib.request
from bs4 import BeautifulSoup
import re
import os
theurl = "http://www.muscleandfitness.com/workouts/workout-routines/gain-10-pounds-muscle-4-weeks-1?day="
urls = []
count = 1
while count <29:
urls.append(theurl + str(count))
count +=1
print(urls)
for url in urls:
thepage = urllib
thepage = urllib.request.urlopen(urls)
soup = BeautifulSoup(thepage,"html.parser")
init_data = open('/Users/paribaker/Desktop/scrapping/workout/4weekdata.txt', 'a')
workout = []
for data_all in soup.findAll('div',{'class':"b-workout-program-day-exercises"}):
try:
for item in data_all.findAll('div',{'class':"b-workout-part--item"}):
for desc in item.findAll('div', {'class':"b-workout-part--description"}):
workout.append(desc.find('h4',{'class':"b-workout-part--exercise-count"}).text.strip("\n") +",\t")
workout.append(desc.find('strong',{'class':"b-workout-part--promo-title"}).text +",\t")
workout.append(desc.find('span',{'class':"b-workout-part--equipment"}).text +",\t")
for instr in item.findAll('div', {'class':"b-workout-part--instructions"}):
workout.append(instr.find('div',{'class':"b-workout-part--instructions--item workouts-sets"}).text.strip("\n") +",\t")
workout.append(instr.find('div',{'class':"b-workout-part--instructions--item workouts-reps"}).text.strip("\n") +",\t")
workout.append(instr.find('div',{'class':"b-workout-part--instructions--item workouts-rest"}).text.strip("\n"))
workout.append("\n*3")
except AttributeError:
pass
init_data.write("".join(map(lambda x:str(x), workout)))
init_data.close
The problem is that the server times out, I'm assuming its not iterating through the list properly or adding characters I do not need and crashing the server parser.
I have also tried to write another script that grabs all the link and put them in a text document, then reopen the text in this script and iterate through the text, but that also gave me the same error. What are your thoughts?
There's a typo here:
thepage = urllib.request.urlopen(urls)
You probably wanted:
thepage = urllib.request.urlopen(url)
Otherwise you are trying to open an array of urls rather than a single one.