Python 3.7 Issue with append function - python-3.x

I'm learning Python and decided to adapte code from an example to scrape Craigslist data to look at prices of cars. https://towardsdatascience.com/web-scraping-craigslist-a-complete-tutorial-c41cea4f4981
I've created a Jupyter notebook and modified the code for my use. I recreated the same error when running the code in Spyder Python 3.7.
I'm running into an issue at line 116.
File "C:/Users/UserM/Documents/GitHub/learning/Spyder Python Craigslist Scrape Untitled0.py", line 116
post_prices.append(post_price). I receive a "SynaxError: invalid syntax".
Any help appreciated. Thanks.
# -*- coding: utf-8 -*-
"""
Created on Wed Oct 2 12:26:06 2019
"""
#import get to call a get request on the site
from requests import get
#get the first page of the Chicago car prices
response = get('https://chicago.craigslist.org/search/cto?bundleDuplicates=1') #eliminate duplicates and show owner only sales
from bs4 import BeautifulSoup
html_soup = BeautifulSoup(response.text, 'html.parser')
#get the macro-container for the housing posts
posts = html_soup.find_all('li', class_= 'result-row')
print(type(posts)) #to double check that I got a ResultSet
print(len(posts)) #to double check I got 120 (elements/page
#grab the first post
post_one = posts[0]
#grab the price of the first post
post_one_price = post_one.a.text
post_one_price.strip()
#grab the time of the post in datetime format to save on cleaning efforts
post_one_time = post_one.find('time', class_= 'result-date')
post_one_datetime = post_one_time['datetime']
#title is a and that class, link is grabbing the href attribute of that variable
post_one_title = post_one.find('a', class_='result-title hdrlnk')
post_one_link = post_one_title['href']
#easy to grab the post title by taking the text element of the title variable
post_one_title_text = post_one_title.text
#the neighborhood is grabbed by finding the span class 'result-hood' and pulling the text element from that
post_one_hood = posts[0].find('span', class_='result-hood').text
#the price is grabbed by finding the span class 'result-price' and pulling the text element from that
post_one_hood = posts[0].find('span', class_='result-price').text
#build out the loop
from time import sleep
import re
from random import randint #avoid throttling by not sending too many requests one after the other
from warnings import warn
from time import time
from IPython.core.display import clear_output
import numpy as np
#find the total number of posts to find the limit of the pagination
results_num = html_soup.find('div', class_= 'search-legend')
results_total = int(results_num.find('span', class_='totalcount').text) #pulled the total count of posts as the upper bound of the pages array
#each page has 119 posts so each new page is defined as follows: s=120, s=240, s=360, and so on. So we need to step in size 120 in the np.arange function
pages = np.arange(0, results_total+1, 120)
iterations = 0
post_timing = []
post_hoods = []
post_title_texts = []
post_links = []
post_prices = []
for page in pages:
#get request
response = get("https://chicago.craigslist.org/search/cto?bundleDuplicates=1"
+ "s=" #the parameter for defining the page number
+ str(page) #the page number in the pages array from earlier
+ "&hasPic=1"
+ "&availabilityMode=0")
sleep(randint(1,5))
#throw warning for status codes that are not 200
if response.status_code != 200:
warn('Request: {}; Status code: {}'.format(requests, response.status_code))
#define the html text
page_html = BeautifulSoup(response.text, 'html.parser')
#define the posts
posts = html_soup.find_all('li', class_= 'result-row')
#extract data item-wise
for post in posts:
if post.find('span', class_ = 'result-hood') is not None:
#posting date
#grab the datetime element 0 for date and 1 for time
post_datetime = post.find('time', class_= 'result-date')['datetime']
post_timing.append(post_datetime)
#neighborhoods
post_hood = post.find('span', class_= 'result-hood').text
post_hoods.append(post_hood)
#title text
post_title = post.find('a', class_='result-title hdrlnk')
post_title_text = post_title.text
post_title_texts.append(post_title_text)
#post link
post_link = post_title['href']
post_links.append(post_link)
#removes the \n whitespace from each side, removes the currency symbol, and turns it into an int
#test removed: post_price = int(post.a.text.strip().replace("$", ""))
post_price = int(float((post.a.text.strip().replace("$", ""))) #does this work??
post_prices.append(post_price)
iterations += 1
print("Page " + str(iterations) + " scraped successfully!")
print("\n")
print("Scrape complete!")
import pandas as pd
eb_apts = pd.DataFrame({'posted': post_timing,
'neighborhood': post_hoods,
'post title': post_title_texts,
'URL': post_links,
'price': post_prices})
print(eb_apts.info())
eb_apts.head(10)

Welcome to StackOverflow. Usually when you see syntax errors in already working code, it means that you've either messed up indentation, forgot to terminate a string somewhere, or missed a closing bracket.
You can tell this when a line of what looks to be ok code is throwing you a syntax error. This is because the line before isn't ended properly and the interpreter is giving you hints around where to look.
In this case, you're short a paranthesis in the line before.
post_price = int(float((post.a.text.strip().replace("$", "")))
should be
post_price = int(float((post.a.text.strip().replace("$", ""))))
or delete the extra paranthesis after float
post_price = int(float(post.a.text.strip().replace("$", "")))

Related

How Can I Assign A Variable To All Of The Items In A List?

I'm following a guide and it's saying to print the first item from an html document that contains the dollar sign.
It seems to do it correctly, outputting a price to the terminal and it actually being present on the webpage. However, I don't want to have just that single listing, I want to have all of the listings and print them to the terminal.
I'm almost positive that you could do this with a for loop, but I don't know how to set that up correctly. Here's the code I have so far with a comment on line 14, and the code I'm asking about on line 15.
from bs4 import BeautifulSoup
import requests
import os
os.system("clear")
url = 'https://www.newegg.com/p/pl?d=RTX+3080'
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
prices = doc.find_all(text="$")
#Print all prices instead of just the specified number?
parent = prices[0].parent
strong = parent.find("strong")
print(strong.string)
You could try the following:
from bs4 import BeautifulSoup
import requests
import os
os.system("clear")
url = 'https://www.newegg.com/p/pl?d=RTX+3080'
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
prices = doc.find_all(text="$")
for price in prices:
parent = price.parent
strong = parent.find("strong")
print(strong.string)

How to ignore text within one class while extracting data from website Python

I am trying to extract comments from the website and whenever there is a reply to the comment the previous post is included in the comments. I am trying to ignore those replies while extracting
url = "https://www.f150forum.com/f118/do-all-2018-f150-trucks-come-adaptive-cruise-control-369065/index2/"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
comments_lst= soup.findAll('div',attrs={"class":"ism-true"})
comments =[]
for item in comments_lst:
result = [item.get_text(strip=True, separator=" ")]
comments.append(result)
quotes = []
for item in soup.findAll('div',attrs={"class":"panel alt2"}):
result = [item.get_text(strip=True, separator=" ")]
quotes.append(result)
For the final result I do not want data from quotes list to be included in my comments. I tried using if but it gives incorrect result.
Example comments[6] gives below result
'Quote: Originally Posted by jeff_the_pilot What the difference between adaptive cruise control on 2018 versus 2017? I believe mine brakes if I encroach another vehicle. It will work in stop and go traffic!'
my expected result
It will work in stop and go traffic!
You need to add some logic to take out the text contained in divs with class panel alt2:
comments =[]
for item in comments_lst:
result = [item.get_text(strip=True, separator=" ")]
if div := item.find('div', class_="panel alt2"):
result[0] = ' '.join(result[0].split(div.text.split()[-1])[1:])
comments.append(result)
>>> comments[6]
[' It will work in stop and go traffic!']
This will get all messages without Quotes:
import requests
from bs4 import BeautifulSoup
url = "https://www.f150forum.com/f118/do-all-2018-f150-trucks-come-adaptive-cruise-control-369065/index2/"
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
msgs = []
for msg in soup.select('[id^="post_message_"]'):
for div in msg.select('div:has(> div > label:contains("Quote:"))'):
div.extract()
msgs.append( msg.get_text(strip=True, separator='\n') )
#print(msgs) # <-- uncomment to see all messages without Quoted messages
print(msgs[6])
Prints:
It will work in stop and go traffic!

Scraping: issues receiving table data when looping with request

UPDATE:
The code only works momentarily. There are 2000+ cryptos and at this moment i have 492 unique files with their history.
When I try to run a url getting skipped in the first place on its own, it works. Therefore I think it has been narrowed down to have something to do with the request of content.
Is it possible to make sure the table i'm interested in is fully loaded before continuing the code?
UPDATE:
I got it working properly. I think there is a limit of requests you can do per sec or min on the website im trying to scrape from.
I put in a delay on 3 sec between every request and NOW IT WORKS!!!!
Thanks to both of you for the help. Even though it didn't provide a direct answer, it put me on the right track to figuring it out.
from bs4 import BeautifulSoup
import requests
import pandas as pd
import time
def scraping(url):
global line
content = requests.get(url).content
soup = BeautifulSoup(content,'html.parser')
table = soup.find('table', {'class': 'table'})
if not table:
print(url)
return
data = [[td.text.strip() for td in tr.findChildren('td')] for tr in table.findChildren('tr')]
df = pd.DataFrame(data)
df.drop(df.index[0], inplace=True)
df[0] = pd.to_datetime(df[0])
for i in range(1,7):
df[i] = pd.to_numeric(df[i].str.replace(",","").str.replace("-",""))
df.columns = ['Date','Open','High','Low','Close','Volume','Market Cap']
df.set_index('Date',inplace=True)
df.sort_index(inplace=True)
return df.to_csv(line + '_historical_data.csv')
with open("list_of_urls.txt") as file:
for line in file:
time.sleep(3)
line = line.strip()
start = "https://coinmarketcap.com/currencies/"
end = "/historical-data/?start=20000101&end=21000101"
url = start + line + end
scraping(url)
it could be URL not found 404 or the page has no table. To debug normalize your loop and print current processing Crypto name
table = soup.find('table', {'class': 'table'})
if not table:
print('no table')
return
You may perform findChildren() only of the returned table and tr objects are not NoneType, as follows:
data = [[td.text.strip() for td in tr.findChildren('td') if td] for tr in table.findChildren('tr') if tr] if table else []
if len(data) > 0:
# process your data here
Hope it helps.

Trouble getting around list index error

I've written some script to scrape Name and Price from craigslist. It works smoothly until it finds that either of the vale is None. As soon as It gets any None value it breaks displaying: "list index out of range". How to deal with that?
import requests
from lxml import html
page = requests.get('http://bangalore.craigslist.co.in/search/rea?s=120').text
tree = html.fromstring(page)
rows = tree.xpath('//li[#class="result-row"]')
for row in rows:
link = row.xpath('.//a[contains(#class,"hdrlnk")]/text()')[0]
price = row.xpath('.//span[#class="result-price"]/text()')[0]
print (link,price)
The most efficient technique by far I've come across to avoid errors.
import requests
from lxml import html
page = requests.get('http://bangalore.craigslist.co.in/search/rea?s=120').text
tree = html.fromstring(page)
def if_exist(row,xpath):
docs=row.xpath(xpath)
if docs:
return docs[0]
return ""
for row in tree.xpath('//li[#class="result-row"]'):
link = if_exist(row,'.//a[contains(#class,"hdrlnk")]/text()')
price = if_exist(row,'.//span[#class="result-price"]/text()')
print (link,price)

Iterate through list while parsing

I am trying to download worksheets for this workout, all the workouts are split on different days. All that needs to be done is add a new number at the end of the link. Here is my code.
import urllib
import urllib.request
from bs4 import BeautifulSoup
import re
import os
theurl = "http://www.muscleandfitness.com/workouts/workout-routines/gain-10-pounds-muscle-4-weeks-1?day="
urls = []
count = 1
while count <29:
urls.append(theurl + str(count))
count +=1
print(urls)
for url in urls:
thepage = urllib
thepage = urllib.request.urlopen(urls)
soup = BeautifulSoup(thepage,"html.parser")
init_data = open('/Users/paribaker/Desktop/scrapping/workout/4weekdata.txt', 'a')
workout = []
for data_all in soup.findAll('div',{'class':"b-workout-program-day-exercises"}):
try:
for item in data_all.findAll('div',{'class':"b-workout-part--item"}):
for desc in item.findAll('div', {'class':"b-workout-part--description"}):
workout.append(desc.find('h4',{'class':"b-workout-part--exercise-count"}).text.strip("\n") +",\t")
workout.append(desc.find('strong',{'class':"b-workout-part--promo-title"}).text +",\t")
workout.append(desc.find('span',{'class':"b-workout-part--equipment"}).text +",\t")
for instr in item.findAll('div', {'class':"b-workout-part--instructions"}):
workout.append(instr.find('div',{'class':"b-workout-part--instructions--item workouts-sets"}).text.strip("\n") +",\t")
workout.append(instr.find('div',{'class':"b-workout-part--instructions--item workouts-reps"}).text.strip("\n") +",\t")
workout.append(instr.find('div',{'class':"b-workout-part--instructions--item workouts-rest"}).text.strip("\n"))
workout.append("\n*3")
except AttributeError:
pass
init_data.write("".join(map(lambda x:str(x), workout)))
init_data.close
The problem is that the server times out, I'm assuming its not iterating through the list properly or adding characters I do not need and crashing the server parser.
I have also tried to write another script that grabs all the link and put them in a text document, then reopen the text in this script and iterate through the text, but that also gave me the same error. What are your thoughts?
There's a typo here:
thepage = urllib.request.urlopen(urls)
You probably wanted:
thepage = urllib.request.urlopen(url)
Otherwise you are trying to open an array of urls rather than a single one.

Resources