Iterate through list while parsing - python-3.x

I am trying to download worksheets for this workout, all the workouts are split on different days. All that needs to be done is add a new number at the end of the link. Here is my code.
import urllib
import urllib.request
from bs4 import BeautifulSoup
import re
import os
theurl = "http://www.muscleandfitness.com/workouts/workout-routines/gain-10-pounds-muscle-4-weeks-1?day="
urls = []
count = 1
while count <29:
urls.append(theurl + str(count))
count +=1
print(urls)
for url in urls:
thepage = urllib
thepage = urllib.request.urlopen(urls)
soup = BeautifulSoup(thepage,"html.parser")
init_data = open('/Users/paribaker/Desktop/scrapping/workout/4weekdata.txt', 'a')
workout = []
for data_all in soup.findAll('div',{'class':"b-workout-program-day-exercises"}):
try:
for item in data_all.findAll('div',{'class':"b-workout-part--item"}):
for desc in item.findAll('div', {'class':"b-workout-part--description"}):
workout.append(desc.find('h4',{'class':"b-workout-part--exercise-count"}).text.strip("\n") +",\t")
workout.append(desc.find('strong',{'class':"b-workout-part--promo-title"}).text +",\t")
workout.append(desc.find('span',{'class':"b-workout-part--equipment"}).text +",\t")
for instr in item.findAll('div', {'class':"b-workout-part--instructions"}):
workout.append(instr.find('div',{'class':"b-workout-part--instructions--item workouts-sets"}).text.strip("\n") +",\t")
workout.append(instr.find('div',{'class':"b-workout-part--instructions--item workouts-reps"}).text.strip("\n") +",\t")
workout.append(instr.find('div',{'class':"b-workout-part--instructions--item workouts-rest"}).text.strip("\n"))
workout.append("\n*3")
except AttributeError:
pass
init_data.write("".join(map(lambda x:str(x), workout)))
init_data.close
The problem is that the server times out, I'm assuming its not iterating through the list properly or adding characters I do not need and crashing the server parser.
I have also tried to write another script that grabs all the link and put them in a text document, then reopen the text in this script and iterate through the text, but that also gave me the same error. What are your thoughts?

There's a typo here:
thepage = urllib.request.urlopen(urls)
You probably wanted:
thepage = urllib.request.urlopen(url)
Otherwise you are trying to open an array of urls rather than a single one.

Related

How Can I Assign A Variable To All Of The Items In A List?

I'm following a guide and it's saying to print the first item from an html document that contains the dollar sign.
It seems to do it correctly, outputting a price to the terminal and it actually being present on the webpage. However, I don't want to have just that single listing, I want to have all of the listings and print them to the terminal.
I'm almost positive that you could do this with a for loop, but I don't know how to set that up correctly. Here's the code I have so far with a comment on line 14, and the code I'm asking about on line 15.
from bs4 import BeautifulSoup
import requests
import os
os.system("clear")
url = 'https://www.newegg.com/p/pl?d=RTX+3080'
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
prices = doc.find_all(text="$")
#Print all prices instead of just the specified number?
parent = prices[0].parent
strong = parent.find("strong")
print(strong.string)
You could try the following:
from bs4 import BeautifulSoup
import requests
import os
os.system("clear")
url = 'https://www.newegg.com/p/pl?d=RTX+3080'
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
prices = doc.find_all(text="$")
for price in prices:
parent = price.parent
strong = parent.find("strong")
print(strong.string)

Python 3.7 Issue with append function

I'm learning Python and decided to adapte code from an example to scrape Craigslist data to look at prices of cars. https://towardsdatascience.com/web-scraping-craigslist-a-complete-tutorial-c41cea4f4981
I've created a Jupyter notebook and modified the code for my use. I recreated the same error when running the code in Spyder Python 3.7.
I'm running into an issue at line 116.
File "C:/Users/UserM/Documents/GitHub/learning/Spyder Python Craigslist Scrape Untitled0.py", line 116
post_prices.append(post_price). I receive a "SynaxError: invalid syntax".
Any help appreciated. Thanks.
# -*- coding: utf-8 -*-
"""
Created on Wed Oct 2 12:26:06 2019
"""
#import get to call a get request on the site
from requests import get
#get the first page of the Chicago car prices
response = get('https://chicago.craigslist.org/search/cto?bundleDuplicates=1') #eliminate duplicates and show owner only sales
from bs4 import BeautifulSoup
html_soup = BeautifulSoup(response.text, 'html.parser')
#get the macro-container for the housing posts
posts = html_soup.find_all('li', class_= 'result-row')
print(type(posts)) #to double check that I got a ResultSet
print(len(posts)) #to double check I got 120 (elements/page
#grab the first post
post_one = posts[0]
#grab the price of the first post
post_one_price = post_one.a.text
post_one_price.strip()
#grab the time of the post in datetime format to save on cleaning efforts
post_one_time = post_one.find('time', class_= 'result-date')
post_one_datetime = post_one_time['datetime']
#title is a and that class, link is grabbing the href attribute of that variable
post_one_title = post_one.find('a', class_='result-title hdrlnk')
post_one_link = post_one_title['href']
#easy to grab the post title by taking the text element of the title variable
post_one_title_text = post_one_title.text
#the neighborhood is grabbed by finding the span class 'result-hood' and pulling the text element from that
post_one_hood = posts[0].find('span', class_='result-hood').text
#the price is grabbed by finding the span class 'result-price' and pulling the text element from that
post_one_hood = posts[0].find('span', class_='result-price').text
#build out the loop
from time import sleep
import re
from random import randint #avoid throttling by not sending too many requests one after the other
from warnings import warn
from time import time
from IPython.core.display import clear_output
import numpy as np
#find the total number of posts to find the limit of the pagination
results_num = html_soup.find('div', class_= 'search-legend')
results_total = int(results_num.find('span', class_='totalcount').text) #pulled the total count of posts as the upper bound of the pages array
#each page has 119 posts so each new page is defined as follows: s=120, s=240, s=360, and so on. So we need to step in size 120 in the np.arange function
pages = np.arange(0, results_total+1, 120)
iterations = 0
post_timing = []
post_hoods = []
post_title_texts = []
post_links = []
post_prices = []
for page in pages:
#get request
response = get("https://chicago.craigslist.org/search/cto?bundleDuplicates=1"
+ "s=" #the parameter for defining the page number
+ str(page) #the page number in the pages array from earlier
+ "&hasPic=1"
+ "&availabilityMode=0")
sleep(randint(1,5))
#throw warning for status codes that are not 200
if response.status_code != 200:
warn('Request: {}; Status code: {}'.format(requests, response.status_code))
#define the html text
page_html = BeautifulSoup(response.text, 'html.parser')
#define the posts
posts = html_soup.find_all('li', class_= 'result-row')
#extract data item-wise
for post in posts:
if post.find('span', class_ = 'result-hood') is not None:
#posting date
#grab the datetime element 0 for date and 1 for time
post_datetime = post.find('time', class_= 'result-date')['datetime']
post_timing.append(post_datetime)
#neighborhoods
post_hood = post.find('span', class_= 'result-hood').text
post_hoods.append(post_hood)
#title text
post_title = post.find('a', class_='result-title hdrlnk')
post_title_text = post_title.text
post_title_texts.append(post_title_text)
#post link
post_link = post_title['href']
post_links.append(post_link)
#removes the \n whitespace from each side, removes the currency symbol, and turns it into an int
#test removed: post_price = int(post.a.text.strip().replace("$", ""))
post_price = int(float((post.a.text.strip().replace("$", ""))) #does this work??
post_prices.append(post_price)
iterations += 1
print("Page " + str(iterations) + " scraped successfully!")
print("\n")
print("Scrape complete!")
import pandas as pd
eb_apts = pd.DataFrame({'posted': post_timing,
'neighborhood': post_hoods,
'post title': post_title_texts,
'URL': post_links,
'price': post_prices})
print(eb_apts.info())
eb_apts.head(10)
Welcome to StackOverflow. Usually when you see syntax errors in already working code, it means that you've either messed up indentation, forgot to terminate a string somewhere, or missed a closing bracket.
You can tell this when a line of what looks to be ok code is throwing you a syntax error. This is because the line before isn't ended properly and the interpreter is giving you hints around where to look.
In this case, you're short a paranthesis in the line before.
post_price = int(float((post.a.text.strip().replace("$", "")))
should be
post_price = int(float((post.a.text.strip().replace("$", ""))))
or delete the extra paranthesis after float
post_price = int(float(post.a.text.strip().replace("$", "")))

Using Beautifulsoup to parse a big comment?

I'm using BS4 to parse this webpage:
You'll notice there are two separate tables on the page. Here's the relevant snipped of my code, which is successfully returning the data I want from the first table, but does not find anything from the second table:
# import packages
import urllib3
import certifi
from bs4 import BeautifulSoup
import pandas as pd
#settings
http = urllib3.PoolManager(
cert_reqs='CERT_REQUIRED',
ca_certs=certifi.where())
gamelog_offense = []
#scrape the data and write the .csv files
url = "https://www.sports-reference.com/cfb/schools/florida/2018/gamelog/"
response = http.request('GET', url)
soup = BeautifulSoup(response.data, features="html.parser")
cnt = 0
for row in soup.findAll('tr'):
try:
col=row.findAll('td')
Pass_cmp = col[4].get_text()
Pass_att = col[5].get_text()
gamelog_offense.append([Pass_cmp, Pass_att])
cnt += 1
except:
pass
print("Finished writing with " + str(cnt) + " records")
Finished writing with 13 records
I've verified the data from the SECOND table is contained within the soup (I can see it!). After lots of troubleshooting, I've discovered that the entire second table is completely contained within one big comment(why?). I've managed to extract this comment into a single comment object using the code below, but can't figure out what to do with it after that to extract the data I want. Ideally, I'd like to parse the comment in same way I'm successfully parsing the first table. I've tried using the ideas from similar stack overflow questions (selenium, phantomjs)...no luck.
import bs4
defense = soup.find(id="all_defense")
for item in defense.children:
if isinstance(item, bs4.element.Comment):
big_comment = item
print(big_comment)
<div class="table_outer_container">
<div class="overthrow table_container" id="div_defense">
...and so on....
Posting an answer here in case others find helpful. Many thanks to #TomasCarvalho for directing me to find a solution. I was able to pass the big comment as html into a second soup instance using the following code, and then just use the original parsing code on the new soup instance. (note: the try/except is because some of the teams have no gamelog, and you can't call .children on a NoneType.
try:
defense = soup.find(id="all_defense")
for item in defense.children:
if isinstance(item, bs4.element.Comment):
html = item
Dsoup = BeautifulSoup(html, features="html.parser")
except:
html = ''
Dsoup = BeautifulSoup(html, features="html.parser")

find() in Beautifulsoup returns None

I'm very new to programming in general and I'm trying to write my own little torrent leecher. I'm using Beautifulsoup In order to extract the title and the magnet link of a torrent file. However find() element keeps returning none no matter what I do. The page is correct. I've also tested with find_next_sibling and read all the similar questions but to no avail. Since there are no errors I have no idea what my mistake is.
Any help would be much appreciated. Below is my code:
import urllib3
from bs4 import BeautifulSoup
print("Please enter the movie name: \n")
search_string = input("")
search_string.rstrip()
search_string.lstrip()
open_page = ('https://www.yify-torrent.org/search/' + search_string + '/s-1/all/all/') # get link - creates a search string with input value
print(open_page)
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
manager = urllib3.PoolManager(10)
page_content = manager.urlopen('GET',open_page)
soup = BeautifulSoup(page_content,'html.parser')
magnet = soup.find('a', attrs={'class': 'movielink'}, href=True)
print(magnet)
Check out the following script which does exactly what you wanna achieve. I used requests library instead of urllib3. The main mistake you made is that you looked for the magnet link in the wrong place. You need to go one layer deep to dig out that link. Try using quote instead of string manipulation to fit your search query within the url.
Give this a shot:
import requests
from urllib.parse import urljoin
from urllib.parse import quote
from bs4 import BeautifulSoup
keyword = 'The Last Of The Mohicans'
url = 'https://www.yify-torrent.org/search/'
base = f"{url}{quote(keyword)}{'/p-1/all/all/'}"
res = requests.get(base)
soup = BeautifulSoup(res.text,'html.parser')
tlink = urljoin(url,soup.select_one(".img-item .movielink").get("href"))
req = requests.get(tlink)
sauce = BeautifulSoup(req.text,"html.parser")
title = sauce.select_one("h1[itemprop='name']").text
magnet = sauce.select_one("a#dm").get("href")
print(f"{title}\n{magnet}")

Web Scraping with BeautifulSoup code review

from bs4 import BeautifulSoup
import requests
import pandas as pd
records=[]
keep_looking = True
url = 'https://www.tapology.com/fightcenter'
while keep_looking:
re = requests.get(url)
soup = BeautifulSoup(re.text,'html.parser')
data = soup.find_all('section',attrs={'class':'fcListing'})
for d in data:
event = d.find('a').text
date = d.find('span',attrs={'class':'datetime'}).text[1:-4]
location = d.find('span',attrs={'class':'venue-location'}).text
mainEvent = first.find('span',attrs={'class':'bout'}).text
url_tag = soup.find('div',attrs={'class':'fightcenterEvents'})
if not url_tag:
keep_looking = False
else:
url = "https://www.tapology.com" + url_tag.find('a')['href']
I am wondering if there are any errors in my code? It is running, but it is taking a very long time to finish and I am afraid it might be stuck in an infinity loop. Please any feedback would be helpful. Please do not rewrite all of this and post, as I would like to keep this format, as I am learning and want to improve.
Although this is not the right site to seek help for review related task, I considered giving a solution as it sounds that you may fall in an infinite loop according to your statement above.
Try this to get information from that site. It will run until there is a next page link to traverse. When there is no more new page link to follow, the script will automatically stop.
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import requests
url = 'https://www.tapology.com/fightcenter'
while True:
re = requests.get(url)
soup = BeautifulSoup(re.text,'html.parser')
for data in soup.find_all('section',attrs={'class':'fcListing'}):
event = data.select_one('.name a').get_text(strip=True)
date = data.find('span',attrs={'class':'datetime'}).get_text(strip=True)[:-1]
location = data.find('span',attrs={'class':'venue-location'}).get_text(strip=True)
try:
mainEvent = data.find('span',attrs={'class':'bout'}).get_text(strip=True)
except AttributeError: mainEvent = ""
print(f'{event} {date} {location} {mainEvent}')
urltag = soup.select_one('.pagination a[rel="next"]')
if not urltag: break #as soon as it finds that there is no next page link, it will break out of the loop
url = urljoin(url,urltag.get("href")) #applied urljoin to save you from using hardcoded prefix
For future reference: feel free to post any question in this site to get your code reviewed.

Resources