Another one for you.
Trying to scrape a list of URLs from a CSV file. This is my code:
from bs4 import BeautifulSoup
import requests
import csv
with open('TeamRankingsURLs.csv', newline='') as f_urls, open('TeamRankingsOutput.csv', 'w', newline='') as f_output:
csv_urls = csv.reader(f_urls)
csv_output = csv.writer(f_output)
for line in csv_urls:
page = requests.get(line[0]).text
soup = BeautifulSoup(page, 'html.parser')
results = soup.findAll('div', {'class' :'LineScoreCard__lineScoreColumnElement--1byQk'})
for r in range(len(results)):
csv_output.writerow([results[r].text])
...Which gives me the following error:
Traceback (most recent call last):
File "TeamRankingsScraper.py", line 11, in <module>
page = requests.get(line[0]).text
File "C:\Users\windowshopr\AppData\Local\Programs\Python\Python36\lib\site-packages\requests\api.py", line 72, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\windowshopr\AppData\Local\Programs\Python\Python36\lib\site-packages\requests\api.py", line 58, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\windowshopr\AppData\Local\Programs\Python\Python36\lib\site-packages\requests\sessions.py", line 512, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\windowshopr\AppData\Local\Programs\Python\Python36\lib\site-packages\requests\sessions.py", line 616, in send
adapter = self.get_adapter(url=request.url)
File "C:\Users\windowshopr\AppData\Local\Programs\Python\Python36\lib\site-packages\requests\sessions.py", line 707, in get_adapter
raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for 'https://www.teamrankings.com/mlb/stat/runs-per-game?date=2018-04-15'
My CSV file is just a list in column A of several urls (ie. https://www...)
(The div class I'm trying to scrape doesn't exist on that page, but that's not where the problem is. At least I don't think. I just need to update that when I can get it to read from the CSV file.)
Any suggestions? Because this code works on another project, but for some reason I'm having issues with this new URL list. Thanks a lot!
from the Traceback, requests.exceptions.InvalidSchema: No connection adapters were found for 'https://www.teamrankings.com/mlb/stat/runs-per-game?date=2018-04-15'
See the random character in the url, it should start from https://www.teamrankings.com/mlb/stat/runs-per-game?date=2018-04-15
So first parse the csv and remove any random characters before the http/https using regex. That should solve your problem.
If you want to resolve your current problem with this particular url while reading csv, do :
import regex as re
strin = "https://www.teamrankings.com/mlb/stat/runs-per-game?date=2018-04-15"
re.sub(r'.*http', 'http', strin)
This will give you the correct url which request can handle.
Since you ask for the full fix of the path which is accessible in the loop, here's what you could do:
from bs4 import BeautifulSoup
import requests
import csv
import regex as re
with open('TeamRankingsURLs.csv', newline='') as f_urls, open('TeamRankingsOutput.csv', 'w', newline='') as f_output:
csv_urls = csv.reader(f_urls)
csv_output = csv.writer(f_output)
for line in csv_urls:
page = re.sub(r'.*http', 'http', line[0])
page = requests.get(page).text
soup = BeautifulSoup(page, 'html.parser')
results = soup.findAll('div', {'class' :'LineScoreCard__lineScoreColumnElement--1byQk'})
for r in range(len(results)):
csv_output.writerow([results[r].text])
Related
So i want to grab a URL located in the network tab > Headers but the specified URL is nowhere to be seen
here is my code so far
import requests
from bs4 import BeautifulSoup
link = 'https://www.tiktok.com/#docfrankhere/video/7128498731908893995?is_from_webapp=1&sender_device=pc&web_id=7111780332651578886'
r = requests.get(link)
response = r.status_code
head = r.headers['Request URL']
print(head)
Output
Traceback (most recent call last):
File "C:\Users\SEBAA\Desktop\copyu.py", line 10, in <module>
head = r.headers['Request URL']
File "C:\Users\SEBAA\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\structures.py", line 54, in __getitem__
return self._store[key.lower()][1]
KeyError: 'request url'
So even when i do
head = r.headers
print(head)
the output don't show me the URL i want to grab
i'm new to this any help would be appreciatedenter image description here
I want to click into each item for each page from this link, then crawler all infos from this page in red circle and save then as dataframe format.
In order to loop all pages, I need to iterate all Ids in the second link, but I don't know where I can get them.
https://www.pudong.gov.cn/shpd/InfoOpen/InfoDetail.aspx?Id=956244
I starts with the following code but I get a response of no table found which I can visually see from the source code:
import requests
import json
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
url = 'https://www.pudong.gov.cn/shpd/InfoOpen/InfoDetail.aspx?Id=1044647'
website_url = requests.get(url, verify=False).text
soup = BeautifulSoup(website_url, 'html.parser')
table = soup.find('table', {'class': 'ewb-claim table'})
# https://stackoverflow.com/questions/51090632/python-excel-export
df = pd.read_html(str(table))[0]
print(df.head(5))
Ouput:
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connectionpool.py:858: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning)
Traceback (most recent call last):
File "<ipython-input-13-d95369a5d235>", line 12, in <module>
df = pd.read_html(str(table))[0]
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/io/html.py", line 915, in read_html
keep_default_na=keep_default_na)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/io/html.py", line 749, in _parse
raise_with_traceback(retained)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/compat/__init__.py", line 385, in raise_with_traceback
raise exc.with_traceback(traceback)
ValueError: No tables found
Someone could help? Thanks.
Ok, I'm new with python and I'm intended to learn a little bit about webscraping which is even harder for me since I don't know anything about Web, JS, HTML etc. My idea was to download some images available on the netflix catalogue. And that even works fine for the first 5 or 6 images
import requests, os
from bs4 import BeautifulSoup
url = "https://www.netflix.com/br/browse/genre/839338"
page = requests.get(url)
page.raise_for_status()
soup = BeautifulSoup(page.text)
img_element_list= soup.select('a img')
print(f'Images avalaible (?) : {len(img_element_list)} ')
quantity = int(input('How many images: '))
for c in range(quantity):
name = img_element_list[c].get('alt')
print('Downloading ' + name + ' image...')
img_response= img_element_list[c].get('src')
print('SCR: ' + img_response + '\n\n')
img = requests.get(img_response)
file = os.path.join('Images', name)
img_file = open(file+'.jpg', 'wb')
for chunk in img.iter_content(100000):
img_file.write(chunk)
img_file.close()
But in the end after the forth or fifth image's been downloaded, the scr for the subsequent images gets like this '' and then it raises this error:
Traceback (most recent call last):
File "C:\Users\1513 IRON\PycharmProjects\DownloadNetflixImg.py", line 20, in <module>
img = requests.get(img_response)
File "C:\Users\1513 IRON\AppData\Local\Programs\Python\Python37\lib\site-packages\requests\api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\1513 IRON\AppData\Local\Programs\Python\Python37\lib\site-packages\requests\api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\1513 IRON\AppData\Local\Programs\Python\Python37\lib\site-packages\requests\sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\1513 IRON\AppData\Local\Programs\Python\Python37\lib\site-packages\requests\sessions.py", line 640, in send
adapter = self.get_adapter(url=request.url)
File "C:\Users\1513 IRON\AppData\Local\Programs\Python\Python37\lib\site-packages\requests\sessions.py", line 731, in get_adapter
raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for ''
I have created a basic crawler in python, I want to take input from a text file.
I used open/raw_input but there was an error.
When I used input("") function it is prompting for input and was working fine.
The problem only with reading a file
import re
import urllib.request
url = open('input.txt', 'r')
data = urllib.request.urlopen(url).read()
data1 = data.decode("utf8")
print(data1)
file =open('output.txt' , 'w')
file.write(data1)
file.close()
error output below.
Traceback (most recent call last):
File "scrape.py", line 8, in <module>
data = urllib.request.urlopen(url).read()
File "/usr/lib/python3.6/urllib/request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.6/urllib/request.py", line 518, in open
protocol = req.type
AttributeError: '_io.TextIOWrapper' object has no attribute 'type'
the method open returns a file object, and not the content of the file as a string. if you want url to contain the content as a string, change the line to:
url = open('input.txt', 'r').read()
I am using beautiful soup with requests package in python3 for web scraping. This is my code.
import csv
from datetime import datetime
import requests
import csv
from datetime import datetime
from bs4 import BeautifulSoup
quote_page = ['http://10.69.161.179:8080'];
data = []
page = requests.get(quote_page)
soup = BeautifulSoup(page.content,'html.parser')
name_box = soup.find('div', attrs={'class':'caption span10'})
name= name_box.text.strip() #strip() is used to remove starting and ending
print(name);
data.append(name)
with open('sample.csv', 'a') as csv_file:
writer = csv.writer(csv_file)
writer.writerow([name])
print ("Success");
When I execute the above code I'm getting the following error.
Traceback (most recent call last):
File "first_try.py", line 21, in <module>
page = requests.get(quote_page);
File "C:\Python\lib\site-packages\requests-2.13.0-py3.6.egg\requests\api.py", line 70, in get
return request('get', url, params=params, **kwargs)
File "C:\Python\lib\site-packages\requests-2.13.0-py3.6.egg\requests\api.py", line 56, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Python\lib\site-packages\requests-2.13.0-py3.6.egg\requests\sessions.py", line 488, in request
resp = self.send(prep, **send_kwargs)
File "C:\Python\lib\site-packages\requests-2.13.0-py3.6.egg\requests\sessions.py", line 603, in send
adapter = self.get_adapter(url=request.url)
File "C:\Python\lib\site-packages\requests-2.13.0-py3.6.egg\requests\sessions.py", line 685, in get_adapter
raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for '['http://10.69.161.179:8080/#/main/dashboard/metrics']'
Can anyone help me with this? :(
Because requests.get() only accept url schema in string format. You need to unpack string inside the list [] .
quote_page = ['http://10.69.161.179:8080']
for url in quote_page:
page = requests.get(url)
.....
By the way , though semicolon is harmless under following statement, you should avoid it unless you need it for some reason
quote_page = ['http://10.69.161.179:8080'];