I am a beginner using Python 3.6.4 and RoboBrowser 0.5.3.
I have saved some HTML webpage and I am trying to pick up the information in the page.
Most likely incorrectly, I took inspiration from a similar question on beautifulSoup. The beautifulSoup solution works for me (BeautifulSoup 4.6.0).
In contrast, the following, based on roboBrowser, does not seem to work:
from robobrowser import RoboBrowser
br = RoboBrowser(parser='html.parser')
br.open(open("my_file.html"))
with error:
MissingSchema: Invalid URL "<_io.TextIOWrapper
name='my_file.html'
mode='r' encoding='UTF-8'>": No schema supplied. Perhaps you meant
http://<_io.TextIOWrapper
name='my_file.html'
mode='r' encoding='UTF-8'>?
I understand that the code expected a "http"-based url. I tried prepending "file://" to the absolute path of my file, to no avail.
Is there any way to communicate with the library that it is a local file, or perhaps such functionality is not part of roboBrowser?
Related
I am using Google Maps API, static map, and would like to save an image file in format JPG.
When I am saving a PNG using urllib.request.urlretrieve(url, 'map_46_6.png') this is working fine. However, when I am using urllib.request.urlretrieve(url, 'map_46_6.jpg'), this is not working. Opening the file gives an error « Not a JPG file: starts with 0x89 0x50 ». Changing manually the extension to PNG will resolve it.
The following is the code :
import urllib.request
url = 'http://maps.googleapis.com/maps/api/staticmap?scale=2¢er=46.257632,6.108669&zoom=12&size=400x400&maptype=satellite&key=xxxxx'
urllib.request.urlretrieve(url, 'map_46_6.jpg')
As this code is part of a previously built pipeline, I would need the JPG files for the next steps.
My question is, is there a setting in Urllib, Google Maps or anything else that could result in this error? Thank you very much in advance !
I have found a solution. If one wants jpg, one needs to explicitly code the format, &format=jpg like the following:
import urllib.request
url = 'https://maps.googleapis.com/maps/api/staticmap?scale=2¢er=46.257632,6.108669&zoom=16&size=400x400&maptype=satellite&format=jpg&key=xxxx'
urllib.request.urlretrieve(url, 'map_46_6.jpg')
I was trying to scrape pictures with Selenium of a certain character on Pixiv, but when I tried to download, it gave me a 403 error. I tried using the request module with the src link to download it directly and it gave me the error. I tried opening a new tab with the src link and it gave me the same error. Is there a way to download a image from Pixiv? I was planning something a bit larger than just downloading a single image, but I am stuck in it. I did put the user-agent, as this thread suggested, but or it didn't work or I did something wrong.
This is the image I tried to download: https://www.pixiv.net/en/artworks/93284987.
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://i.pximg.net/img-original/img/2021/10/07/19/41/28/93284987_p0.jpg')
I skipped the code to get it here, but what happens is that I get the src link to the image, but I can't access the page to download it. I don't know if I need to actually go to the page, but I can't do anything with the src either. I tried several methods but nothing works. Can someone help me?
Seems to download just fine without any headers for me.
import requests
import shutil
url = 'https://i.pximg.net/img-master/img/2021/10/07/19/41/28/93284987_p0_master1200.jpg'
response = requests.get(url, stream=True)
local_filename = url.split('/')[-1]
with open(local_filename, 'wb') as out_file:
shutil.copyfileobj(response.raw, out_file)
del response
print(local_filename)
This is my first attempt to use web scraping in python to extract some links from a webpage.
This the webpage i am interested in getting some data from:
http://www.hotstar.com/tv/bhojo-gobindo/14172/seasons/season-5
I am interest in extracting all the instance of following from above webpage:
href="/tv/bhojo-gobindo/14172/gobinda-is-in-a-fix/1000196352"
I have written following regex to extract all the matches of above type of links:
r"href=\"(\/tv\/bhojo-gobindo\/14172\/.*\/\d{10})\""
Here is quick code i have written to try to extract all the regex mataches:
#!/usr/bin/python3
import re
import requests
url = "http://www.hotstar.com/tv/bhojo-gobindo/14172/seasons/season-5"
page = requests.get(url)
l = re.findall(r'href=\"(\/tv\/bhojo-gobindo\/14172\/.*\/\d{10})\"', page.text)
print(l)
When I run the above code I get following ouput:
./links2.py
[]
When I inspect the webpage using developer tools within the browser I can see this links but when I try to extract the text I am interested in(href="/tv/bhojo-gobindo/14172/gobinda-is-in-a-fix/1000196352") using python3 script I get no matches.
Am I downloading the webpage correctly, how do I make sure I am getting all of the webapage from within my script. i have a feeling I am missing parts of the web page when using the requests to get the web page.
Any help please.
I've been using different versions of a web scraper to download anime images from a number of websites I like using beautifulsoup, urllib, and requests.
when I have the image link i use requests.get(name_of_url).content and write the file to a directory on my computer. It has been working for other sites but not on this new one. With this new site, the program runs fine, but the file is not written correctly, as I am unable to view it with any image viewers. Here is my code without all of the html parsing, just the url to image download section:
import requests
import os
img_data = requests.get("https://cs.sankakucomplex.com/data/ba/bc/babc83a0361198bb43a9b367273b3ef7.jpg?e=1510027320&m=euskBFzOAk-YJJjfbP-26A").content
completename = os.path.join('C:\\', 'Users', 'jesse', '.spyder-py3', 'Image_scraper','sankaku', 'testtesttest.jpg')
with open(completename, 'wb') as handler:
handler.write(img_data)
I'm fairly certain that the issue is coming from the different url structure this sight has. If you notice after the ".jpg?" there is more url information, which the other sites I was looking through did not previously have. I'm open to using urllib2 or another library, I'm just learning to use python to interface with html over the last 2-3 weeks. Any ideas or suggestions are appreciated~
thank you
I have python 3.3 installed.
i use the example they use on their site:
import urllib.request
response = urllib.request.urlopen('http://python.org/')
html = response.read()
the only thing that happens when I run it is I get this :
======RESTART=========
I know I am a rookie but I figured the example from python's own website should be able to work.
It doesn't. What am I doing wrong?Eventually I want to run this script from the website below. But I think urllib is not going to work as it is on that site. Can someone tell me if the code will work with python3.3???
http://flowingdata.com/2007/07/09/grabbing-weather-underground-data-with-beautifulsoup/
I think I see what's probably going on. You're likely using IDLE, and when it starts a new run of a program, it prints the
======RESTART=========
line to tell you that a fresh program is starting. That means that all the variables currently defined are reset and/or deleted, as appropriate.
Since your program didn't print any output, you didn't see anything.
The two lines I suggested adding were just tests to figure out what was going on, they're not needed in general. [Unless the window itself is automatically closing, which it shouldn't.] But as a rule, if you want to see output, you'll have to print what you're interested in.
Your example works for me. However, I suggest using requests instead of urllib2.
To simplify the example you linked to, it would look like:
from bs4 import BeautifulSoup
import requests
resp = requests.get("http://www.wunderground.com/history/airport/KBUF/2007/12/16/DailyHistory.html")
soup = BeautifulSoup(resp.text)