Python web scraper won't save image files

Python web scraper won't save image files - python-3.x

I started working on a small image scraping terminal program that is supposed to save images to a specified file within the program hierarchy. This comes from a basic tutorial I found online. However, whenever I enter in a search term into the terminal to start scraping bing.com (yeah, i know), the program crashes. The errors i get seem to focus on either the image file type not being recognized or the file path where the images will be saved is not being recognized:
from bs4 import BeautifulSoup
import requests
from PIL import Image
from io import BytesIO
search = input("Search for:")
params = {"q": search}
r = requests.get("http://www.bing.com/images/search", params=params)
soup = BeautifulSoup(r.text, "html.parser")
links = soup.findAll("a", {"class": "thumb"})
for item in links:
img_obj = requests.get(item.attrs["href"])
print("Getting", item.attrs["href"])
title = item.attrs["href"].split("/")[-1]
img = Image.open(BytesIO(img_obj.content))
img.save("./scraped_images/" + title, img.format)
Error thrown: Exception has occurred: FileNotFoundError
[Errno 2] No such file or directory: './scraped_images/3849747391_4a7dc3f19e_b.jpg'
I've tried adding a file path variable (using pathlib) and concatenating that with with the other necessary variables:
from bs4 import BeautifulSoup
import requests
from PIL import Image
from io import BytesIO
from pathlib import Path
image_folder = Path("./scraped_images/")
search = input("Search for:")
params = {"q": search}
r = requests.get("http://www.bing.com/images/search", params=params)
soup = BeautifulSoup(r.text, "html.parser")
links = soup.findAll("a", {"class": "thumb"})
for item in links:
img_obj = requests.get(item.attrs["href"])
print("Getting", item.attrs["href"])
title = item.attrs["href"].split("/")[-1]
img = Image.open(BytesIO(img_obj.content))
img.save(image_folder + title, img.format)
Error thrown: Exception has occurred: TypeError
unsupported operand type(s) for +: 'WindowsPath' and 'str'
I've checked the documentation for PIL, BeautifulSoup, etc. to see if any updates may have been screwing me up, i've checked the elements on bing to see if the classes are correct, and even tried searching by different class and nothing worked. I'm at a loss. Any thoughts or guidance is appreciated. Thanks!

I have changed your code a bit:
from bs4 import BeautifulSoup
import requests
from pathlib import Path
import os
image_folder = Path("./scraped_images/")
if not os.path.isdir(image_folder):
print('Making %s'%(image_folder))
os.mkdir(image_folder)
search = input("Search for:")
params = {"q": search}
r = requests.get("http://www.bing.com/images/search", params=params)
soup = BeautifulSoup(r.text, "html.parser")
links = soup.findAll("a", {"class": "thumb"})
for item in links:
img_obj = requests.get(item.attrs["href"])
print("Getting", item.attrs["href"])
title = item.attrs["href"].split("/")[-1]
if img_obj.ok:
with open('%s/%s'%(image_folder, title), 'wb') as file:
file.write(img_obj.content)
You can use PIL but in this case you do not need it.
I also improved the code with PIL to work better:
from bs4 import BeautifulSoup
import requests
from PIL import Image
from io import BytesIO
from pathlib import Path
s = requests.Session()
image_folder = Path("./scraped_images/")
search = input("Search for:")
params = {"q": search}
r = s.get("http://www.bing.com/images/search", params=params)
soup = BeautifulSoup(r.text, "html.parser")
links = soup.findAll("a", {"class": "thumb"})
for item in links:
try:
img_obj = s.get(item.attrs["href"], headers={'User-Agent': 'User-Agent: Mozilla/5.0'})
if img_obj.ok:
print("Getting", item.attrs["href"])
title = item.attrs["href"].split("/")[-1]
if '?' in title:
title = title.split('?')[0]
img = Image.open(BytesIO(img_obj.content))
img.save(str(image_folder) + '/' + title, img.format)
else:
continue
except OSError:
print('\nError downloading %s try to visit'
'\n%s\n'
'manually and try to get the image manually.\n' %(title, item.attrs["href"]))
I use a requests session and added try and except if PIL can't make the image. I also only make try to make a image if the request get a 200 response from the site.

Related

python web scraping _images scraping

I faced problem as shown in attached photos when i used this code the images showed like this “ not appear “
from email.mime import image
import requests
from bs4 import BeautifulSoup
import os
url = 'https://www.imdb.com/search/title/?groups=top_100&sort=user_rating,desc&count=100&start=%22+%22&ref_=adv_nxt'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
images = soup.find_all('img')
for image in images:
name = image['alt']
link = image ['src']
with open (name + '.jpg','wb') as f:
im = requests.get(link)
f.write(im.content)

How to scrape simple image from webpage

I am very very new to python.
when I run the below:
from PIL import Image
import requests
import bs4
url = 'https://parts.bmwmonterey.com/a/BMW_2004_330i-Sedan/_52014_5798240/Cooling-System-Water-Hoses/17_0215.html'
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text, 'html.parser')
image = soup.find('img')
image_url = image['src']
img = Image.open(requests.get(image_url, stream = True).raw)
img.save('image.jpg')
I got this error:
Invalid URL '/images/parts/BMW/fullsize/158518.jpg': No schema supplied. Perhaps you meant http:///images/parts/BMW/fullsize/158518.jpg?

In your code, the image_url gives the directory of the image where it stored on the hosting service. You need to append the domain name to the image_url variable and use the requests library to download it.
Use the following code and it will work.
import bs4
import requests
url = "https://parts.bmwmonterey.com/a/BMW_2004_330i-Sedan/_52014_5798240/Cooling-System-Water-Hoses/17_0215.html"
resp = requests.get(url)
soup = bs4.BeautifulSoup(resp.text, "html.parser")
img = soup.find('img')
image = img["src"]
img_url = "https://parts.bmwmonterey.com" + str(image)
r = requests.get(img_url)
with open("image.jpg","wb") as f:
f.write(r.content)

How to extract img src from web page via lxml in beautifulsoup using python?

I am new in python and I am working on web scraping project from amazon and I have a problem on how to extract the product img src from product page via lxml using BeautifulSoup
I tried the following code to extract it but it doesn't show the url of the img.
here is my code:
import requests
from bs4 import BeautifulSoup
import re
url = 'https://www.amazon.com/crocs-Unisex-Classic-Black-Women/dp/B0014C0LSY/ref=sr_1_2?_encoding=UTF8&qid=1560091629&s=fashion-womens-intl-ship&sr=1-2&th=1&psc=1'
r = requests.get(URL, headers={'User-Agent':'Mozilla/5.0'})
s = BeautifulSoup(r.text, "lxml")
img = s.find(class_="imgTagWrapper").img['src']
# I tried this code.
print(img)
I tried this code...but it shows like this:
data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAoHBwgHBgoICAgLCgoLDhgQDg0NDh0VFhEYIx8lJCIfIiEmKzcvJik0KSEiMEExNDk7Pj4+JS5ESUM8SDc9Pjv/2wBDAQoLCw4NDhwQEBw7KCIoOzs7Ozs7Ozs7Ozs7Ozs7Ozs7Ozs7Ozs7Ozs7Ozs7Ozs7Ozs7Ozs7Ozs7Ozs7Ozs7Ozv/wAARCAG9AM4DASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0t....//
any help ?

What you are seeing there is the base64 encoding of the image. What you do with it depends on what you're doing with image URLs.

The image you like to grab from that page is available in the value of this key data-a-dynamic-image. There are multiple images with different sizes in there. All you need to do now is create a conditional statement to isolate that image containing 395.
import json
import requests
from bs4 import BeautifulSoup
url = 'https://www.amazon.com/crocs-Unisex-Classic-Black-Women/dp/B0014C0LSY/ref=sr_1_2?_encoding=UTF8&qid=1560091629&s=fashion-womens-intl-ship&sr=1-2&th=1&psc=1'
r = requests.get(url, headers={'User-Agent':'Mozilla/5.0'})
s = BeautifulSoup(r.text, "lxml")
img = s.find(id="landingImage")['data-a-dynamic-image']
img = json.loads(img)
for k,v in img.items():
if '395' in k:
print(k)
Output:
https://images-na.ssl-images-amazon.com/images/I/71oNMAAC7sL._UX395_.jpg
In that case try like this and pick the one suits your need:
for k,v in img.items():
print(k)

Get the name of Instagram profile and the date of post with Python

I'm in the process of learning python3 and I try to solve a simple task. I want to get the name of account and the date of post from instagram link.
import requests
from bs4 import BeautifulSoup
html = requests.get('https://www.instagram.com/p/BuPSnoTlvTR')
soup = BeautifulSoup(html.text, 'lxml')
item = soup.select_one("meta[property='og:description']")
name = item.find_previous_sibling().get("content").split("•")[0]
print(name)
This code works sometimes with links like this https://www.instagram.com/kingtop
But I need it to work also with post of image like this https://www.instagram.com/p/BuxB00KFI-x/
That's all what I could make, but this is not working. And I can't get the date also.
Do you have any ideas? I appreciate any help.

I found a way to get the name of account. Now I'm trying to find a way to get an upload date
import requests
from bs4 import BeautifulSoup
import urllib.request
import urllib.error
import time
from multiprocessing import Pool
from requests.exceptions import HTTPError
start = time.time()
file = open('users.txt', 'r', encoding="ISO-8859-1")
urls = file.readlines()
for url in urls:
url = url.strip ('\n')
try:
req = requests.get(url)
req.raise_for_status()
except HTTPError as http_err:
output = open('output2.txt', 'a')
output.write(f'не найдена\n')
except Exception as err:
output = open('output2.txt', 'a')
output.write(f'не найдены\n')
else:
output = open('output2.txt', 'a')
soup = BeautifulSoup(req.text, "lxml")
the_url = soup.select("[rel='canonical']")[0]['href']
the_url2=the_url.replace('https://www.instagram.com/','')
head, sep, tail = the_url2.partition('/')
output.write (head+'\n')

Beautifulsoup image downloading error

I am trying to download the images from the imageurls that come back from a Beautifulsoup scrape. I was trying to get this code to work after reading on some other examples tho the
Getting error:
f.write(requests.get(img))
TypeError: a bytes-like object is required, not 'Response
line: f.write(requests.get(img)[What goes here?]) is causing me trouble now. I use soup = BeautifulSoup(source, 'html.parser')
Where as the reference uses soup = BeautifulSoup(r.content)
import os
dir_path = os.path.dirname(os.path.realpath(__file__))
import urllib.request
from bs4 import BeautifulSoup
import codecs
import sys
import requests
from os.path import basename
lines = open('list.txt').read().splitlines()
for designers in lines:
for i in range(1):
url = 'https://www.example.com/Listings?st=' + author + '{}'.format(i)
source = urllib.request.urlopen(url)
soup = BeautifulSoup(source, 'html.parser')
for products in soup.find_all('li', class_='widget'):
image = products.find('img', class_='lazy-load')
print(image.get('data-src'))
img = (image.get('data-src'))
with open(basename(img), "wb") as f:
f.write(requests.get(img)**[What goes here?]**)
Thanks!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Python web scraper won't save image files - python-3.x

Related

python web scraping _images scraping

How to scrape simple image from webpage

How to extract img src from web page via lxml in beautifulsoup using python?

Get the name of Instagram profile and the date of post with Python

Beautifulsoup image downloading error

Categories

Resources