I have been reading through alot of documentation around Praw, bs4 and I've had a look at other peoples examples of how to do this but I just can't get anything working the way I would like. I thought it would be a pretty simple script but every example I find is either written in python2 or just doesn't work at all.
I would like a script to download the top 10 images from a given Subreddit and save them to a folder.
If anyone could point me in the write direction that would be great.
Cheers
The high level flow will look something like this -
Iterate through the top posts of your subreddit.
Extract the url of the submission.
Check if the url is an image.
Save the image to your desired folder.
Stop once you have 10 images.
Here's an example of how this could be implemented-
import urllib.request
subreddit = reddit.subreddit("aww")
count = 0
# Iterate through top submissions
for submission in subreddit.top(limit=None):
# Get the link of the submission
url = str(submission.url)
# Check if the link is an image
if url.endswith("jpg") or url.endswith("jpeg") or url.endswith("png"):
# Retrieve the image and save it in current folder
urllib.request.urlretrieve(url, f"image{count}")
count += 1
# Stop once you have 10 images
if count == 10:
break
Related
I'm trying to download images with Python 3.9.1
Other than the first 2-3 images, all images are 1 kb in size. How do I download all pictures? Please, help me.
Sample book: http://web2.anl.az:81/read/page.php?bibid=568450&pno=1
import urllib.request
import os
bibID = input("ID: ")
first = int(input("First page: "))
last = int(input("Last page: "))
if not os.path.exists(bibID):
os.makedirs(bibID)
for i in range(first,last+1):
url=f"http://web2.anl.az:81/read/img.php?bibid={bibID}&pno={i}"
urllib.request.urlretrieve(url,f"{bibID}/{i}.jpg")
Doesn't look like there is an issue with your script. It has to do with the APIs you are hitting and the sequence required.
A GET http://web2.anl.az:81/read/img.php?bibid=568450&pno=<page> just on its own doesn't seem to work right away. Instead, it returns No FILE HERE
The reason this happens is that the retrieval of the images is linked to your cookie. You first need to initiate your read session that's generated when first visiting the page and clicking the TƏSDİQLƏYIRƏM button
From what I could tell you need to do the following:
POST http://web2.anl.az:81/read/page.php?bibid=568450 with Content-Type: multipart/form-data body. It should have a single key value of approve: TƏSDİQLƏYIRƏM - this starts a session and generates a cookie for you which you have to add as a header for all of your API calls from now on.
E.g.
requests.post('http://web2.anl.az:81/read/page.php?bibid=568450', files=dict(approve='TƏSDİQLƏYIRƏM'))
Do the following in your for-loop of pages:
a. GET http://web2.anl.az:81/read/page.php?bibid=568450&pno=<page number> - page won't show up if you don't do this first
b. GET http://web2.anl.az:81/read/img.php?bibid=568450&pno=<page number> - finally get the image!
I'm using Selenium for extracting comments of Youtube.
Everything went well. But when I print comment.text, the output is the last sentence.
I don't know who to save it for further analyze (cleaning and tokenization)
path = "/mnt/c/Users/xxx/chromedriver.exe"
This is the path that I saved and downloaded my chrome
chrome = webdriver.Chrome(path)
url = "https://www.youtube.com/watch?v=WPni755-Krg"
chrome.get(url)
chrome.maximize_window()
scrolldown
sleep = 5
chrome.execute_script('window.scrollTo(0, 500);'
time.sleep(sleep)
chrome.execute_script('window.scrollTo(0, 1080);')
time.sleep(sleep)
text_comment = chrome.find_element_by_xpath('//*[#id="contents"]')
comments = text_comment.find_elements_by_xpath('//*[#id="content-text"]')
comment_ids = []
Try this approach for getting the text of all comments. (the forloop part edited- there was no indention in the previous code.)
for comment in comments:
comment_ids.append(comment.get_attribute('id'))
print(comment.text)
when I print, i can see all the texts here. but how can i open it for further study. Should i always use for loop? I want to tokenize the texts but the output is only last sentence. Is there a way to save this .text file with the whole texts inside it and open it again? I googled it a lot but it wasn't successful.
So it sounds like you're just trying to store these comments to reference later. Your current solution is to append them to a string and use a token to create substrings? I'm not familiar with pythons data structures, but this sounds like a great job for an array or a list depending on how you plan to reference this data.
I've been trying to scrape data for a project from the times for the last 7 hours. And yes it has to be done without the API. Its been a war of attrition, but this code that checks out keeps returning nans, am I missing something simple? Towards the bottom of the page is every story contained within the front page, little cards that have an image, 3 article titles, and their corresponding links. It either doesn't grab a thing, partially grabs it, or grabs something completely wrong. There should be about 35 cards with 3 links a piece for 105 articles. I've gotten it to recognize 27 cards with a lot of nans instead of strings and none of the individual articles.
import csv, requests, re, json
from bs4 import BeautifulSoup
handle = 'http://www.'
location = 'ny'
ping = handle + locaiton + 'times.com'
pong = requests.get(ping, headers = {'User-agent': 'Gordon'})
soup = BeautifulSoup(pong.content, 'html.parser')
# upper cards attempt
for i in soup.find_all('div', {'class':'css-ki19g7 e1aa0s8g0'}):
print(i.a.get('href'))
print(i.a.text)
print('')
# lower cards attempt
count = 0
for i in soup.find_all('div', {"class":"css-1ee8y2t assetWrapper"}):
try:
print(i.a.get('href'))
count+=1
except:
pass
print('current card pickup: ', count)
print('the goal card pickup:', 35)
Everything Clickable uses "css-1ee8y2t assetWrapper", but when I find_all I'm only getting 27 of them. I wanted to start from css-guaa7h and work my way down but it only returns nans. Other promising but fruitless divs are
div class="css-2imjyh" data-testid="block-Well" data-block-tracking-id="Well"
div class="css-a11566"
div class="css-guaa7h”
div class="css-zygc9n"
div data-testid="lazyimage-container" # for images
Current attempt:
h3 class="css-1d654v4">Politics
My hope is running out, why is just getting a first job is harder then working hard labor.
I checked their website and it's using ajax to load the articles as soon as you scroll down. You'll probably have to use selenium. Here's an answer that might help you do that: https://stackoverflow.com/a/21008335/7933710
I want to download a Video on a CNN's website using Python 3 automatically.
The address is https://edition.cnn.com/cnn10 .
Every weekday the CNN will put a video on that website.
I know how to find the video's real URL using Chrome browser manually:
Open the adress in Chrome browser, then press F12, like this:
enter image description here
Choosing the label of Network, then click the PLAY button to start playing the video, like this:
enter image description here
Then I can get the real URL, like this:
enter image description here
enter image description here
The video real URL is https://pmd.cdn.turner.com/cnn/big/cnn10/2018/09/13/caption/ten-0914.cnn_2253601_768x432_1300k.mp4
Using the following python 3 code can download that video:
import requests
print("Download is start!")
url = 'https://pmd.cdn.turner.com/cnn/big/cnn10/2018/09/13/caption/ten-0914.cnn_2253601_768x432_1300k.mp4'
r = requests.get(url, stream = True)
with open('20180914.mp4', "wb") as mp4:
for chunk in r.iter_content(chunk_size = 768*432):
if chunk:
mp4.write(chunk)
print("Download over!")
My problem is how to get that URL using Python or any other automatic ways? Because I want to download automatically those different videos every weekday.
What works I have already did:
I am looking for a solution on the Internet but failed.
Then I got some video's URL and try to find a regular pattern about these URL, like this:
https://pmd.cdn.turner.com/cnn/big/cnn10/2018/09/11/caption/ten-0912.cnn_2250840_768x432_1300k.mp4
https://pmd.cdn.turner.com/cnn/big/cnn10/2018/09/10/caption/ten-0911.cnn_2249562_768x432_1300k.mp4
Obviously, the date is related to every weekday, but there is 7 "random" numbers in url. I still can not understand these numbers now!
Any Help will be appreciated !
Thank you !
I was following a web scraping tutorial from here: http://docs.python-guide.org/en/latest/scenarios/scrape/
It looks pretty straight forward and before I did anything else, I just wanted to see if the sample code would run. I'm trying to find the URIs for the images on this site.
http://www.bvmjets.com/
This actually might be a really bad example. I was trying to do this with a more complex site but decided to dumb it down a bit so I could understand what was going on.
Following the instructions, I got the XPath for one of the images.
/html/body/div/div/table/tbody/tr[4]/td/p[1]/a/img
The whole script looks like:
from lxml import html
import requests
page = requests.get('http://www.bvmjets.com/')
tree = html.fromstring(page.content)
images = tree.xpath('/html/body/div/div/table/tbody/tr[4]/td/p[1]/a/img')
print(images)
But when I run this, the dict is empty. I've looked at the XPath docs and I've tried various alterations to the xpath but I get nothing each time.
I dont think I can answer you question directly, but I noticed the images on the page you are targeting are sometimes wrapped differently. I'm unfamiliar with xpath myself, and wasnt able to get the number selector to work, despite this post. Here are a couple of examples to try:
tree.xpath('//html//body//div//div//table//tr//td//div//a//img[#src]')
or
tree.xpath('//table//tr//td//div//img[#src]')
or
tree.xpath('//img[#src]') # 68 images
The key to this is building up slowly. Find all the images, then find the image wrapped in the tag you are interested in.. etc etc, until you are confident you can find only the images your are interested in.
Note that the [#src] allows us to now access the source of that image. Using this post we can now download any/all image we want:
import shutil
from lxml import html
import requests
page = requests.get('http://www.bvmjets.com/')
tree = html.fromstring(page.content)
cool_images = tree.xpath('//a[#target=\'_blank\']//img[#src]')
source_url = page.url + cool_images[5].attrib['src']
path = 'cool_plane_image.jpg' # path on disk
r = requests.get(source_url, stream=True)
if r.status_code == 200:
with open(path, 'wb') as f:
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)
I would highly recommend looking at Beautiful Soup. For me, this has helped my amateur web scraping ventures. Have a look at this post for a relevant starting point.
This may not be the answer you are looking for, but hopeful it is a starting point / of some use to you - best of luck!