How to compose the URL for a reverse Google image search/google Lens? - google-image-search

I find only half solution on my problem here: How to compose the URL for a reverse Google image search?, but i have other problem.
Definition of my problem:
It`s posible get a "find source image" in 1 rq.
I used query url: https://lens.google.com/uploadbyurl?url=...
Result:
https://lens.google.com/search?p=AU55jv3E4ntmru0pcNu6hvpmWRmVBGSySvJzSFuemkt4iWkZomxXpAWgJkP4FnXLkQyyHtVLP7gFi74_MefFAcz4A8_IwuUlV3rkR1FBMz2Dp0wwI2wML5lvNYOP8kCJ_3Vua1QtP7m42HfSqtqmvYcl1cSscIgPh5QTBqfroYlo3BM-DD9EvpvMRmtlFDuhmnv2bnSuZlhPIoFj4WDI90j9CK26cjigJXCBARCc8OZbSNhMEFAmVTbgLA%3D%3D#lns=W251bGwsbnVsbCxudWxsLG51bGwsbnVsbCxudWxsLG51bGwsIkVrY0tKREE0WkRnMU5HRmhMVEEzWkRjdE5HUXdaaTFpWVRaakxUazBZVE16TURNek9UUmlZaElmWTNkbE9XMXJiazlTVjFsWldVTjFORzVoV2swMk1sY3lVRU5hWDFGNFp3PT0iXQ==
Need:
https://www.google.com/search?tbs=sbi:AMhZZiu6KbInqo_1cD4Va1l7bT3HeIPDYP9q3C5Wx4dMh4dbRAYgtwdMqfsqLSHS1KO8TxuE8uGJEZtA4_114knq5ZKrmFLM-ukVx8Khkn10-1Pr_1EqP3Nui-9KKGGsCDVbi9YlBwVz4jz0kIFMTV-t7Cc_1ZZzeVMFUw

Related

How to scrape dynamic content with Selenium in Python?

I'm trying to scrape the following page: https://www.heo.co.uk/uk/en/product/FRYU40156
I understand that this is a dynamic website so using the classic Request & Beautiful Soup combo might not work. I'm trying using Selenium and I do get more information by I can't seem to get the following: image of the bit I'm trying to get
I've tried to use By.TAG_NAME and 'h1' but it does not work.
Any ideas of how I can get the header ?
Thanks !
If you go to network tab you will get the following api
https://www.heo.co.uk/api/article
Use the api to get the header value
import requests
payload={
"articleNumber": "FRYU40156",
"language": "en"
}
r=requests.post("https://www.heo.co.uk/api/article",data=payload).json()
print(r['article']['localization']['deName'])
output:
Jujutsu Kaisen 0: The Movie Hikkake PVC Statue Satoru Gojo 10 cm

Image download with Python

I'm trying to download images with Python 3.9.1
Other than the first 2-3 images, all images are 1 kb in size. How do I download all pictures? Please, help me.
Sample book: http://web2.anl.az:81/read/page.php?bibid=568450&pno=1
import urllib.request
import os
bibID = input("ID: ")
first = int(input("First page: "))
last = int(input("Last page: "))
if not os.path.exists(bibID):
os.makedirs(bibID)
for i in range(first,last+1):
url=f"http://web2.anl.az:81/read/img.php?bibid={bibID}&pno={i}"
urllib.request.urlretrieve(url,f"{bibID}/{i}.jpg")
Doesn't look like there is an issue with your script. It has to do with the APIs you are hitting and the sequence required.
A GET http://web2.anl.az:81/read/img.php?bibid=568450&pno=<page> just on its own doesn't seem to work right away. Instead, it returns No FILE HERE
The reason this happens is that the retrieval of the images is linked to your cookie. You first need to initiate your read session that's generated when first visiting the page and clicking the TƏSDİQLƏYIRƏM button
From what I could tell you need to do the following:
POST http://web2.anl.az:81/read/page.php?bibid=568450 with Content-Type: multipart/form-data body. It should have a single key value of approve: TƏSDİQLƏYIRƏM - this starts a session and generates a cookie for you which you have to add as a header for all of your API calls from now on.
E.g.
requests.post('http://web2.anl.az:81/read/page.php?bibid=568450', files=dict(approve='TƏSDİQLƏYIRƏM'))
Do the following in your for-loop of pages:
a. GET http://web2.anl.az:81/read/page.php?bibid=568450&pno=<page number> - page won't show up if you don't do this first
b. GET http://web2.anl.az:81/read/img.php?bibid=568450&pno=<page number> - finally get the image!

How to open "partial" links using Python?

I'm working on a webscraper that opens a webpage, and prints any links within that webpage if the link contains a keyword (I will later open these links for further scraping).
For example, I am using the requests module to open "cnn.com", and then trying to parse out all href/links within that webpage. Then, if any of the links contain a specific word (such as "china"), Python should print that link.
I could just simply open the main page using requests, save all href's onto a list ('links'), and then use:
links = [...]
keyword = "china"
for link in links:
if keyword in link:
print(link)
However, the problem with this method is that the links that I originally parsed out aren't full links. For example, all links with CNBC's webpage are structured like this:
href="https://www.cnbc.com/2019/08/11/how-recession-affects-tech-industry.html"
But for CNN's page, they're written like this (not full links... they're missing the part that comes before the "/"):
href="/2019/08/10/europe/luxembourg-france-amsterdam-tornado-intl/index.html"
This is a problem because I'm writing more script to automatically open these links to parse them. But Python can't open
"/2019/08/10/europe/luxembourg-france-amsterdam-tornado-intl/index.html"
because it isn't a full link.
So, what is a robust solution to this (something that works for other sites too, not just CNN)?
EDIT: I know the links I wrote as an example in this post don't contain the word "China", but this these are just examples.
Try using the urljoin function from the urllib.parse package. It takes two parameters, the first is the URL of the page you're currently parsing, which serves as the base for relative links, the second is the link you found. If the link you found starts with http:// or https://, it'll return just that link, else it will resolve URL relative to what you passed as the first parameter.
So for example:
#!/usr/bin/env python3
from urllib.parse import urljoin
print(
urljoin(
"https://www.cnbc.com/",
"/2019/08/10/europe/luxembourg-france-amsterdam-tornado-intl/index.html"
)
)
# prints "https://www.cnbc.com/2019/08/10/europe/luxembourg-france-amsterdam-tornado-intl/index.html"
print(
urljoin(
"https://www.cnbc.com/",
"http://some-other.website/"
)
)
# prints "http://some-other.website/"

Get a Video URL on a Web when I click the PLAY button. Then using Python3 to download it

I want to download a Video on a CNN's website using Python 3 automatically.
The address is https://edition.cnn.com/cnn10 .
Every weekday the CNN will put a video on that website.
I know how to find the video's real URL using Chrome browser manually:
Open the adress in Chrome browser, then press F12, like this:
enter image description here
Choosing the label of Network, then click the PLAY button to start playing the video, like this:
enter image description here
Then I can get the real URL, like this:
enter image description here
enter image description here
The video real URL is https://pmd.cdn.turner.com/cnn/big/cnn10/2018/09/13/caption/ten-0914.cnn_2253601_768x432_1300k.mp4
Using the following python 3 code can download that video:
import requests
print("Download is start!")
url = 'https://pmd.cdn.turner.com/cnn/big/cnn10/2018/09/13/caption/ten-0914.cnn_2253601_768x432_1300k.mp4'
r = requests.get(url, stream = True)
with open('20180914.mp4', "wb") as mp4:
for chunk in r.iter_content(chunk_size = 768*432):
if chunk:
mp4.write(chunk)
print("Download over!")
My problem is how to get that URL using Python or any other automatic ways? Because I want to download automatically those different videos every weekday.
What works I have already did:
I am looking for a solution on the Internet but failed.
Then I got some video's URL and try to find a regular pattern about these URL, like this:
https://pmd.cdn.turner.com/cnn/big/cnn10/2018/09/11/caption/ten-0912.cnn_2250840_768x432_1300k.mp4
https://pmd.cdn.turner.com/cnn/big/cnn10/2018/09/10/caption/ten-0911.cnn_2249562_768x432_1300k.mp4
Obviously, the date is related to every weekday, but there is 7 "random" numbers in url. I still can not understand these numbers now!
Any Help will be appreciated !
Thank you !

Why does this simple Python 3 XPath web scraping script not work?

I was following a web scraping tutorial from here: http://docs.python-guide.org/en/latest/scenarios/scrape/
It looks pretty straight forward and before I did anything else, I just wanted to see if the sample code would run. I'm trying to find the URIs for the images on this site.
http://www.bvmjets.com/
This actually might be a really bad example. I was trying to do this with a more complex site but decided to dumb it down a bit so I could understand what was going on.
Following the instructions, I got the XPath for one of the images.
/html/body/div/div/table/tbody/tr[4]/td/p[1]/a/img
The whole script looks like:
from lxml import html
import requests
page = requests.get('http://www.bvmjets.com/')
tree = html.fromstring(page.content)
images = tree.xpath('/html/body/div/div/table/tbody/tr[4]/td/p[1]/a/img')
print(images)
But when I run this, the dict is empty. I've looked at the XPath docs and I've tried various alterations to the xpath but I get nothing each time.
I dont think I can answer you question directly, but I noticed the images on the page you are targeting are sometimes wrapped differently. I'm unfamiliar with xpath myself, and wasnt able to get the number selector to work, despite this post. Here are a couple of examples to try:
tree.xpath('//html//body//div//div//table//tr//td//div//a//img[#src]')
or
tree.xpath('//table//tr//td//div//img[#src]')
or
tree.xpath('//img[#src]') # 68 images
The key to this is building up slowly. Find all the images, then find the image wrapped in the tag you are interested in.. etc etc, until you are confident you can find only the images your are interested in.
Note that the [#src] allows us to now access the source of that image. Using this post we can now download any/all image we want:
import shutil
from lxml import html
import requests
page = requests.get('http://www.bvmjets.com/')
tree = html.fromstring(page.content)
cool_images = tree.xpath('//a[#target=\'_blank\']//img[#src]')
source_url = page.url + cool_images[5].attrib['src']
path = 'cool_plane_image.jpg' # path on disk
r = requests.get(source_url, stream=True)
if r.status_code == 200:
with open(path, 'wb') as f:
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)
I would highly recommend looking at Beautiful Soup. For me, this has helped my amateur web scraping ventures. Have a look at this post for a relevant starting point.
This may not be the answer you are looking for, but hopeful it is a starting point / of some use to you - best of luck!

Resources