Hello to the whole community, I wanted to know how to get an image through xpath. I have the following code to download an image using the link of the jpg file
import requests
url = 'https://www.elesquiu.com/u/portadas/tapas/7349.jpg'
myfile = requests.get(url)
open('ESQUIU.jpg', 'wb').write(myfile.content)
The problem that arises here, is that the file 7349.jpg is randomly renamed, and for that reason is that I need to go directly through xpath, can someone help me with this? Grateful
webpage info "https://www.elesquiu.com"
Related
I'm building an instagram Bot using selenium.
How do I extract the URL of a page using python?
For example Selenium is loading a webpage. I want to extract the url of that particular page .(Suppose : https://instagram.com/as80df67s4)
If you still don't understand what I'm talking about, please check the image below. There, I have highlighted the page link. How do I extract that link?
From webdriver.py:
def current_url(self):
"""
Gets the URL of the current page.
:Usage:
driver.current_url
"""
return self.execute(Command.GET_CURRENT_URL)['value']
This means that in order to get a current url you can use:
your_url = driver.current_url
But first you need for this page to open.
I wanted to extract hashtags from a specific post(given url) using BeautifoulSoup4. First I fetch the page using requests and I've tried find_all() to get every hashtag but it seems there is a hidden problem.
here is the code:
import requests
from bs4 import BeautifulSoup as bs
URL = 'https://www.instagram.com/p/CBz7-X6AOqK/?utm_source=ig_web_copy_link'
r = requests.get(URL)
soup = bs(r.content,'html.parser')
items = soup.find_all('a',attrs={'class':' xil3i'})
print(items)
the result of this code is just an empty list. Can someone please help me with the problem?
It looks like the page you are trying to scrape requires javascript. This means that some elements of the webpage are not there when you send a GET requests.
One way you can figure out if the webpage you are scraping requires javascript to populate the info you need is to simply save the html into a file:
URL = 'https://www.instagram.com/p/CBz7-X6AOqK/?utm_source=ig_web_copy_link'
r = requests.get(URL)
with open('dump.html', 'w+') as file:
file.write(r.text)
and then open that file into a web browser
If the file you open does not have the information you want to scrape then it is likely that it is automatically populated using javascript.
To get around this you can render the javascript using
A web driver (like selenium) that simulates a user going to those pages in a web browser
requests-HTML, which is a slightly new package that allows you to render javascript on a page, and has so many other awesome features that are useful for web scraping
There is a larger group of people who work with selenium which makes debugging easier than with requests-HTML, but if you do not want to learn about a new module like selenium, requests-HTML is very similar to requests and picking it up should not be very difficult
Sorry for bothering you with my request. I have started to get acquaintance with web-scraping with the library BeautifulSoup. Beacuase I have to download some data from OECD's websites I wanted to try some web-scraping approaches. More specifically, I wanted to download a .csv file from the following page:
https://goingdigital.oecd.org/en/indicator/50/
As you can see, data can be easily downloaded by clicking on 'Download data'. However, because I will have do deal with some a recursive download with loop, I tried to download it directly from the Python console. Therefore, by inspecting the page, I evidenced the download's URL that I have reported in the following picture:
Hence, I wrote the following code:
from bs4 import BeautifulSoup
import requests
from requests import get
url = 'https://goingdigital.oecd.org/en/indicator/50/'
response = get(url)
print(response.text[:500])
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
containers = html_soup.find_all('div', {'class': 'css-cqestz e12cimw51'})
print(type(containers))
print(len(containers))
d = []
for a in containers[0].find_all('a', href = True):
print(a['href'])
d.append(a['href'])
The object containers is composed by three elements since there are three divs with the specified class. The first one (the one I have selected in the loop) should be the one containing the URL in which I am interested. However, I get no result. Conversely, when I select the third element of the object containers I get the following output:
https://www.facebook.com/sharer/sharer.php?u=https%3A%2F%2Fgoingdigital.oecd.org%2Fen%2Findicator%2F50%2F
https://twitter.com/intent/tweet?text=OECD%20Going%20Digital%20Toolkit&url=https%3A%2F%2Fgoingdigital.oecd.org%2Fen%2Findicator%2F50%2F
https://www.linkedin.com/shareArticle?mini=true&url=https%3A%2F%2Fgoingdigital.oecd.org%2Fen%2Findicator%2F50%2F
mailto:?subject=OECD%20Going%20Digital%20Toolkit%3A%20Percentage%20of%20individuals%20aged%2055-74%20using%20the%20Internet&body=Percentage%20of%20individuals%20aged%2055-74%20using%20the%20Internet%0A%0Ahttps%3A%2F%2Fgoingdigital.oecd.org%2Fen%2Findicator%2F50%2F
By the way, for this download I guess it could be related to the following thread. Thank you in advance!
When you pull data from a website, you should first check whether the content you are looking for is in the page source. If it's not in the page source, you should try web scraping with selenium.
When I examined the site you mentioned, I could not see it in the page source, it shows that the link you want on this page is dynamically created.
I am working to scrape the actual data of graph from the site. But this data in javascript code and store in the list. Then please tell me the how to scrape this data by using python.
click here and see the HTML page image.
In this image show script tag and in this tag one column[] list.In this list data is store
Then please send the solution of this problem.
This is my python code
from bs4 import BeautifulSoup
import urllib.request
urlpage = 'http://www.stockgraph.com/' //This is not original url ,above give
the link of image of html page.
page = urllib.request.urlopen(urlpage)
soup = BeautifulSoup(page,'html.parser')
script=soup.find('script',attrs={'class':'col-md-9 col-md-push-3'})
print(script)
In the above code open url and find out the script tag but I can't scrape javascript code.
please tell me the solution.
My data in script tag and store in the list then how to scrape this data
To get you off in the right direction, I will try to guide you in what you need to do.
First you need to use something to read your webpage like urllib
import urllib2
response = urllib2.urlopen("http://google.com")
page_source = response.read()
You will then need to parse this code using another Module like BeautifulSoup
Follow some documents to get you started on scraping your website
https://www.freecodecamp.org/news/how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe/
I'm trying to scrape each panel in the screenshot but i didn't get the right xpath to scrape those parts .Any one can help me please.
https://www.seloger.com/annonces/achat/appartement/paris-15eme-75/saint-lambert/142632059.htm?cp=75&idtt=2,5&idtypebien=2,1&LISTING-LISTpg=2&naturebien=1,2,4&tri=initial&
This data is taken from additional request to https://www.seloger.com/detail,json,caracteristique_bien.json?idannonce=142632059. There you will get json with whole information.
UPD:
url_id = re.search(r'/(\d+)\.htm', response.url).group(1)
details_url = 'https://www.seloger.com/detail,json,caracteristique_bien.json?idannonce={}'
# make request to url
yield Request(details_url.format(url_id))