getting data from pages of a site

getting data from pages of a site - python-3.x

Im trying to get a site data with requests and then i wanna go to next page and again get the data.
the simplest way is that i put page numbers at the end of site url but the problme is the url doesnt reload for next page and it doesnt have page number in url
i can click on the next page button with selenium but i dont know how to get the data cause of driver in selenium doesnt have .text or any other functions as i know
what can i do?
this is the part of my code to trying access to the site data:
from selenium import webdriver
import requests
import re
from bs4 import BeautifulSoup as bfs
import time
driver = webdriver.Chrome(executable_path='/Users/payasystem1/w/samane_tadarokat/chromedriver')
# URL of website
url = "https://etend.setadiran.ir/etend/index.action"
r = driver.get(url);
time.sleep(5) # Let the user actually see something!
search_box = driver.find_element_by_id('next_tendersGrid_pager')
search_box.click()
print (str(r))
time.sleep(5)
driver.quit()
as you see i can access to the site and next page but i dont know how to get the shown table data in my program
if you know pls help me!

Related

I cant extract instagram hashtags of a post with bs4

I wanted to extract hashtags from a specific post(given url) using BeautifoulSoup4. First I fetch the page using requests and I've tried find_all() to get every hashtag but it seems there is a hidden problem.
here is the code:
import requests
from bs4 import BeautifulSoup as bs
URL = 'https://www.instagram.com/p/CBz7-X6AOqK/?utm_source=ig_web_copy_link'
r = requests.get(URL)
soup = bs(r.content,'html.parser')
items = soup.find_all('a',attrs={'class':' xil3i'})
print(items)
the result of this code is just an empty list. Can someone please help me with the problem?

It looks like the page you are trying to scrape requires javascript. This means that some elements of the webpage are not there when you send a GET requests.
One way you can figure out if the webpage you are scraping requires javascript to populate the info you need is to simply save the html into a file:
URL = 'https://www.instagram.com/p/CBz7-X6AOqK/?utm_source=ig_web_copy_link'
r = requests.get(URL)
with open('dump.html', 'w+') as file:
file.write(r.text)
and then open that file into a web browser
If the file you open does not have the information you want to scrape then it is likely that it is automatically populated using javascript.
To get around this you can render the javascript using
A web driver (like selenium) that simulates a user going to those pages in a web browser
requests-HTML, which is a slightly new package that allows you to render javascript on a page, and has so many other awesome features that are useful for web scraping
There is a larger group of people who work with selenium which makes debugging easier than with requests-HTML, but if you do not want to learn about a new module like selenium, requests-HTML is very similar to requests and picking it up should not be very difficult

How to scrape data from javascript code from HTML page by using python

I am working to scrape the actual data of graph from the site. But this data in javascript code and store in the list. Then please tell me the how to scrape this data by using python.
click here and see the HTML page image.
In this image show script tag and in this tag one column[] list.In this list data is store
Then please send the solution of this problem.
This is my python code
from bs4 import BeautifulSoup
import urllib.request
urlpage = 'http://www.stockgraph.com/' //This is not original url ,above give
the link of image of html page.
page = urllib.request.urlopen(urlpage)
soup = BeautifulSoup(page,'html.parser')
script=soup.find('script',attrs={'class':'col-md-9 col-md-push-3'})
print(script)
In the above code open url and find out the script tag but I can't scrape javascript code.
please tell me the solution.
My data in script tag and store in the list then how to scrape this data

To get you off in the right direction, I will try to guide you in what you need to do.
First you need to use something to read your webpage like urllib
import urllib2
response = urllib2.urlopen("http://google.com")
page_source = response.read()
You will then need to parse this code using another Module like BeautifulSoup
Follow some documents to get you started on scraping your website
https://www.freecodecamp.org/news/how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe/

Web Scrape google search pop-up results or www.prokabaddi.com

I am trying to scrape the results after searching for 'Jaipur Pink Panthers' on google or directly visiting the prokabaddi website. Target is to scrape the table which pops up when you click on any match providing the total score spread for the entire match.
I have tried using beautiful soup and selenium but I endup reading nothing with the div class values. Any help in this regard is highly appreciable.
What I have tried as of now is as follows: [PS: I am absolutely new to Python]:
Attempt1:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.prokabaddi.com/stats/0-102-total-points-statistics')
soup = BeautifulSoup(driver.page_source,"lxml")
for item in soup.select('.sipk-lb-playerName'):
[elem.extract() for elem in soup("span")]
print(item.text)
driver.quit()
Attempt2:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.prokabaddi.com/stats/0-102-total-points-statistics')
soup = BeautifulSoup(page.text, 'html.parser')
name_list = soup.find(class_='.sipk-lb-playerName')

Little Background
Websites such as these, are made in such a manner to make the user's life easy by sending only the content that is just required by you at that point in time.
As you move around the website and click on something, the remaining data is sent back to you. So, it basically works like a demand based interaction between you and the server.
What is the issue in your code?
In your first approach, you are getting an empty div list even though you are able to see that element in the html source. The reason is you clicked on Player tab on the web-page and then it got listed there. It generated the new html content at that point of time and hence you see it.
How to do it?
You need to simulate clicking of that button before sending the html source to BeautifulSoup. So, first find that button by using find_element_by_id() method. Then, click it.
element = driver.find_element_by_id('player_Btn')
element.click()
Now, you have the updated html source in your driver object. Just send this to BeautifulSoup constructor.
soup = BeautifulSoup(driver.page_source)
You do not need an lxml parser for this. Now, you can look for the specific class and get all the names (which I have done here).
soup.findAll('div',attrs={'class':'sipk-lb-playerName'})
Voila! You can store the returned list and get only the names formatted as you want.

BeautifulSoup.find() is returning None, unable to access table on wiki page

import requests
from bs4 import BeautifulSoup
url = "http://leagueoflegends.wikia.com/wiki/List_of_items'_stats"
page = requests.get(url).text
pageSoup = BeautifulSoup(page, 'html5lib')
table = pageSoup.find('table',{'class':'wikitable sortable'})
print(table)
I am trying to access the data from a table on a wiki page. I have already accessed the table on another page, however the return I am getting from the find function from this page is None. Also, when i print all p tags, there is only one p tag in the whole of the soup, which seems strange to say the least, therefore I think there might be an error in the way I am accessing the html. Any help would be appreciated.

The issue is that there is a dynamic tab mechanism going on in the background. You should select the tab which you want and use the link for that. For example if you would like to see the Offensive tab, you can right click and open the corresponding page, which will give you the correct url, then with this your code is running correctly:
url = 'http://leagueoflegends.wikia.com/wiki/List_of_items%27_stats/Offensive'

can't find the element I need in Javascript with selenium to click on link

I can't find the element I need to tell selenium that I want it to click it, I believe it is because the page is generated by javascript
can someone please help? maybe show me a way to do it and then explain how to find?
the website I'm working on is www.howlongtobeat.com
I want selenium to do the following:
go to http://www.howlongtobeat.com => click on the search tab => enter "God of War (2018)" => click the link that pops up
this is the code I have so far:
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
from requests import get
from bs4 import BeautifulSoup
url = "http://www.howlongtobeat.com"
driver = webdriver.Chrome()
driver.get(url)
search_element = driver.find_element_by_name("global_search_box")
search_element.clear()
search_element.send_keys("God of War (2018)")
search_element.send_keys(Keys.RETURN)
#this is where my isssue is, I dont know what element it is or how to find
link = driver.find_element_by_link_text("input")
link.click()
it's just the last step I need help with
can someone advise?

#Ankur Singh solution works fine. You can also use the CSS Selector to do the same clicking (I generally prefer CSS Selectors)
element=WebDriverWait(driver,10).until(EC.presence_of_element_located((By.CSS_SELECTOR,"h3.shadow_text")))
element1= driver.find_element_by_css_selector('h3.shadow_text > a')
element1.click()
time.sleep(3)
driver.quit()

You can use below code to click on link:-
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, "search_list_image"))
link = driver.find_element_by_link_text("God of War (2018)")
link.click()

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

getting data from pages of a site - python-3.x

Related

I cant extract instagram hashtags of a post with bs4

How to scrape data from javascript code from HTML page by using python

Web Scrape google search pop-up results or www.prokabaddi.com

BeautifulSoup.find() is returning None, unable to access table on wiki page

can't find the element I need in Javascript with selenium to click on link

Categories

Resources