How can I retrieve data from a web page with a loading screen? - python-3.x

I am using the requests library to retrieve data from nitrotype.com/racer/insert_name_here about a user's progress using the following code:
import requests
base_url = 'https://www.nitrotype.com/racer/'
name = 'test'
url = base_url + name
page = requests.get(url)
print(page.text)
However my problem is that this retrieves data from the loading screen, I want the data after the loading screen.
Is it possible to do this and how?

This is likely because of dynamic loading and can easily be navigated by using selenium or pyppeteer.
In my example, I have used pyppeteer to spawn a browser and load the javascript so that I can attain the required information.
Example:
import pyppeteer
import asyncio
async def main():
# launches a chromium browser, can use chrome instead of chromium as well.
browser = await pyppeteer.launch(headless=False)
# creates a blank page
page = await browser.newPage()
# follows to the requested page and runs the dynamic code on the site.
await page.goto('https://www.nitrotype.com/racer/tupac')
# provides the html content of the page
cont = await page.content()
return cont
# prints the html code of the user profiel: tupac
print(asyncio.get_event_loop().run_until_complete(main()))

Related

Unable to fetch web elements in pytest-bdd behave in python

So, I'm in the middle of creating a test-automation framework using pytest-bdd and behave in python 3.10.
I have some codes, but the thing is, I'm not able to fetch the web element from the portal. The error doesn't say anything about it. Here is the error I'm getting in the console.
Let me share the codes here too for better understanding.
demo.feature
Feature: Simple first feature
#test1
Scenario: Very first test scenario
Given Launch Chrome browser
When open my website
Then verify that the logo is present
And close the browser
test_demo.py
from behave import *
from selenium import webdriver
from selenium.webdriver.common.by import By
# from pytest_bdd import scenario, given, when, then
import time
import variables
import xpaths
from pages.functions import *
import chromedriver_autoinstaller
# #scenario('../features/demo.feature', 'Very first test scenario')
# def test_eventLog():
# pass
#given('Launch Chrome browser')
def given_launchBrowser(context):
launchWebDriver(context)
print("DEBUG >> Browser launches successfully.")
#when('Open my website')
def when_openSite(context):
context.driver.get(variables.link)
# context.driver.get(variables.nitsanon)
print("DEBUG >> Portal opened successfully.")
#then('verify that the logo is present')
def then_verifyLogo(context):
time.sleep(5)
status = context.driver.find_element(By.XPATH, xpaths.logo_xpath)
# status = findElement(context, xpaths.logo_xpath)
print('\n\n\n\n', status, '\n\n\n\n')
assert status is True, 'No logo present'
print("DEBUG >> Logo validated successfully.")
#then('close the browser')
def then_closeBrowser(context):
context.driver.close()
variables.py
link = 'https://nitin-kr.onrender.com/'
xpaths.py
logo_xpath = "//*[#id='logo']/div"
requirements.txt
behave~=1.2.6
selenium~=4.4.3
pytest~=7.1.3
pytest-bdd~=6.0.1
Let me know if you need any more information. I'm very eager to create an automation testing framework without any OOPs concept used.
Just the thing is, I'm not able to fetch the web elements. Not able to use find_element(By.XPATH, XPATH).send_keys(VALUE) like methods of selenium.

How to get download link to the file, in the firefox downloads via selenium and python3

I can't get link in webpage, it's generat automaticly using JS. But I can get firefox download window after clicking on href (it's a JS script, that returns href).
How can I get the link in this window using selenium. If i can't do this, is there any other way to get link (no explicit link in the HTML DOM)
from selenium import webdriver
profile = webdriver.FirefoxProfile()
profile.set_preference('browser.download.folderList', 2) # 2 means custom location
profile.set_preference('browser.download.manager.showWhenStarting', False)
profile.set_preference('browser.download.dir', '/tmp') # location is tmp
profile.set_preference('browser.helperApps.neverAsk.saveToDisk', 'text/csv')
browser = webdriver.Firefox(profile)
browser.get("yourwebsite")
element = browser.find_element_by_id('yourLocator')
href = element.get_attribute("href")
Now you have website in your href.
Use below code to navigate to the URL
browser.navigate().to(href)
You can go for the following approach:
Get WebElement's href attribute using WebElement.get_attribute() function
href = your_element.get_attribute("href")
Use WebDriver.execute_script() function to evaluate the JavaScript and return the real URL
url = driver.execute_script("return " + href + ";")
Now you should be able to use urllib or requests library to download the file. If your website assumes authentication - don't forget to obtain Cookies from the browser instance and add the relevant Cookie header to the request downloading the file

Python - Web scraping using HTML tags

I am trying to scrape a web-page to list out the jobs posted in URL: https://careers.microsoft.com/us/en/search-results?rk=l-hyderabad
Refer to image for details of web-page inspect Web inspect
Following is observed through a web-page inspect:
Each job listed, is in a HTML li with class="jobs-list-item". The Li contains following html tag & data in parent Div within li
data-ph-at-job-title-text="Software Engineer II",
data-ph-at-job-category-text="Engineering",
data-ph-at-job-post-date-text="2018-03-19T16:33:00".
1st Child Div within parent Div with class="information" has HTML with url
href="https://careers.microsoft.com/us/en/job/406138/Software-Engineer-II"
3rd child Div with class="description au-target" within parent Div has short job description
My requirement is to extract below information for each job
Job Title
Job Category
Job Post Date
Job Post Time
Job URL
Job Short Description
I have tried following Python code to scrape the webpage, but unable to extract the required information. (Please ignore the indentation shown in code below)
import requests
from bs4 import BeautifulSoup
def ms_jobs():
url = 'https://careers.microsoft.com/us/en/search-results?rk=l-hyderabad'
resp = requests.get(url)
if resp.status_code == 200:
print("Successfully opened the web page")
soup = BeautifulSoup(resp.text, 'html.parser')
print(soup)
else:
print("Error")
ms_jobs()
If you want to do this via requests you need to reverse engineer the site. Open the dev tools in Chrome, select the networks tab and fill out the form.
This will show you how the site loads the data. If you dig in the site you'll see, that it grabs the data by doing a POST to this endpoint: https://careers.microsoft.com/widgets. It also shows you the payload that the site uses. The site uses cookies so all you have to do is create a session that keeps the cookie, get one and copy/paste the payload.
This way you'll be able to extract the same json-data, that the javascript fetches to populate the site dynamically.
Below is a working example of what that would look like. Left is only to parse out the json as you see fit.
import requests
from pprint import pprint
# create a session to grab a cookie from the site
session = requests.Session()
r = session.get("https://careers.microsoft.com/us/en/")
# these params are the ones that the dev tools show that site sets when using the website form
payload = {
"lang":"en_us",
"deviceType":"desktop",
"country":"us",
"ddoKey":"refineSearch",
"sortBy":"",
"subsearch":"",
"from":0,
"jobs":"true",
"counts":"true",
"all_fields":["country","state","city","category","employmentType","requisitionRoleType","educationLevel"],
"pageName":"search-results",
"size":20,
"keywords":"",
"global":"true",
"selected_fields":{"city":["Hyderabad"],"country":["India"]},
"sort":"null",
"locationData":{}
}
# this is the endpoint the site uses to fetch json
url = "https://careers.microsoft.com/widgets"
r = session.post(url, json=payload)
data = r.json()
job_list = data['refineSearch']['data']['jobs']
# the job_list will hold 20 jobs (you can se the parameter in the payload to a higher number if you please - I tested 100, that returned 100 jobs
job = job_list[0]
pprint(job)
Cheers.

Python POST opens a new tab when logging in. How do i get the new tab

Currently using Requests to try to log into https://mypay.dfas.mil/mypay.aspx. My problem is I don't know how to get the page this generates in a new tab upon login.
import requests
url = 'https://mypay.dfas.mil/mypay.aspx'
payload = {'visLogin': 'id', 'visPin': 'pass'}
r = requests.post(url, data=payload)
print(r.text)
This gives me the same page back because it opens a new tab. I just need to get past this part and everything else is generated in the same tab.

In memory web page browsing

Is their any way to execute load events of javascript of web page after scraping html, without any browser. i.e. I need to scrape web content rendered via javascript for example vedio of bbc news web page are rendered via javacscript after page load, i am interested to scrape video link and shot description. http://www.bbc.co.uk/news/video_and_audio/
No, as far as I know. If the content is rendered by Javascript, you need a browser. It is possible to automate a browser: http://seleniumhq.org/
I often do this using webkit:
http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/
import sys
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
url = 'http://sitescraper.net'
r = Render(url)
html = r.frame.toHtml()

Resources