Web Scraping of Asynchronous Search Page - python-3.x

I am trying to learn web-scraping on asynchronous javascript-heavy sites. I chose a real estate website to do that. So, I have done the search by hand and came up with the URL as the first step. Here is the url:
CW_url = https://www.cushmanwakefield.com/en/united-states/properties/invest/invest-property-search#q=Los%20angeles&sort=%40propertylastupdateddate%20descending&f:PropertyType=[Office,Warehouse%2FDistribution]&f:Country=[United%20States]&f:StateProvince=[CA]
I then tried to write code to read the page using beautiful soup:
while iterations < 10:
time.sleep(5)
html = driver.execute_script("return document.documentElement.outerHTML")
sel_soup = bs(html, 'html.parser')
forsales = sel_soup.findAll("for sale")
iterations += 1
print (f'iteration {iterations} - forsales: {forsales}')
I also tried using requests-html:
from requests_html import HTMLSession, HTML
from requests_html import AsyncHTMLSession
asession = AsyncHTMLSession()
r = await asession.get(CW_url)
r.html.arender(wait = 5, sleep = 5)
r.text.find('for sale')
But, this gives me -1, which means the text could not be found! The r.text does give me a wall of HTML text, and inside that there seems to be some javascript not run yet!
<script type="text/javascript">
var endpointConfiguration = {
itemUri: "sitecore://web/{34F7EE0A-4405-44D6-BF43-13BC99AE8AEE}?lang=en&ver=4",
siteName: "CushmanWakefield",
restEndpointUri: "/coveo/rest"
};
if (typeof (CoveoForSitecore) !== "undefined") {
CoveoForSitecore.SearchEndpoint.configureSitecoreEndpoint(endpointConfiguration);
CoveoForSitecore.version = "5.0.788.5";
var context = document.getElementById("coveo3a949f41");
if (!!context) {
CoveoForSitecore.Context.configureContext(context);
}
}
</script>
I thought the fact that the url contains all the search criteria means that the site makes the fetch request, returns the data, and generate the HTML. Apparently not! So, what am I doing wrong and how to deal with this or similar sites? Ideally, one would replace the search criteria in the CW_url and let the code retrieve and store the data

Related

How to scrape a website with multiple pages with the same url adress using scrapy-playwright

I am trying to scrape a website with multiple pages with the same url using scrapy-playwright.
the following script returned only the data of the second page and did not continue to the rest of the pages.
can anyone suggest how I can fix it?
import scrapy
from scrapy_playwright.page import PageMethod
from scrapy.crawler import CrawlerProcess
class AwesomeSpideree(scrapy.Spider):
name = "awesome"
def start_requests(self):
# GET request
yield scrapy.Request(
url=f"https://www.cia.gov/the-world-factbook/countries/" ,
callback = self.parse,
meta=dict(
playwright = True,
playwright_include_page = True,
playwright_page_methods = {
"click" : PageMethod('click',selector = 'xpath=//div[#class="pagination-controls col-lg-6"]//span[#class="pagination__arrow-right"]'),
"screenshot": PageMethod("screenshot", path=f"step1.png", full_page=True)
},
)
)
async def parse(self, response):
page = response.meta["playwright_page"]
await page.close()
print("-"*80)
CountryLst = response.xpath("//div[#class='col-lg-9']")
for Country in CountryLst:
yield {
"country_link": Country.xpath(".//a/#href").get()
}
I see you are trying to fetch URLs of countries from above mentioned URL.
if you inspect the Network tab you can see there is one request to one JSON data API. You can fetch all countries URL's from this url
after that if you still want scrap more data from scraped URL's then you can easily scrap because that data is static so there will be no need to use playwright.
Have a good day :)

Can i render in realtime an variable in Flask using python? [duplicate]

I have a view that generates data and streams it in real time. I can't figure out how to send this data to a variable that I can use in my HTML template. My current solution just outputs the data to a blank page as it arrives, which works, but I want to include it in a larger page with formatting. How do I update, format, and display the data as it is streamed to the page?
import flask
import time, math
app = flask.Flask(__name__)
#app.route('/')
def index():
def inner():
# simulate a long process to watch
for i in range(500):
j = math.sqrt(i)
time.sleep(1)
# this value should be inserted into an HTML template
yield str(i) + '<br/>\n'
return flask.Response(inner(), mimetype='text/html')
app.run(debug=True)
You can stream data in a response, but you can't dynamically update a template the way you describe. The template is rendered once on the server side, then sent to the client.
One solution is to use JavaScript to read the streamed response and output the data on the client side. Use XMLHttpRequest to make a request to the endpoint that will stream the data. Then periodically read from the stream until it's done.
This introduces complexity, but allows updating the page directly and gives complete control over what the output looks like. The following example demonstrates that by displaying both the current value and the log of all values.
This example assumes a very simple message format: a single line of data, followed by a newline. This can be as complex as needed, as long as there's a way to identify each message. For example, each loop could return a JSON object which the client decodes.
from math import sqrt
from time import sleep
from flask import Flask, render_template
app = Flask(__name__)
#app.route("/")
def index():
return render_template("index.html")
#app.route("/stream")
def stream():
def generate():
for i in range(500):
yield "{}\n".format(sqrt(i))
sleep(1)
return app.response_class(generate(), mimetype="text/plain")
<p>This is the latest output: <span id="latest"></span></p>
<p>This is all the output:</p>
<ul id="output"></ul>
<script>
var latest = document.getElementById('latest');
var output = document.getElementById('output');
var xhr = new XMLHttpRequest();
xhr.open('GET', '{{ url_for('stream') }}');
xhr.send();
var position = 0;
function handleNewData() {
// the response text include the entire response so far
// split the messages, then take the messages that haven't been handled yet
// position tracks how many messages have been handled
// messages end with a newline, so split will always show one extra empty message at the end
var messages = xhr.responseText.split('\n');
messages.slice(position, -1).forEach(function(value) {
latest.textContent = value; // update the latest value in place
// build and append a new item to a list to log all output
var item = document.createElement('li');
item.textContent = value;
output.appendChild(item);
});
position = messages.length - 1;
}
var timer;
timer = setInterval(function() {
// check the response for new data
handleNewData();
// stop checking once the response has ended
if (xhr.readyState == XMLHttpRequest.DONE) {
clearInterval(timer);
latest.textContent = 'Done';
}
}, 1000);
</script>
An <iframe> can be used to display streamed HTML output, but it has some downsides. The frame is a separate document, which increases resource usage. Since it's only displaying the streamed data, it might not be easy to style it like the rest of the page. It can only append data, so long output will render below the visible scroll area. It can't modify other parts of the page in response to each event.
index.html renders the page with a frame pointed at the stream endpoint. The frame has fairly small default dimensions, so you may want to to style it further. Use render_template_string, which knows to escape variables, to render the HTML for each item (or use render_template with a more complex template file). An initial line can be yielded to load CSS in the frame first.
from flask import render_template_string, stream_with_context
#app.route("/stream")
def stream():
#stream_with_context
def generate():
yield render_template_string('<link rel=stylesheet href="{{ url_for("static", filename="stream.css") }}">')
for i in range(500):
yield render_template_string("<p>{{ i }}: {{ s }}</p>\n", i=i, s=sqrt(i))
sleep(1)
return app.response_class(generate())
<p>This is all the output:</p>
<iframe src="{{ url_for("stream") }}"></iframe>
5 years late, but this actually can be done the way you were initially trying to do it, javascript is totally unnecessary (Edit: the author of the accepted answer added the iframe section after I wrote this). You just have to include embed the output as an <iframe>:
from flask import Flask, render_template, Response
import time, math
app = Flask(__name__)
#app.route('/content')
def content():
"""
Render the content a url different from index
"""
def inner():
# simulate a long process to watch
for i in range(500):
j = math.sqrt(i)
time.sleep(1)
# this value should be inserted into an HTML template
yield str(i) + '<br/>\n'
return Response(inner(), mimetype='text/html')
#app.route('/')
def index():
"""
Render a template at the index. The content will be embedded in this template
"""
return render_template('index.html.jinja')
app.run(debug=True)
Then the 'index.html.jinja' file will include an <iframe> with the content url as the src, which would something like:
<!doctype html>
<head>
<title>Title</title>
</head>
<body>
<div>
<iframe frameborder="0"
onresize="noresize"
style='background: transparent; width: 100%; height:100%;'
src="{{ url_for('content')}}">
</iframe>
</div>
</body>
When rendering user-provided data render_template_string() should be used to render the content to avoid injection attacks. However, I left this out of the example because it adds additional complexity, is outside the scope of the question, isn't relevant to the OP since he isn't streaming user-provided data, and won't be relevant for the vast majority of people seeing this post since streaming user-provided data is a far edge case that few if any people will ever have to do.
Originally I had a similar problem to the one posted here where a model is being trained and the update should be stationary and formatted in Html. The following answer is for future reference or people trying to solve the same problem and need inspiration.
A good solution to achieve this is to use an EventSource in Javascript, as described here. This listener can be started using a context variable, such as from a form or other source. The listener is stopped by sending a stop command. A sleep command is used for visualization without doing any real work in this example. Lastly, Html formatting can be achieved using Javascript DOM-Manipulation.
Flask Application
import flask
import time
app = flask.Flask(__name__)
#app.route('/learn')
def learn():
def update():
yield 'data: Prepare for learning\n\n'
# Preapre model
time.sleep(1.0)
for i in range(1, 101):
# Perform update
time.sleep(0.1)
yield f'data: {i}%\n\n'
yield 'data: close\n\n'
return flask.Response(update(), mimetype='text/event-stream')
#app.route('/', methods=['GET', 'POST'])
def index():
train_model = False
if flask.request.method == 'POST':
if 'train_model' in list(flask.request.form):
train_model = True
return flask.render_template('index.html', train_model=train_model)
app.run(threaded=True)
HTML Template
<form action="/" method="post">
<input name="train_model" type="submit" value="Train Model" />
</form>
<p id="learn_output"></p>
{% if train_model %}
<script>
var target_output = document.getElementById("learn_output");
var learn_update = new EventSource("/learn");
learn_update.onmessage = function (e) {
if (e.data == "close") {
learn_update.close();
} else {
target_output.innerHTML = "Status: " + e.data;
}
};
</script>
{% endif %}

How is my scraper being detected immediately by a search engine

I am using Scrapy with Selenium in order to scrape urls from a particular search engine (ekoru). Here is a screenshot of the response I get back from the search engine with just ONE request:
Since I am using selenium, I'd assume that my user-agent should be fine so what else could the issue be that makes the search engine detect the bot immediately?
Here is my code:
class CompanyUrlSpider(scrapy.Spider):
name = 'company_url'
def start_requests(self):
yield SeleniumRequest(
url='https://ekoru.org',
wait_time=3,
screenshot=True,
callback=self.parseEkoru
)
def parseEkoru(self, response):
driver = response.meta['driver']
search_input = driver.find_element_by_xpath("//input[#id='fld_q']")
search_input.send_keys('Hello World')
search_input.send_keys(Keys.ENTER)
html = driver.page_source
response_obj = Selector(text=html)
links = response_obj.xpath("//div[#class='serp-result-web-title']/a")
for link in links:
yield {
'ekoru_URL': link.xpath(".//#href").get()
}
Sometimes you need to pass other parameters in order to avoid being detected by any webpage.
Let me share a code you can use:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
#This code helps to simulate a "human being" visiting the website
chrome_options = Options()
chrome_options.add_argument('--start-maximized')
driver = webdriver.Chrome(options=chrome_options, executable_path=r"chromedriver")
driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {"source":
"""Object.defineProperty(navigator,
'webdriver', {get: () => undefined})"""})
url = 'https://ekoru.org'
driver.get(url)
Yields (Check out below the bar address "Chrome is being controlled..."):

Unable to get data using requests, no data source in network tab

Hi I am trying to load the https://www.ubytovanienaslovensku.eu/ using Requests module and BS4, but I am unable to get the required data. It seems data being loaded using js but I am unable to see any data source in Chrome Dev tools Network tab.
import requests
import bs4
import lxml
url ='https://www.ubytovanienaslovensku.eu'
res = requests.get(url)
soup = bs4.BeautifulSoup(res.content, 'lxml')
print(soup.get_text())
I see that data on site loads on the fly but I don't see any source of that data.
I need to the listings on the site. Not only the basic html which only have script tags
It's coming from a websocket so you have to search the WS message panel:
You're not going to be able to get that with requests. You could try Selenium.
You can use pyppeteer in combination with asyncio to get the listings asynchronously from that site.
import asyncio
from pyppeteer import launch
url = "https://www.ubytovanienaslovensku.eu/"
async def get_listings(link):
wb = await launch({"headless": False})
[page] = await wb.pages()
await page.goto(link)
await page.waitForSelector('#home-rentals')
containers = await page.querySelectorAll('.rental-item')
for container in containers:
title = await container.querySelectorEval('span.caption','e => e.innerText')
link = await page.evaluate('e => e.href',container)
print(title,link)
asyncio.get_event_loop().run_until_complete(get_listings(url))
Output are like:
VIP SK Drevenica - Liptovská Štiavnica (max. 75) https://www.ubytovanienaslovensku.eu/chalupky-u-babky
VIP SK Drevenica - Mýto pod Ďumbierom (max. 28) https://www.ubytovanienaslovensku.eu/chata-zinka
VIP SK Drevenica - Liptovský Trnovec (max. 72) https://www.ubytovanienaslovensku.eu/liptovske-chaty
VIP SK Drevenica - Ružomberok (max. 90) https://www.ubytovanienaslovensku.eu/chaty-liptovo

Trying to scrape a website that requires login

so i m new to this and been at it for almost a week now trying to scrape a website i use to collect analytics data (think of it like google analytics).
I tried playing around with xpath to figure out what this script is able to pull but all i get is "[]" as an output after running it.
Please help me find what i'm missing.
import requests
from lxml import html
#credentials
payload = {
'username': '<my username>',
'password': '<my password>',
'csrf-token': '<auth token>'
}
#open a session with login
session_requests = requests.session()
login_url = '<my website>'
result = session_requests.get(login_url)
#passing the auth token
tree = html.fromstring(result.text)
authenticity_token = list(set(tree.xpath('//input[#name=\'form_token\']/#value')))[0]
result = session_requests.post(
login_url,
data=payload,
headers=dict(referer=login_url)
)
#scrape the analytics dashboard from this event link
url = '<my analytics webpage url>'
result = session_requests.get(
url,
headers=dict(referer=url)
)
#print output using xpath to find and load what i need
trees = html.fromstring(result.content)
bucket_names = trees.xpath("//*[#id='statistics_dashboard']/div[1]/text()")
print(bucket_names)
print(result.ok)
print(result.status_code)
..........
this is what i get as a result
[]
True
200
Process finished with exit code 0
which is a big step for me because i've been getting so many errors to get to this point.

Resources