Generate Seletor from source code, for scrapy - python-3.x

I am trying to create a CSS selector from the source code of a dynamic web page. I have tried with no results with:
response.css('seller-info#region *::text').get()
response.css('seller-info > region *::text').get()
response.css('.seller-info#region ::text').get()
response.css('seller-info#region ::text').get()
response.css('seller-info > region ::text').get()
response.css('seller-info:contains("to extract")::text').get()
response.css('.seller-info:contains("to extract")::text').get()
response.css('.seller-info:contains("to extract") *::text').get()
response.css('seller-info:contains("to extract") *::text').get()
Response of each: "None"
I need the text: "to extract"
*The region name is repeated in other code trees
Source code
<seller-info
username='glorious'
ispro='true'
region="to extract"
phoneurl='/pg/0.gif"'
storeurl=""
seniority=''
category="1220"
phonevisible='true'
>
<div slot="avatar">
<div class="seller-info__header--icon-container">
<i class="icon-yapo icon-briefcase "></i>
</div>
</div>
</seller-info>```

Data from your source code that you are trying to extract - this is a tag attribute value (not tag text):
region = response.css("seller-info[region]::attr(region)").get()
or:
region = response.css("seller-info::attr(region)").get()
Selectors like tagname::text aimed to extract text between opening and closing tags like <tagname> text to extract </tagname>
Your <seller-info> tag - is self-closing tag (like img tag). It store data inside its attributes.

Related

Python + Selenium - Select Drop Down Option using Stored Variable

I have written a python selenium script that selects a state value from a drop down. The HTML for the drop down element is copied below:
<div class="hQSHyh4QFG0Xh0d-6pxTF" tabindex="0" style="height: 238px; display: none;">
<div class="SD_7vnwWhO0KG80czzPb3 option-0 al-option">AL</div>
<div class="SD_7vnwWhO0KG80czzPb3 option-1 ak-option">AK</div>
<div class="SD_7vnwWhO0KG80czzPb3 option-2 as-option">AS</div>
<div class="SD_7vnwWhO0KG80czzPb3 option-3 az-option">AZ</div>
<div class="SD_7vnwWhO0KG80czzPb3 option-4 ar-option">AR</div>
<div class="SD_7vnwWhO0KG80czzPb3 option-5 ca-option">CA</div>
<div class="SD_7vnwWhO0KG80czzPb3 option-59 um-option">UM</div>
</div>
Problem: the automation script locates the same state value ("CA") using a hard-coded xpath statement (See code snippet from script below). Instead, I would like to select the state value using a stored variable called "state".
state_selection = self.driver.find_element_by_xpath("/html/body/div[2]/div/div[2]/div/div/div[2]/div[1]/form/div/div[2]/div[2]/div[3]/div[2]/div/div[3]/div[6]")
state_selection.click()
Additional Notes: I have tried using other methods to locate the state value (see below) but, so far, I have only been successful using the hard-coded xpath above.
I also tried to locate the drop down element using the Selenium Select Method but I got messages telling me that "Select only works on <select> elements, not on 'div' "
driver.findElement(by.xpath("//select[#SD_7vnwWhO0KG80czzPb3='']/option[#value='CA']")).click()
Try to select required option by its text content:
state = "CA"
state_selection = self.driver.find_element_by_xpath("//div[.='%s']" % state)
state_selection.click()

How to replace selected string from content editable div?

I'm trying to replace chapter titles from the contenteditable="true" div tag by using python and selenium-webdriver, at first I am searching for the chapter title, which is usually at first line... then I'm replacing it with empty value and saving.. but it's not saving after refreshing browser. But I see that code is working. Here is my code
##getting content editable div tag
input_field = driver.find_element_by_css_selector('.trumbowyg-editor')
### getting innerHTML of content editable div
chapter_html = input_field.get_attribute('innerHTML')
chapter_content = input_field.get_attribute('innerHTML')
if re.search('<\w*>', chapter_html):
chapter_content = re.split('<\w*>|</\w*>', chapter_html)
first_chapter = chapter_content[1]
### replacing first_chapter with ''
chapter_replace = chapter_html.replace(first_chapter, '')
### writing back innerHTML without first_chapter string
driver.execute_script("arguments[0].innerHTML = arguments[1];",input_field, chapter_replace)
time.sleep(1)
## click on save button
driver.find_element_by_css_selector('.btn.save-button').click()
How I can handle this ? It is working when I'm doing manually(I mean it probably can't be site problem/bug)... Please help ...
Relevant HTML is following:
<div class="trumbowyg-editor" dir="ltr" contenteditable="true">
<p>Chapter 1</p>
<p> There is some text</p>
<p> There is some text</p>
<p> There is some text</p>
</div>
As per the HTML you have shared to replace the chapter title with empty value you have to induce WebDriverWait with expected_conditions clause set to visibility_of_element_located and can use the following block of code :
page_number = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[#class='trumbowyg-editor' and #contenteditable='true']/p[contains(.,'Chapter')]")))
driver.execute_script("arguments[0].removeAttribute('innerHTML')", page_number)
#or
driver.execute_script("arguments[0].removeAttribute('innerText')", page_number)
#or
driver.execute_script("arguments[0].removeAttribute('textContent')", page_number)

Search entire text for images

I have a problem with a project.
I need to search a string for images.
I want to get the source of the image and modify the html form of the img tag.
For example the image form is:
and I want to change it to:
<div class="col-md-3">
<hr class="visible-sm visible-xs tall" />
<a class="img-thumbnail lightbox pull-left" href="upload/uploader/up_164.jpg" data-plugin-options='{"type":"image"}' title="Image title">
<img class="img-responsive" width="215" src="upload/uploader/up_164.jpg"><span class="zoom"><i class="fa fa-search"></i>
</span></a>
I have done some part of this.
I can find the image, change the form of the html but cannot loop this for all images found in the string.
My code goes like
Using the following function I get the string between two strings
// Get substring between
function GetBetween($var1="",$var2="",$pool){
$temp1 = strpos($pool,$var1)+strlen($var1);
$result = substr($pool,$temp1,strlen($pool));
$dd=strpos($result,$var2);
if($dd == 0){
$dd = strlen($result);
}
return substr($result,0,$dd);
}
And then I get the image tag from the string
$imageFile = GetBetween("img","/>",$newText);
The next was to filter the source of the image:
$imageSource = GetBetween('src="','\"',$imageFile);
And for the last part I call str_replace to do the job:
$newText = str_replace('oldform', 'newform', $newText);
The problem is in case there are more tha one images, I cannot loop this process.
Thank you in advance.
The best, simple and safe way to read an xml file is to use an xml parser.
And, I think you will gain a lot of time.

Python 3 BeautifulSoup4 search for text in source page

I want to search for all '1' in the source code and print the location of that '1' ex: <div id="yeahboy">1</div> the '1' could be replaced by any other string. I want to see the tag around that string.
Consider this context for example * :
from bs4 import BeautifulSoup
html = """<root>
<div id="yeahboy">1</div>
<div id="yeahboy">2</div>
<div id="yeahboy">3</div>
<div>
<span class="nested">1</span>
</div>
</root>"""
soup = BeautifulSoup(html)
You can use find_all() passing parameter True to indicate that you want only element nodes (instead of the child text nodes), and parameter text="1" to indicate that the element you want must have text content equals "1" -or any other text you want to search for- :
for element1 in soup.find_all(True, text="1"):
print(element1)
Output :
<div id="yeahboy">1</div>
<span class="nested">1</span>
*) For OP: for future questions, try to give a context, just like the above context example. That will make your question more concrete and easier to answer -as people doesn't have to create context on his own, which may turn out to be not relevant to the situation that you actually have.

getelementsbyID inner dt id values

I am extracting data from HTML using Vb Script. This is the HTML code from which am trying to extract the data.
<dl id="overview">
<dt id="overview-summary-current-title" class="summary-current" style="display:block">
Current
</dt>
<dd class="summary-current" style="display:block">
<ul class="current">
<li>
Software Engineer
<span class="at">at </span>
<a class="company-profile-public" href="/company/ABC Systems?trk=ppro_cprof">
<span class="org summary">ABC Systems</span></a>
</li>
</ul>
</dd>
In my previous question, I had asked for a similar doubt. The link is Excel getElementById extract the span class information.
However, in that case, I wanted to extract the information corresponding to the dl id and it also had span id. In this case, I need to extract the information corresponding to the dt id.
In my VB Script, I tried something like this.
Dim openedpage as String
openedpage = iedoc1.getElementById("overview").getElementById("overview-summary-current-title").innerHTML
However, I am getting no output.
I want the output as Software Engineer at ABC systems.
Kindly help me out.
The object returned by getElementById() doesn't have a method .getElementById(), so the following line fails:
.getElementById("overview").getElementById("overview-summary-current-title")
If you don't get any output, not even an error message, you probably have On Error Resume Next somewhere in your script. Please don't use that unless you know exactly what you're doing and have sensible error handling code in place.
Also, the element with the ID "overview-summary-current-title" is this:
<dt id="overview-summary-current-title" class="summary-current" style="display:block">
Current
</dt>
So you couldn't possibly extract the text "Software Engineer at ABC systems" from that element.
Try selecting the first <ul> tag from the element with the ID "overview", and then use the innerText property instead of the innerHtml property:
Set ie = CreateObject("InternetExplorer Application")
ie.Navigate "..."
While ie.Busy : WScript.Sleep 100 : Wend
Set e1 = ie.document.getElementById("overview")
Set e2 = e1.getElementsByTagName("ul")(0)
WScript.Echo e2.innerText

Resources