I need to scrape some data off tags in a page which further has more DOM elements.
The articles are repeated and they have an xpath as:
//*[#id="post_page"]/div/div[2]/main/div/div/div/div[2]/div[2]/div/div[3]/div/article[N]
where 'N' represents the Nth article.
And within each article, the xpath for the element I'm interested in is:
/div/div/div/div/div/div/div[3]/div[1]/button[1]/span
The first thing I did was to use
Elements = driver.find_elements(By.XPATH, <first_path>)
And it fetched me all the articles in the page. PS: I did not add [N] because that would only fetch a specific article, and I'm interested in all.
Then, for each element in the list, I used find_element using the second path as follows:
for elem in Elements:
Required.append(elem.find_element(By.XPATH, <second_path>))
Where Required is a list in which I'll be storing the data. And this is where I got the element does not exist error.
I also tried adding a . before <second_path> but that didn't solve the issue either.
The complete xpath of the element is:
//*[#id="post_page"]/div/div[2]/main/div/div/div/div[2]/div[2]/div/div[3]/div/article[N]/div/div/div/div/div/div/div[3]/div[1]/button[1]/span
And the CSS Selector for the same is:
#post_page > div > div._UuSG.w77Za._21rSD._3SBW4 > main > div > div > div > div._UuSG._ayWa._3dGg1.Vlb1o._1vyTb > div._UuSG.qzupC._3cqkW > div > div:nth-child(3) > div > article:nth-child(N) > div > div > div > div > div > div > div._UuSG._3VzCT._2FoTG > div._UuSG._3dGg1._2VJFi._2h1-g > button:nth-child(1) > span
I also tried an approach using a loop where I increment a counter variable and use that as N for the whole xpath, but that didn't seem to work either. Got the same error.
Any help would be greatly appreciated.
EDIT[1]
The last span has the following class names:
<span class="_UuSG _3_54N a8-QN _2cSLK L4pn5 RiX17">Stuff I need</span>
Which are unique (collectively) in the page. This information might be relevant somehow.
I think I know your problem. When you do
Elements = driver.find_elements(By.XPATH, <first_path>)
you have already found all the elements you need here. So in your for loop, just use elem, no more "finding" is needed.
for elem in Elements:
Required.append(elem)
I would use .// to select using descendent-or-self axis starting from the current node (. means current node).
You have already tried with ./, which is pretty close.
xpath ".//span", what does the dot mean?
What is meaning of .// in XPath?
Related
I am trying to get list values in:
Though I can extract them with,
for r in g.find_elements(By.XPATH,'//ul//li[contains(#data-bib-id, "bib")]'):
print(r.text)
the attribute (data-bib-id) is not always the same and I am trying to make my scraping task as generic as possible. So, is there a way that I can extract the same info when the exact attribute is not known? That is, li showing up under a ul or ol or div with an attribute value containing a subtext "bib" or "ref"?
In case li elements can be below ul or ol or div maybe we can omit this detail and start the expression with //li?
If so you can add a logic or operator on the data-bib-id attribute value in your XPath expression so it will be as following:
for r in g.find_elements(By.XPATH,'//li[contains(#data-bib-id, "bib") or contains(#data-bib-id, "bib")]'):
print(r.text)
In case you need to limit the search so that li mast be child of ul or ol or div parent element you need to add logic or case on parent nodes so your XPath expression can be as following:
for r in g.find_elements(By.XPATH,'//ul//li[contains(#data-bib-id, "bib") or contains(#data-bib-id, "bib")] or //ol//li[contains(#data-bib-id, "bib") or contains(#data-bib-id, "bib")] //div//li[contains(#data-bib-id, "bib") or contains(#data-bib-id, "bib")]'):
print(r.text)
I want to get the text which is inside the span. However, I am not able to achieve it. The text is inside ul<li<span<a<span. I am using selenium with python.
Below is the code which I tried:
departmentCategoryContent = driver.find_elements_by_class_name('a-list-item')
departmentCategory = departmentCategoryContent.find_elements_by_tag_name('span')
after this, I am just iterating departmentCategory and printing the text using .text i.e
[ print(x.text) for x in departmentCategory ]
However, this is generating an error: AttributeError: 'list' object has no attribute 'find_elements_by_tag_name'.
Can anyone tell me what I am doing wrong and how I can get the text?
Problem:
As far as I understand, departmentCategoryContent is a list, not a single WebElement, then it doesn't have the find_elements_by_tag_name() method.
Solution:
you can choose 1 of 2 ways below:
You need for-each of list departmentCategoryContent first, then find_elements_by_tag_name().
Save time with one single statement, using find_elements_by_css_selector():
departmentCategory = driver.find_elements_by_css_selector('.a-spacing-micro.apb-browse-refinements-indent-2 .a-list-item span')
[ print(x.text) for x in departmentCategory ]
Test on devtool:
Explanation:
Your locator .a-list-item span will return all the span tag belong to the div that has class .a-list-time. There are 88 items containing the unwanted tags.
So, you need to add more specific locator to separate the other div. In this case, I use some more classes. .a-spacing-micro.apb-browse-refinements-indent-2
You're looping over the wrong thing. You want to loop through the 'a-list-item' list and find a single span element that is a child of that webElement. Try this:
departmentCategoryContent = driver.find_elements_by_class_name('a-list-item')
print(x.find_element_by_tag_name('span').text) for x in departmentCategoryContent
note that the second dom search is a find_element (not find_elements) which will return a single webElement, not a list.
I am trying to do some web scraping reading some lines inside a html page. I need to look for a text which is repeated through the page inside some <span> elements. In the following example I would like to end with an array of strings with ['Text number 1','Text number 2','Text number 3']
<html>
...
<span>Text number 1</span>
...
<span>Text number 2</span>
...
<span>Text number 3</span>
...
</html>
I have the following code
sElements = ' ... span'; // I declare the selector.
cs = await page.$$(sElements); // I get an array of ElementHandle
The selector is working as in Google Chrome developer tools it captures exactly the 3 elements I am looking for. Also the cs variable is filled with an array of three elements. But then I am trying
for(c in cs)
console.log(c.innerText);
But undefined is logged. I have tried with .text .value .innerText .innerHTML .textContent ... I do not know what I am missing as I think this is really simple
I have also tried this with the same undefined result.
cs = await page.$$eval(sElements, e => e.innerHTML);
Here is an example that would get the innerText of the last span element.
let spanElement;
spanElement = await this.page.$$('span');
spanElement = spanElement.pop();
spanElement = await spanElement.getProperty('innerText');
spanElement = await spanElement.jsonValue();
If you still are unable to get any text then ensure the selector is correct and that the span elements have an innerText defined (not outerText). You can run $(selector) in Chrome console to check.
I'm trying to parse through an awful website and I need some help with using cheerio.
I know that if I for example want to get html of a body of a html I do
$('body','html').html();
How do I descend through multiple elements?
(What if I want to get html > body > font > table > tbody > tr ?)
!! Have to be careful with all these elements being immediate children, I do not want to catch some other nonimmediate children (for example if table > table existed)
You could just do:
$('html > body > font > table > tbody > tr').html()
You can select just like with jQuery selectors
I have a SP Dataview that I have converted to XSLT, so that I could add a header displaying a percentage (Complete). Before I converted the dvwp to xslt, I added two count headers- one on Complete, and another on LastName. They worked wonderfully- showing me the # of records and the # of records with a value in the complete field. However, when I converted the dv to xslt I realized that I lost my headers :(
So, I am adding them back in using xslt. Currently the XPath code for the equation that I have is <xsl:value-of select="count($Rows) div count($Rows)" />.
How do I get the total # of Yes values that are in my Complete field?
UPDATE1:
Found this http://www.endusersharepoint.com/STP/viewtopic.php?f=14&t=534 and tried it, however causes the following error- Failed setting processor stylesheet: 0x80004005: Argument 1 must return a node-set. -->count(/dsQueryResponse/Rows/Row='Y')<--
UPDATE2:
Complete is the name of a field w/i my XSLT dataset. The return type is either Y or blank. For grins I tried <xsl:value-of select="count(/xpath/to/parent/element[#Complete eq 'Y']) div count($Rows)" /> however I recieved the following error- Failed setting processor stylesheet: 0x80004005: Expected token ']' found 'NAME'.count((/xpath/to/parent/element[#Complete -->eq <--'Y']) div count($Rows) Am starting to think that there may be a problem w/ 'eq'.... Referencing my XML operators...
UPDATE3:
<xsl:value-of select="count(/xpath/to/parent/element[#Complete = 'Y']) div count($Rows)" />
Okay so it still says 0, but I think the reason why it's not showing the correct answer is b/c it is expecting to show an integer, and obviously the value being returned from the equation is going to be a decimal... Have been fiddling with the equation in XPath... here's what I've tried-
count(/xpath/to/parent/element[#Complete = 'Y']) div count($Rows)*100
(count(/xpath/to/parent/element[#Complete = 'Y']) div count($Rows))*100
100(count(/xpath/to/parent/element[#Complete = 'Y']) div count($Rows))
UPDATE4:
So I know my previous thought that the correct number not showing b/c it was a float is not correct, as all numbers in XPath and XSLT 1.0 are floats. Reference
UPDATE5:
Upon further investigation, I have found that the problem lies with the count(/xpath/to/parent/element[#Complete = 'Y']) part of my equation, as this is returning 0 instead of a value. [i know i have at least 3 'Y' vals in my Complete col]
UPDATE6:
<records*>
<record*>
<last_name></last_name>
<first_name></first_name>
<mi></mi>
<office_symbol></office_symbol>
<geo_location></geo_location>
<complete></complete>
<date_complete></date_complete>
<date_expires></date_expires>
<email></email>
<supervisor></supervior>
</record*>
</records*>
*i don't know what these nodes are called as my data is coming from a database and not an xml file, i just made up record/records
UPDATE7
Going back to my original question. I am still trying to find out the XPath equation to display the number of parents (record in the XML i posted above) where the complete node = Y.
UPDATE8
Ok. So I have edited and tested using http://www.w3schools.com/xsl/tryxslt.asp?xmlfile=cdcatalog&xsltfile=tryxsl_value-of. Working XSLT to count the # of Complete = Y is <xsl:value-of select="count(catalog/cd [complete = 'Y'])" /> so theen I put EXACTLY what works on W3schools into my SP Dataview and I get nothing... just an empty space. Why doesn't the code work in my SPDV?
If your "Complete" field is an element:
<xsl:value-of select="count(/xpath/to/complete/field/element[string(.) eq 'Yes])"/>
If your complete field is an attribute of an element:
<xsl:value-of select="count(/xpath/to/parent/element[#complete eq 'Yes'])"/>
Without knowing the structure of your XML I can't provide the specific XPATH required -- the predicate "[]" is what selects only the "Yes" values