Dom Traversal with lxml into a graph database and pass on ID to establish a complete tree - python-3.x

I'm inserting hierarchical data made of a DOM Tree into a graph database but, I'm not able to establish the complete relationship between the nodes. I while looping I ended up truncating the trees
Below is the code that illustrates a traversing of DOM nodes, inserting the tags and obtaining the last inserted id. The problem I'm having is how to properly connect the trees by passing the ID obtained from the previous iteration.
I had the same issue when I used a recursion.
How do I loop and pass the IDs so they can be evenly connected to all the node trees?
Considering the following HTML
<!doctype html>
<html lang="en">
<head>
<meta charset="UTF-8"/>
<title>Document</title>
</head>
<body>
<ul class="menu">
<div class="itm">home</div>
<div class="itm">About us</div>
<div class="itm">Contact us</div>
</ul>
<div id="idone" class="classone">
<li class="item1">First</li>
<li class="item2">Second</li>
<li class="item3">Third</li>
<div id="innerone"><h1>This Title</h1></div>
<div id="innertwo"><h2>Subheads</h2></div>
</div>
<div id="second" class="below">
<div class="inner">
<h1>welcome</h1>
<h1>another</h1>
<h2>third</h2>
</div>
</div>
</body>
</html>
With the current python code, I ended up with the truncated tree as illustrated. I omitted the graph Database driver. in order to focus on the cypher since most graph database follows almost the same cypher query.
import json
from lxml import etree
from itertools import tee
from lxml import html
for n in dom_tree.iter():
cursor = Cypher("CREATE (t:node {tag: %s} ) RETURN id(t)", params=(n.tag,))
parent_id = cursor.fetchone()[0] # get last inserted ID
ag.commit()
print(f"Parent:{n.tag}")
for x in n.iterchildren():
cursor = Cypher("CREATE (x:node {tag: %s} ) RETURN id(x)", params=(x.tag,))
xid = cursor.fetchone()[0] # get last inserted ID
ag.commit()
print(f"--------{x.tag}")
cx = Cypher("MATCH (p:node),(k:node) WHERE id(p) = %s AND id(k) = %s CREATE (p)-[:connect {name: p.name+ '->'+k.name}]->(k)", params=(eid, xid,))

ElementTree provides a method, getiterator(), to iterate over every element in the tree.
As you visit each node, you can also get a list of its children with element.getchildren() or its parent with element.getparent().
from lxml import html
tree = html.parse("about.html")
for element in tree.getiterator():
if parent := element.getparent():
print(f"The element {element.tag} with text {element.text} and attributes {element.attrib} is the child of the element {parent.tag}")

In your case you crate tag nodes for the parent time and again, you need to pass on the identifier for the parent node for each level down.
Either you have to take a recursive approach where you pass in the id of the parent to the function you call recursively.
Or you need to assign the elements in your dom tree an id e.g. based on level and position (sibling) to identify them uniquely.
in neo4j you can also use
MATCH (n) WHERE id(n) = $id
MERGE (n)-[:CHILD]->(m:Node {tag:$tag)
to create the child within the context of the parent but that doesn't work with duplicate tags.
I'd go with the recursive approach.

Related

beautifulsoup get value of attribute using get_attr method

I'd like to print all items in the list, but not containing the style tag = the following value: "text-align: center"
test = soup.find_all("p")
for x in test:
if not x.has_attr('style'):
print(x)
Essentially, return me all items in list where style is not equal to: "text-align: center". Probably just a small error here, but is it possible to define the value of style in has_attr?
Just check if the specific style is present in the Tag's style. Style is not considered a multi-valued attribute and the entire string inside quotes is the value of style attribute. Using x.get("style",'') instead of x['style'] also handles cases in which there is no style attribute and avoids KeyError.
for x in test:
if 'text-align: center' not in x.get("style",''):
print(x)
You can also use list comprehension to skip a few lines.
test=[x for x in soup.find_all("p") if 'text-align: center' not in x.get("style",'')]
print(test)
If you wanted to consider a different approach you could use the :not selector
from bs4 import BeautifulSoup as bs
html = '''
<html>
<head>
<title>Try jsoup</title>
</head>
<body>
<p style="color:green">This is the chosen paragraph.</p>
<p style="text-align: center">This is another paragraph.</p>
</body>
</html>
'''
soup = bs(html, 'lxml')
items = [item.text for item in soup.select('p:not([style="text-align: center"])')]
print(items)

How to filter HTML nodes which have text in it from a html page

i am new to web scraping and got an issue
I am using BeautifulSoup for scraping a webpage. I want to get nodes which have text in it.
I tried that using get_text() method like this
soup = BeautifulSoup(open('FAQ3.html'), "html.parser")
body = soup.find('body')
for i in body:
if type(i) != bs4.element.Comment and type(i)!= bs4.element.NavigableString :
if i.get_text():
print(i)
but get_text is giving node even if its child have text in it,
sample html:
<div>
<div id="header">
<script src="./FAQ3_files/header-home.js"></script>
</div>
<div>
<div>
this node contain text
</div>
</div>
</div>
while checking topmost div itself, it is returning the whole node as the innermost had text in it,
how to iterate over all nodes and filter only the nodes which actually have text in it?
I used depth-first search for this, it solved my use case
def get_text_bs4(self, soup, leaf):
if soup.name is not None:
if soup.string != None and soup.name != 'script':
if soup.text not in leaf:
leaf[soup.text] = soup
for child in soup.children:
self.get_text_bs4(child, leaf)
return leaf

Using XPath, select node without text sibling

I want to extract some HTML elements with python3 and the HTML parser provided by lxml.
Consider this HTML:
<!DOCTYPE html>
<html>
<body>
<span class="foo">
<span class="bar">bar</span>
foo
</span>
</body>
</html>
Consider this program:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from lxml import html
tree = html.fromstring('html from above')
bars = tree.xpath("//span[#class='bar']")
print(bars)
print(html.tostring(bars[0], encoding="unicode"))
In a browser, the query selector "span.bar" selects only the span element. This is what I desire. However, the above program produces:
[<Element span at 0x7f5dd89a4048>]
<span class="bar">bar</span>foo
It looks like my XPath does not actually behave like a query selector and the sibling text node is picked up next to the span element. How can I adjust the XPath to select only the bar element, but not the text "foo"?
Notice that XML tree model in lxml (as well as in the standard module xml.etree) has concept of tail. So text nodes located after a.k.a following-sibling of element will be stored as tail of that element. So your XPath correctly return the span element, but according to the tree model, it has tail which holds the text 'foo'.
As a workaround, assuming that you don't want to use the tree model further, simply clear the tail before printing:
>>> bars[0].tail = ''
>>> print(html.tostring(bars[0], encoding="unicode"))
<span class="bar">bar</span>

Python 3 BeautifulSoup4 search for text in source page

I want to search for all '1' in the source code and print the location of that '1' ex: <div id="yeahboy">1</div> the '1' could be replaced by any other string. I want to see the tag around that string.
Consider this context for example * :
from bs4 import BeautifulSoup
html = """<root>
<div id="yeahboy">1</div>
<div id="yeahboy">2</div>
<div id="yeahboy">3</div>
<div>
<span class="nested">1</span>
</div>
</root>"""
soup = BeautifulSoup(html)
You can use find_all() passing parameter True to indicate that you want only element nodes (instead of the child text nodes), and parameter text="1" to indicate that the element you want must have text content equals "1" -or any other text you want to search for- :
for element1 in soup.find_all(True, text="1"):
print(element1)
Output :
<div id="yeahboy">1</div>
<span class="nested">1</span>
*) For OP: for future questions, try to give a context, just like the above context example. That will make your question more concrete and easier to answer -as people doesn't have to create context on his own, which may turn out to be not relevant to the situation that you actually have.

Groovy XmlSlurper get value out of NodeChildren

I'm parsing HTML and trying to get full / not parsed value out of one particular node.
HTML example:
<html>
<body>
<div>Hello <br> World <br> !</div>
<div><object width="420" height="315"></object></div>
</body>
</html>
Code:
def tagsoupParser = new org.ccil.cowan.tagsoup.Parser()
def slurper = new XmlSlurper(tagsoupParser)
def htmlParsed = slurper.parseText(stringToParse)
println htmlParsed.body.div[0]
However it returns only text in case of first node and I get empty string for the second node. Question: how can I retrieve value of the first node such that I get:
Hello <br> World <br> !
This is what I used to get the content from the first div tag (omitting xml declaration and namespaces).
Groovy
#Grab('org.ccil.cowan.tagsoup:tagsoup:1.2.1')
import org.ccil.cowan.tagsoup.Parser
import groovy.xml.*
def html = """<html>
<body>
<div>Hello <br> World <br> !</div>
<div><object width="420" height="315"></object></div>
</body>
</html>"""
def parser = new Parser()
parser.setFeature('http://xml.org/sax/features/namespaces',false)
def root = new XmlSlurper(parser).parseText(html)
println new StreamingMarkupBuilder().bindNode(root.body.div[0]).toString()
Gives
<div>Hello <br clear='none'></br> World <br clear='none'></br> !</div>
N.B. Unless I'm mistaken, Tagsoup is adding the closing tags. If you literally want Hello <br> World <br> !, you might have to use a different library (maybe regex?).
I know it's including the div element in the output... is this a problem?

Resources