How to select all nodes beetwen certain headers? - python-3.x

Each <header> tag contains a Title of Conference.
Each <ul> tag contains the links of this conference.
When I'll to try to crawl the website, I'm try to associating the <header> tag with yours links in <ul> tags. But I don't know how I can only select the <ul> tags of are sibling two certain <headers>.
HTML:
<header>... 0 ... </header>
<ul class="publ-list">... 0 ...</ul>
<header>... 1 ... </header>
<ul class="publ-list">... 0 ...</ul>
<header>... 2 ... </header>
<ul class="publ-list">... 0 ...</ul>
<p>...</p>
<ul class="publ-list">... 1 ...</ul>
<header>... 3 ...</header>
<ul class="publ-list">... 0 ...</ul>
<ul class="publ-list">... 1 ...</ul>
<ul class="publ-list">... 2 ....</ul>
<ul class="publ-list">... 3 ....</ul>
<ul class="publ-list">... 4 ....</ul>
<header>... 4 ...</header>
Example:
<ul> tags are sibling of header[0] and header[1]
<ul class="publ-list">... 0 ...</ul>
<ul> tags are sibling of header[2] and header[3]
<ul class="publ-list">... 0 ...</ul>
<ul class="publ-list">... 1 ...</ul>
Some cases:
It's possible more than one ul tag between header tag
Sometimes has a p tag between ul tags
All tags are siblings!
All ul has class "publ-list"
My code:
TITLE_OF_EDITIONS_SELECTIOR = 'header h2'
GROUP_OF_TYPES_OF_EDITION_SELECTOR = ".publ-list"
size_editions = len(response.css(GROUP_OF_TYPES_OF_EDITION_SELECTOR))
i = 0
while i < size_editions:
# Get the title of conference
title_edition_conference = response.css(TITLE_OF_EDITIONS_SELECTIOR)[i]
# Get datas and links of <ul> tags "(.publ-list)"
TYPES_OF_CONFERENCE = response.css(GROUP_OF_TYPES_OF_EDITION_SELECTOR)[i]
TYPE = TYPES_OF_CONFERENCE.css('.entry')
types_of_edition = {}
size_type_editions = 0
for type_of_conference in TYPE:
title_type = type_of_conference.css('.data .title ::text').extract()
link_type = type_of_conference.css('.publ ul .drop-down .body ul li a ::attr(href)').extract_first()
types_of_edition[size_type_editions] = {
"title": title_type,
"link": link_type,
}
size_type_editions = size_type_editions + 1
editions[i] = {
"title_edition_conference": title_edition_conference,
"types_of_edition": types_of_edition
}
i = i + 1
Problem of My Code
Sometimes there are many ul tags
Sometimes has a <p> tag and it's break my xPath, and get only the previous <ul> tags.
I got it testing with JQuery on Console of Google Chrome, example:
"$($('header')[0]).nextUntil($('header')[1])"
But How I can select this using xPath or CSS Selector? Thank you!

Following combination of css selectors and python for loop can solve this task.
from parsel import Selector
html = """
<ul class="publ-list">p1</ul>
<header>h1</header>
<ul class="publ-list">p2</ul>
<header>h2</header>
<ul class="publ-list">p3</ul>
<header>h3</header>
<ul class="publ-list">p4</ul>
<p>p_tag_1</p>
<ul class="publ-list">p5</ul>
<header>h4</header>
<ul class="publ-list">p6</ul>
<ul class="publ-list">p7</ul>
<header>h5</header>
<ul class="publ-list">p8</ul>
"""
response = Selector(text=html)
tags = response.css("header, ul")
output = {}
key = False
for t in tags:
if key and "<ul" in t.css("*").extract_first():
output[key].append(t.css("::text").extract_first())
elif "<header>" in t.css("*").extract_first():
key = t.css("::text").extract_first()
if key not in output.keys():
output[key]=[]
else:
pass
print(output)
Output is:
{'h1': ['p2'], 'h2': ['p3'], 'h3': ['p4', 'p5'], 'h4': ['p6', 'p7'], 'h5': ['p8']}
This css selector: tags = response.css("header, ul") returns list of <header> and <ul> tags in the same order as in the html code.
After that we can iterate through received tags using for loop and select required data.

Try to use following-sibling like here:
>>> txt = """<header>..</header>
... <ul class="publ-list">...</ul>
... <header>..</header>
... <ul class="publ-list">...</ul>
... <header>..</header>
... <ul class="publ-list">...</ul>
... <p>...</p>
... <ul class="publ-list">...</ul>
... <header>..</header>
... <ul class="publ-list">...</ul>
... <ul class="publ-list">...</ul>
... <header>..</header>"""
>>> from scrapy import Selector
>>> sel = Selector(text=txt)
>>> sel.xpath('//header/following-sibling::*[not(self::header)]').extract()
[u'<ul class="publ-list">...</ul>', u'<ul class="publ-list">...</ul>', u'<ul class="publ-list">...</ul>', u'<p>...</p>', u'<ul class="publ-list">...</ul>', u'<ul class="publ-list">...</ul>', u'<ul class="publ-list">...</ul>']
So with //header/following-sibling::*[not(self::header)] we choose all header siblings, but not header.

This may be what you're looking for.
html = """
<ul class="publ-list">...</ul>
<header>..</header>
<ul class="publ-list">...</ul>
<header>..</header>
<ul class="publ-list">...</ul>
<header>..</header>
<ul class="publ-list">...</ul>
<p>...</p>
<ul class="publ-list">...</ul>
<header>..</header>
<ul class="publ-list">...</ul>
<ul class="publ-list">...</ul>
<header>..</header>
<ul class="publ-list">...</ul>
"""
Note I added a <ul>before the first and after the last <header>..</header> sets.
This expression
//ul[
preceding-sibling::header
and
following-sibling::header
]
should select all the <ul> tags, except those I added before and after, and none of the <p> tags which may be in the way.

Related

CheerioJS get specific <li> where header text 'What I Want'

I'm trying to get li elements where the header is 'What I want'
This is my Code:
let wants = []
$$('li').each((wantIdx, wantElement) => {
const want= $(relatedArticleElement).text()
wants.push(want)
})
and this is the HTML i'm trying to parse from:
<div class="side-list-panel">
<h4 class="panel-header">What I Want</h4>
<ul class="panel-items-list">
<li>
1
</li>
<li>
2
</li>
<li>
3
</li>
<li>
4
</li>
<li>
5
</li>
</ul>
</div>
<div class="side-list-panel">
<h4 class="panel-header">What I don't want</h4>
<ul class="panel-items-list">
<li>
a
</li>
<li>
b
</li>
<li>
c
</li>
<li>
d
</li>
<li>
e
</li>
</ul>
</div>
this code gets me every single li elements in the page obviously, is there any way i can only get the lis under the 'What I Want' panel-header?
You can get those with:
$('h4:contains("What I Want") + ul li').get().map(li => $(li).text())
You can try JQuery's contains if Cheerio supports it Example $('td:contains("male")')

ValueError: how to extract `<a href >` from a dataframe?

I want to get <a href> from this dataframe, but instead I get:
ValueError: Length of values does not match length of index.
This is what the DataFrame looks like
df.head(7)
0 <ul class="toc"> <li class="first"><a href="#d...
1 <ul class="toc"> <li><a href="#d17e906">1. LEE...
2 <ul class="toc"> <li><a href="#d17e974">2.1 Be...
3 <ul class="toc"> <li><a href="#d17e6333">3.1. ...
4 <ul class="toc"> <li><a href="#d17e23490">4.1 ...
5 <ul class="toc"> <li><a href="#d17e27196">5.1 ...
6 <ul class="toc"> <li><a href="#d17e54643">Bijl...
7 <ul class="toc"> <li><a href="#d17e55852">31. ...
This is the code I am using.
df = pd.read_html(url)[0]
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
ul_toc = soup.find_all('ul', class_= 'toc')
links = []
for a_tag in ul_toc:
extract= a_tag.find_all('li')
for each in extract:
try:
link = each.find('a')['href']
links.append(link)
except:
pass
df['Link'] = links
I am not sure what I am missing with the above code.
I can find it myself.
toc_class= soup.find_all('ul', class_='toc')
df= pd.DataFrame(data= toc_class)
links = []
for a_tag in toc_class:
extract= a_tag.find('li')
for each in extract:
try:
link = each.get('href')
links.append(link)
except:
pass
df['Link'] = links
df
and this is the output:
0 Link
0 <ul class="toc"> <li class="first"><a href="#d... #d17e58
1 <ul class="toc"> <li><a href="#d17e906">1. LEE... #d17e906
2 <ul class="toc"> <li><a href="#d17e974">2.1 Be... #d17e974
3 <ul class="toc"> <li><a href="#d17e6333">3.1. ... #d17e6333
4 <ul class="toc"> <li><a href="#d17e23490">4.1 ... #d17e23490
5 <ul class="toc"> <li><a href="#d17e27196">5.1 ... #d17e27196
6 <ul class="toc"> <li><a href="#d17e54643">Bijl... #d17e54643
7 <ul class="toc"> <li><a href="#d17e55852">31. ... #d17e55852

How to add menu link wrapper in drupal 7?

I have a menu, one menu item looking like this:
<li>
LINK TITLE
</li>
And I want this:
<li>
<a href="aaa">
<div class="custom">LINK TITLE</div>
</a>
</li>
Or this:
<li>
<div class="custom">
LINK TITLE
</div>
</li>
How can I do this?
This works:
function theme_menu_link(array $variables) {
$variables['element']['#localized_options']['html'] = true;
$variables['element']['#title'] = '<div class="custom">' . $variables['element']['#original_link']['link_title'] . '</div>';
return theme_menu_link($variables);
}

Unable to extract an html fragment using cheerio

I am using cheerio to perfom some html manipulation on node js server .I have an html string like this
var htmlString =" <ol>
<li>
<p>item1</p>
</li>
<li>
<p>item2</p>
</li>
<li>
<p>item 3</p>
</li>
<li>
<p>item 4</p>
</li>
</ol>
<p>First paragraph</p>
<p>second paragraph</p>
<p>Third paragraph</p>
"
var $ = cheerio.load(htmlString);
var dummy = $("<div></div>")
var item = dummy.append($("*").slice(0,3).clone()).html();
The output returned is
<ol>
<li>
<p>item1</p>
</li>
<li>
<p>item2</p>
</li>
<li>
<p>item 3</p>
</li>
<li>
<p>item 4</p>
</li>
</ol>
<li>item1</li>
<p>item1</p>
The output that I expect is the ordered list followed byparagraph1 followed by paragraph2
Am I doing something wrong or is this a bug in cheerio?
After fiddling with the code for the entire day I finally got the solution. Apparently I was loading the html fragment incorrectly. This worked for me
var $ = cheerio.load();
var dummy = $("<div></div>")
var item = dummy.append($(htmlString).slice(0,3)).html();

JQuery mobile - content navigation collapse on a button on portrait

We are developing an application with the layout near to the jqm examples here( table of contents on left and contents at right) but we want the same behavior as Sencha mobile, when it's in portrait, the table of contents collapses in a navigation button here
Is it possible to do with jqm?
I have created a sample jQuery Mobile application which works like this - When in portrait mode,a split view layout will be shown.When in landscape mode,navigation can be done via a button in the header.For illustrating this in a desktop browser,I have given the width to check as 500px.If width>500 px ,split view. If width <500px, button in header.
This is the source code:
<!DOCTYPE html>
<html>
<head>
<title>Page</title>
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="stylesheet" href="http://code.jquery.com/mobile/1.0/jquery.mobile-1.0.min.css" />
<link rel="stylesheet" href="http://jquerymobile.com/test/docs/_assets/css/jqm-docs.css"/>
<script type="text/javascript" src="http://code.jquery.com/jquery-1.6.4.min.js"></script>
<script>
function showNavList() {
$(".navdiv").toggle();
}
$(".page").live("pagebeforeshow", function() {
$(".navdiv").hide();
});
</script>
<script type="text/javascript" src="http://code.jquery.com/mobile/1.0/jquery.mobile-1.0.min.js"></script>
<style>
.content-secondary{
margin: 0px !important;
padding:0px !important;
}
/*refer http://css-tricks.com/snippets/css/media-queries-for-standard-devices/ */
/* Smartphones (landscape) ----------- */
#media all and (min-width: 501px){/*For demo in desktop browsers,gave 501.Should be 321px.Refer above link*/
.headerNav{
display:none !important;
}
.content-secondary{
display: block;
}
.navdiv{
display:none !important;
}
}
/* Smartphones (portrait) ----------- */
#media all and (max-width: 500px){/*320px*/
.headerNav{
display:block !important;
}
.content-secondary{
display: none;
}
}
</style>
</head>
<body>
<div data-role="page" class="page" id="page1">
<div class="navdiv" style="width:150px;top:38px;left:5px;position:absolute;z-index:1000;display:none">
<ul data-role="listview">
<ul data-role="listview" data-theme="c">
<li class="ui-btn-active" data-icon="false">
Page 1
</li>
<li data-icon="false">
Page 2
</li>
<li data-icon="false">
Page 3
</li>
</ul>
</ul>
</div>
<div data-role="header">
<h1>Page 1</h1>
Navigation
</div><!-- /header -->
<div data-role="content">
<div class="content-primary">
Content1
</div>
<div class="content-secondary">
<ul data-role="listview" data-theme="c">
<li class="ui-btn-active" data-icon="false">
Page 1
</li>
<li>
Page 2
</li>
<li>
Page 3
</li>
</ul>
</div>
</div><!-- /content -->
</div><!-- /page -->
<div data-role="page" class="page" id="page2">
<div class="navdiv" style="width:150px;top:38px;left:5px;position:absolute;z-index:1000;display:none">
<ul data-role="listview">
<ul data-role="listview" data-theme="c">
<li data-icon="false">
Page 1
</li>
<li data-icon="false" class="ui-btn-active">
Page 2
</li>
<li data-icon="false">
Page 3
</li>
</ul>
</ul>
</div>
<div data-role="header">
<h1>Page 2</h1>
Navigation
</div><!-- /header -->
<div data-role="content">
<div class="content-primary">
Content2
</div>
<div class="content-secondary">
<ul data-role="listview" data-theme="c">
<li data-icon="false">
Page 1
</li>
<li class="ui-btn-active" data-icon="false" >
Page 2
</li>
<li data-icon="false">
Page 3
</li>
</ul>
</div>
</div><!-- /content -->
</div><!-- /page -->
<div data-role="page" class="page" id="page3">
<div class="navdiv" style="width:150px;top:38px;left:5px;position:absolute;z-index:1000;display:none">
<ul data-role="listview">
<ul data-role="listview" data-theme="c">
<li data-icon="false">
Page 1
</li>
<li data-icon="false">
Page 2
</li>
<li data-icon="false" class="ui-btn-active">
Page 3
</li>
</ul>
</ul>
</div>
<div data-role="header">
<h1>Page 3</h1>
Navigation
</div><!-- /header -->
<div data-role="content">
<div class="content-primary">
Content3
</div>
<div class="content-secondary">
<ul data-role="listview" data-theme="c">
<li>
Page 1
</li>
<li>
Page 2
</li>
<li class="ui-btn-active">
Page 3
</li>
</ul>
</div>
</div><!-- /content -->
</div><!-- /page -->
</body>
</html>
This is not a foolproof application.But just a rough copy to illustrate how this feature can be done.There are still lot of things to be done to make this work perfectly.
To make it work I have used the concept of media queries.Using it you can selectively hide/show the layout depending on the browser width(orientation of device).
P.S. I have used jqm-docs.css for this example.This css has other media queries too targeting other widths.So there might be some weird layout issues when you test this code.Please modify that css to remove the unwanted media queries.
Let me know if it helps

Resources