I want to get <a href> from this dataframe, but instead I get:
ValueError: Length of values does not match length of index.
This is what the DataFrame looks like
df.head(7)
0 <ul class="toc"> <li class="first"><a href="#d...
1 <ul class="toc"> <li><a href="#d17e906">1. LEE...
2 <ul class="toc"> <li><a href="#d17e974">2.1 Be...
3 <ul class="toc"> <li><a href="#d17e6333">3.1. ...
4 <ul class="toc"> <li><a href="#d17e23490">4.1 ...
5 <ul class="toc"> <li><a href="#d17e27196">5.1 ...
6 <ul class="toc"> <li><a href="#d17e54643">Bijl...
7 <ul class="toc"> <li><a href="#d17e55852">31. ...
This is the code I am using.
df = pd.read_html(url)[0]
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
ul_toc = soup.find_all('ul', class_= 'toc')
links = []
for a_tag in ul_toc:
extract= a_tag.find_all('li')
for each in extract:
try:
link = each.find('a')['href']
links.append(link)
except:
pass
df['Link'] = links
I am not sure what I am missing with the above code.
I can find it myself.
toc_class= soup.find_all('ul', class_='toc')
df= pd.DataFrame(data= toc_class)
links = []
for a_tag in toc_class:
extract= a_tag.find('li')
for each in extract:
try:
link = each.get('href')
links.append(link)
except:
pass
df['Link'] = links
df
and this is the output:
0 Link
0 <ul class="toc"> <li class="first"><a href="#d... #d17e58
1 <ul class="toc"> <li><a href="#d17e906">1. LEE... #d17e906
2 <ul class="toc"> <li><a href="#d17e974">2.1 Be... #d17e974
3 <ul class="toc"> <li><a href="#d17e6333">3.1. ... #d17e6333
4 <ul class="toc"> <li><a href="#d17e23490">4.1 ... #d17e23490
5 <ul class="toc"> <li><a href="#d17e27196">5.1 ... #d17e27196
6 <ul class="toc"> <li><a href="#d17e54643">Bijl... #d17e54643
7 <ul class="toc"> <li><a href="#d17e55852">31. ... #d17e55852
Related
how do I have to configure the "Wayfinder" in MODx to get a Bootstrap 5 output. specifically for the "Dropdown Submenue".
<div class="collapse navbar-collapse" id="navbarCollapse">
<ul class="navbar-nav me-auto mb-2 mb-md-0">
<li class="nav-item active">
<a class="nav-link active" aria-current="page" href="#">Home</a>
</li>
<li class="nav-item dropdown">
<a class="nav-link dropdown-toggle" role="button" data-bs-toggle="dropdown" aria-expanded="false" href="#">Project</a>
<ul class="dropdown-menu">
<li><a class="dropdown-item" href="#">Action</a></li>
<li><a class="dropdown-item" href="#">Another action</a></li>
</ul>
</li>
<li class="nav-item">
<a class="nav-link" href="#">How-To</a>
</li>
<li class="nav-item">
<a class="nav-link" href="#">About</a>
</li>
<li class="nav-item">
<a class="nav-link" href="#">Contact</a>
</li>
</div>
I think it has to be the innerTpl and the innerRowTpl. The normal (level 1 menu) works. Just not the submenu.
How do I have to configure that?
The Wayfinder call:
[[Wayfinder? &startId=`0` &level=`2` &outerClass=`navbar-nav me-auto mb-2 mb-md-0` &rowTpl=`tpl_navigation-menu` &rowClass=`nav-item` &innerTpl=`innerTpl` &innerRowTpl=`innerRowTpl`]]
&rowTpl:
<li[[+wf.id]][[+wf.classes]]><a href="[[+wf.link]]" class="nav-link" title="[[+wf.title]]" [[+wf.attributes]]>[[+wf.linktext]]</a>[[+wf.wrapper]]</li>
&innerTpl and innerRowTpl is still blank.
Did someone have an Idea?
Wayfinder often confusing here unfortunately, please take a look at snippet documentation diagram, it will help you to understand which chunks are used to form menu child elements.
OK,
here is the final solution. (For those who struggels with the Wayfinder-Stuff as me)
[[Wayfinder?
&startId=`0`
&level=`2`
&outerClass=`navbar-nav me-auto mb-2 mb-md-0`
&innerClass=`dropdown-menu`
&rowTpl=`tpl_row`
&parentRowTpl=`tpl_parentrow`
&innerRowTpl=`tpl_innerrow`
]]
Chunk for tpl_row:
<li class="nav-item [[+wf.classnames]]">
<a class="nav-link" href="[[+wf.link]]">[[+wf.linktext]]</a>
[[+wf.wrapper]]
</li>
Chunk for tpl_parentrow:
<li class="nav-item dropdown [[+wf.classnames]]">
<a class="nav-link dropdown-toggle" role="button" data-bs-toggle="dropdown" aria-expanded="false" href="[[+wf.link]]">[[+wf.linktext]]</a>
[[+wf.wrapper]]
</li>
Chunk for tpl_innerrow:
<li class="[[+wf.classnames]]">
<a class="dropdown-item" href="[[+wf.link]]">[[+wf.linktext]]</a>
[[+wf.wrapper]]
</li>
I want to find the answer with data-correct="1". Here is my source text:
<ul class="list-group">
<li class="list-group-item list-ques"><b>1.</b> What the capital of Bangladesh?
</li>
<li class="answer" data-qid="1" data-ans="a" data-correct="0" name="ans_4665" class="rd_ques_ans">
a. Chittagong
</li>
<li class="answer" data-qid="1" data-ans="b" data-correct="0" name="ans_4665" class="rd_ques_ans">
b.Khulna
</li>
<li class="answer" data-qid="1" data-ans="c" data-correct="0" name="ans_4665" class="rd_ques_ans">
c.Satkhira
</li>
<li class="answer" data-qid="1" data-correct="1" # name="ans_4665" class="rd_ques_ans">
d.Dhaka
</li>
</ul>
my code:
ans_block = soup.find_all('ul', attrs = {'class': 'list-group'})
my_answer = q.find('li', attrs = {'class':'answer'}).find(re.compile('data-correct="1"')).string
Its returns None instead of d.Dhaka as result.
Your answer will be appreciated.
Happy coding :)
There is no need for a regular expression. It's more convenient to search for all li tags with css class answer and the data-correct attribute with value '1':
my_answer = q.find('li', attrs = {'class':'answer', 'data-correct' : '1'}).text.strip()
I have taken your data as html and you can find li tag and give additional attrs in find method to get text
html="""<ul class="list-group">
<li class="list-group-item list-ques"><b>1.</b> What the capital of Bangladesh?
</li>
<li class="answer" data-qid="1" data-ans="a" data-correct="0" name="ans_4665" class="rd_ques_ans">
a. Chittagong
</li>
<li class="answer" data-qid="1" data-ans="b" data-correct="0" name="ans_4665" class="rd_ques_ans">
b.Khulna
</li>
<li class="answer" data-qid="1" data-ans="c" data-correct="0" name="ans_4665" class="rd_ques_ans">
c.Satkhira
</li>
<li class="answer" data-qid="1" data-correct="1" # name="ans_4665" class="rd_ques_ans">
d.Dhaka
</li>
</ul>"""
soup= BeautifulSoup(html, 'html.parser')
main=soup.find("ul",class_="list-group")
main.find("li",attrs={"class":"rd_ques_ans","data-correct":"1"}).get_text(strip=True)
Ouput:
'd.Dhaka'
I'm trying to get li elements where the header is 'What I want'
This is my Code:
let wants = []
$$('li').each((wantIdx, wantElement) => {
const want= $(relatedArticleElement).text()
wants.push(want)
})
and this is the HTML i'm trying to parse from:
<div class="side-list-panel">
<h4 class="panel-header">What I Want</h4>
<ul class="panel-items-list">
<li>
1
</li>
<li>
2
</li>
<li>
3
</li>
<li>
4
</li>
<li>
5
</li>
</ul>
</div>
<div class="side-list-panel">
<h4 class="panel-header">What I don't want</h4>
<ul class="panel-items-list">
<li>
a
</li>
<li>
b
</li>
<li>
c
</li>
<li>
d
</li>
<li>
e
</li>
</ul>
</div>
this code gets me every single li elements in the page obviously, is there any way i can only get the lis under the 'What I Want' panel-header?
You can get those with:
$('h4:contains("What I Want") + ul li').get().map(li => $(li).text())
You can try JQuery's contains if Cheerio supports it Example $('td:contains("male")')
Each <header> tag contains a Title of Conference.
Each <ul> tag contains the links of this conference.
When I'll to try to crawl the website, I'm try to associating the <header> tag with yours links in <ul> tags. But I don't know how I can only select the <ul> tags of are sibling two certain <headers>.
HTML:
<header>... 0 ... </header>
<ul class="publ-list">... 0 ...</ul>
<header>... 1 ... </header>
<ul class="publ-list">... 0 ...</ul>
<header>... 2 ... </header>
<ul class="publ-list">... 0 ...</ul>
<p>...</p>
<ul class="publ-list">... 1 ...</ul>
<header>... 3 ...</header>
<ul class="publ-list">... 0 ...</ul>
<ul class="publ-list">... 1 ...</ul>
<ul class="publ-list">... 2 ....</ul>
<ul class="publ-list">... 3 ....</ul>
<ul class="publ-list">... 4 ....</ul>
<header>... 4 ...</header>
Example:
<ul> tags are sibling of header[0] and header[1]
<ul class="publ-list">... 0 ...</ul>
<ul> tags are sibling of header[2] and header[3]
<ul class="publ-list">... 0 ...</ul>
<ul class="publ-list">... 1 ...</ul>
Some cases:
It's possible more than one ul tag between header tag
Sometimes has a p tag between ul tags
All tags are siblings!
All ul has class "publ-list"
My code:
TITLE_OF_EDITIONS_SELECTIOR = 'header h2'
GROUP_OF_TYPES_OF_EDITION_SELECTOR = ".publ-list"
size_editions = len(response.css(GROUP_OF_TYPES_OF_EDITION_SELECTOR))
i = 0
while i < size_editions:
# Get the title of conference
title_edition_conference = response.css(TITLE_OF_EDITIONS_SELECTIOR)[i]
# Get datas and links of <ul> tags "(.publ-list)"
TYPES_OF_CONFERENCE = response.css(GROUP_OF_TYPES_OF_EDITION_SELECTOR)[i]
TYPE = TYPES_OF_CONFERENCE.css('.entry')
types_of_edition = {}
size_type_editions = 0
for type_of_conference in TYPE:
title_type = type_of_conference.css('.data .title ::text').extract()
link_type = type_of_conference.css('.publ ul .drop-down .body ul li a ::attr(href)').extract_first()
types_of_edition[size_type_editions] = {
"title": title_type,
"link": link_type,
}
size_type_editions = size_type_editions + 1
editions[i] = {
"title_edition_conference": title_edition_conference,
"types_of_edition": types_of_edition
}
i = i + 1
Problem of My Code
Sometimes there are many ul tags
Sometimes has a <p> tag and it's break my xPath, and get only the previous <ul> tags.
I got it testing with JQuery on Console of Google Chrome, example:
"$($('header')[0]).nextUntil($('header')[1])"
But How I can select this using xPath or CSS Selector? Thank you!
Following combination of css selectors and python for loop can solve this task.
from parsel import Selector
html = """
<ul class="publ-list">p1</ul>
<header>h1</header>
<ul class="publ-list">p2</ul>
<header>h2</header>
<ul class="publ-list">p3</ul>
<header>h3</header>
<ul class="publ-list">p4</ul>
<p>p_tag_1</p>
<ul class="publ-list">p5</ul>
<header>h4</header>
<ul class="publ-list">p6</ul>
<ul class="publ-list">p7</ul>
<header>h5</header>
<ul class="publ-list">p8</ul>
"""
response = Selector(text=html)
tags = response.css("header, ul")
output = {}
key = False
for t in tags:
if key and "<ul" in t.css("*").extract_first():
output[key].append(t.css("::text").extract_first())
elif "<header>" in t.css("*").extract_first():
key = t.css("::text").extract_first()
if key not in output.keys():
output[key]=[]
else:
pass
print(output)
Output is:
{'h1': ['p2'], 'h2': ['p3'], 'h3': ['p4', 'p5'], 'h4': ['p6', 'p7'], 'h5': ['p8']}
This css selector: tags = response.css("header, ul") returns list of <header> and <ul> tags in the same order as in the html code.
After that we can iterate through received tags using for loop and select required data.
Try to use following-sibling like here:
>>> txt = """<header>..</header>
... <ul class="publ-list">...</ul>
... <header>..</header>
... <ul class="publ-list">...</ul>
... <header>..</header>
... <ul class="publ-list">...</ul>
... <p>...</p>
... <ul class="publ-list">...</ul>
... <header>..</header>
... <ul class="publ-list">...</ul>
... <ul class="publ-list">...</ul>
... <header>..</header>"""
>>> from scrapy import Selector
>>> sel = Selector(text=txt)
>>> sel.xpath('//header/following-sibling::*[not(self::header)]').extract()
[u'<ul class="publ-list">...</ul>', u'<ul class="publ-list">...</ul>', u'<ul class="publ-list">...</ul>', u'<p>...</p>', u'<ul class="publ-list">...</ul>', u'<ul class="publ-list">...</ul>', u'<ul class="publ-list">...</ul>']
So with //header/following-sibling::*[not(self::header)] we choose all header siblings, but not header.
This may be what you're looking for.
html = """
<ul class="publ-list">...</ul>
<header>..</header>
<ul class="publ-list">...</ul>
<header>..</header>
<ul class="publ-list">...</ul>
<header>..</header>
<ul class="publ-list">...</ul>
<p>...</p>
<ul class="publ-list">...</ul>
<header>..</header>
<ul class="publ-list">...</ul>
<ul class="publ-list">...</ul>
<header>..</header>
<ul class="publ-list">...</ul>
"""
Note I added a <ul>before the first and after the last <header>..</header> sets.
This expression
//ul[
preceding-sibling::header
and
following-sibling::header
]
should select all the <ul> tags, except those I added before and after, and none of the <p> tags which may be in the way.
What is the best way to make any kind of article/blog content to make available offline or make it readable more like pocket using node.js ?
How do I perform downloading whole page including it's resources for offline use.
the link https://nodejs.org/api/ has all the child pages
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<title> Node.js v5.3.0 Manual & Documentation</title>
<link rel="stylesheet" href="https://fonts.googleapis.com /css?family=Lato:400,700,400italic">
<link rel="stylesheet" href="assets/style.css">
<link rel="stylesheet" href="assets/sh.css">
<link rel="canonical" href="https://nodejs.org/api/index.html">
</head>
<body class="alt apidoc" id="api-section-index">
<div id="content" class="clearfix">
<div id="column2" class="interior">
<div id="intro" class="interior">
<a href="/" title="Go back to the home page">
Node.js (1)
</a>
</div>
<ul>
<li><a class="nav-documentation" href="documentation.html">About these Docs</a></li>
<li><a class="nav-synopsis" href="synopsis.html">Synopsis</a> </li>
<li><a class="nav-assert" href="assert.html">Assertion Testing</a></li>
<li><a class="nav-buffer" href="buffer.html">Buffer</a></li>
<li><a class="nav-addons" href="addons.html">C/C++ Addons</a></li>
<li><a class="nav-child_process" href="child_process.html">Child Processes</a></li>
<li><a class="nav-cluster" href="cluster.html">Cluster</a></li>
<li><a class="nav-console" href="console.html">Console</a></li>
<li><a class="nav-crypto" href="crypto.html">Crypto</a></li>
<li><a class="nav-debugger" href="debugger.html">Debugger</a> </li>
<li><a class="nav-dns" href="dns.html">DNS</a></li>
<li><a class="nav-domain" href="domain.html">Domain</a></li>
<li><a class="nav-errors" href="errors.html">Errors</a></li>
<li><a class="nav-events" href="events.html">Events</a></li>
<li><a class="nav-fs" href="fs.html">File System</a></li>
<li><a class="nav-globals" href="globals.html">Globals</a></li>
<li><a class="nav-http" href="http.html">HTTP</a></li>
<li><a class="nav-https" href="https.html">HTTPS</a></li>
<li><a class="nav-modules" href="modules.html">Modules</a></li>
<li><a class="nav-net" href="net.html">Net</a></li>
<li><a class="nav-os" href="os.html">OS</a></li>
<li><a class="nav-path" href="path.html">Path</a></li>
<li><a class="nav-process" href="process.html">Process</a></li>
<li><a class="nav-punycode" href="punycode.html">Punycode</a></li>
<li><a class="nav-querystring" href="querystring.html">Query Strings</a></li>
<li><a class="nav-readline" href="readline.html">Readline</a></li>
<li><a class="nav-repl" href="repl.html">REPL</a></li>
<li><a class="nav-stream" href="stream.html">Stream</a></li>
<li><a class="nav-string_decoder" href="string_decoder.html">String Decoder</a></li>
<li><a class="nav-timers" href="timers.html">Timers</a></li>
<li><a class="nav-tls" href="tls.html">TLS/SSL</a></li>
<li><a class="nav-tty" href="tty.html">TTY</a></li>
<li><a class="nav-dgram" href="dgram.html">UDP/Datagram</a></li>
<li><a class="nav-url" href="url.html">URL</a></li>
<li><a class="nav-util" href="util.html">Utilities</a></li>
<li><a class="nav-v8" href="v8.html">V8</a></li>
<li><a class="nav-vm" href="vm.html">VM</a></li>
<li><a class="nav-zlib" href="zlib.html">ZLIB</a></li>
now call these pages by node.js
var Client = require('node-rest-client').Client;
client = new Client();
var fs = require('fs')
client.get("https://nodejs.org/api/buffer.html",
function(data, response){
console.log(data.toString());
});
now use fs to save the file