How to filter text after webscraping

How to filter text after webscraping - python-3.x

So I'm trying to webscrape this website that provides novels for free, for example this page: https://www.wuxiaworld.com/novel/martial-world/mw-chapter-1
I'm trying to only extract the title and the body of the chapter. Finding the title is easy enough since its in h4, however the body of the chapter is not separated by any specific div tags so I cannot just isolate it. I was wondering how I'd do this. The closest Ive gotten to just having the text is this.
Ps. Im new to webscraping, sorry if my question is unclear or stupid.
I tried to identify if the body of text was under any exclusive div tag but it wasn't, so i tried to call it under whatever the closest div tag was, this still returned a lot of useless and unwanted text.
edit : #koro, there's more than one instance of fr-view being used so it doesn't isolate the text. fr-view class also appears before the chapter text.

I'm not versed in webscraping but upon reviewing the page source html I see that <div class="fr-view"> only precedes the body text on the novel pages. If you start the logging after the scraper identifies this line you should be able to stop at the very next <a href="/novel..... tag to only have the novel text included.
Some of the pages I see also include footnotes with some extra information, these include an <a href=#footnote....> tag, so if you would like to keep the footnotes included I would search for <a href=/novel...> and NOT <a href=...>
P.S. I only looked at 4 pages and while they all appear to have the same format that I've pointed out above it's still possible that you may run into issues, but that's definitely something you can a bridge you can cross when you get there!

Related

How to remove <div ...> tags in a Excel cell with much HTML code?

i scraped datas from a website using Python. The results are great.
Just the after-work-party is something annoying.
The HTML source code is (of course) messed up with various kind of div class="abc123"> tags.
Do we have an Excel geek-trick to remove them quickly?
I search a <div ...> tag manually and remove the specified one via search'n replace function.
After I jump to the next div tag and so on....
Isn't it a little bit too much of the "good old school" way to remove it?
Of course, there are some online services (free and paid) to do that, but I'm sure we have a trick in Excel, I just can't get it out right now. And using external services like an online tool for cleaning HTML code is again a extra workload - unnecessary.

After experimenting around with some RegEx I was able to solve my problem:
Pressing Ctrl + H for opening Find and Replace then entering my pattern <div *> and replacing it with - whatever:
Find what: <div *>
Replace with:
That's it. Finally.

Extract multiple image links from a single div tag that has no class name

Hi I have this HTML code from a website: I want to be able to extract multiple images, I have cases where there is 3-4 images. How would I go about doing that?
<div style="float: right;"><u>Chakra required:</u>
<img src="https://naruto-arena.net/images/energy/energy_2.gif">
<img src="https://naruto-arena.net/images/energy/energy_4.gif">
</div>
My code:
chakras1 = soup.find_all("div")[42].img['src']
print(chakras1)
Result:
https://naruto-arena.net/images/energy/energy_2.gif
I only get the FIRST image but now the second one.

To extract multiple images from within a tag, I personally stick to using for loops. So what I mean is once you find the div tag that you want to look within and say you call that chakras1, I would write the following:
for img in chakras1.find_all("img"):
print(img)
For me the steps kind of go as follows:
find the specific tags you are looking for (img src)
see what tag those tags are within (
navigate through the HTML to that tag (using beautiful soup's .find or .find_all functions depending on what you want to use)
once you have navigated to the tag search within that tag for the tags you're really looking for.
One quick note, with this method, if you are looking through multiple div tags, you're also going to need to loop through those as well.
I hope this makes sense!

Kentico 9 search result transformation

We've noticed a bug when looking at French search results. in the CMS Desk, i've kept the Page Name in English for the French content. The issue is, these are showing on the French results page.
in the transformation, based off the default one, I present the clickable title like this:
<a href='<%# SearchResultUrl() %>' data-type="title" target="_blank" ><%#SearchHighlight(HTMLHelper.HTMLEncode(CMS.ExtendedControls.ControlsHelper.RemoveDynamicControls(DataHelper.GetNotEmpty(Eval("Title"), ""))), "<span class='highLight'>", "</span>")%></a>
Here's my thinking, if the Menu Caption is filled out, use that rather than title. How do i output DocumentMenuCaption without adjust the search fields on the menu page type?
I think my logic is, check if DocumentMenuCaption is emtpy, if it use, use Title.

You should be able to continue using GetNotEmpty and just pass in the DocumentMenuCaption first, something like this:
<%# GetNotEmpty(GetSearchValue("DocumentMenuCaption");Eval("Title")) %>
You may or may not need the "GetSearchValue" function, but that allows you to grab values from the object that may not be available in the default set of columns for the search results.
Alternatively, you should be able to use the IfEmpty() method:
<%# IfEmpty(GetSearchValue("DocumentMenuCaption"), Eval("Title"), GetSearchValue("DocumentMenuCaption")) %>
Both transformation methods taken from here (double check syntax on "GetNotEmpty" as there are different ways it's implemented: https://docs.kentico.com/k9/developing-websites/loading-and-displaying-data-on-websites/writing-transformations/reference-transformation-methods
You can read more about the search transformations here: https://docs.kentico.com/k9/configuring-kentico/setting-up-search-on-your-website/displaying-search-results-using-transformations

How can I hide certain text from search engines?

In my WordPress blog, I have "Posted ? days ago" on every post. I have 10 posts on my homepage. So according to most keyword analysis tools, "days ago" is a top keyword on my blog, but I don't want it to be. How can I hide those words from search engines?
I don't want to use Javascript. I can easily use PHP and the $_SERVER variable, but I'm afraid I might get penalized for cloaking. Is there a HTML tag or an attribute like rel="nofollow" that I can use?

From Is there any way to have search engines not index a certain section of a page?
Supposedly you can add the class
robots-nocontent to elements on your
page, like this:
<div class="robots-nocontent">
<p>Ignore this stuff.</p>
</div>
Yahoo respects this, though I
don't know if other search engines
respect this. It appears Google is
not supporting this at this time.
I suspect if you load your content via
ajax you would get the same effect of
it not being present on the page.
and
There's no general way to do that and
personally I wouldn't bother with it.
Search engines are pretty good at
recognizing relevant content on a
page, and even though that content
might show up in the keywords that
search engines have found, it doesn't
mean that it would make the page
relevant for those keywords.
If you have a page about "Fish" and a
page about "Dogs" (that has the link
to the page about "Fish" somewhere in
the sidebar), search engines will
generally be able to recognize that
the page about "Fish" is much more
relevant for "Fish" than the page
about "Dogs" that mentions "Fish" in
the sidebar. It's possible that both
pages might be found at some point,
but generally given that mostly one
page from the site is shown in the
search results, that's not something
worth worrying about.
There's no need to be fancy with that,
and search engines are likely to just
get more confused if you try (eg if
you use JavaScript to hide the
content, you never know when search
engines will start to find that
content regardless). Similarly, using
iframes with robots.txt disallows or
AJAX will frequently degrade the
quality of your pages to users (slow
it down or make it less usable on a
variety of devices), so unless there
is a very, very strong & proven reason
that you need to do this, I would
strongly recommend not bothering with
it.

What I have found on wiki:
For Yandex:
<!--noindex-->Don't index this text.<!--/noindex-->
For Yahoo:
<div class="robots-nocontent">Don't index this text.</div>
For Google:
<!--googleoff: index--> Don't index this text.<!--googleon: index-->

Linksku, I'm fairly sure you shouldn't be worried about that particular piece of text. Our algorithms do a relatively good job detecting boilerplate text. As far as I can tell from your question, this text is boilerplate and we likely already know that.
As for detecting Googlebot and don't serving this text for it, you're right, that would be cloaking and you should never do it. In this case if you hide that text from us, we will also have a hard time detecting it's boilerplate and you would end up doing exactly what you're trying to avoid :)

I worked this out and posted it up at: http://www.scivillage.com/thread-2580.html
This should work, however more testing of it and feedback would be appreciated.
.x:before{
content:attr(title);
display:inline;
}
<ul>
<li><span class="x" title="Homepage"></span></li>
<li><span class="x" title="Contact" /></li>
</ul>
(I kept the class name short to reduce mark-up creep)
The search engines should ignore HTML tags with empty values when comes to looking for keywords, this should mean that it ignores what is written in the title attribute. (It assumes that the value is what's important, if it's empty then there is no point checking the attributes)
It was suggested that it's possible to negate having the closing tag in HTML5 due reduced strictness, however there is counter suggestions that end tags are still required.
I'd suggest not using it directly on a (anchor) tags since they can be used for sitemaps (using #), so it's means they would like have the Title spidered.
Although it is possible that it might assume any title content is there to inflate keywords through hidden elements, however I can not confirm this.

To exclude specific text from Google search results you can add data-nosnippet attribute.
https://developers.google.com/search/reference/robots_meta_tag#data-nosnippet-attr
From google documentation
You can also prevent certain parts of the page text content from being shown in a snippet by using data-nosnippet.

HTML:
<div class="hasHiddenText">_</div>
It is important that you leave a non-whitespace character between the element with a hidden text.
External CSS:
.hasHiddenText{
content: "Your hidden text here...";
/*This ovewrites the default content of the div but it isn't supported by all browsers.*/
}
.hasHiddenText::before{
content: " Your hidden text here...";
/*Places a hidden text above the div.*/
}
The "hidden text" pertains to content hidden to all search engines but visible to visitors.
You can also use nextline and all sorts of Unicode characters by escaping them with \uXXXX. To display linebreak characters correctly, be sure to add the
white-space:pre-line;
property.

Web accessibility and h1-h6 headings - must all content be under these tags?

At the top of many pages in our web application we have error messages and notifications, 'Save' and other buttons, and then our h1 tag with the content title. When making a web application accessible, is it ever acceptable to have content above the top-level structure tag like we do here?

As a screen reader user I don't like content above the main heading. Normally I navigate by headings so would miss the error message. A better solution is to output an h1 heading above the error message, then leave the rest of your headings in tact giving you two h1 headings.

Yes (you can put stuff above them). The H simply means Heading. It's a question of what the heading relates to I guess.
My only caveat is, H2 shouldn't really be above H1, and H3 Shouldn't be above H2. But I don't think it's an actual rule.Websites have menus, warning, notifications. It's acceptable to put them above the rest of your content. I don't see how it would affect accessibility as long as your content is ordered logically. Look at the page CSS turned off. Does it look logical? That's the most important part of accessibility.
Although some people do go that extra mile and have the menu as the last item in the markup and use CSS to bring it back to the top. Personally, I find that solution counter productive. The menu is still important, it belongs at the top of the page.

Yes, just consider it is in that order that the user will get the information. So, if you just did an operation it sounds like a good idea to get any message related to it as the first thing. If it is a notification that appears on any page unrelated to what you are doing, I wouldn't put it above, as it might be a little weird.
Also you can use a text browser that doesn't use styles, it should look like a document with appropriate headers.

Heading tags are used to indicate the hierarchy of the content below it. You should only have one h1 tag and it should be the first content to appear on your page (this is usually the name of the site). Also, you shouldn't skip heading tags when drilling down through different tiers of content.
In your case, you can still use CSS to position items above the h1 tag as long as it is in the correct order in the html.

I assume the elements above the heading are used by JavaScript. In that case, it's preferable if they are created by JavaScript, not included in the source of the page.
To return to your original question, it is probably best that they be at the foot of the page. However, if they are hidden using the CSS "display: none;" or "visibility: hidden;" properties then they will not be seen by most (perhaps all?) screenreaders or by many other assistive technologies, and so should not be an issue. I've written a fairly detailed explanation of why accessibility technology ignores such elements.
Of course if somebody disables CSS things are going to look pretty messy. If there is content on the page that can be used even when CSS and/or JavaScript are disabled, then putting those elements at the bottom of the page will at least make things less cluttered.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to filter text after webscraping - python-3.x

Related

How to remove <div ...> tags in a Excel cell with much HTML code?

Extract multiple image links from a single div tag that has no class name

Kentico 9 search result transformation

How can I hide certain text from search engines?

Web accessibility and h1-h6 headings - must all content be under these tags?

Categories

Resources