How to remove <div ...> tags in a Excel cell with much HTML code? - excel

i scraped datas from a website using Python. The results are great.
Just the after-work-party is something annoying.
The HTML source code is (of course) messed up with various kind of div class="abc123"> tags.
Do we have an Excel geek-trick to remove them quickly?
I search a <div ...> tag manually and remove the specified one via search'n replace function.
After I jump to the next div tag and so on....
Isn't it a little bit too much of the "good old school" way to remove it?
Of course, there are some online services (free and paid) to do that, but I'm sure we have a trick in Excel, I just can't get it out right now. And using external services like an online tool for cleaning HTML code is again a extra workload - unnecessary.

After experimenting around with some RegEx I was able to solve my problem:
Pressing Ctrl + H for opening Find and Replace then entering my pattern <div *> and replacing it with - whatever:
Find what: <div *>
Replace with:
That's it. Finally.

Related

Extract multiple image links from a single div tag that has no class name

Hi I have this HTML code from a website: I want to be able to extract multiple images, I have cases where there is 3-4 images. How would I go about doing that?
<div style="float: right;"><u>Chakra required:</u>
<img src="https://naruto-arena.net/images/energy/energy_2.gif">
<img src="https://naruto-arena.net/images/energy/energy_4.gif">
</div>
My code:
chakras1 = soup.find_all("div")[42].img['src']
print(chakras1)
Result:
https://naruto-arena.net/images/energy/energy_2.gif
I only get the FIRST image but now the second one.
To extract multiple images from within a tag, I personally stick to using for loops. So what I mean is once you find the div tag that you want to look within and say you call that chakras1, I would write the following:
for img in chakras1.find_all("img"):
print(img)
For me the steps kind of go as follows:
find the specific tags you are looking for (img src)
see what tag those tags are within (
navigate through the HTML to that tag (using beautiful soup's .find or .find_all functions depending on what you want to use)
once you have navigated to the tag search within that tag for the tags you're really looking for.
One quick note, with this method, if you are looking through multiple div tags, you're also going to need to loop through those as well.
I hope this makes sense!

How to filter text after webscraping

So I'm trying to webscrape this website that provides novels for free, for example this page: https://www.wuxiaworld.com/novel/martial-world/mw-chapter-1
I'm trying to only extract the title and the body of the chapter. Finding the title is easy enough since its in h4, however the body of the chapter is not separated by any specific div tags so I cannot just isolate it. I was wondering how I'd do this. The closest Ive gotten to just having the text is this.
Ps. Im new to webscraping, sorry if my question is unclear or stupid.
I tried to identify if the body of text was under any exclusive div tag but it wasn't, so i tried to call it under whatever the closest div tag was, this still returned a lot of useless and unwanted text.
edit : #koro, there's more than one instance of fr-view being used so it doesn't isolate the text. fr-view class also appears before the chapter text.
I'm not versed in webscraping but upon reviewing the page source html I see that <div class="fr-view"> only precedes the body text on the novel pages. If you start the logging after the scraper identifies this line you should be able to stop at the very next <a href="/novel..... tag to only have the novel text included.
Some of the pages I see also include footnotes with some extra information, these include an <a href=#footnote....> tag, so if you would like to keep the footnotes included I would search for <a href=/novel...> and NOT <a href=...>
P.S. I only looked at 4 pages and while they all appear to have the same format that I've pointed out above it's still possible that you may run into issues, but that's definitely something you can a bridge you can cross when you get there!

HTML Agility pack + Select node by its inner text

I have gotten the hang of using the html agility pack to find specific nodes using their attributes and xpaths. The problem is, I've been doing this manually for each of my projects (opening the website html and scanning for the nodes that have the text i need). Is there a way to select a single node by its inner text? This would make it easier to write an update script for websites whose content scheme is the same, but attribute tags change values over time. Thanks in advance!
Would be better if you have provided sample HTML, but since you haven't, let's assume we have HTML containing this markup :
<body>
<div class="foo">bar</div>
</body>
You can select the <div> by it's attribute using HtmlAgilityPack's SelectSingleNode() and XPath like so :
myHtmlDocument.DocumentNode.SelectSingleNode("//div[#class='foo']");
or you can select the same by the inner text like so :
myHtmlDocument.DocumentNode.SelectSingleNode("//div[.='bar']");
Hope this help.

Selecting parent elements with WebdriverJS

I'm trying to create some scripts which require moving through a lot of on-the-fly HTML with random IDs. They require getting the parents of an element - but I'm not sure how to implement this in WebdriverJS.
If I access the element I want via a console, I can do the following to get it;
document.querySelector('span[email="noreply#example.com"]').parentNode.parentNode
Is there a way to do this in WDJS? I've looked and can't see anything obvious - it's specifically the parent stuff I'm having issue with. I saw that a possible solution may be xPath however I'm unsure of syntax having never used it before.
Thanks in advance!
I don't know the syntax of WebDriverJS. But the XPath is as below, you need a way to fit it in somewhere.
This is based on your CSS Selector, so please show HTML if needed.
.//span[#email='noreply#example.com']/../..
For example, if you have HTML like this
<div>
<div>
<span email="noreply#example.com">Contact me</span>
</div>
</div>
You can avoid using .. to go up.
.//div[./div/span[#email='noreply#example.com']]
If you have more levels to look up, another handy method would be using ancestor from XPath Axes.
Also, as #sircapsalot brought up, CSS selectors spec doesn't support parent selecting, so XPath is the only way to go, unless you inject JS.

How can I hide certain text from search engines?

In my WordPress blog, I have "Posted ? days ago" on every post. I have 10 posts on my homepage. So according to most keyword analysis tools, "days ago" is a top keyword on my blog, but I don't want it to be. How can I hide those words from search engines?
I don't want to use Javascript. I can easily use PHP and the $_SERVER variable, but I'm afraid I might get penalized for cloaking. Is there a HTML tag or an attribute like rel="nofollow" that I can use?
From Is there any way to have search engines not index a certain section of a page?
Supposedly you can add the class
robots-nocontent to elements on your
page, like this:
<div class="robots-nocontent">
<p>Ignore this stuff.</p>
</div>
Yahoo respects this, though I
don't know if other search engines
respect this. It appears Google is
not supporting this at this time.
I suspect if you load your content via
ajax you would get the same effect of
it not being present on the page.
and
There's no general way to do that and
personally I wouldn't bother with it.
Search engines are pretty good at
recognizing relevant content on a
page, and even though that content
might show up in the keywords that
search engines have found, it doesn't
mean that it would make the page
relevant for those keywords.
If you have a page about "Fish" and a
page about "Dogs" (that has the link
to the page about "Fish" somewhere in
the sidebar), search engines will
generally be able to recognize that
the page about "Fish" is much more
relevant for "Fish" than the page
about "Dogs" that mentions "Fish" in
the sidebar. It's possible that both
pages might be found at some point,
but generally given that mostly one
page from the site is shown in the
search results, that's not something
worth worrying about.
There's no need to be fancy with that,
and search engines are likely to just
get more confused if you try (eg if
you use JavaScript to hide the
content, you never know when search
engines will start to find that
content regardless). Similarly, using
iframes with robots.txt disallows or
AJAX will frequently degrade the
quality of your pages to users (slow
it down or make it less usable on a
variety of devices), so unless there
is a very, very strong & proven reason
that you need to do this, I would
strongly recommend not bothering with
it.
What I have found on wiki:
For Yandex:
<!--noindex-->Don't index this text.<!--/noindex-->
For Yahoo:
<div class="robots-nocontent">Don't index this text.</div>
For Google:
<!--googleoff: index--> Don't index this text.<!--googleon: index-->
Linksku, I'm fairly sure you shouldn't be worried about that particular piece of text. Our algorithms do a relatively good job detecting boilerplate text. As far as I can tell from your question, this text is boilerplate and we likely already know that.
As for detecting Googlebot and don't serving this text for it, you're right, that would be cloaking and you should never do it. In this case if you hide that text from us, we will also have a hard time detecting it's boilerplate and you would end up doing exactly what you're trying to avoid :)
I worked this out and posted it up at: http://www.scivillage.com/thread-2580.html
This should work, however more testing of it and feedback would be appreciated.
.x:before{
content:attr(title);
display:inline;
}
<ul>
<li><span class="x" title="Homepage"></span></li>
<li><span class="x" title="Contact" /></li>
</ul>
(I kept the class name short to reduce mark-up creep)
The search engines should ignore HTML tags with empty values when comes to looking for keywords, this should mean that it ignores what is written in the title attribute. (It assumes that the value is what's important, if it's empty then there is no point checking the attributes)
It was suggested that it's possible to negate having the closing tag in HTML5 due reduced strictness, however there is counter suggestions that end tags are still required.
I'd suggest not using it directly on a (anchor) tags since they can be used for sitemaps (using #), so it's means they would like have the Title spidered.
Although it is possible that it might assume any title content is there to inflate keywords through hidden elements, however I can not confirm this.
To exclude specific text from Google search results you can add data-nosnippet attribute.
https://developers.google.com/search/reference/robots_meta_tag#data-nosnippet-attr
From google documentation
You can also prevent certain parts of the page text content from being shown in a snippet by using data-nosnippet.
HTML:
<div class="hasHiddenText">_</div>
It is important that you leave a non-whitespace character between the element with a hidden text.
External CSS:
.hasHiddenText{
content: "Your hidden text here...";
/*This ovewrites the default content of the div but it isn't supported by all browsers.*/
}
.hasHiddenText::before{
content: " Your hidden text here...";
/*Places a hidden text above the div.*/
}
The "hidden text" pertains to content hidden to all search engines but visible to visitors.
You can also use nextline and all sorts of Unicode characters by escaping them with \uXXXX. To display linebreak characters correctly, be sure to add the
white-space:pre-line;
property.

Resources