How to parse a document using crawler4j - search

I wanted to parse all the documents containing some text I enter as "query" using crawler4j in Eclipse.
Any ideas?

Not really a "direct" answer, but I also played with crawling these last few days. I looked first at Crawler4J, then stumbled on JSoup. Did not play much with the crawler, but jSoup turns out to be quite an easy tool for parsing. Hence my suggestion. I guess crawler is good if you really need to crawl a part of the web. But JSoup really seems to shine as a good parser. Similar to JQuery in terms of selecting nodes etc... So perhaps use the crawler for first collecting documents, then parse them using JSoup. Here's a quick example:
Document doc = Jsoup.connect("http://example.com").userAgent("Mozilla").timeout(5000)
.get();
Elements els = doc.select("li");

Related

Implementing a docs search for multiple docs sites

We have many different documentation sites and I would like to search a keyword across all of these sites. How can I do that?
I already thought about implementing a simple web scraper, but this seems like a very ugly solution.
An alternative may be to use Elasticsearch and somehow point it to the different doc repos.
Are there better suggestions?
Algolia is the absolute best solution that I can think of. There's also Typesense and Meilisearch of course.
Algolia is meant specifically for situations like yours, so it even comes with a crawler.
https://www.algolia.com/products/search-and-discovery/crawler/
https://www.algolia.com/
https://typesense.org/
https://www.meilisearch.com/
Here's a fun page comparing them (probably a little biased in Typesense's favor)
https://typesense.org/typesense-vs-algolia-vs-elasticsearch-vs-meilisearch/
Here are some example sites that use Algolia Search
https://developers.cloudflare.com/
https://getbootstrap.com/docs/5.1/getting-started/introduction/
https://reactjs.org/
https://hn.algolia.com/
If you personally are just trying to search for a keyword, as long as they're indexed by Google, you can always search with the format site:{domain} "keyword"
You can checkout Meilisearch for your use case. Meilisearch is a Rust based and open sourced search engine.
Meilisearch comes with a document scraper tool ( https://github.com/meilisearch/docs-scraper ) that can scrape content and then also index it.
While using it you need to define what exact content you are searching for in the configuration file for the scraper tool. And then you can run the tool using Docker.

How do I check if certain text exists on a page (puppeteer)

Sorry in advance if I seem kinda clueless, I just started using puppeteer yesterday and I’m inexperienced with this kinda stuff.
I’m trying to check if a certain page (opened with puppeteer) has the phrase “hello” for example, keep in mind that I know the XPath of the text (if it exists). I’ve tried .waitForXPath() but I can’t seem to get it to work. Is there an easier function for this?
(await page.content()).match('hello')
That depends on what you typed into .waitforXPath() method.
I can imagine this can work:
await page.waitForXPath("//*[contains(text(), 'hello')]");
But it might be slow because all texts of all elements will be searched. It's better to narrow down the search to e.g. some elements. Unfortunately you don't provide more specifics, so I can't help you there.

Extracting a url from a hotspot

I've got a collection of url hotspots in a Notes document. I'd like to extract the urls associated with the hotspots. Any hints on how to approach this?
thanks
clem
I recommend using the DXL export features and then walking the resulting XML to find the URLs. Try it, and if you have issues, come back and ask questions about the specific piece of code.

How does google return "searches" from other websites?

Let's say I'm performing a google search for search term.
Sometimes, one of the suggestions will be to a URL like this: www.someothersearch.com/search+term/
How does "someothersearch.com" do this?
In general, a page will only be in Google if some other page links to it. Google is not going to go to someothersearch.com and submit "search term" into the form, it is likely a hidden or nonhidden link on someothesearch.com.
Why not? someothersearch.com presumably has its own index pages for terms searched previously; the Google spider is just indexing those index pages as well.
Just a guess. Maybe these sites support OpenSearch?
I misunderstood your question at first; What these sites are doing is rewriting their requests. How they know which terms people will search for is a bit of a mystery to me, but it probably relies on things like watching google.com/trends, scraping their own and other log files for referral from google that include the search term, buying lists of well ranking terms people might use AdSense for and instead trying to generate natural traffic for them... etc. Probably when they add new pages with these terms they're also adding them to their xml sitemap that Google will crawl.
Redacted:
I have added the Open-Search tag to your question; please follow it. You'll find this post on https://stackoverflow.com/questions/20830/firefox-and-ie7-users-here-is-your-stackoverflow-search-pluginlink textthe most informative; however I recommend you use image/png for your icon format.

How to get a description of a URL

I have a list of URLs and am trying to collect their "descriptions." By description I mean what comes up, for example, if you Googled the link. For example, http://stackoverflow.com">Google: http://stackoverflow.com shows the description as
A language-independent collaboratively
edited question and answer site for
programmers. Questions and answers
displayed by user votes and tags.
This the data I'm trying to accumulate for the URLs I have.
I tried parsing the URL's meta-descriptions, however most of them are lacking a meta-description (yet Google and other search engines manage to get a description somehow).
Any ideas? Should I just "google" each link and scrape the data? I have a feeling Google wouldn't like this...
Thanks guys.
Different search engines have different algorithms to get the description out of the page if/when they are lacking the description meta tag. Some ignore the tag even it it's there.
If you want the description Google has, the most accurate way to get it would be to scrape it. Otherwise, you could write your own or look around on the web for code that does it.
These are called snippets.
Google use proprietary (and possibly patented) methods to garner this information, so there is no simple answer.
As you suggest, they will use meta-description information if it is there. (How to set the meta-information to help Google.)
They will also honour requests from the page authors to NOT include snippets. (How to prevent Google from displaying snippets) You should probably respect this too (as well as robots.txt, of course.)
You may have some luck with existing auto-summary packages, such as OTS.
You may want to check AboutUs.org (i.e. http://www.aboutus.org/StackOverflow.com).
But, there's little chance that the site will have an aboutus page and not have a meta description.
Some info that might explain how google does this:
Webmasters/Site owners Help
Adding a URL to google
I am not familiar with Google APIs, but perhaps there is an official way to get such information.
Interesting. some sources are better than others.
For "audiotuts.com" google has a worse description than AboutUs.com.
Google
Nov 18th in General by Joel Falconer ·
1. Recently, an AUDIOTUTS reader asked me about creative process. While this
is a topic that can’t be made into a
...
AboutUs.com:
AUDIOTUTS is a blog/tutorial site for
musicians, producers and audio
junkies! It is the sister site of the
popular PSDTUTS, VECTORTUTS and
NETTUTS.
I hate problems like these... they should be trivial but they aren't!
If you can assume English content, you can first look for Meta Description, and if that doesn't work, you can look for the first two or three sentence-like word sequences.
A product I worked on looked for the first P or DIV that contained more than one sequence of > n "words" delimited by periods. It would use the two or three sentence-like sequences, up to x total words, as a summary paragraph. It wasn't 100% accurate, but good enough for the average case. The number of words was adjusted a few times to eliminate things like navigation elements.

Resources