Solr and web site indexing to create a site search - search

I was trying to build a 'site search' on a simple http site.
I have a site, lets call it www.mycompany.com, that is pure html.
Is there an easy way to use solr to index the entire site to build a full text search using solr as the engine?
I googled for a bit and could not find anything specific of the type:
Do A
Do B
...
profit!
Let me also know if I am a bit off with what is solr for :P
Thanks in advance.

Solr is only for indexing and searching text, it does not have a crawler since it's out the project's scope.
However take a look at Nutch, which is a crawler and not too hard to setup initially.
Nutch and Solr can be integrated if you need some Solr-specific feature to search the index.

$ bin/solr create -c corename
$ bin/post -c corename https://siteurl.com -recursive 2 -delay 1
This would do a basic index of the site but it would not be the best. If you want simple then there it is. It can be done.
I think this only works on solr 5+.

Two other options you might want to look at are Crawl Anywhere and Heritrix

Related

Implementing a docs search for multiple docs sites

We have many different documentation sites and I would like to search a keyword across all of these sites. How can I do that?
I already thought about implementing a simple web scraper, but this seems like a very ugly solution.
An alternative may be to use Elasticsearch and somehow point it to the different doc repos.
Are there better suggestions?
Algolia is the absolute best solution that I can think of. There's also Typesense and Meilisearch of course.
Algolia is meant specifically for situations like yours, so it even comes with a crawler.
https://www.algolia.com/products/search-and-discovery/crawler/
https://www.algolia.com/
https://typesense.org/
https://www.meilisearch.com/
Here's a fun page comparing them (probably a little biased in Typesense's favor)
https://typesense.org/typesense-vs-algolia-vs-elasticsearch-vs-meilisearch/
Here are some example sites that use Algolia Search
https://developers.cloudflare.com/
https://getbootstrap.com/docs/5.1/getting-started/introduction/
https://reactjs.org/
https://hn.algolia.com/
If you personally are just trying to search for a keyword, as long as they're indexed by Google, you can always search with the format site:{domain} "keyword"
You can checkout Meilisearch for your use case. Meilisearch is a Rust based and open sourced search engine.
Meilisearch comes with a document scraper tool ( https://github.com/meilisearch/docs-scraper ) that can scrape content and then also index it.
While using it you need to define what exact content you are searching for in the configuration file for the scraper tool. And then you can run the tool using Docker.

Monitoring search terms in Plone

I'd like to get a better idea of what people are searching for when they're using our website.
Just curious, what's the best way to monitor what's being entered into the search field in Plone 4? I saw this product — http://plone.org/products/ifsearchmonitor — but it's an old one. Has anyone used it with Plone 4 or know of something similar?
Okay I don't know why it took me so long to realize this, but it's built into Google Analytics. Here are instructions: https://support.google.com/analytics/answer/1012264?hl=en
And the search query parameters I used for Plone are: ##search, SearchableText, advanced_search
using google analytic's site search won't track users using the livesearch (without pressing enter and submit to the ##search view.
for awstats i use this extra section to track both:
# updated version for plone 4.3
# /livesearch_reply?q=testsuche
# /##search?SearchableText=testsuche
# /##updated_search?SearchableText=testsuche
# livesearches shown as q=, normal searches with just the phrase
ExtraSectionName1="Plone Suchabfragen"
ExtraSectionCodeFilter1="200 304"
ExtraSectionCondition1="URL,\/##search||URL,\/search||URL,\/##updated_search||URL,\/livesearch_reply"
ExtraSectionFirstColumnTitle1="Search:"
#ExtraSectionFirstColumnValues1="QUERY_STRING,SearchableText=([^&]+)||QUERY_STRING,q=([^&]+)"
ExtraSectionFirstColumnValues1="QUERY_STRING,SearchableText=([^&]+)||QUERY_STRING,(q=[^&]+)"
ExtraSectionFirstColumnFormat1="%s"
ExtraSectionStatTypes1=PL
ExtraSectionAddAverageRow1=0
ExtraSectionAddSumRow1=1
MaxNbOfExtra1=100
MinHitExtra1=1
if you want to track the livesearch in google analytics, you'll need to use event tracking: https://developers.google.com/analytics/devguides/collection/analyticsjs/events

TYPO3 - Indexed Search and how to index extension

I use indexed_search and RealUrl and I need it to show the whole url in the search result.
Right now it is only showing that part of the url which is related to pages and not the part that is related to my extension.
Now it shows: domain.dk/products/
But it should show: domain.dk/products/product/product-title
I dont know whether it is in RealUrl configuration or in Indexed Search I should make som changes.
There are some pretty good explanations on the web, showing how to index database/extension records with the crawler extension. Try this one as a start, it shows everything step by step and with screenshots, so I guess it should be useful.
If this is not enough, there are ready-to-use examples for tt_news and other extensions in the crawler documentation.

How to omit JavaScript and comments using nutch crawl?

I am a newbie at this, trying to use Nutch 1.2 to fetch a site. I'm using only a Linux console to work with Nutch as I don't need anything else. My command looks like this
bin/nutch crawl urls -dir crawled -depth 3
Where the folder urls is were I have my links and I do get the results to the folder crawled.
And when I would like to see the results I type:bin/nutch readseg -dump crawled/segments/20110401113805 /home/nutch/dumpfiles
This works very fine, but I get a lot of broken links.
Now, I do not want Nutch to follow JavaScript links, only regular links, could anyone give me a hint/help on how to do that?
I've tried to edit the conf/crawl-urlfilter.txt with no results. I might have typed wrong commands!
Any help appreciated!
beware there are two different filter files, one for the one stop crawl command and the other for the step-by-step commands.
For the rest just build a regex that will match the urls you want to skip, add minus before and you shoulb de done.

Grails (On App Engine) - Basic Search Functionality

What I need is Search Scaffolding but in its absence I was wondering if you could point me in the direction of any really simple examples for adding search to a domain.
I can't use the searchable plugin as it conflicts with the AppEngine plugin (Unless someone has got this to work?). I just need to be able to filter the scaffold list to contain only the results which match the query. I don't need a pure text box solution, I imagine it too look exactly like the 'create' form except when you submit you get a list of matching objects.
I hope this makes sense, thanks in advance!
Gav
Google App Engine - Full Text Search

Resources