From a pool of webpages, finding pages similar to any given webpage - nlp

I am given a set of webpages and I need to build a page recommender. Whichever URL is given to the application, the application should be able to find out pages from the given pool that are similar to the page at the URL.
I tried looking for different approaches to do that. The use of word2vec interested me. I am planning to crawl through all the given set of webpages and generate tags for that page based on the content in that page. From these tags I was hoping to use word2vec to calculate a vector value for the page and store it. When searching, I would caclulate vector for the given page in similar way to look for similar values. Is this the correct way of using word2vec? What training vector should be used? Any other better way to do this task?Or just plain text matching would be a better option?

I'd recommend using existing IR open source to handle your documents i.e. to index your crawled webpages and to query to get the results.
You can try to index document using elastic index all webpages and to query using more like this query, from elastic documentation:
The More Like This Query (MLT Query) finds documents that are "like" a given set of documents

Related

Implementing a docs search for multiple docs sites

We have many different documentation sites and I would like to search a keyword across all of these sites. How can I do that?
I already thought about implementing a simple web scraper, but this seems like a very ugly solution.
An alternative may be to use Elasticsearch and somehow point it to the different doc repos.
Are there better suggestions?
Algolia is the absolute best solution that I can think of. There's also Typesense and Meilisearch of course.
Algolia is meant specifically for situations like yours, so it even comes with a crawler.
https://www.algolia.com/products/search-and-discovery/crawler/
https://www.algolia.com/
https://typesense.org/
https://www.meilisearch.com/
Here's a fun page comparing them (probably a little biased in Typesense's favor)
https://typesense.org/typesense-vs-algolia-vs-elasticsearch-vs-meilisearch/
Here are some example sites that use Algolia Search
https://developers.cloudflare.com/
https://getbootstrap.com/docs/5.1/getting-started/introduction/
https://reactjs.org/
https://hn.algolia.com/
If you personally are just trying to search for a keyword, as long as they're indexed by Google, you can always search with the format site:{domain} "keyword"
You can checkout Meilisearch for your use case. Meilisearch is a Rust based and open sourced search engine.
Meilisearch comes with a document scraper tool ( https://github.com/meilisearch/docs-scraper ) that can scrape content and then also index it.
While using it you need to define what exact content you are searching for in the configuration file for the scraper tool. And then you can run the tool using Docker.

Tell Solr Search Not to Index Parts of My Page

I'm having an issue where inline Javascript is being displayed in Solr search results on my Drupal website. Is there a way to hide parts of my code from being indexed by Solr similar to how google uses googleoff:index and googleon:index to keep code from being indexed?
If you use the solr search module for drupal, you can tell solr to index specific fields in your content :
https://www.drupal.org/project/search_api_solr
So your javascript will not get indexed.

Drupal, Solr & Facet Api - Persistent facet links in blocks

I need to produce facet block from two vocabularies in my site. I am using Views and a patched version of Views infinite Scroll to generate the search page, using my search index, and I have tweaked everything I could in the facet display settings to see if I could produce the requested results, to no avail.
I do not need keyword searches. I need to show all taxonomy terms in each facets at all times and to be able to select a single criteria at a time from each vocabulary. So, never more thane one selection at a time from each facet block.
Why are you using Solr to store data and generate your search page, if you do not need keyword search and are trying to go against the native working of solr Facets, I hear you say? For performance reasons, it is the reason why I am using Solr to store & serve the results, I have even gone as far as pushing renedered node to the index with the help of the somwhat obscure search_api_solr_view_modes module.
I could take two separate routes
Create a custom block, load all the taxonomy terms, alter the output of the term link to point to the view and provide the TID for the View. The active filter data could be obtained from the view arguments. I know how to do that but feel it is the wrong way to go about it, if I am working with Solr, I should be using a facet, not a custom block.
Build a custom Facet block that has this exact behaviour. After reading a lot of documentation, I git kind of dicouraged with the possibility of doing this simply without having to develop a Facet plugin, which is kind of out of my league.
Any advice is appreciated.
Here is a screenshot of the interface I have to produce.
http://imageshack.com/a/img834/9836/kr0i.png
Each taxonomy term has to be persistent, i.e., produce a link event if there are no nodes indexed under this term.
Selecting a term in one of the vocabularies will deselect previously selected terms
Clicking on the x next to a term will remove it form the active search criterias.
Have a look at this. https://drupal.org/project/ajax_facets This might get you to where you need to be. Sans you infinite scroll. There is a youtube video that goes with it. http://www.youtube.com/watch?v=pBj3OkXLyWs
I'd appreciate it if it works as I haven't tried it my self.

Searching Single Pages with Dynamic Content

I have a slight problem I have been trying to address for a client I have been working with. We have 4 sets of single pages that are loading content from a database using PHP based upon a get string that is provided. These pages that are generated are optimized well for SEO and have alt tags for images and Content that we need to be able to search using a search feature.
Now i had assumed (An everyone knows what assuming gets you) that these pages by default would be able to be searched by the concrete 5 built in search feature. But it doesn't work. If I search for a word that I know is definitely on one of these pages even multiple times no results are found.
How can I make Concrete5 search these pages. If its no do able by a default or by a plugin, then can someone please offer some advice on how to fix this. This is an important feature and must be completed.
EDIT: See my comment below. I still need some help or direction here as CSE inst much of an option.
EDIT2: It may be viable for me to install a crawler and a custom search engine to address my problems. I was thinking of spider. Any other suggestions on that or other options are much appreciated!
Unfortunately C5 doesn't provide a way to do this -- the only way to tap into the search index is with blocks. And even if you created a phony block just to pass content from the single_page through to the search index, there's no way to say that some content is from one URL while other content is from another URL (which you'd need to do since your single_page controller is handling many different URL's).
I don't know of a way to achieve what you want to do (and it appears that nobody else does either -- http://www.concrete5.org/community/forums/customizing_c5/make-content-in-single-pages-searchable/ ), other than building your own internal search engine.
EDIT: I just did some digging, and thought that perhaps you could manually insert records into the PageSearchIndex table and specify the searchable content and the desired path there -- but this won't work because it relies on one cID (collection id, a.k.a. page id) per entry -- so you'd only be able to insert one record for the top-level single_page path.
I think the simplest solution here would be to create your own searching infrastructure for your single_pages (like some kind of function in the controller that would return an array of page paths and searchable content for each one), then override the search block and perform an additional search of your single_page -- then combine the results on the search results page there. Or just use google site search for your site, which will actually crawl the pages and hence find your various single_page urls: https://www.google.com/cse/
Best of luck.
I have not tested this, but maybe you can put a function getSearchableContent() in the single pages controller like you do for blocks. This would return the string to be searched. Would look something like this:
function getSearchableContent() {
// ... compose searchstring depending on the queried content.
return $searchstring;
}
But I don't know if this works for dynamic content. If not, I'd look into C5's search index core classes and try to extend them for your project.

Best way to display list content coming from different site collections in a webpart?

I would like to build a webpart that is going to display list content coming from different site collections (but the same content type).
I wanted to use the SPSiteDataGuery to create the datasource of a SPGridView for instance, but it appears to be working only within the same site collection. Do you have any idea ?
The best performing solution would be to use the search object model. Look into FullTextSqlQuery. You can filter based on content type easily enough. The downside in using this technique is obviously there is a delay until the content is crawled. The performance over anything else however will be significantly better. Waldek has some great articles on comparing the alternatives.

Resources