Is it possible to index external files with Solr? - search

Recently, google custom search has been shutting down and ending their services come April 2018 (don't quote me on that).
In light of this, I've been attempting to move our Drupal site's search to a new search engine, namely Apache Solr.
Our drupal site hosts tons of files from PDFs to Images to JSON and XML files.
I haven't had any trouble indexing these files since they're stored locally on the same machine that hosts the Drupal site, but we have a bunch of external files that I used to have no problem searching with GCSE.
I want to be able to index external files and be able to search/query them with Solr just like I was able to search them with GCSE.
Is this possible? I'm sort of a noobie and have been following step-by-step guides up until now in order to get Solr search up and running on our site.
If anyone has any idea on how to search and query external files with Apache Solr, I'd be grateful.

Yes, it's possible to index different external files inside Apache Solr. There are plenty of tutorials, how to do that.
I will recommend you to look through this reference guide. Basically most of the stuff under Indexing and Basic Data Operations, with paying attention to the Uploading Data with Index Handlers, which will help you to index XML/XSLT, JSON and CSV data and also take a look at Uploading Data with Solr Cell using Apache Tika, which will explain how to index PPT, XLS, PDF and others more complex formats.
On the side of the querying it - follow some initial guidelines from Searching, when you will have troubles - feel free to ask additional questions here.

Related

Using Azure Search to index and search an Orchard CMS site

I am working on an Orchard CMS system that is hosted in Azure. However, using the inbuilt Lucene search it has proved difficult to implement a search algorithm that filters out documents that are links to files (e.g. PDF/Images) and filtering out documents that do not belong to certain taxonomies have are associated in a certain lat/long square, date/time of occurrence. To get an idea of the data that I am dealing with, the website is https://ahdb.org.uk/. Consequently, I am looking into implementing Azure Search to index and provide the search functionality for the site. Just so that you know the version of Orchard that is installed is 1.10.1.0.
I have searched the web to the best of my ability and there seems to be nothing out there.
Graham Harris
While there's no direct integration of Orchard with Azure Cognitive Search, it should still be possible with a little work. It looks like you have custom rules about what you need to index. You might need to create a custom database view that normalizes the data and is specific about your use case, and then feed that into the Azure Search pipeline. The Orchard 1.x schema is very relational, and will require some understanding of how parts and content items are related, as well as how versioning is implemented. A good way to do that is to install the miniprofiler module and look at some of the queries being generated by Orchard itself as it's doing similar tasks (such as a projection of data that looks like what you want to feed into search).

Sharepoint to replace a fileshare

Is Sharepoint my best option to replace an aging network of fileshares? There's approx 1TB of data residing among 3 fileshares (1 DFS, 2 NAS boxes). A document management system is in place for new things - the file shares are now just read-only archives/legacy. Our users would simply need to be able to search for and open the documents.
Users are finding it difficult to locate their documents in the file shares and windows search does not often help. Sharepoint was suggested as something which would play nicely with Office documents (99% of the content) and have a good search facility.
Not being a Sharepoint Developer or having had any training on it, I'm getting a little lost. I have set up a test server to try it out using SP2013. I have managed to index each of my file shares and have created a search page. However, results aren't consistent with the indexted items. I assume I need to somehow get the relevant metadata from the files but I have no idea how to go about this.
Could anyone suggest some resources for help on this subject (my searches have mainly turned up paid-for Sharepoint addons or outdated blogs) and any experience of doing something similar? Also happy for any suggestions on ways to achieve this using other software/platforms.
I went with Microsoft Search Server 2010 in the end.
Sharepoint is basically optimized to be a document manager. I think you don't need to buy or donwload addons.
For your problem, metadata are the key! You need to properly specify the metadata.
I give you the theory of a plan document management in SharePoint 2013 :
https://technet.microsoft.com/en-us/library/cc263266.aspx
A nice introduction to metadata :
http://fr.slideshare.net/gzelfond/document-management-in-sharepoint-without-folders-introduction-to-metadata
Be careful to use the Microsoft documentation for the beginning. From my experience, its difficult to start with this documentation because you have several things in it. There is also good books/ebooks that you can find easily to start well, and probably more simplified than MS documentation.

Intranet search engine frontend?

We are currently using a number of open source and commercial products to store different type of information (in our internal network). All these products come with their own repositories (usually a database) and their own search capabilities and store different type of information.
Currently the list of products is as follows:
Wordpress
Jira
Confluence
Sharepoint
Dynamics AX
Moodle
The problem we are facing is that when one needs to search for information, one needs to login into all these different systems and execute a search on each one.
I Googled for "search engine frontend", "meta search engine", etc. but i was not able to find something obvious that solves our problem. At this point, i have to say that we are not interested in building one "central repository" to be searched, but instead we are in need of a frontend that will accept the query from the user, "package it" to the format that each of the individual search engines understand, receive the respone (JSON or XML) and present it to the user
Any suggestions on how we could solve it?
Your strategy is right: If you are not interested in building a central index, you will need an application that accepts the query from the user, converts it to the format that each of the individual search engines understand, receives the responses and presents them to the user. This is exactly what a meta search engine does. Even if you use a framework (e.g. Carrot2), much work will probably remain to write those query and result transformers, and you will probably experience slow results because the meta search can never be faster than the underlying search modules of the components you search through.
Instead of querying each backend separately you can put your data into one backend.
You could export your data to a Apache Solr server and use a frontend like CorePages, http://www.corepages.biz . You could add a backlink to your data so you can directly jump to your search result entry, f. e. a Jira Ticket or a wiki article.

Can someone point me the best way to integrate/connect sharepoint to TYPO3?

I'm currently working on a project requesting to integrate or connect Sharepoint to TYPO3.
Share point will somehow replace the fileadmin of TYPO3.
So what I mean by "integrate" or "connect" is the following points:
To display lists of documents from sharepoint on TYPO3 pages through the TYPO3 BE by using some tag or category. In short accessing sharepoints document in the TYPO3 BE.
To be able through TYPO3 to search documents from Sharepoint. And to filter them by type or category. And of course to display the results.
I found some references on the web.
The obvious one was the sharepoint connector SPTools of TYPOTYCOON but it seems dead as there are no fresh news on the website and no activity on the twitter account.
I found also two extensions on the TER (WSS/MOSS Reader and WSS/MOSS Writer) last uploaded December 2010. Surely outdated - Did anyone ever used them? have some feedback?
I found also some references about CMIS and the TYMIS extension but couldn't find it on the TER.
That's why I come to you, hoping you have some solution, useful feedback or lead at least...
Starting in TYPO3 6.0 the new File abstraction Layer (FAL) [1.] was introduced. This gives you the possibility to split the file storage from the files used in TYPO3. As the result fileadmin might contain any number of virtual mount points of any supported storage. Multiple (local, WebDAV) FAL drivers come preinstalled and there is an Amazon S3 driver at [2.]. I am not aware of any FAL driver development for Sharepoint. So this might be up to you to resolve, but these hints should get you started.
Links:
http://docs.typo3.org/typo3cms/FileAbstractionLayerReference/
http://git.typo3.org/TYPO3v4/Extensions/fal_amazons3.git

Are there libraries available that create an image search tool with Apache Solr?

I'm considering using apache solr as the search backbone on my site.
Along with standard text based search of documents, I want to create an image search tool.
Are there any libraries that allow solr to create a search index of images in crawled websites?
Are there other, better, options available?
As #Mikos mentioned in the comments, what I was looking for is a Content Based Image Retrieval system. There does exist on for use with Apache SOLR and it's called LIRe.
Here is a list of CBIR systems on Wikipedia.
Here is the tool that I've been investigating for use with a Linux based image search system I'm working on -- The GNU Image Finding Tool - GIFT. This is open source and seems to be the right tool for the job when you already have the database of images.

Resources