where solr store search index, in database or in file? - search

I am new to Solr. can anybody tell where it stores index.
in existing site database
create new database
into xml files.
Thanks in advance
EDIT: i am asking this because i need a list of files to copy over to production.
Suppose, if i index a site on development environment, which files i need to copy over to production environment. we dont want to re-index whole site again when its live.
EDIT-2 where i can find index directory, which stores data folder?

Solr (and underlying Lucene) index is a specially designed data structure,
stored on the file system as a set of index files.
The index is designed with efficient data structures to maximize performance and minimize resource usage.
you can check the lucene index usually residing in the data/index folder
Detailed info # http://lucene.apache.org/java/3_0_0/fileformats.html

Related

Multiple indexers on same storage location in Lucene

I want to build a highly scalable application where I intend to use Lucene as my search engine library. While browsing through the docs and faqs, I realize that it only allows one index writer to be open on a storage location by creating some write.lock in index directory. We can open multiple IndexReaders on that index.
I am interested in building an architecture where there are number of indexers running on different machines/servers and multiple searcher answering various types of queries on the indexes created by these indexers. Both searchers and indexers will be running on different computers.
In such scenario it will be preferable to have multiple indexers use same index storage location to index the documents. How to achieve this? Should I go with something like NFS (Networked File System)? Has this issue been taken care of by Solr or some other framework on top of Lucene? One obvious solution which comes to my mind is to create one index per indexer and then asking the searchers to make query across multiple index dirs. But these will lead to large number of different index dirs being created, as many as there are indexer servers, which I guess isn't much desirable. I want (# of index dirs) << (# of indexers) < (# of searchers)
What are the various alternatives do I have in this case?
First of all: never use NFS with Lucene, it's simply slow and risky.
If it comes to scalability and high availability I'd suggest you to just let elasticsearch do all the hard work for you, so that you can concentrate on your data. You can of course have multiple threads indexing data.
If you want to know more about the distributed nature of elasticsearch I'd suggest you to have a look at this video.
Take a look at ElasticSearch and Solr Cloud.
Comparison of ElasticSearch and Solr.

How should I load the contents of a .txt file to serve on a website?

I am trying to build excerpts for each document returned as a search results on my website. I am using the Sphinx search engine and the Apache web server on Linux CentOS. The function within the Sphinx API that I'd like to use is called BuildExcerpts. This function requires you to pass an array of strings where each string contains the documents contents.
I'm wondering what the best practice is for retrieving the document contents in real time as I serve the results on the web. Currently, these documents are in text files on my system, spread across multiple drives. There are roughly 100MM of them and they take up a few terabytes of space.
It's easy for me to call something like file_get_contents(), but that feels like the wrong way to do this. My databases are already gigantic ( 100GB+ ) and I don't particularly want to throw the document contents in there along with the document attributes that already exist. Perhaps this is the best way to do this, however.
Suggestions?
Well the source needs to be fetched from somewhere. If you dont want to duplicate it in your database, then you will need to fetch it from the filesystem. (using file_get_contets or similar)
Although the BuildExerpts function does give you one extra option "load_files"
... then sphinx will read the data from the filename for you.
What problem are you experiencing with reading it from files? Is it too slow? If so maybe use some caching in front - using memcache maybe.

SphinxSearch or a spider - which one to choose?

We own SiteA and SiteB and they share the same server and database where we have full control.
SiteC , siteD and siteE are some of the sites we own as well but reside on a different web hosts.
The goal is to create a unified search functionality for all of the sites mentioned above. That is if somebody search for a term in SiteA, the search result will automatically come up with results from SiteB,SiteC,SiteD and Site E too. The search results should be shown under the website they were found in.
All these websites content are stored in their own databases.
If I use SphinxSearch to index the above sites,I would then require those sites that we dont have complete control with to setup a web service where i can download a database dump or csv file for indexing.
Im not quite sure about how a sphider will come into play here so need your opinion.
Sphinx or a spider?
THanks!
If you can ask the owner of other websites to give you content for free, then there is no need for a spider. Just use sphinxsearch to index the content.
If you can't get content directly from them, a spider is the only choice for you. There is little to think about this issue.
Sphinx is a full-text search engine solution, while a spider is for fetching contents from internet. They are not replacements to each other. Even if you use a spider, you still have to use some full-text search engine software for example sphinx or lucene/solr.
So you have to make a decision first: Do I want to use sphinx for searching? If the answer is yes, then there is only one thing left: how can I index the contents for searching?
sphinx supports using database or XML as data source. Database as data source is more popular because preparing and updating XML documents in a specific format is very tedious(compared to maintaining a database table). So I guess finally you have to store all of the data into database. As you described, all of the data are all ready in databases, but some of the databases are out of your control. For you own database, there is no problem. For the databases that out of your control, I suggest that you use distributed sphinx searching: http://sphinxsearch.com/docs/2.0.6/distributed.html
The key idea is to horizontally partition (HP) searched data accross
search nodes and then process it in parallel.
Partitioning is done manually. You should
setup several instances of Sphinx programs (indexer and searchd) on
different servers;
make the instances index (and search) different parts of data;
configure a special distributed index on some of the searchd
instances;
and query this index.
This index only contains references to other local and remote indexes
- so it could not be directly reindexed, and you should reindex those indexes which it references instead.

searching for files from a single folder (knowing prefix) versus searching for files from multiple folders (knowing folder name)

I've got a system in which users select a couple of options and receive an image based on those options. I'm trying to combine multiple generated images(corresponding to those options) into the requested picture. I'm trying to optimize this so that if, an image exists for a certain option (i.e. the file exists), then there's no need to compute it and we move on to the next step.
Should I store these images in different folders, where each folder is an option name? Should I store them in the same folder, adding a prefix corresponding to the option to each image? Should I store the filenames in a database and check there? Which way is faster to check a file for existence?
I'm using PHP on Linux, but I'm also interested if the answer varies if I change the programming language or the OS.
If you're going to be producing a lot of these images, it doesn't seem very scalable to keep them all in one flat directory. I would go with a hierarchy, which will make it a lot easier to manage.
It's always going to be quicker to check in a database than to check if a file exists though, so if speed is the primary concern, use a hierarchical folder structure and keep all the filenames in a database.

Using Lucene like a relational database

I am just wondering if we could achieve some RDBMS capabilities in lucene.
Example:
1) I have 10,000 project documents (pdf files) which have to be indexed with their content to make them available for search.
2) Every document is related to a SINGLE PROJECT. The project can contain details like project name, number, start date, end date, location, type etc.
I have to search in the contents of the pdf files for a given keyword, but while displaying the results I want to display the project meta data as mentioned in point (2).
My idea is to associate a field called projectId with each pdf file while indexing. Once we get that, we will fire search again for getting project meta data.
This way we could avoid duplicated data. Also, if we want to update the project meta data we will end up updating at a SINGLE PLACE only. Otherwise if we store this meta data with all the pdf doument indexes, we will end up updating all of the documents, which is not the way I am looking for.
please advise.
If I understand you correctly, you have two questions:
Can I store a project id in Lucene and use it for further searches? Yes, you can. This is a common practice.
Can I use this project id to search Lucene for project meta data? Yes, you can. I do not know if this is a good idea. It depends on the frequency of your meta data updates and your access pattern. If the meta data is relatively static, and you only access it by id, Lucene may be a good place to store it. Otherwise, you can use the project id as a primary key to a database table, which could be a better fit.
Sounds like a perfectly good thing to do. The only limitation you'll have (by storing a reference to the project in Lucene rather than the project data itself) is that you won't be able to query both the document text and project metadata at the same time. For example, "documentText:foo OR projectName:bar" . If you have no such requirement, then seems like storing the ID in Lucene which refers to a database row is a fine thing to do.
I am not sure on your overall setup, but maybe Hibernate Search is for you. It would allow you to combine the benefits of a relational database with the power of a fulltext search engine like Lucene. The meta data could live in the database, maybe together with the original pdf documents, while the Lucene documents just contain the searchable data.
This is definitely possible. But always be aware of the fact that you're using Lucene for something that it was not intended for. In general, Lucene is designed for full-text search, not for mapping relational content. So the more complex your system your relational content becomes, the more you'll see a decrease in performance.
In particular, there are a few areas to keep a close eye on:
Storing the value of each field in your index will decrease performance. If you are not overly concerned with sub-second search results, or if your index is relatively small, then this may not be a problem.
Also, be aware that if you are not using the default ranking algorithm, and your custom algorithm requires information about the project in order to calculate the score for each document, this will have a dramatic impact on search performance, as well.
If you need a more powerful index that was designed for relational content, there are hierarchical indexing tools out there (one developed by Apache, called Jackrabbit) that are worth looking into.
As your project continues to grow, you might also check out Solr, also developed by Apache, which provides some added functionality, such as multi-faceted search.
You can use Lucene that way;
Pros:
Full-text search is easy to implement, which is not the case in an RDBMS.
Cons:
Referential integrity: you get it for free in an RDBMS, but in Lucene, you must implement it yourself.

Resources