Full-text indexing an archived file

Full-text indexing an archived file - zip

Greetings,
in short, I have to find out whether I can implement a way to index zipped .rtf files via IFilter under Sql Server 2008 Express for fulltext search.
Long version:
this question is mostly a theoretical one - I'm neither experienced nor knowledgeable enough to find out whether such a thing is possible on my own.
The problem is as follows. There's a limited-size Sql Server Express 2008 R2 database thats going to store large .rtf files, probably 2-10k of them, and index them for fulltext search. Now, they probably wont fit into the 10gb limitation, thus I'm wondering if they could be archived (zipped, for instance) and stored that way. Fulltext search should be doable on them, in their zipped state.
My thought was to try to chain ifilters in some way to achieve this (I've no idea if thats doable), or there could be a different solution that I'm not seeing atm; I'd appreciate any input, as I'm kinda at a loss.

You may have a much easier time using something like Lucene. You could extract the text for the files and index it.

Related

Using Apache Solr with 'metadata' in Excel and files in DropBox

First of all, apologies for what might seem like an 'amateur' scenario & question...
Situation
I have many, many documents (100,000) that I need to users to be able to search and browse via a web application we are building
This search functionality is just 1 of several other functions
I currently have around a dozen excel spreadsheets that contain the 'metadata' (title, date, author, source, country etc.) or document information
Each of the 100,000 'records' (or excel rows) has a unique identifier
The actual files (majority PDF but some Word & Excel) are stored in Dropbox using the corresponding unique identifier as the file name
Questions
Is Apache Solr the best tool to use in order to provide the search functionality?
What is the best design to facilitate this (e.g. Files in AWS S3 etc.)?
What is the best method to migrate from excel/Dropbox to the proposed Apache solr solution?
I very much appreciate any assistance as I have just been getting many different answers from paid consultants.
Regards
Mark

Your questions:
Q: Is Apache Solr the best tool to use in order to provide the search functionality?
A: In my opinion, Solr is an awesome option for things like this. However, as you've discovered, there's "some assembly required" (and that's putting it mildly)
Q: What is the best design to facilitate this (e.g. Files in AWS S3 etc.)?
A: If it were me, I'd use the filesystem. I think it's the easiest to debug.
Also, if it were me, I'd export the Excel sheets to CSV, I think it's a bit easier to work with then. However, Solr does include open source Tika filters, which do support Excel, but they won't treat your multiple-records as multiple-documents; Tika will make each Excel sheet into just one document each.
Q: What is the best method to migrate from excel/Dropbox to the proposed Apache solr solution?
A: I'm a fan of local filesystem. Dropbox does let you mirror your Dropbox files on a local directory. And as I said before, if you can get the Excel sheets exported to CSV, in some automated or "macro" way, I think it'd make your life easier too. For example, Python can read and write CSV files and is a great tool for messaging data into its final form.
If you don't mind commercial solutions, you might consider Lucidworks Fusion; it does include a bunch of connectors, including a filesystem datasource connector. Disclaimer: I happen to work for Lucid, but listing this suggestion last, before the "free" answers. And I'd mention this anyway, even if I didn't work there.

How to speed up a search on large collection of text files (1TB)

I have a collection of text files containing anonymised medical data (age, country, symptoms, diagnosis etc). This data goes back for at least 30 years so as you can imagine I have quite a large sized data set. In total I have around 20,000 text files totalling approx. 1TB.
Periodically I will be needing to search these files for occurances of a particular string (not regex). What is the quickest way to search through this data?
I have tried using grep and recursively searching through the directory as follows:
LC_ALL=C fgrep -r -i "searchTerm" /Folder/Containing/Files
The only problem with doing the above is that it takes hours (sometimes half a day!) to search through this data.
Is there a quicker way to search through this data? At this moment I am open to different approaches such as databases, elasticsearch etc. If I do go down the database route, I will have approx. 1 billion records.
My only requirements are:
1) The search will be happening on my local computer (Dual-Core CPU and 8GB RAM)
2) I will be searching for strings (not regex).
3) I will need to see all occurances of the search string and the file it was within.

There are a lot of answers already, I just wanted to add my two cents:
Having this much huge data(1 TB) with just 8 GB of memory will not be good enough for any approach, be it using the Lucene or Elasticsearch(internally uses Lucene) or some grep command if you want faster search, the reason being very simple all these systems hold the data in fastest memory to be able to serve faster and out of 8 GB(25% you should reserve for OS and another 25-50% at least for other application), you are left with very few GB of RAM.
Upgrading the SSD, increasing RAM on your system will help but it's quite cumbersome and again if you hit performance issues it will be difficult to do vertical scaling of your system.
Suggestion
I know you already mentioned that you want to do this on your system but as I said it wouldn't give any real benefit and you might end up wasting so much time(infra and code-wise(so many approaches as mentioned in various answers)), hence would suggest you do the top-down approach as mentioned in my another answer for determining the right capacity. It would help you to identify the correct capacity quickly of whatever approach you choose.
About the implementation wise, I would suggest doing it with Elasticsearch(ES), as it's very easy to set up and scale, you can even use the AWS Elasticsearch which is available in free-tier as well and later on quickly scale, although I am not a big fan of AWS ES, its saves a lot of time of setting up and you can quickly get started if you are much familiar of ES.
In order to make search faster, you can split the file into multiple fields(title,body,tags,author etc) and index only the important field, which would reduce the inverted index size and if you are looking only for exact string match(no partial or full-text search), then you can simply use the keyword field which is even faster to index and search.
I can go on about why Elasticsearch is good and how to optimize it, but that's not the crux and Bottomline is that any search will need a significant amount of memory, CPU, and disk and any one of becoming bottleneck would hamper your local system search and other application, hence advising you to really consider doing this on external system and Elasticsearch really stands out as its mean for distributed system and most popular open-source search system today.

You clearly need an index, as almost every answer has suggested. You could totally improve your hardware but since you have said that it is fixed, I won’t elaborate on that.
I have a few relevant pointers for you:
Index only the fields in which you want to find the search term rather than indexing the entire dataset;
Create multilevel index (i.e. index over index) so that your index searches are quicker. This will be especially relevant if your index grows to more than 8 GB;
I wanted to recommend caching of your searches as an alternative, but this will cause a new search to again take half a day. So preprocessing your data to build an index is clearly better than processing the data as the query comes.
Minor Update:
A lot of answers here are suggesting you to put the data in Cloud. I'd highly recommend, even for anonymized medical data, that you confirm with the source (unless you scraped the data from the web) that it is ok to do.

To speed up your searches you need an inverted index. To be able to add new documents without the need to re-index all existing files the index should be incremental.
One of the first open source projects that introduced incremental indexing is Apache Lucense. It is still the most widely used indexing and search engine although other tools that extend its functionality are more popular nowadays. Elasiticsearch and Solr are both based on Lucense. But as long as you don't need a web frontend, support for analytical querying, filtering, grouping, support for indexing non-text files or an infrastrucutre for a cluster setup over multiple hosts, Lucene is still the best choice.
Apache Lucense is a Java library, but it ships with a fully-functional, commandline-based demo application. This basic demo should already provide all the functionality that you need.
With some Java knowledge it would also be easy to adapt the application to your needs. You will be suprised how simple the source code of the demo application is. If Java shouldn't be the language of your choice, its wrapper for Pyhton, PyLucene may also be an alternative. The indexing of the demo application is already reduced nearly to the minimum. By default no advanced functionlity is used like stemming or optimization for complex queries - features, you most likely will not need for your use-case but which would increase size of the index and indexing time.

I see 3 options for you.
You should really consider upgrading your hardware, hdd -> ssd upgrade can multiply the speed of search by times.
Increase the speed of your search on the spot.
You can refer to this question for various recommendations. The main idea of this method is optimize CPU load, but you will be limited by your HDD speed. The maximum speed multiplier is the number of your cores.
You can index your dataset.
Because you're working with texts, you would need some full text search databases. Elasticsearch and Postgres are good options.
This method requires you more disk space (but usually less than x2 space, depending on the data structure and the list of fields you want to index).
This method will be infinitely faster (seconds).
If you decide to use this method, select the analyzer configuration carefully to match what considered to be a single word for your task (here is an example for Elasticsearch)

Worth covering the topic from at two level: approach, and specific software to use.
Approach:
Based on the way you describe the data, it looks that pre-indexing will provide significant help. Pre-indexing will perform one time scan of the data, and will build a a compact index that make it possible to perform quick searches and identify where specific terms showed in the repository.
Depending on the queries, it the index will reduce or completely eliminate having to search through the actual document, even for complex queries like 'find all documents where AAA and BBB appears together).
Specific Tool
The hardware that you describe is relatively basic. Running complex searches will benefit from large memory/multi-core hardware. There are excellent solutions out there - elastic search, solr and similar tools can do magic, given strong hardware to support them.
I believe you want to look into two options, depending on your skills, and the data (it will help sample of the data can be shared) by OP.
* Build you own index, using light-weight database (sqlite, postgresql), OR
* Use light-weight search engine.
For the second approach, using describe hardware, I would recommended looking into 'glimpse' (and the supporting agrep utility). Glimple provide a way to pre-index the data, which make searches extremely fast. I've used it on big data repository (few GB, but never TB).
See: https://github.com/gvelez17/glimpse
Clearly, not as modern and feature rich as Elastic Search, but much easier to setup. It is server-less. The main benefit for the use case described by OP is the ability to scan existing files, without having to load the documents into extra search engine repository.

Can you think about ingesting all this data to elasticsearch if they have a consistent data structure format ?
If yes, below are the quick steps:
1. Install filebeat on your local computer
2. Install elasticsearch and kibana as well.
3. Export the data by making filebeat send all the data to elasticsearch.
4. Start searching it easily from Kibana.

Fs Crawler might help you in indexing the data into elasticsearch.After that normal elasticsearch queries can you be search engine.

I think if you cache the most recent searched medical data it might help performance wise instead of going through the whole 1TB you can use redis/memcached

How should I load the contents of a .txt file to serve on a website?

I am trying to build excerpts for each document returned as a search results on my website. I am using the Sphinx search engine and the Apache web server on Linux CentOS. The function within the Sphinx API that I'd like to use is called BuildExcerpts. This function requires you to pass an array of strings where each string contains the documents contents.
I'm wondering what the best practice is for retrieving the document contents in real time as I serve the results on the web. Currently, these documents are in text files on my system, spread across multiple drives. There are roughly 100MM of them and they take up a few terabytes of space.
It's easy for me to call something like file_get_contents(), but that feels like the wrong way to do this. My databases are already gigantic ( 100GB+ ) and I don't particularly want to throw the document contents in there along with the document attributes that already exist. Perhaps this is the best way to do this, however.
Suggestions?

Well the source needs to be fetched from somewhere. If you dont want to duplicate it in your database, then you will need to fetch it from the filesystem. (using file_get_contets or similar)
Although the BuildExerpts function does give you one extra option "load_files"
... then sphinx will read the data from the filename for you.
What problem are you experiencing with reading it from files? Is it too slow? If so maybe use some caching in front - using memcache maybe.

Using Lucene like a relational database

I am just wondering if we could achieve some RDBMS capabilities in lucene.
Example:
1) I have 10,000 project documents (pdf files) which have to be indexed with their content to make them available for search.
2) Every document is related to a SINGLE PROJECT. The project can contain details like project name, number, start date, end date, location, type etc.
I have to search in the contents of the pdf files for a given keyword, but while displaying the results I want to display the project meta data as mentioned in point (2).
My idea is to associate a field called projectId with each pdf file while indexing. Once we get that, we will fire search again for getting project meta data.
This way we could avoid duplicated data. Also, if we want to update the project meta data we will end up updating at a SINGLE PLACE only. Otherwise if we store this meta data with all the pdf doument indexes, we will end up updating all of the documents, which is not the way I am looking for.
please advise.

If I understand you correctly, you have two questions:
Can I store a project id in Lucene and use it for further searches? Yes, you can. This is a common practice.
Can I use this project id to search Lucene for project meta data? Yes, you can. I do not know if this is a good idea. It depends on the frequency of your meta data updates and your access pattern. If the meta data is relatively static, and you only access it by id, Lucene may be a good place to store it. Otherwise, you can use the project id as a primary key to a database table, which could be a better fit.

Sounds like a perfectly good thing to do. The only limitation you'll have (by storing a reference to the project in Lucene rather than the project data itself) is that you won't be able to query both the document text and project metadata at the same time. For example, "documentText:foo OR projectName:bar" . If you have no such requirement, then seems like storing the ID in Lucene which refers to a database row is a fine thing to do.

I am not sure on your overall setup, but maybe Hibernate Search is for you. It would allow you to combine the benefits of a relational database with the power of a fulltext search engine like Lucene. The meta data could live in the database, maybe together with the original pdf documents, while the Lucene documents just contain the searchable data.

This is definitely possible. But always be aware of the fact that you're using Lucene for something that it was not intended for. In general, Lucene is designed for full-text search, not for mapping relational content. So the more complex your system your relational content becomes, the more you'll see a decrease in performance.
In particular, there are a few areas to keep a close eye on:
Storing the value of each field in your index will decrease performance. If you are not overly concerned with sub-second search results, or if your index is relatively small, then this may not be a problem.
Also, be aware that if you are not using the default ranking algorithm, and your custom algorithm requires information about the project in order to calculate the score for each document, this will have a dramatic impact on search performance, as well.
If you need a more powerful index that was designed for relational content, there are hierarchical indexing tools out there (one developed by Apache, called Jackrabbit) that are worth looking into.
As your project continues to grow, you might also check out Solr, also developed by Apache, which provides some added functionality, such as multi-faceted search.

You can use Lucene that way;
Pros:
Full-text search is easy to implement, which is not the case in an RDBMS.
Cons:
Referential integrity: you get it for free in an RDBMS, but in Lucene, you must implement it yourself.

What is the best search approach?

I'm using lucene in my project.
Here is my question:
should I use lucene to replace the whole search module which has been implemented with sql using a large number of like statement and accurate search by id or sth,
or should I just use lucene in fuzzy search(i mean full text search)?

Probably you should use lucene, unless the SQL search is very performant.
We are right now moving to Solr (based on Lucene) because our search queries are inherently slow, and cannot be sped up with our database.... If you have reasonably large tables, your search queries will start to get really slow unless the DB has some kind of highly optimized free text search mechanisms.
Thus, let Lucene do what it does best....

I don't think using like statement abusively is a good idea.
And I believe the performance of lucene will be better than database.

I'm actually very impressed by Solr, at work we were looking for a replacement for our Google Mini (it's woefully inadequate for any serious site search) and were expecting something that would take a while to implement. Within 30 minutes of installing Solr we had done what we had expected to take at least a few days and provided us with a far more powerful search interface than we had before.
You could probably use Solr to do quite a lot of clever things beyond a simple site search.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string