Derby: Full Text Search - search

I have a gigantic data of more than 2500000000 records distributed among 10 tables in derby. There are two columns "floraNfauna" and "locations" common in each table. Now I have to find a particular "floraNfauna" found at particular "locations", so I use "select" query with "like" e.g. "select * from tables where floraNfauna like('%fish%') and locations like('%shallow water bodies%')"; and it takes days to finally fetch the results which count below 1000 sometimes. After searching I found that "full text search" would be the best and faster approach to this. Can you help me with an example?

Derby integrates nicely with Lucene, which is a full-text search engine.
Read more about that here: http://wiki.apache.org/db-derby/LuceneIntegration

Firstly you must consider indexing your table. Here is an SO link which definitely would help to know more about Why to index a DB table.
More about Adding Indexes to a table.
Secondly, if you are using a centralized database, then definitely consider upgrading your server hardware configuration.
Thanks, hope it helps.

Related

Query on all columns cassandra

I have close to six tables, each of them have from 20 to 60 columns in Cassandra. I am designing the schema for this database.
The requirement from the query is that all the columns must be queriable individually.
I know if the data has High-Cardinality using secondary indexes is not encouraged.
Materialized views will solve my purpose to an extent where I will be able to query on other columns as well.
My question is :
In this scenario, if each table has 30 to 50+ materialized views, is this an okay pattern to follow or is it going on a totally wrong track. Is it taking this functionality to its extreme. Maybe writes will start to become expensive on the system (I know they are written eventually and not with the immediate write to the actual table).
You definitely do not want 30 to 50 materialized views.
It sounds like the use case you're trying to satisfy is search, more so than a specific query.
If the queries that are going to be done on each column can be pre defined, then you can also go the denormalization route, trading flexibility of search for better performance and less operational overhead.
If you're interested in the search route, here's what I suggest you take a look at:
SASI Indexes (depending on Cassandra version you're using)
Elastic Search
Solr
DataStax Enterprise Search (disclaimer I work for DataStax)
Elassandra
Stratio
Those are just the ones I know off the top of my head. There may be others (Sorry if I missed you). I provided links to each so you can make your own informed decision as to which makes more sense for your use case.

full text search in databases

I have two fairly general question about full text search in a database. I was looking into elastic search and solr and it seems to me that one needs to produce separate documents made up of table entries, which then get searched. So the result of such a search is not actually a database entry? Or did I misunderstand something?
I also looked into whoosh search, which does index table columns and the result of whoosh are actual table rows.
When using solr or elastic search, should I put the row id into the document which gets searched and after I have my result use that id to retrieve the relevant rows from the table? Or is there a better solution?
Another question I have is if I have a id like abc/123.64664, which is stored as a string, is there any advantage in searching such a column with a FTS? It seems to me there is not much to be gained by indexing? Or am I wrong?
thanks
Elasticsearch can store the indexed document, and you can retrieve it as a part of query result. Usually ppl still store the original data in an usual DB, it gives you more reliability and flexibility on reindexing. Mind that ES indexes non-relational data. You can have you data stored in relational manner and compose denormalized documents for indexing.
As for "abc/123.64664" you can index it as tokenized string or you can tune the index for prefix search etc. It's up to you
(TL;DR) Don't think about what your data is structured in your RDBS. Think about what you are searching.
Content storage for good full text search is quite different from relational database standard storage. So, your data going into Search Engine can end up looking quite differently from the way you stored it.
This is all driven by your expected search results. You may increase granularity of the data or - opposite - denormalize it so the parent/related record content shows up in the records you actually want returned as part of search. Text processing (copyField, tokenization, pre-processing, etc) is also where a lot of content modifications happen to make a record findable.
Sometimes, relational databases support full-text search. PostgreSQL is getting better and better at that. But most of the time, relational databases just do not provide enough flexibility to support good relevancy-driven search.
Finally, if the original schema is quite complex, it may make sense to only use search engine to get the right - relevant - IDs out and then merge them in the client code with the details from the original database records.

Fastest way to search a SQL Server table (or indexed view) column with "like '%search%'"?

Suppose there's a table with columns (UserID, FieldID, Value), with half a million records. I want to see if some search term T(N) occurs anywhere in each Value (i.e. Value.Contains( T(N) ) ).
I think I'm just hitting a wall volume wise, just too many values to sift through. I don't think a Full Text index will help, because it's only useful for StartsWith queries that look at individual words, not occurrences anywhere within the string at all.
Is there a good approach to indexing this kind of data for such a search in SQL Server?
A half-million records is not terribly large, although I don't know the size of the field contents. A couple of ideas - this was too long for a comment or else I may have posted as such.
You could implement a full-text search engine like Elastic, Solr, etc and use it as a sidecar. If when you are doing text searches, you are not otherwise making much use of the other data, this might be easy enough. Note that you could put other data for searching into Elastic or Solr, but I'm not sure if you'd want to duplicate all your data, and those tools aren't really great for a transactional data store.
Another option for volumes this small, assuming you only need basic "contains" searching: create two more tables: keywords and keyword_index (or whatever). When saving, tokenize your text content and write out any new keywords to keywords table and then add the data to the join table. Index everything, and then do your search off the keywords table, joining back to the master via the intermediate keyword_index table.
This is fairly hackish, and getting your keyword handling really dialed in (for stemming, etc) may be a pain. It is a reasonable quick & dirty solution for smaller-scale needs though.

Cassandra full text search like

Let's say I have a column family named Questions like below:
Questions = {
Who are you: {
username: "user1"
}, What is the answer: {
username: "user1"
}...
}
How do I search for all the questions that contain certain words?
Get all questions that contain 'what' word.
How do I do it using python or at least Java?
Solandra (https://github.com/tjake/Solandra) is the new name for Lucandra.
Solandra is a combination of Cassandra and Solr (which is based on the Lucene full-text search engine).
Cassandra alone doesn't tackle text-search, although you could implement some basic text indexing by creating secondary index column families (Google: cassandra secondary index).
I'm new to Cassandra, but querying in it is relatively limited, compared to, for instance, a relational database. (This is by design.) I'm pretty sure there's no support for full text search at this time (this may not even be on the roadmap).
You might be best to go with Lucene or something comparable to index the text of the questions, either within the Cassandra datastore or in a separate datastore.
http://lucene.apache.org/java/docs/index.html
There appears to be at least one project that is attempting to integrate Lucene with Cassandra, and there may be others:
http://github.com/tjake/Lucandra
Another way to go in your case might be to break up the questions into words and maintain your own index of words to questions; your mileage may vary here, and something like Lucene will no doubt give you greater flexibility in querying.
Sounds like you could add "DSE Search", from the folks that support Cassandra, and you would have what you need. Lucene/Solr like capabilities but all the data stored in Cassandra.
http://www.datastax.com/dev/blog/cassandra-with-solr-integration-details
You have a good solution given by the last gent but this solution may serve your purpose better from a usability point of view.
Disclaimer: I work for a NoSQL vendor but not on Cassandra.

Are there any technologies that help develop website search?

PROBLEM:
I need to write an advanced search functionality for a website. All the data is stored in MySQL and I'm using Zend Framework on top. I know that I can write a script that takes the search page and builds an SQL query out of it, but this becomes extremely slow if there's a lot of hits. Then I would have to get down to the gritty details of optimizing the database tables/fields/etc. which I'm trying to avoid if possible.
Lucene: I gave Lucene a try, but since it's a full-text search engine, it does not allow any mathematical operators!! So if I wanted to get all the records where field_x > 5, there is no way to do it (correct?)
General Practice? I would like to know how large sites deal with this dilemma. Is there a standard way of doing this that I don't know about, or does everyone have to deal with the nasty details of optimizing the database at some point? I was hoping that some fast indexing/searching technology existed (e.g. Lucene) that would address this problem.
ANY OTHER COMMENTS OR SUGGESTION ARE MOST WELCOME!!
Thanks a lot guys!
Ali
You can use Zend Lucene for textual search, and combine it with MySQL for joins.
Please see Mark Krellenstein's Search Engine vs DBMS paper about the choice; Basically, search engines are better for ranked text search; Databases are better for more complex data manipulations, such as joins, using different record structures.
For a simple x>5 type query, you can use a range query inside Lucene.
Use Lucene for your text-based searches, and use SQL for field_x > 5 searches. I say this because text-based search is hard to get right, and you're probably better off leaving that to an expert.
If you need your users to have the capability of building mathematical expression searches, consider writing an expression builder dialog like this example to collect the search phrase. Then use a parameterized SQL query to execute the search.
SqlWhereBuilder ASP.NET Server Control
http://www.codeproject.com/KB/custom-controls/SqlWhereBuilder.aspx
You can use filters in Lucene to carry out a text search of a reduced set of records. So if you query the database first to get all records where field_x > 5, build a filter (a list of lucene document IDs) and pass this into the lucene search method along with the text query. I'm just learning about this, here's a link to a question I asked (it uses Lucene.Net and C# but it may help) - ignore my question, just check out the accepted answer:
How do you implement a custom filter with Lucene.net?

Resources