Cassandra full text search like - search

Let's say I have a column family named Questions like below:
Questions = {
Who are you: {
username: "user1"
}, What is the answer: {
username: "user1"
}...
}
How do I search for all the questions that contain certain words?
Get all questions that contain 'what' word.
How do I do it using python or at least Java?

Solandra (https://github.com/tjake/Solandra) is the new name for Lucandra.
Solandra is a combination of Cassandra and Solr (which is based on the Lucene full-text search engine).
Cassandra alone doesn't tackle text-search, although you could implement some basic text indexing by creating secondary index column families (Google: cassandra secondary index).

I'm new to Cassandra, but querying in it is relatively limited, compared to, for instance, a relational database. (This is by design.) I'm pretty sure there's no support for full text search at this time (this may not even be on the roadmap).
You might be best to go with Lucene or something comparable to index the text of the questions, either within the Cassandra datastore or in a separate datastore.
http://lucene.apache.org/java/docs/index.html
There appears to be at least one project that is attempting to integrate Lucene with Cassandra, and there may be others:
http://github.com/tjake/Lucandra
Another way to go in your case might be to break up the questions into words and maintain your own index of words to questions; your mileage may vary here, and something like Lucene will no doubt give you greater flexibility in querying.

Sounds like you could add "DSE Search", from the folks that support Cassandra, and you would have what you need. Lucene/Solr like capabilities but all the data stored in Cassandra.
http://www.datastax.com/dev/blog/cassandra-with-solr-integration-details
You have a good solution given by the last gent but this solution may serve your purpose better from a usability point of view.
Disclaimer: I work for a NoSQL vendor but not on Cassandra.

Related

Query on all columns cassandra

I have close to six tables, each of them have from 20 to 60 columns in Cassandra. I am designing the schema for this database.
The requirement from the query is that all the columns must be queriable individually.
I know if the data has High-Cardinality using secondary indexes is not encouraged.
Materialized views will solve my purpose to an extent where I will be able to query on other columns as well.
My question is :
In this scenario, if each table has 30 to 50+ materialized views, is this an okay pattern to follow or is it going on a totally wrong track. Is it taking this functionality to its extreme. Maybe writes will start to become expensive on the system (I know they are written eventually and not with the immediate write to the actual table).
You definitely do not want 30 to 50 materialized views.
It sounds like the use case you're trying to satisfy is search, more so than a specific query.
If the queries that are going to be done on each column can be pre defined, then you can also go the denormalization route, trading flexibility of search for better performance and less operational overhead.
If you're interested in the search route, here's what I suggest you take a look at:
SASI Indexes (depending on Cassandra version you're using)
Elastic Search
Solr
DataStax Enterprise Search (disclaimer I work for DataStax)
Elassandra
Stratio
Those are just the ones I know off the top of my head. There may be others (Sorry if I missed you). I provided links to each so you can make your own informed decision as to which makes more sense for your use case.

Yii2: How should site-wide search work?

What is the best practice methododology of implementing site-wide search in Yii2?
This question is not about how to implement search specifically, but rather about what kind of approach to use. Should we use Sphinx? Elasticsearch? Or do we use UNION selects to get the data into a DataProvider?
Assume the application is using a relational database to store data. We want to search and display multiple different models. For example, our database contains tables of Books, Authors and Stores. When we search for a keyword we want to display results from all 3 tables (matching Books by title or content, Authors by full name and Stores by name etc).
There are tutorials which show how to use Elasticsearch but assume that our data is stored in the Elasticsearch database, which does not make sense. Our data is already stored in MySQL or PostgreSQL. Does this mean
we need to maintain a duplicate of our data in the Elasticsearch database?
What is the best practice methododology of implementing site-wide search in Yii2?
That depends on many factors, so I cant give you a specific recommendation for your case. Some of the factors to think about are:
What would you like to achieve with this search? Is every little bit in your database a significant search term?
Do you need only full-text-search or a wide range of analytics?
Have you any limits in time or costs?
Can your (tech-)infrastructure handle your ideas?
Is it worth to bring in another extensive technology in the project?
Can you handle additional maintenance tasks to run such a search engine?
And many more ...
In my internal Yii2 Project with a PostgreSQL RDBMS, I decided to use a PostgreSQL Text Search Type called tsvector. Thats good enough for my needs. Why?
You can use Stemming.
Supports Fuzzy search.
Supports basic ranking.
Supports multiple languages.
I highly recommend this blog post Postgres full-text search is Good Enough.

Cassandra store and query dynamic (user defined) data

We've been looking into using Cassandra to store some of the larger data in a multi-tenant system we are building. The decision to use Cassandra is mostly to do with scaling capabilities and performance when working with large data sets, but I am not sure whether what we're looking for is possible in Cassandra, so I'm hoping someone has some clues as to whether (and how) this could be done:
We are looking for a way to provide our users to first define their own Entity types then define fields in those entities (and field types). Once they've defined this, their data (that matches the definitions they just created) could be imported, stored and most importantly queried by pretty much any field they defined.
So for instance, we may have one user who defines an Airplane, which has the manufacturer name, model, tail number, year of production, etc...
Their data will, then, contain those fields, be searchable and sortable by those fields, etc..
Another user may decide to define a Boat, which can then have different fields, which should be also sortable and searchable by content.
Because of the possible number of entries - the typical relational approach is unlikely to yield adequate performance, so we're looking at a noSQL approach.
Is this something that could be done in C*? Or are there any other suggestions in terms of a storage engine that would offer best flexibility?
I can see two important points in your requirements
Dynamic typing/schemaless data: Cassandra defines how data is structured like a relational database. Yet you can use columns of complex type: map...
Query by any field: Cassandra requires each query to provide the partition id. Cassandra data model is driven by querying, if you don't know your queries in advance, you won't be able to design the appropriate model, and you won't be able to query it.
I advise you to have look at Elasticsearch.
Then, if you have to use Cassandra for some other reason, then I advise you to look a DataStax Enterprise edition of Cassandra which integrates with SolR and Spark: both will give you extra querying capabilities.

Searching for data in Cassandra

I understand that with Cassandra, it is possible to search using secondary indexes, but the problem is I am trying to search on information from a super column. So I want to search on a value within a super column, but return everything within that row (not just that one super column).Is this possible to do?
My understanding is that Facebook and Twitter use Cassandra, and so it would seem quite pointless if they have search facilities but it is not possible to search using something built into Cassandra.
Please correct me if I have not understood the proper use of super columns within Cassandra.
Thanks.
You cannot search on a super column value, as secondary indexes are not supported for SCs. You should avoid using super columns for a variety of reasons, but mostly because they are effectively deprecated. Most super column use cases are supported through the use of composites--which will ultimately replace SCs. In the meantime, if you must search for a value in a SC, you will have to do so manually (i.e. in code) or using an external tool such as Hadoop or Solr.

Using Lucene like a relational database

I am just wondering if we could achieve some RDBMS capabilities in lucene.
Example:
1) I have 10,000 project documents (pdf files) which have to be indexed with their content to make them available for search.
2) Every document is related to a SINGLE PROJECT. The project can contain details like project name, number, start date, end date, location, type etc.
I have to search in the contents of the pdf files for a given keyword, but while displaying the results I want to display the project meta data as mentioned in point (2).
My idea is to associate a field called projectId with each pdf file while indexing. Once we get that, we will fire search again for getting project meta data.
This way we could avoid duplicated data. Also, if we want to update the project meta data we will end up updating at a SINGLE PLACE only. Otherwise if we store this meta data with all the pdf doument indexes, we will end up updating all of the documents, which is not the way I am looking for.
please advise.
If I understand you correctly, you have two questions:
Can I store a project id in Lucene and use it for further searches? Yes, you can. This is a common practice.
Can I use this project id to search Lucene for project meta data? Yes, you can. I do not know if this is a good idea. It depends on the frequency of your meta data updates and your access pattern. If the meta data is relatively static, and you only access it by id, Lucene may be a good place to store it. Otherwise, you can use the project id as a primary key to a database table, which could be a better fit.
Sounds like a perfectly good thing to do. The only limitation you'll have (by storing a reference to the project in Lucene rather than the project data itself) is that you won't be able to query both the document text and project metadata at the same time. For example, "documentText:foo OR projectName:bar" . If you have no such requirement, then seems like storing the ID in Lucene which refers to a database row is a fine thing to do.
I am not sure on your overall setup, but maybe Hibernate Search is for you. It would allow you to combine the benefits of a relational database with the power of a fulltext search engine like Lucene. The meta data could live in the database, maybe together with the original pdf documents, while the Lucene documents just contain the searchable data.
This is definitely possible. But always be aware of the fact that you're using Lucene for something that it was not intended for. In general, Lucene is designed for full-text search, not for mapping relational content. So the more complex your system your relational content becomes, the more you'll see a decrease in performance.
In particular, there are a few areas to keep a close eye on:
Storing the value of each field in your index will decrease performance. If you are not overly concerned with sub-second search results, or if your index is relatively small, then this may not be a problem.
Also, be aware that if you are not using the default ranking algorithm, and your custom algorithm requires information about the project in order to calculate the score for each document, this will have a dramatic impact on search performance, as well.
If you need a more powerful index that was designed for relational content, there are hierarchical indexing tools out there (one developed by Apache, called Jackrabbit) that are worth looking into.
As your project continues to grow, you might also check out Solr, also developed by Apache, which provides some added functionality, such as multi-faceted search.
You can use Lucene that way;
Pros:
Full-text search is easy to implement, which is not the case in an RDBMS.
Cons:
Referential integrity: you get it for free in an RDBMS, but in Lucene, you must implement it yourself.

Resources