How to create full / advance text search in scylladb and cassandra?

How to create full / advance text search in scylladb and cassandra? - cassandra

I have installed the latest version of scylladb and cassandra in my centos os. i have tried allow filtering in select query but i don't need it, I want advance search or full text search in it, i have google it but couldn't find any solution, when i create indexes and try to run the select query it gives error "server error: not implemented: indexes".
can any one help me please?

Scylla is actively working to enable secondary indexes. Expecting to have a working solution with 2.2 release
http://www.scylladb.com/product/technology/scylla-roadmap/
To currently support a full text search with Scylla, an auxiliary solution such as Solr or Elasticsearch is needed, the following link explains how to combine a Scylla and Elasticsearch
http://www.scylladb.com/2017/08/03/data-analytics-elastic-scylla/

If you are using cassandra version 3.4 or above then you can use SSTable Attached Secondary Index (SASI).
Using CQL, SSTable attached secondary indexes (SASI) can be created on a non-collection column defined in a table. Secondary indexes are used to query a table that uses a column that is not normally queryable, such as a non primary key column. SASI implements three types of indexes, PREFIX, CONTAINS, and SPARSE.
Learn more on : https://docs.datastax.com/en/dse/5.1/cql/cql/cql_using/useSASIIndex.html
Or You could use Apache Solr or Elastic Search. So when ever any searchable data created, updated or deleted you have index or delete the data from solr or elastic search.

Related

"LIKE" and "ILIKE" queries in AWS Keyspaces

As there is no support for custom index in AWS Keyspaces what would be the best solution / pattern to be able to run LIKE or ILIKE queries on specific columns of a Cassandra Table?
In vanilla Cassandra, you can use SSTable secondary index to use LIKE queries, but we can't in AWS...
Is there any query for Cassandra as same as SQL:LIKE Condition?
Feeding an OpenSearch service, or even a good old Postgres at the same time of updating Keyspaces seems a bit overkill to me.
Fetching all columns in-memory somewhere to do the query seems slow as well.
What would be the lightest infra / architecture to implement to provide a LIKE query support based on AWS Keyspaces as source of truth?

You can use a Lexi-graphical Select statement to narrow your query down the same way you would do a LIKE statement. If you needed to further narrow it down you could do that narrowing client side. I would love to learn more your use case so I can better assist you.

Search / Filter on primary key

I need to filter on a column, like "SELECT * FROM code WHERE code='a';" to get all code that starts with "a". That is: "aa","ab","ac"
CREATE TABLE codes (
code text,
PRIMARY KEY (CODE)
);
Do you know how?

Like Search (%% in sql) is not possible in cassandra.
The only way to do this efficiently is to use a full-text search engine like https://github.com/tjake/Solandra (Solr-on-cassandra).

Datastax enterprise edition has integrated solr feature for such query. But still it has read performance hit.
Step 1) solr will search and get the list of keys
Step 2) these keys should traverse throw the entire cluster and get the data, again depends on CONSISTENCY LEVEL.
my recommendation is avoid such query, cassandra is not for that.

Spark Cassandra connector - Range query on partition key

I'm evaluating spark-cassandra-connector and i'm struggling trying to get a range query on partition key to work.
According to the connector's documentation it seems that's possible to make server-side filtering on partition key using equality or IN operator, but unfortunately, my partition key is a timestamp, so I can not use it.
So I tried using Spark SQL with the following query ('timestamp' is the partition key):
select * from datastore.data where timestamp >= '2013-01-01T00:00:00.000Z' and timestamp < '2013-12-31T00:00:00.000Z'
Although the job spawns 200 tasks, the query is not returning any data.
Also I can assure that there is data to be returned since running the query on cqlsh (doing the appropriate conversion using 'token' function) DOES return data.
I'm using spark 1.1.0 with standalone mode. Cassandra is 2.1.2 and connector version is 'b1.1' branch. Cassandra driver is DataStax 'master' branch.
Cassandra cluster is overlaid on spark cluster with 3 servers with replication factor of 1.
Here is the job's full log
Any clue anyone?
Update: When trying to do server-side filtering based on the partition key (using CassandraRDD.where method) I get the following exception:
Exception in thread "main" java.lang.UnsupportedOperationException: Range predicates on partition key columns (here: timestamp) are not supported in where. Use filter instead.
But unfortunately I don't know what "filter" is...

i think the CassandraRDD error is telling that the query that you are trying to do is not allowed in Cassandra and you have to load all the table in a CassandraRDD and then make a spark filter operation over this CassandraRDD.
So your code (in scala) should something like this:
val cassRDD= sc.cassandraTable("keyspace name", "table name").filter(row=> row.getDate("timestamp")>=DateFormat('2013-01-01T00:00:00.000Z')&&row.getDate("timestamp") < DateFormat('2013-12-31T00:00:00.000Z'))
If you are interested in making this type of queries you might have to take a look to others Cassandra connectors, like the one developed by Stratio

You have several options to get the solution you are looking for.
The most powerful one would be to use Lucene indexes integrated with Cassandra by Stratio, which allows you to search by any indexed field in the server side. Your writing time will be increased but, on the other hand, you will be able to query any time range. You can find further information about Lucene indexes in Cassandra here. This extended version of Cassandra is fully integrated into the deep-spark project so you can take all the advantages of the Lucene indexes in Cassandra through it. I would recommend you to use Lucene indexes when you are executing a restricted query that retrieves a small-medium result set, if you are going to retrieve a big piece of your data set, you should use the third option underneath.
Another approach, depending on how your application works, might be to truncate your timestamp field so you can look for it using an IN operator. The problem is, as far as I know, you can't use the spark-cassandra-connector for that, you should use the direct Cassandra driver which is not integrated with Spark, or you can have a look at the deep-spark project where a new feature allowing this is about to be released very soon. Your query would look something like this:
select * from datastore.data where timestamp IN ('2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', ... , '2013-12-31')
, but, as I said before, I don't know if it fits to your needs since you might not be able to truncate your data and group it by date/time.
The last option you have, but the less efficient, is to bring the full data set to your spark cluster and apply a filter on the RDD.
Disclaimer: I work for Stratio :-) Don't hesitate on contacting us if you need any help.
I hope it helps!

Using Pig with Cassandra CQL3

When trying to run PIG against a CQL3 created Cassandra Schema,
-- This script simply gets a row count of the given column family
rows = LOAD 'cassandra://Keyspace1/ColumnFamily/' USING CassandraStorage();
counted = foreach (group rows all) generate COUNT($1);
dump counted;
I get the following Error.
Error: Column family 'ColumnFamily' not found in keyspace 'KeySpace1'
I understand that this is by design, but I have been having trouble finding the correct method to load CQL3 tables into PIG.
Can someone point me in the right direction? Is there a missing bit of documentation?

This is now supported in Cassandra 1.2.8

As you mention this is by design because if thrift was updated to allow for this it would compromise backwards computability. Instead of creating keyspaces and column families using CQL (I'm guessing you used cqlsh) try using the C* CLI.
Take a look at these issues as well:
https://issues.apache.org/jira/browse/CASSANDRA-4924
https://issues.apache.org/jira/browse/CASSANDRA-4377

Per this https://github.com/alexliu68/cassandra/pull/3, it appears that this fix is planned for the 1.2.6 release of Cassandra. It sounds like they're trying to get that out in the reasonably near future, but of course there's no certain ETA.

As e90jimmy said, its supported in Cassandra 1.2.8, but we have a issue when using counter column type. This was fixed by Alex Liu but due to regression problem in 1.2.7 the patch doesn't go ahead:
https://issues.apache.org/jira/browse/CASSANDRA-5234
To correct this, wait until 2.0 become production ready or download the source, apply the patch from the above link by yourself and rebuild the cassandra .jar. Worked for me by now...

The best way to access Cql3 Tables in Pig is by using the CqlStorage Handler
The syntax is similar to what you have a above
row = Load 'cql://Keyspace/ColumnFamily/' Using CqlStorage()
More info In the Dev Blog Post

Cassandra Database Problem

I am using Cassandra database for large scale application. I am new to using Cassandra database. I have a database schema for a particular keyspace for which I have created columns using Cassandra Command Line Interface (CLI). Now when I copied dataset in the folder /var/lib/cassandra/data/, I was not able to access the values using the key of a particular column. I am getting message zero rows present. But the files are present. All these files are under extension, XXXX-Data.db, XXXX-Filter.db, XXXX-Index.db. Can anyone tell me how to access the columns for existing datasets.

(a) Cassandra doesn't expect you to move its data files around out from underneath it. You'll need to restart if you do any manual surgery like that.
(b) if you didn't also copy the schema definition it will ignore data files for unknown column families.

For what you are trying to achieve it may probably be better to export and import your SSTables.
You should have a look at bin/sstable2json and bin/json2sstable.
Documentation is there (near the end of the page): Cassandra Operations

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string