Full-Text search in Cassandra;
I am fairly new to Cassandra, and wish to understand it more properly. I am attempting to perform a Full-Text search in Cassandra, but after some research I have found that there may not be a "simple" approach for this.. and I say maybe because the first page of Google hasn't said much of anything.
So I am trying to understand now instead, what is the best approach here.. This sort of lead me to take make up my own assumptions based on what I've learned so far about Cassandra, that is based on these two principals; a) design your tables based on your queries, rather than the data, and b) more-data is a good thing, as long as it is being used properly.
With that being said, I've come up with a couple of solutions I'd like to share, and also ask that if anyone has a better idea, please fill me on it before I commit to anything unreasonable/naive.
First Solution: Create a Column Family(CF), with two primary keys and an Index like so:
CREATE TABLE "FullTextSearch" (
"PartialText" text,
"TargetIdentifier" uuid,
"CompleteText" text,
"Type" int,
PRIMARY KEY ("PartialText","TargetIdentifier")
);
CREATE INDEX IX_FullTextSearch_Type "keyspace"."FullTextSearch" ("Type");
With the above table, I would need to insert rows for the text "Hello World" as follows:
BATCH APPLY;
INSERT INTO "FullTextSearch" ("PartialText","TargetIdentifier","CompleteText","Type") VALUES ("H",000000000-0000-0000-0000-000000000,"Hello World",1);
INSERT INTO "FullTextSearch" ("PartialText","TargetIdentifier","CompleteText","Type") VALUES ("He",000000000-0000-0000-0000-000000000,"Hello World",1);
INSERT INTO "FullTextSearch" ("PartialText","TargetIdentifier","CompleteText","Type") VALUES ("Hel",000000000-0000-0000-0000-000000000,"Hello World",1);
.....
INSERT INTO "FullTextSearch" ("PartialText","TargetIdentifier","CompleteText","Type") VALUES ("Hello Wor",000000000-0000-0000-0000-000000000,"Hello World",1);
INSERT INTO "FullTextSearch" ("PartialText","TargetIdentifier","CompleteText","Type") VALUES ("Hello Worl",000000000-0000-0000-0000-000000000,"Hello World",1);
INSERT INTO "FullTextSearch" ("PartialText","TargetIdentifier","CompleteText","Type") VALUES ("Hello World",000000000-0000-0000-0000-000000000,"Hello World",1);
.....
INSERT INTO "FullTextSearch" ("PartialText","TargetIdentifier","CompleteText","Type") VALUES ("Wor",000000000-0000-0000-0000-000000000,"Hello World",1);
INSERT INTO "FullTextSearch" ("PartialText","TargetIdentifier","CompleteText","Type") VALUES ("Worl",000000000-0000-0000-0000-000000000,"Hello World",1);
INSERT INTO "FullTextSearch" ("PartialText","TargetIdentifier","CompleteText","Type") VALUES ("World",000000000-0000-0000-0000-000000000,"Hello World",1);
END BATCH;
Basically, the above will satisfy the following wildcards/partialtext "%o W%", "Hello%", "Worl%"; However it will not satisfy partial words such as "%ell%" for "Hello", which I can feel alright about for now..... (OCD sorta kicks in here)
This approach sort of sucks for me because I would now have to delete/re-insert any time a save/name change occurs on the "TargetIdentifier";
The Second Solution, would be very similar only this time making use of wide-columns; where the table might look like:
CREATE TABLE "FullTextSearch" (
"TargetIdentifier" uuid,
"Type" int,
"CompleteText" text,
PRIMARY KEY("TargetIdentifier")
);
and now during a search something like:
SELECT * FROM "FullTextSearch" WHERE "He" = 1;
so that if the column exists, the respective rows are returned;
Third Solution:
similar to the one above, only this time instead of using wide-columns we use a set column such as map for the partial texts, and perform a query like:
SELECT * FROM "FullTextSearch" WHERE "PartialTexts"['He'] = 1;
Anyways, I am all out of ideas, it is late, and I can only hope for a great response! Please, let me know what I should be doing here... am I even on the right path?
AFAIK Datastax Enterprise Search is the (commercial) successor of Solandra.
Cassandra 2.0 supports so called "custom secondary indexes".
Custom secondary indexes are Java code. Your own implementation has to implement the abstract class org.apache.cassandra.db.index.SecondaryIndex
(See http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/create_index_r.html)
I'm not sure whether implementations exist for Elasticsearch or Solr.
I would not recommend to code all the weird full text search logic like stemming, multiple/exotic language support or even geo spatial stuff.
But SecondaryIndexwould be a good point to start integrating your favorite search engine.
If your dataset is relative small you can simply use a inmemory instance of lucene, update the index at a set interval and you are ready to go.
Use elassandra which comes elasticsearch as a plugin in cassandra.
An example can be found from here
Check out SOLANDRA (former Lucandra)
But I think Solandra is not being actively developed any more, the author moved to Datastax and continued his work there.
So You can also take a look at Datastax Enterprise Search
There are some limitation also, look at DistributedSearch
The very basic thing about cassandra is if you want to use where clause for filtration of records that column is either primary key or you have to assign index to it, so what i can see is you have given primary key to "TargetIdentifier" field and index to "Type" and using "CompleteText" in where clause so this may not work..
Assign secondary index to "CompleteTex" and check whether you are getting desired output or not.
A couple other options you have:
Stratio Lucene Plugin. This uses Lucene for implementing a native secondary index.
You also have SSTable Attached Secondary Index (SASI) available to use for free text searching.
Be forewarned that both of these strategies use locally distributed indexes such that queries will not be very performant since searches will end up being broadcast across the entire cluster. For SASI, you can avoid this if you can use a partition key as part of your query.
Use Solr for fullText search
Cassandra is not good for fullText.
1 Db Cassandra for Archive
2 Solr for full text search
Related
Looking at the following lines of code:
query = "DROP TABLE IF EXISTS my_table"
cur.execute(query)
conn.commit()
# print(table_name)
I'm running the query against multiple tables with various query and I want to return the name of the table and the action executed each time. Is there a way to get some kind of meta data from cur.execute or conn.commit on the action running?
In the example above I'd like to print the table name (my_table) and the action (DROP TABLE). however I want this to be dynamic. If I'm creating a table I want to the name of the table newly created and the action (CREATE TABLE).
Thanks.
Quick and Dirty
tables = ['table_1', 'table_2', 'table_3']
action = 'DROP TABLE'
for table in tables:
cur.execute(f'{str(action)} IF EXISTS {str(table)}')
print(f'ACTION: {action}')
print(f'TABLE: {table}')
conn.commit()
HOWEVER, please do not ever do something like this in anything other than a tiny app that will never leave your computer, and especially not with anything that will accept input from a user.
Bad things will happen.
Dynamically interfacing with databases using OOP is a solved problem, and its not worth reinventing the wheel. Have you considered using an ORM like SQLAlchemy?
I would like to do more complex queries on jsonb/documents that contain arrays of objects. Is there any library anyone would recommend for Node? I am using pg but I want to do more advanced queries like select the document where a document has an array with an object with a certain key/value. If there aren't any libraries that do this, does anyone know how I could do it with json functions/etc in psql? or point me to a book/resource where I could learn this advanced querying?
If you need to do really complicated things you're going to be writing SQL no matter what. But for basic queries that involve working with JSONB fields Massive (full disclosure, it's my project) has you covered, and executing handwritten prepared statements is as easy as anything else since scripts are loaded into the API.
Searching an embedded array falls into the 'really complicated' category, unfortunately, but if you know your element positions you could do this quite simply with Massive:
await db.mytable.find({
'somejson.arrayfield[0].key': 'value'
});
This would return all records from mytable where the somejson column has an arrayfield array, the first element in which contains a "key": "value" pair.
For searching, check out the Postgres docs. The specific question you have requires a lateral join on the jsonb_array_elements function like so:
SELECT somejson
FROM mytable
JOIN LATERAL jsonb_array_elements(mytable.somejson->'arrayfield') AS elements
ON TRUE
WHERE elements->>'key' = $1;
With Massive, you'd put this query in a script in your application's /db directory and run it as db.myScriptName('value'). You can use folders to group similar scripts too.
When I tried to retrieve table using contains keyword it prompts "Cannot use CONTAINS relation on non collection column col1" but when I tried to create table using
CREATE TABLE test (id int,address map<text, int>,mail list<text>,phone set<int>,primary key (id,address,mail,phone));
it prompts "Invalid collection type for PRIMARY KEY component phone"
One of the basics in Cassandra is that you can't modify primary keys. Always keep that in mind.
You can't use a collection as primary key unless it is frozen, meaning you can't modify it.
This will work
CREATE TABLE test (id int,address frozen<map<text, int>>,mail frozen<list<text>>,phone frozen<set<int>>,primary key (id,address,mail,phone));;
However, I think you should take a look at this document: http://www.datastax.com/dev/blog/cql-in-2-1
You can put secondary indexes on collections after cql 2.1. You may want to use that functionality.
I'm trying to build auto suggest functionality using Solr. The index contains different locations within a city and looks something like
id: unique id
name: the complete name
type: can be one of 'location_zone', 'location_subzone', 'location_city', 'outlet', 'landmark' ...
city: city id
now when the user types something, I want it to return suggestion only from the current city and of type location_*. something similar to WHERE city_id = 1 AND type="location_%" in SQL.
I guess one way to do it is by faceting but is that the right way? will it still search in all documents and then filter the results or will it apply the condition first as mysql would do it
PS: I'm new to solr and would appreciate if you can point out any mistakes in the approach
Solr does provide filtering, using the fq parameter. What you're looking for should be something along the lines of:
&fq=city_id:1&fq=type:location_*&q=...
This page illustrates very well how and when to use filter queries in Solr.
I am using subsonic simplerepo with migrations in dev and it makes things pretty easy but I keep running into issues with my nvarchar columns that have an index. My users table has an index defined on the username column for obvious reasons but each time I start the project subsonic is doing this:
ALTER TABLE [Users] ALTER COLUMN Username nvarchar(50);
which causes this:
The index 'IX_Username' is dependent on column 'Username'.ALTER TABLE ALTER COLUMN Username failed because one or more objects access this column
Is there any way around this issue?
Which DBMS are you using? Sql Server?
Never had a problem like this with MySQL, but it seems that your DBMS does not allow to alter a column with an index on it. This is not a SubSonic related issue.
Maybe you should do:
Execute("DROP index ...");
AlterColumn("...");
Execute("CREATE index ...");
It is one or the other. If you choose to manage your DB by Subsonic (use migrations), you can't use indexes and stuff. Sorry mate.