I'm trying to get a proper understanding of using mongodb to optimise queries. In this case it's for fields that would hold an integer. So say i have a collection
with two fields value and cid where value will store data of type string and cid will store data of type number.
I intend to write queries that will search for records by matching the fields value and cid. Also the expectation is that the saved records for this collection would get very large and hence queries could benefit from mongodb indexes. It makes sense to me to index the value field which holds string. But I wonder if the cid field requires indexing, or its okay as is, given that it will be holding integers.
I'm asking because I was going through a code base with this exact scenario described and i can't figure out why the number field was not indexed. Hoping my question makes any sense.
Regardless of datatypes, generally speaking all queries should use an index. If you use a sort predicate you can assist the database by having a compound index on both the equality portion of the query (the filter predicate) as well as the sorting portion (the sort predicate). MongoDB recommends following the index strategy referred to as the E.S.R. rule - see Performance Best Practices for E.S.R. rule.
Cassandra supports CONTAINS on collections:
CREATE TABLE contacts (
id int PRIMARY KEY,
firstName text,
lastName text,
phones map<text, text>,
emails set<text>
);
CREATE INDEX ON contacts (firstName);
CREATE INDEX ON contacts (keys(phones)); // Using the keys function to index the map keys
CREATE INDEX ON contacts (emails);
And it is possible to query the emails set and check for specific Email. Simply:
SELECT * FROM contacts WHERE emails CONTAINS 'Benjamin#oops.com';
What would be the solution if one wants to check for lack of an element, something like: DOES NOT CONTAIN? I couldn't find such functionality in CQL docs, is there any solution for that?
No Cassandra does not support such feature. I can guess you have gone through below article:
A deep look at the CQL WHERE clause
You have to get the whole collections and filter it from your application.
Cassandra also doesn't support the "is not null" operator nor the "not equals" operator.
These restrictions are because of how C* stores data and how it finds row and scans columns fast. C* stores data in a key-value pair (map of map). It can fast get data by indexing the keys. C* can find the particular item very fast by jumping to the location where the data resides. If you index collections:
Sets and lists can index all values found by indexing the collection column. Maps can index a map key, map value, or map entry.
To support 'not feature', C* has to get all the rows or item in a collection and filter out all the results to give you the result which is not very efficient. So C* does not support this. If you need this, you can handle it in your application knowing all the facts and considerations.
**Note: Using C* index has its own performance impacts. Make sure you know all the considerations and use cases.
All the cautions about using secondary indexes apply to indexing collections.
Indexing a collection
When to use an index
Cassandra CQL mapping -> How C* stores collections.
I just switched from Oracle to using Cassandra 2.0 with Datastax driver and I'm having difficulty structuring my model for this big data approach. I have a Persons table with UUID and serialized Persons. These Persons have lists of addresses, names, identifications, and DOBs. For each of these lists I have an additional table with a compound key on each value in the respective list and the additional person_UUID column. This model feels too relational to me, but I don't know how else to structure it so that I can have index(am able to search by) on address, name, identification, and DOB. If Cassandra supported indexes on lists I would have just the one Persons table containing indexed lists for each of these.
In my application we receive transactions, which can contain within them 0 or more of each of those address, name, identification, and DOB. The persons are scored based on which person matched which criteria. A single person with the highest score is matched to a transaction. Any additional address, name, identification, and DOB data from the transaction that was matched is then added to that person.
The problem I'm having is that this matching is taking too long and the processing is falling far behind. This is caused by having to loop through result sets performing additional queries since I can't make complex queries in Cassandra, and I don't have sufficient memory to just do a huge select all and filter in java. For instance, I would like to select all Persons having at least two names in common with the transaction (names can have their order scrambled, so there is no first, middle, last; that would just be three names) but this would require a 'group by' which Cassandra does not support, and if I just selected all having any of the names in common in order to filter in java the result set is too large and i run out of memory.
I'm currently searching by only Identifications and Addresses, which yield a smaller result set (although it could still be hundreds) and for each one in this result set I query to see if it also matches on names and/or DOB. Besides still being slow this does not meet the project's requirements as a match on Name and DOB alone would be sufficient to link a transaction to a person if no higher score is found.
I know in Cassandra you should model your tables by the queries you do, not by the relationships of the entities, but I don't know how to apply this while maintaining the ability to query individually by address, name, identification, and DOB.
Any help or advice would be greatly appreciated. I'm very impressed by Cassandra but I haven't quite figured out how to make it work for me.
Tables:
Persons
[UUID | serialized_Person]
addresses
[address | person_UUID]
names
[name | person_UUID]
identifications
[identification | person_UUID]
DOBs
[DOB | person_UUID]
I did a lot more reading, and I'm now thinking I should change these tables around to the following:
Persons
[UUID | serialized_Person]
addresses
[address | Set of person_UUID]
names
[name | Set of person_UUID]
identifications
[identification | Set of person_UUID]
DOBs
[DOB | Set of person_UUID]
But I'm afraid of going beyond the max storage for a set(65,536 UUIDs) for some names and DOBs. Instead I think I'll have to do a dynamic column family with the column names as the Person_UUIDs, or is a row with over 65k columns very problematic as well? Thoughts?
It looks like you can't have these dynamic column families in the new version of Cassandra, you have to alter the table to insert the new column with a specific name. I don't know how to store more than 64k values for a row then. With a perfect distribution I will run out of space for DOBs with 23 million persons, I'm expecting to have over 200 million persons. Maybe I have to just have multiple set columns?
DOBs
[DOB | Set of person_UUID_A | Set of person_UUID_B | Set of person_UUID_C]
and I just check size and alter table if size = 64k? Anything better I can do?
I guess it's just CQL3 that enforces this and that if I really wanted I can still do dynamic columns with the Cassandra 2.0?
Ugh, this page from Datastax doc seems to say I had it right the first way...:
When to use a collection
This answer is not very specific, but I'll come back and add to it when I get a chance.
First thing - don't serialize your Persons into a single column. This complicates searching and updating any person info. OTOH, there are people that know what they're saying that disagree with this view. ;)
Next, don't normalize your data. Disk space is cheap. So, don't be afraid to write the same data to two places. You code will need to make sure that the right thing is done.
Those items feed into this: If you want queries to be fast, consider what you need to make that query fast. That is, create a table just for that query. That may mean writing data to multiple tables for multiple queries. Pick a query, and build a table that holds exactly what you need for that query, indexed on whatever you have available for the lookup, such as an id.
So, if you need to query by address, build a table (really, a column family) indexed on address. If you need to support another query based on identification, index on that. Each table may contain duplicate data. This means when you add a new user, you may be writing the same data to more than one table. While this seems unnatural if relational databases are the only kind you've ever used, but you get benefits in return - namely, horizontal scalability thanks to the CAP Theorem.
Edit:
The two column families in that last example could just hold identifiers into another table. So, voilĂ you have made an index. OTOH, that means each query takes two reads. But, still will be a performance improvement in many cases.
Edit:
Attempting to explain the previous edit:
Say you have a users table/column family:
CREATE TABLE users (
id uuid PRIMARY KEY,
display_name text,
avatar text
);
And you want to find a user's avatar given a display name (a contrived example). Searching users will be slow. So, you could create a table/CF that serves as an index, let's call it users_by_name:
CREATE TABLE users_by_name (
display_name text PRIMARY KEY,
user_id uuid
}
The search on display_name is now done against users_by_name, and that gives you the user_id, which you use to issue a second query against users. In this case, user_id in users_by_name has the value of the primary key id in users. Both queries are fast.
Or, you could put avatar in users_by_name, and accomplish the same thing with one query by using more disk space.
CREATE TABLE users_by_name (
display_name text PRIMARY KEY,
avatar text
}
Suppose I search for a query in Field A, and I want to retrive the corresponding fields B and C from my index, how should I go about it? I am using Lucene 3.6.0.
The results of your query will be returned as a set of documents, not fields. Once you've got a document, you can load whichever field contents you're interested in.
One thing that's probably worth watching out for is to ensure that your fields have been "stored".
Good luck,
I want to retrieve the different row id values depending on super column name.
For that purpose I have used this code
SuperColumnQuery<String, String, String, String> superColumnQuery =
HFactory.createSuperColumnQuery(keyspaceOperator, se, se,se,se);
superColumnQuery.setColumnFamily(COLUMN_FAMILY).setKey(rowID).setSuperName(superColumnName);
QueryResult<HSuperColumn<String, String, String>> result = superColumnQuery.execute();
//rowID contains a list of rows separated by ','
But it's not working.
Given that you're trying to select row keys based on column names, I'd venture to guess that your data model is backwards. You should generally be moving from the outside in -- select on row key, then on supercolumn name, then on column name. Otherwise you're going to be stuck iterating over rows in your code trying to match a column name, instead of using the Cassandra engine to select what you need. This approach is never going to scale.
So I'd suggest redoing your data model -- or if you need to have it this way, consider adding another ColumnFamily that serves as an index for the first. Contrary to old-school SQL databases, the credo in NoSQL dbs like Cassandra is "If you're denormalizing -- you're doing it right".