I've been playing with leveldb and it's really good at what it's designed to do--storing and getting key/value pairs based on keys.
But now I want to do something more advanced and find myself immediately stuck. Is there no way to find a record by value? The only way I can think of is to iterate through the entire database until I find an entry with the value I'm looking for. This becomes worse if I'm looking for multiple entries with the value (basically a "where" query) since I have to iterate through the entire database every time I try to do this type of query.
Am I trying to do what Leveldb isn't designed to do and should I be using another database instead? Or is there a nice way to do this?
You are right. Basically what you need to know about is key composition.
Second, you don't query by value itself in SQL WHERE clause, but using a boolean query like age = 42.
To answer your particlular question imagine you have a first key-value namespace in leveldb, where you store your objects where the value is serialized in json for instance:
key | value
-------------------------------------------------
namespace | uid | value
================================================
users | 1 | {name:"amz", age=32}
------------------------------------------------
users | 2 | {name:"abki", age=42}
In another namespace, you index users uid by age:
key | value
----------------------------------
namespace | age | uid | value
==================================
users-by-uid | 32 | 1 | empty
----------------------------------
users-by-uid | 42 | 2 | empty
Here the value is empty because, the key must be unique. What we could think as the value of the given rows would be uid column it's composed
into the key to make each row's key unique.
In that second namespace, every key that starts with the (user-by-uid, 32) match records that answer the query age = 32.
Related
How do I query in cassandra for != null columns.
Select * from tableA where id != null;
Select * from tableA where name != null;
Then I wanted to store these values and insert these into different table.
I don't think this is possible with Cassandra. First of all, Cassandra CQL doesn't support the use of NOT or not equal to operators in the WHERE clause. Secondly, your WHERE clause can only contain primary key columns, and primary key columns will not allow null values to be inserted. I wasn't sure about secondary indexes though, so I ran this quick test:
create table nullTest (id text PRIMARY KEY, name text);
INSERT INTO nullTest (id,name) VALUES ('1','bob');
INSERT INTO nullTest (id,name) VALUES ('2',null);
I now have a table and two rows (one with null data):
SELECT * FROM nullTest;
id | name
----+------
2 | null
1 | bob
(2 rows)
I then try to create a secondary index on name, which I know contains null values.
CREATE INDEX nullTestIdx ON nullTest(name);
It lets me do it. Now, I'll run a query on that index.
SELECT * FROM nullTest WHERE name=null;
Bad Request: Unsupported null value for indexed column name
And again, this is done under the premise that you can't query for not null, if you can't even query for column values that may actually be null.
So, I'm thinking this can't be done. Also, if null values are a possibility in your primary key, then you may want to re-evaluate your data model. Again, I know the OP's question is about querying where data is not null. But as I mentioned before, Cassandra CQL doesn't have a NOT or != operator, so that's going to be a problem right there.
Another option, is to insert an empty string instead of a null. You would then be able to query on an empty string. But that still doesn't get you past the fundamental design flaw of having a null in a primary key field. Perhaps if you had a composite primary key, and only part of it (the clustering columns) had the possibility of being empty (certainly not part of the partitioning key). But you'd still be stuck with the problem of not being able to query for rows that are "not empty" (instead of not null).
NOTE: Inserting null values was done here for demonstration purposes only. It is something you should do your best to avoid, as inserting a null column value WILL create a tombstone. Likewise, inserting lots of null values will create lots of tombstones.
1) select * from test;
name | id | address
------------------+----+------------------
bangalore | 3 | ramyam_lab
bangalore | 4 | bangalore_ramyam
bangalore | 5 | jasgdjgkj
prasad | 11 | null
prasad | 12 | null
india | 6 | karnata
india | 7 | karnata
ramyam-bangalore | 3 | jasgdjgkj
ramyam-bangalore | 5 | jasgdjgkj
2)cassandra does't support null values selection.It is showing null for our understanding.
3) For handling null values use another strings like "not-available","null",then we can select data
I'm looking for examples of the code using python3, no links to the documentation. I havent found examples in the documentation.
I'm looking to query 2 elements with the category "red" starting at the ID 1.
This is my table:
| ID | category | description |
| 0 | red | .... |
| 1 | red | .... |
| 2 | blue | .... |
| 3 | red | .... |
| 4 | red | .... |
The query should return the elements with the id 1 and 3.
Looking forward to read your examples. Thanks in advance.
In dynamo Db you query over your PartitionKey, LSIs or GSIs.
In your case I would create a GSI with its' partitionKey (gsiID) as your category and its' sortKey (gsiSK) as your ID.
In that case you can do a query like this: query all elements with gsiID = red and gsiSK = *
This will give you all the reds sorted by their ID in ascending order (you can also specify descending order)
Now dynamo queries have an option to limit your result. Since you need to you can do a limit = 2.
I hope this will help you!
You need to define an Global secondary index in which the partition key is category and the sort key is id.
Once your have that index defined, you can query it as follows (I am using the JS notation, sorry):
{
TableName: 'your_table_name',
IndexName: 'your_index_name',
KeyConditionExpression: 'category = :x and ID >= :y',
ExpressionAttributeValues: {
':x': 'red',
':y': 1
}
}
Note that this is a query. In DynamoDB, queries work on "chunks" of items (aka: "pages"). Specifically, when executing a query, DDB takes a chunk, finds all matching items in that chunks and returns them. If there are other matching items in other chunks they will not be returned. However, the response will provide you with details of the next chunk so that you can issue a subsequent query on the next chunk. These "details" are encapsulated in the LastEvaluatedKey field of the response and they should be copied into the ExclusiveStartKey of the subsequent request.
You can check this guide to see an example of using LastEvaluatedKey. Look for the following line:
while 'LastEvaluatedKey' in response:
Important!
Although you want to get just two items, you do not want to set the Limit field to 2. Setting it to 2 means that DynamoDB will use very small chunks when looking for items that match your query (in fact, it will use chunks of just two items): this means you will need to do numerous repeated queries (by using LastEvaluatedkey/ExclusiveStartKey as explained above) until you actually find two matching items. This will considerably slow down the entire process. For most practical scenarios, the best thing to do is not to set the Limit field at all, and just use its default value.
I am using application insights to record custom measurements about our application. I have a customEvent that has data stored in the customMeasurements object. The object ontains 4 key-value pairs. I have many of these customEvents and I am trying to average the key-value pairs from all the events and display the results in a 2 column table.
I want to have one table that has 2 columns. First column is the key
name, and the second column in the key-value of all the events
averaged.
For example, event1 has key1's value set to 2. event2 has key1's value set to 6. If those are the only two events I received in the last 7 days, I want my table to show the number 4 in the row containing data for key1.
I can only average 1 key per query since I cannot put multiple summarizes inside of 1 query... Here is what I have for averaging the first key in the customMeasurements object:
customEvents
| where name == "PerformanceMeasurements"
| where timestamp > ago(7d)
| summarize key1average=avg(toint(customMeasurements.key1))
| project key1average
But I need to average all the keys inside of this object and build 1 table as described above.
For reference, I have attached a screenshot of the layout of a customEvent customMeasurements object:
If amount of Keys is limited and is known beforehand, then I'd recommend using multiple aggregations within | summarize operator by separating them with comma:
| summarize key1average=avg(toint(customMeasurements.key1)), key2average=avg(toint(customMeasurements.key2)), key3average=avg(toint(customMeasurements.key3))
If Keys may vary, then you'd to flatten out custom dimensions first with |mvexpand operator:
customEvents
| where timestamp > ago(1h)
| where name == "EventName"
| project customDimensions
| mvexpand bagexpansion=array customDimensions
| extend Key = customDimensions[0], Value = customDimensions[1]
| summarize avg(toint(Value)) by tostring(Key)
In this case, each Key-Value pair from customDimensions will become its own row and you will be able to operate on those with the standard query language constructs.
David, could I ask for some clarification on what you say about joins in this answer
When you say "You cannot, using the join of the relational stores, join one entry to multiple ones", does that mean in any direction?
E.g. Store 1:
| Key1 | Measure1 |
Store 2:
| Key 1 | SomeId1 | Measure2 | Measure3 |
| Key 1 | SomeId2 | Measure4 | Measure4 |
So is it not possible to join these two stores by putting the join from Store 2 to Store 1?
And if not, are you saying then that the only way to manage this is to duplicate the entries in Store 1? E.g.:
Store 1
| Key 1 | SomeId1 | Measure1 | Measure2 | Measure3 |
| Key 1 | SomeId2 | Measure1 | Measure4 | Measure4 |
The direction matters for the one-to-many : it depends which store is the "parent" one.
The relational stores includes the concept of an "ActivePivot Store" which is your main store (on which your schema is based). This store can then be joined to one or more stores, given a set of key fields, that we'll call "child" stores for simplicity. Each of these child stores can eventually be joined with other stores, and so on (you can represent it with a directed graph).
The main rule to respect is that you should never have a "parent" store entry resolving to multiple "child" store entries (neither should you have any cyclic relationship I believe).
The simplified idea behind the relational stores (as of RS 1.5.x / AP 4.4.x) is that when one entry is submitted into the "ActivePivot Store" then, starting from the ActivePivot Store, it'll recursively resolve the joins in order to retrieve maximum one entry in each of the joined stores. Depending of your schema definition, these entries will then be used to populate the fact before inserting it in the cube.
If resolving a join result in more than one entry then AP will not be able to choose which one to use in order to populate the fact and will throw an exception.
Coming back to your example you can do the join between Store 1 and Store 2 only in the case where Store 2 is your ActivePivot Store or a "parent" of Store 1 (APStore->...->Store2->Store1), which seems to be your case.
If not (Store1->Store2) you will then have to duplicate the entries of Store 1 in order to ensure that it will always find only one entry at maximum when resolving the join. Store 1 will then looks like:
| Key 1 | SomeId1 | Measure1
| Key 1 | SomeId2 | Measure1
Your join with Store 2 will then be done on the fields "Key, SomeId" instead of just "Key" and that will ensure you to find only one entry when resolving Store1->Store2
I have to create and query a column family with composite key as [timestamp,long]. Also,
while querying I want to fire range query for timestamp (like timestamp between xxx and yyy) Is this possible ?
Currently I am doing something really funny (Which I know its not correct). I create keys with timestamp string for given range and concatenate with long.
like ,
1254345345435-1234
3423432423432-1234
1231231231231-9999
and pass set of keys to hector api. (so if i have date range for 1 month and I want every minute data, i create 30 * 24 * 60 * [number of secondary key - long])
I can solve concatenation issue with composite key. But query part is what I am trying to understand.
As far as I understood, As we are using RandomPartitioner we cannot really query based on range as keys are MD5 checksum. Whats ideal design for this kind of use case ?
my schema and requirements are as follows : (actual csh)
CREATE TABLE report(
ts timestamp,
user_id long,
svc1 long,
svc2 long,
svc3 long,
PRIMARY KEY(ts, user_id));
select from report where ts between (123445345435 and 32423423424) and user_id is in (123,567,987)
You cannot do range queries on the first component of a composite key. Instead, you should write a sentinel value such as a daystamp (the unix epoch at midnight on the current day) as the key, then write a composite column as timestamp:long. This way you can provide the keys that comprise your range, and slice on the timestamp component of the composite column.
Denormalize! You must model your schema in a manner that will enable the types of queries you wish to perform. We create a reverse (aka inverted, inverse) index for such scenarios.
CREATE TABLE report(
KEY uuid PRIMARY KEY,
svc1 bigint,
svc2 bigint,
svc3 bigint
);
CREATE TABLE ReportsByTime(
KEY ascii PRIMARY KEY
) with default_validation=uuid AND comparator=uuid;
CREATE TABLE ReportsByUser(
KEY bigint PRIMARY KEY
)with default_validation=uuid AND comparator=uuid;
See here for a nice explanation. What you are doing now is generating your own ascii key in the times table, to enable yourself to perform the range slice query you want - it doesn't have to be ascii though just something you can use to programmatically generate your own slice keys with.
You can use this approach to facilitate all of your queries, this likely isn't going to suit your application directly but the idea is the same. You can squeeze more out of this by adding meaningful values to the column keys of each table above.
cqlsh:tester> select * from report;
KEY | svc1 | svc2 | svc3
--------------------------------------+------+------+------
1381b530-1dd2-11b2-0000-242d50cf1fb5 | 332 | 333 | 334
13818e20-1dd2-11b2-0000-242d50cf1fb5 | 222 | 223 | 224
13816710-1dd2-11b2-0000-242d50cf1fb5 | 112 | 113 | 114
cqlsh:tester> select * from times;
KEY,1212051037 | 13818e20-1dd2-11b2-0000-242d50cf1fb5,13818e20-1dd2-11b2-0000-242d50cf1fb5 | 1381b530-1dd2-11b2-0000-242d50cf1fb5,1381b530-1dd2-11b2-0000-242d50cf1fb5
KEY,1212051035 | 13816710-1dd2-11b2-0000-242d50cf1fb5,13816710-1dd2-11b2-0000-242d50cf1fb5 | 13818e20-1dd2-11b2-0000-242d50cf1fb5,13818e20-1dd2-11b2-0000-242d50cf1fb5
KEY,1212051036 | 13818e20-1dd2-11b2-0000-242d50cf1fb5,13818e20-1dd2-11b2-0000-242d50cf1fb5
cqlsh:tester> select * from users;
KEY | 13816710-1dd2-11b2-0000-242d50cf1fb5 | 13818e20-1dd2-11b2-0000-242d50cf1fb5
-------------+--------------------------------------+--------------------------------------
23123123231 | 13816710-1dd2-11b2-0000-242d50cf1fb5 | 13818e20-1dd2-11b2-0000-242d50cf1fb5
Why don't you use wide rows, where Key is timestamp and Column Name as Long-Value then you can pass multiple key's (timestamp's) to getKeySlice and select multiple column's to withColumnSlice by there name (which is id).
As I don't know what is column name and value, I feel this can help you. Can you provide more details of your column family definition.