dynamoDB: How to query N elements starting at X index

dynamoDB: How to query N elements starting at X index - python-3.x

I'm looking for examples of the code using python3, no links to the documentation. I havent found examples in the documentation.
I'm looking to query 2 elements with the category "red" starting at the ID 1.
This is my table:
| ID | category | description |
| 0 | red | .... |
| 1 | red | .... |
| 2 | blue | .... |
| 3 | red | .... |
| 4 | red | .... |
The query should return the elements with the id 1 and 3.
Looking forward to read your examples. Thanks in advance.

In dynamo Db you query over your PartitionKey, LSIs or GSIs.
In your case I would create a GSI with its' partitionKey (gsiID) as your category and its' sortKey (gsiSK) as your ID.
In that case you can do a query like this: query all elements with gsiID = red and gsiSK = *
This will give you all the reds sorted by their ID in ascending order (you can also specify descending order)
Now dynamo queries have an option to limit your result. Since you need to you can do a limit = 2.
I hope this will help you!

You need to define an Global secondary index in which the partition key is category and the sort key is id.
Once your have that index defined, you can query it as follows (I am using the JS notation, sorry):
{
TableName: 'your_table_name',
IndexName: 'your_index_name',
KeyConditionExpression: 'category = :x and ID >= :y',
ExpressionAttributeValues: {
':x': 'red',
':y': 1
}
}
Note that this is a query. In DynamoDB, queries work on "chunks" of items (aka: "pages"). Specifically, when executing a query, DDB takes a chunk, finds all matching items in that chunks and returns them. If there are other matching items in other chunks they will not be returned. However, the response will provide you with details of the next chunk so that you can issue a subsequent query on the next chunk. These "details" are encapsulated in the LastEvaluatedKey field of the response and they should be copied into the ExclusiveStartKey of the subsequent request.
You can check this guide to see an example of using LastEvaluatedKey. Look for the following line:
while 'LastEvaluatedKey' in response:
Important!
Although you want to get just two items, you do not want to set the Limit field to 2. Setting it to 2 means that DynamoDB will use very small chunks when looking for items that match your query (in fact, it will use chunks of just two items): this means you will need to do numerous repeated queries (by using LastEvaluatedkey/ExclusiveStartKey as explained above) until you actually find two matching items. This will considerably slow down the entire process. For most practical scenarios, the best thing to do is not to set the Limit field at all, and just use its default value.

Related

How can i do a "GROUP BY WITH ROLLUP" in Kusto?

In T-SQL, when grouping results, you can also get a running total row when specifying "WITH ROLLUP".
How can i achieve this in Kusto? So, consider the following query:
customEvents | summarize counter = count() by name
The query above gives me a list of event names, and how often they occurred. This is what i need, but i also want a row with the running total (the count of all events).
It feels like there should be an easy way to achieve this, but i havent found anything in the docs ...

You can write 2 queries, the first query is used to count the number of each events, the second query is used to count the numbers of all the events. Then use the union operator to join them.
The query like below:
customEvents
| count
| extend name = "total",counter=Count
| project name,counter
| union
(customEvents
| summarize counter = count() by name)
Test result is as below:

Trying to calculate the average on a count of records in my query results

I'm trying to create a query in Application Insights that can show me the absolute and average number of messages in conversations over a particular time period. I'm using the LUIS trace example to get the context+LUIS information, which is where I'm pulling the conversationID from. I can get a table showing the number of messages per conversation, but I would also like to have a average number of messages for the data set. Either static average or rolling average (by pulling in timestamp) would be fine. I can get this value by doing a second summarize statement, but then I lose the granularity from the first. Here is my query.
requests
| where url endswith "messages"
| where timestamp > ago(30d)
| project timestamp, url, id
| parse kind = regex url with *"(?i)http://"botName".azurewebsites.net/api/messages"
| join kind= inner (
traces | extend id = operation_ParentId
) on id
| where message == "LUIS"
| extend convID = tostring(customDimensions.LUIS_botContext_conversation_id)
| order by timestamp desc nulls last
| project timestamp, botName, convID
| summarize messages=count() by conversation=convID
This gives me a table of conversation IDs with the message count for each conversation. I would also like to see the average number of messages per conversation. For example, if I have 4 conversations with 100 messages total, I want to see that the average is 25. I can get this result by doing a second summarize statement | summarize messages=sum(messages), avgMessages=avg(messages), but then of course I can no longer see the individual conversations. Is there any way to see both in the same table?

You can write 2 queries, one for "gives me a table of conversation IDs with the message count for each conversation", and another for " the average number of messages per conversation". And consider use Let statement for your query.
The tricky here is that, in both of the 2 queries, after the summarize statement, add this line of code at the end, like | extend myidentifier="aaa" .
Then you can join the 2 queries by using myidentifier.

I couldn't figure out how to do this without losing granularity from the first list (i.e. I couldn't figure out how to calculate average per period e.g. day), but the following query does at least get me the average across whatever timestamp filter I set, which ultimately gets me at the data I was looking for.
requests
| where url endswith "messages"
| where timestamp > ago(30d)
| project timestamp, url, id
| parse kind = regex url with *"(?i)http://"botName".azurewebsites.net/api/messages"
| join kind= inner (
traces | extend id = operation_ParentId
) on id
| where message == "LUIS"
| extend convID = tostring(customDimensions.LUIS_botContext_conversation_id)
| order by timestamp desc nulls last
| project timestamp, botName, convID
| summarize messages=count() by conversation=convID
| summarize conversations=count(), messageAverage=avg(messages)

Azure Query Analytics average all values in column

I am using application insights to record custom measurements about our application. I have a customEvent that has data stored in the customMeasurements object. The object ontains 4 key-value pairs. I have many of these customEvents and I am trying to average the key-value pairs from all the events and display the results in a 2 column table.
I want to have one table that has 2 columns. First column is the key
name, and the second column in the key-value of all the events
averaged.
For example, event1 has key1's value set to 2. event2 has key1's value set to 6. If those are the only two events I received in the last 7 days, I want my table to show the number 4 in the row containing data for key1.
I can only average 1 key per query since I cannot put multiple summarizes inside of 1 query... Here is what I have for averaging the first key in the customMeasurements object:
customEvents
| where name == "PerformanceMeasurements"
| where timestamp > ago(7d)
| summarize key1average=avg(toint(customMeasurements.key1))
| project key1average
But I need to average all the keys inside of this object and build 1 table as described above.
For reference, I have attached a screenshot of the layout of a customEvent customMeasurements object:

If amount of Keys is limited and is known beforehand, then I'd recommend using multiple aggregations within | summarize operator by separating them with comma:
| summarize key1average=avg(toint(customMeasurements.key1)), key2average=avg(toint(customMeasurements.key2)), key3average=avg(toint(customMeasurements.key3))
If Keys may vary, then you'd to flatten out custom dimensions first with |mvexpand operator:
customEvents
| where timestamp > ago(1h)
| where name == "EventName"
| project customDimensions
| mvexpand bagexpansion=array customDimensions
| extend Key = customDimensions[0], Value = customDimensions[1]
| summarize avg(toint(Value)) by tostring(Key)
In this case, each Key-Value pair from customDimensions will become its own row and you will be able to operate on those with the standard query language constructs.

Find by value on leveldb

I've been playing with leveldb and it's really good at what it's designed to do--storing and getting key/value pairs based on keys.
But now I want to do something more advanced and find myself immediately stuck. Is there no way to find a record by value? The only way I can think of is to iterate through the entire database until I find an entry with the value I'm looking for. This becomes worse if I'm looking for multiple entries with the value (basically a "where" query) since I have to iterate through the entire database every time I try to do this type of query.
Am I trying to do what Leveldb isn't designed to do and should I be using another database instead? Or is there a nice way to do this?

You are right. Basically what you need to know about is key composition.
Second, you don't query by value itself in SQL WHERE clause, but using a boolean query like age = 42.
To answer your particlular question imagine you have a first key-value namespace in leveldb, where you store your objects where the value is serialized in json for instance:
key | value
-------------------------------------------------
namespace | uid | value
================================================
users | 1 | {name:"amz", age=32}
------------------------------------------------
users | 2 | {name:"abki", age=42}
In another namespace, you index users uid by age:
key | value
----------------------------------
namespace | age | uid | value
==================================
users-by-uid | 32 | 1 | empty
----------------------------------
users-by-uid | 42 | 2 | empty
Here the value is empty because, the key must be unique. What we could think as the value of the given rows would be uid column it's composed
into the key to make each row's key unique.
In that second namespace, every key that starts with the (user-by-uid, 32) match records that answer the query age = 32.

Cassandra - Overlapping Data Ranges

I have the following 'Tasks' table in Cassandra.
Task_ID UUID - Partition Key
Starts_On TIMESTAMP - Clustering Column
Ends_On TIMESTAMP - Clustering Column
I want to run a CQL query to get the overlapping tasks for a given date range. For example, if I pass in two timestamps (T1 and T2) as parameters to the query, I want to get the all tasks that are applicable with in that range (that is, overlapping records).
What is the best way to do this in Cassandra? I cannot just use two ranges on Starts_On and Ends_On here because to add a range query to Ends_On, I have to have a equality check for Starts_On.

In CQL you can only range query on one clustering column at a time, so you'll probably need to do some kind of client side filtering in your application. So you could range query on starts_on, and as rows are returned, check ends_on in your application and discard rows that you don't want.

Here's another idea (somewhat unconventional). You could create a user defined function to implement the second range filter (in Cassandra 2.2 and newer).
Suppose you define your table like this (shown with ints instead of timestamps to keep the example simple):
CREATE TABLE tasks (
p int,
task_id timeuuid,
start int,
end int,
end_range int static,
PRIMARY KEY(p, start));
Now we create a user defined function to check returned rows based on the end time, and return the task_id of matching rows, like this:
CREATE FUNCTION my_end_range(task_id timeuuid, end int, end_range int)
CALLED ON NULL INPUT RETURNS timeuuid LANGUAGE java AS
'if (end <= end_range) return task_id; else return null;';
Now I'm using a trick there with the third parameter. In an apparent (major?) oversight, it appears you can't pass a constant to a user defined function. So to work around that, we pass a static column (end_range) as our constant.
So first we have to set the end_range we want:
UPDATE tasks SET end_range=15 where p=1;
And let's say we have this data:
SELECT * FROM tasks;
p | start | end_range | end | task_id
---+-------+-----------+-----+--------------------------------------
1 | 1 | 15 | 5 | 2c6e9340-4a88-11e5-a180-433e07a8bafb
1 | 2 | 15 | 7 | 3233a040-4a88-11e5-a180-433e07a8bafb
1 | 4 | 15 | 22 | f98fd9b0-4a88-11e5-a180-433e07a8bafb
1 | 8 | 15 | 15 | 37ec7840-4a88-11e5-a180-433e07a8bafb
Now let's get the task_id's that have start >= 2 and end <= 15:
SELECT start, end, my_end_range(task_id, end, end_range) FROM tasks
WHERE p=1 AND start >= 2;
start | end | test.my_end_range(task_id, end, end_range)
-------+-----+--------------------------------------------
2 | 7 | 3233a040-4a88-11e5-a180-433e07a8bafb
4 | 22 | null
8 | 15 | 37ec7840-4a88-11e5-a180-433e07a8bafb
So that gives you the matching task_id's and you have to ignore the null rows (I haven't figured out a way to drop rows using UDF's). You'll note that the filter of start >= 2 dropped one row before passing it to the UDF.
Anyway not a perfect method obviously, but it might be something you can work with. :)

A while ago I wrote an application that faced a similar problem, in querying events that had both start and end times. For our scenario, I was able to partition on a userID (as queries were for events of a specific user), set a clustering column for type of event, and also for event date. The table structure looked something like this:
CREATE TABLE userEvents (
userid UUID,
eventTime TIMEUUID,
eventType TEXT,
eventDesc TEXT,
PRIMARY KEY ((userid),eventTime,eventType));
With this structure, I can query by userid and eventtime:
SELECT userid,dateof(eventtime),eventtype,eventdesc FROM userevents
WHERE userid=dd95c5a7-e98d-4f79-88de-565fab8e9a68
AND eventtime >= mintimeuuid('2015-08-24 00:00:00-0500');
userid | system.dateof(eventtime) | eventtype | eventdesc
--------------------------------------+--------------------------+-----------+-----------
dd95c5a7-e98d-4f79-88de-565fab8e9a68 | 2015-08-24 08:22:53-0500 | End | event1
dd95c5a7-e98d-4f79-88de-565fab8e9a68 | 2015-08-24 11:45:00-0500 | Begin | lunch
dd95c5a7-e98d-4f79-88de-565fab8e9a68 | 2015-08-24 12:45:00-0500 | End | lunch
(3 rows)
That query will give me all event rows for a particular user for today.
NOTES:
If you need to query by whether or not an event is starting or ending (I did not) you will want to order eventType ahead of eventTime in the primary key.
You will store each event twice (once for the beginning, and once for the end). Duplication of data usually isn't much of a concern in Cassandra, but I did want to explicitly point that out.
In your case, you will want to find a good key to partition on, as Task_ID will be too unique (high cardinality). This is a must in Cassandra, as you cannot range query on a partition key (only a clustering key).

There doesn't seem to be a completely satisfactory way to do this in Cassandra but the following method seems to work well:
I cluster the table on the Starts_On timestamp in descending order. (Ends_On is just a regular column.) Then I constrain the query with Starts_On<? where the parameter is the end of the period of interest - i.e. filter out events that start after our period of interest has finished.
I then iterate through the results until the row Ends_On is earlier than the start of the period of interest and throw away the rest of the results rows. (Note that this assumes events don't overlap - there are no subsequent results with a later Ends_On.)
Throwing away the rest of the result rows might seem wasteful, but here's the crucial bit: You can set the paging size sufficiently small that the number of rows to throw away is relatively small, even if the total number of rows is very large.
Ideally you want the paging size just a little bigger than the total number of relevant rows that you expect to receive back. If the paging size is too small the driver ends up retrieving multiple pages, which could hurt performance. If it is too large you end up throwing away a lot of rows and again this could hurt performance by transfering more data than is necessary. In practice you can probably find a good compromise.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string