CouchDB sorting - Collation Specification

CouchDB sorting - Collation Specification - couchdb

Collation Specification
Using a CouchDB view it seems my keys aren't sorted as per the collation specification.
rows:
[0] key: ["bylatest", -1294536544000] value: 1
[1] key: ["bylatest", -1298817134000] value: 1
[2] key: ["bylatest", -1294505612000] value: 1
I would of expect the second entry to come after the third.
Why is this happening?

I get this result for a view emitting the values you indicate.
{"total_rows":3,"offset":0,"rows":[
{"id":"29e86c6bf38b9068c56ab1cd0100101f","key":["bylatest",-1298817134000],"value":1},
{"id":"29e86c6bf38b9068c56ab1cd0100101f","key":["bylatest",-1294536544000],"value":1},
{"id":"29e86c6bf38b9068c56ab1cd0100101f","key":["bylatest",-1294505612000],"value":1}
]}
The rows are different from both of your examples. They fit the collation specification by beginning with the smallest (greatest magnitude negative) value and ending with the greatest value (least magnitude negative in this case).
Would you perhaps include the documents and map/reduce functions you are using?

Related

How to keep track of updated fields of an RDD in Spark?

There is an issue that i am dealing with on keeping track of updated fields in spark RDD.
assume that we have an RDD like this:
(1,2)
(2,10)
(5,9)
(3,8)
(8,15)
based on some conditions value of some keys may change. for example the value of key=2 changes from 10 to 11. then the value of a key in RDD that its value is equal to the key of updated row should be changed respectively. for example key=1 has value equal to 2, which 2 is a key in other row. because value of key=2 changes to 11. then the value of key=1 should change to 11 to. after some execution RDD looks like this:
(1,11)
(2,11)
(5,9)
(3,7)
(8,7)
is there any efficient way to implement this?

Assuming you are talking about a DStream (of RDDs). In that case you can use the updateStateByKey method.
To use updateStateByKey, you need to provide a function update(events, oldState) that takes in the events that arrived for a key and its previous state, and returns a newState to store for it.
events: is a list of events that arrived in the current batch (may be empty).
oldState: is an optional state opbject, stored withing an Option; it might be missing if there was no prevous state for the key.
newState: returned by the function, is also an Option.
The result of updateStateByKey() will be a new DStream that contains an RDD of (key, state) pairs.
Basic Example:
def myUpdate(values: Seq[Long], state: Option[Long]) = {
// select new value
}
myDStream.updateStateByKey(myUpdate _)
Background given from the book "Learning Spark" (O'Reilly).

How to sort data by rating with Aws Lambda using nodeJS

I have a db on Dynamodb. And writing some user scores to db. Also I have a lambda function which wrote it with nodejs. I want to get first 10 users who have most points. How could I scan this users?
Thanks a lot.

Max() in NoSQL is much trickier than in SQL. And it doesn't really scale - if you want very high scalability on achieving this let me know, but let's get back to the question.
Assuming your table looks like:
User
----------
userId - hashKey
score
...
Add a dummy category attribute to your table, which will be constant (for example value "A"). Create the index:
category - hash key
score - sort key
Query this index by hash key "A" in reserve order in order to get results much faster than a scan. But this scales to max 10GB (max partition size, all data being in same partition). Also make sure you project only needed attributes for this index, in order to save space.
You can go up to 30GB for example, by setting 3 categories ("A", "B", "C"), executing 3 queries and merge programatically the results. This will affect performance a bit, but still better than a full scan.
EDIT
var params = {
TableName: 'MyTableName',
Limit: 10,
// Set ScanIndexForward to false to display most recent entries first
ScanIndexForward: false,
KeyConditionExpression: 'category = : category',
ExpressionAttributeValues: {
':category': {
S: 'category',
},
},
};
dynamo.query(params, function(err, data) {
// handle data
});
source: https://www.debassociates.com/blog/query-dynamodb-table-from-a-lambda-function-with-nodejs-and-apex-up/

Unable to execute a timeseries query using a timeuuid as the primary key

My goal is to do a sum of the messages_sent and emails_sent per each DISTINCT provider_id value for a given time range (fromDate < stats_date_id < toDate), but without specifying a provider_id. In other words, I need to know about any and all Providers within the specified time range and to sum their messages_sent and emails_sent.
I have a Cassandra table using an express-cassandra schema (in Node.js) as follows:
module.exports = {
fields: {
stats_provider_id: {
type: 'uuid',
default: {
'$db_function': 'uuid()'
}
},
stats_date_id: {
type: 'timeuuid',
default: {
'$db_function': 'now()'
}
},
provider_id: 'uuid',
provider_name: 'text',
messages_sent: 'int',
emails_sent: 'int'
},
key: [
[
'stats_date_id'
],
'created_at'
],
table_name: 'stats_provider',
options: {
timestamps: {
createdAt: 'created_at', // defaults to createdAt
updatedAt: 'updated_at' // defaults to updatedAt
}
}
}
To get it working, I was hoping it'd be as simple as doing the following:
let query = {
stats_date_id: {
'$gt': db.models.minTimeuuid(fromDate),
'$lt': db.models.maxTimeuuid(toDate)
}
};
let selectQueries = [
'provider_name',
'provider_id',
'count(direct_sent) as direct_sent',
'count(messages_sent) as messages_sent',
'count(emails_sent) as emails_sent',
];
// Query stats_provider table
let providerData = await db.models.instance.StatsProvider.findAsync(query, {select: selectQueries});
This, however, complains about needing to filter the results:
Error during find query on DB -> ResponseError: Cannot execute this query as it might involve data filtering and thus may have unpredictable performance.
I'm guessing you can't have a primary key and do date range searches on it? If so, what is the correct approach to this sort of query?

So while not having used Express-Cassandra, I can tell you that running a range query on your partition key is a hard "no." The reason for this, is that Cassandra can't determine a single node for that query, so it has to poll every node. As that's essentially a full scan of your table across multiple nodes, it throws that error to prevent you from running a bad query.
However, you can run a range query on a clustering key, provided that you are filtering on all of the keys prior to it. In your case, if I'm reading this right, your PRIMARY KEY looks like:
PRIMARY KEY (stats_date_id, created_at)
That primary key definition is going to be problematic for two reasons:
stats_date_id is a TimeUUID. This is great for data distribution. But it sucks for query flexibility. In fact, you will need to provide that exact TimeUUID value to return data for a specific partition. As a TimeUUID has millisecond precision, you'll need to know the exact time to query down to the millisecond. Maybe you have the ability to do that, but usually that doesn't make for a good partition key.
Any rows underneath that partition (created_at) will have to share that exact time, which usually leads to a lot of 1:1 cardinality ratios for partition:clustering keys.
My advice on fixing this, is to partition on a date column that has a slightly lower level of cardinality. Think about how many provider messages are usually saved within a certain timeframe. Also pick something that won't store too many provider messages together, as you don't want unbound partition growth (Cassandra has a hard limit of 2 billion cells per partition).
Maybe something like: PRIMARY KEY (week,created_at)
So then your CQL queries could look something like:
SELECT * FROM stats_provider
WHERE week='201909w1'
AND created_at > '20190901'
AND created_at < '20190905';
TL;DR;
Partition on a time bucket not quite as precise as something down to the ms, yet large enough to satisfy your usual query.
Apply the range filter on the first clustering key, within a partition.

IndexedDB getAll() ordering

I'm using getAll() method to get all items from db.
db.transaction('history', 'readonly').objectStore('history').getAll().onsuccess = ...
My ObjectStore is defined as:
db.createObjectStore('history', { keyPath: 'id', autoIncrement: true });
Can I count on the ordering of the items I get? Will they always be sorted by primary key id?
(or is there a way to specify sort explicitly?)
I could not find any info about ordering in official docs

If the docs don't help, consult the specs:
getAll refers to "steps for retrieving multiple referenced values"
the retrieval steps refer to "first count records in index"
the specification of index contains the following paragraph:
The records in an index are always sorted according to the record's
key. However unlike object stores, a given index can contain multiple
records with the same key. Such records are additionally sorted
according to the index's record's value (meaning the key of the record
in the referenced object store).
Reading backwards: An index is sorted. getAll retrieves the first N of an index, i.e. it is order-preserving. Therefore the result itself should retain the sort order.

Storing time ranges in cassandra

I'm looking for a good way to store data associated with a time range, in order to be able to efficiently retrieve it later.
Each entry of data can be simplified as (start time, end time, value). I will need to later retrieve all the entries which fall inside a (x, y) range. In SQL, the query would be something like
SELECT value FROM data WHERE starttime <= x AND endtime >= y
Can you suggest a structure for the data in Cassandra which would allow me to perform such queries efficiently?

This is an oddly difficult thing to model efficiently.
I think using Cassandra's secondary indexes (along with a dummy indexed value which is unfortunately still needed at the moment) is your best option. You'll need to use one row per event with at least three columns: 'start', 'end', and 'dummy'. Create a secondary index on each of these. The first two can be LongType and the last can be BytesType. See this post on using secondary indexes for more details. Since you have to use an EQ expression on at least one column for a secondary index query (the unfortunate requirement I mentioned), the EQ will be on 'dummy', which can always set to 0. (This means that the EQ index expression will match every row and essentially be a no-op.) You can store the rest of the event data in the row alongside start, end, and dummy.
In pycassa, a Python Cassandra client, your query would look like this:
from pycassa.index import *
start_time = 12312312000
end_time = 12312312300
start_exp = create_index_expression('start', start_time, GT)
end_exp = create_index_expression('end', end_time, LT)
dummy_exp = create_index_expression('dummy', 0, EQ)
clause = create_index_clause([start_exp, end_exp, dummy_exp], count=1000)
for result in entries.get_indexed_slices(clause):
# do stuff with result
There should be something similar in other clients.
The alternative that I considered first involved OrderPreservingPartitioner, which is almost always a Bad Thing. For the index, you would use the start time as the row key and the finish time as the column name. You could then perform a range slice with start_key=start_time and column_finish=finish_time. This would scan every row after the start time and only return those with columns before the finish_time. Not very efficient, and you have to do a big multiget, etc. The built-in secondary index approach is better because nodes will only index local data and most of the boilerplate indexing code is handled for you.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string