DynamoDB schema design for timeseries data without clear partition key - rust

I'm playing around with dynamodb and CRUD operations for a small app that I'm tinkering with.
tl;dr I have a dataset that is
pub struct TransferOnly {
pub ts: String,
pub block: String,
pub from: String,
pub to: String,
pub value: String,
}
where
ts = timestamp in unix
block = hex string
from = hex string
to = hex string
value = hex string.
We definitely want ts as our sort key but we are unsure what to use as our partition key. Block is not unique (neither is ts nor the combination of both), from and to are non-unique per block and neither is value.
We want to optimize for searching and aggregating froms and tos (ie. everyone that sent money to x, total money sent to x by y from) between certain dates.
We're using Dynamodb b/c timestream doesn't have a rust sdk. What's an optimal database schema design here?

Please refer the following in the link DynamoDBTimeSeries
Copying few contents from the URL.
The following design pattern often handles this kind of scenario effectively:
Create one table per period, provisioned with the required read and write capacity and the required indexes.
Before the end of each period, prebuild the table for the next period. Just as the current period ends, direct event traffic to the new table. You can assign names to these tables that specify the periods they have recorded.
As soon as a table is no longer being written to, reduce its provisioned write capacity to a lower value (for example, 1 WCU), and provision whatever read capacity is appropriate. Reduce the provisioned read capacity of earlier tables as they age. You might choose to archive or delete the tables whose contents are rarely or never needed.

Related

Storing arrays in Cassandra

I have lots of fast incoming data that is organised thusly;
Lots of 1D arrays, one per logical object, where the position of each element in the array is important and each element is calculated and produced individually in parallel, and so not necessarily in order.
The data arrays themselves are not necessarily written in order.
The length of the arrays may vary.
The data is either read as an entire array at a time so makes sense to store the entire thing together.
The way I see it, the issue is primarily caused by the way the data is made available for writing. If it was all available together I'd just store the entire lot together at the same time and be done with it.
For smaller data loads I can get away with the postgres array datatype. One row per logical object with a key and an array column. This allows me to scale by having one writer per array, writing the elements in any order without blocking any other writer. This is limited by the rate of a single postgres node.
In Cassandra/Scylla it looks like I have the options of either:
Storing each element as its own row which would be very fast for writing, reads would be more cumbersome but doable and involve potentially lots of very wide scans,
or converting the array to json/string, reading the cell, tweaking the value then re-writing it which would be horribly slow and lead to lots of compaction overhead
or having the writer buffer until it receives all the array values and then writing the array in one go, except the writer won't know how long the array should be and will need a timeout to write down whatever it has by this time which ultimately means I'll need to update it at some point in the future if the late data turns up.
What other options do I have?
Thanks
Option 1, seems to be a good match:
I assume each logical object have an unique id (or better uuid)
In such a case, you can create something like
CREATE TABLE tbl (id uuid, ord int, v text, PRIMARY KEY (id, ord));
Where uuid is the partition key, and ord is the clustering (ordering) key, strong each "array" as a partition and each value as a row.
This allows
fast retrieve of the entire "array", even a big one, using paging
fast retrieve of an index in an array

Hazelcast Projection fetch single field

I have question regarding hazelcast Projection API.
Lets say I want to fetch just a single field from an entry in the map using Portable serialization.
In this example a name from an employee.
I'm guessing I will be getting better performance in relation to network traffic and deserialization by using a projection like this:
public String getName(Long key) {
return map.project(
(Projection<Entry<Long, Employee>, String>) entry -> entry.getValue().getName(),
(Predicate<Long, Event>) mapEntry -> mapEntry.getKey().equals(key))
.stream()
.findFirst().orElse(null);
}
Instead of something like:
public String getName(Long key) {
return map.get(key).getName();
}
Projection is a type of predicate that comes handy if you want to return only a part of the entire value object in a result set of many value objects. For one single key based lookup, it is an overkill. A map.get is a lighter weight operation than running a predicate.
Network traffic = not sure how much savings you will have as that depends on your network bandwidth + size of the object + no. of concurrent objects traveling.
Deserialization = not much saving unless the actual object stored as value is monstrous and the field you are extracting is a tiny bit.
If you are conscious of network bandwidth and ser/des cost then keep in-memory-format to OBJECT and use EntryProcessor to update. If you do not have anything to update then use ExecutorService.

Strategy for storing application logs in Azure Table Storage

I am to determine a good strategy for storing logging information in Azure Table Storage. I have the following:
PartitionKey: The name of the log.
RowKey: Inversed DateTime ticks,
The only issue here is that partitions could get very large (millions of entities) and the size will increase with time.
But that being said, the type of queries being performed will always include the PartitionKey (no scanning) AND a RowKey filter (a minor scan).
For example (in a natural language):
where `PartitionKey` = "MyApiLogs" and
where `RowKey` is between "01-01-15 12:00" and "01-01-15 13:00"
Provided that the query is done on both PartitionKey and RowKey, I understand that the size of the partition doesn't matter.
Take a look at our new Table Design Patterns Guide - specifically the log-data anti-pattern as it talks about this scenario and alternatives. Often when people write log files they use a date for the PK which results in a partition being hot as all writes go to a single partition. Quite often Blobs end up being a better destination for log data - as people typically end up processing the logs in batches anyway - the guide talks about this as an option.
Adding my own answer so people can have something inline without needing external links.
You want the partition key to be the timestamp plus the hash code of the message. This is good enough in most cases. You can add to the hash code of the message the hash code(s) of any additional key/value pairs as well if you want, but I've found it's not really necessary.
Example:
string partitionKey = DateTime.UtcNow.ToString("o").Trim('Z', '0') + "_" + ((uint)message.GetHashCode()).ToString("X");
string rowKey = logLevel.ToString();
DynamicTableEntity entity = new DynamicTableEntity { PartitionKey = partitionKey, RowKey = rowKey };
// add any additional key/value pairs from the log call to the entity, i.e. entity["key"] = value;
// use InsertOrMerge to add the entity
When querying logs, you can use a query with partition key that is the start of when you want to retrieve logs, usually something like 1 minute or 1 hour from the current date/time. You can then page backwards another minute or hour with a different date/time stamp. This avoids the weird date/time hack that suggests subtracting the date/time stamp from DateTime.MaxValue.
If you get extra fancy and put a search service on top of the Azure table storage, then you can lookup key/value pairs quickly.
This will be much cheaper than application insights if you are using Azure functions, which I would suggest disabling. If you need multiple log names just add another table.

When to use Blobs in a Cassandra (and CQL) table and what are blobs exactly?

I was trying to understand better the design decision choice when making table entries in cassandra and when the blob type is a good choice.
I realized I didn't really know when to choose a blob as a data type because I was not sure what a blob really was (or what the acronym stood for). Thus I decided to read the following documentation for the data type blob:
http://www.datastax.com/documentation/cql/3.0/cql/cql_reference/blob_r.html
Blob
Cassandra 1.2.3 still supports blobs as string constants for input (to allow smoother transition to blob constant). Blobs as strings are
now deprecated and will not be supported in the near future. If you
were using strings as blobs, update your client code to switch to blob
constants. A blob constant is an hexadecimal number defined by
0xX+ where hex is an hexadecimal character, such as
[0-9a-fA-F]. For example, 0xcafe.
Blob conversion functions
A number of functions convert the native types into binary data (blob). For every <native-type> nonblob type supported by CQL3, the
typeAsBlob function takes a argument of type type and returns it as a
blob. Conversely, the blobAsType function takes a 64-bit blob argument
and converts it to a bigint value. For example, bigintAsBlob(3) is
0x0000000000000003 and blobAsBigint(0x0000000000000003) is 3.
What I got out of it is that its just a long hexadecimal/binary. However, I don't really appreciate when I would use it as a column type for a potential table and how its better or worse than other type. Also, going through some of its properties might be a good way to figure out what situations blobs are good for.
Blobs (Binary Large OBjectS) are the solution for when your data doesn't fit into the standard types provided by C*. For example, say you wanted to make a forum where users were allowed to upload files of any type. To store these in C* you would use a Blob column (or possibly several blob columns since you don't want individual cells to become to large).
Another example might be a table where users are allowed to have a current photo, this photo could be added as a blob and be stored along with the rest of the user information.
Accoring to 3.x document, blob type is suitable for storing a small image or short string.
In my case I used it to store a hashed value, as hash function returns binary and the best option is to store as binary from the view of table data size.
(Converting to string and store as string(text) could be also ok, if size not considered.)
Results below shows my test in local machine (insert 1 million records) and the sizes are 52,626,907(binary) and 72,879,839(base64-converted data as string).
unit: byte.
CREATE TABLE IF NOT EXISTS testks.bin_data (
bin_data blob,
PRIMARY KEY(bin_data)
);
CREATE TABLE IF NOT EXISTS testks.base64_data (
base64_data text,
PRIMARY KEY(base64_data)
);
cqlsh> select * from testks.base64_data limit 10;
base64_data
------------------------------
W0umEPMzL5O81v+tTZZPKZEWpkI=
bGUzPm4zRvcqK1ogwTvPNPNImvk=
Nsr0GKx6LjXaiZSwATU38Ffo7fA=
A6lBV69DbFz/UFWbxolb+dlLcLc=
R919DvcyqBUup+NrpRyRvzJD+/E=
63LEynDKE5RoEDd1M0VAnPPUtIg=
FPkOW9+iPytFfhjdvoqAzbBfcXo=
uHvtEpVIkKivS130djPO2f34WSM=
fzEVf6a5zk/2UEIU8r8bZDHDuEg=
fiV4iKgjuIjcAUmwGmNiy9Y8xzA=
(10 rows)
cqlsh> select * from testks.bin_data limit 10;
bin_data
--------------------------------------------
0xb2af015062e9aba22be2ab0719ddd235a5c0710f
0xb1348fa7353e44a49a3363822457403761a02ba8
0x4b3ecfe764cbb0ba1e86965576d584e6e616b03e
0x4825ef7efb86bbfd8318fa0b0ac80eaa2ece9ced
0x37bdad7db721d040dcc0b399f6f81c7fd2b5cea6
0x3de4ca634e3a053a1b0ede56641396141a75c965
0x596ec12d9d9afeb5b1b0bb42e42ad01b84302811
0xbf51709a8d1a449e1eea09ef8a45bdd2f732e8ec
0x67dcb3b6e58d8a13fcdc6cf0b5c1e7f71b416df6
0x7e6537033037cc5c028bc7c03781882504bdbd65

Redis key design for real-time stock application

I am trying to build a real-time stock application.
Every seconds I can get some data from web service like below:
[{"amount":"20","date":1386832664,"price":"183.8","tid":5354831,"type":"sell"},{"amount":"22","date":1386832664,"price":"183.61","tid":5354833,"type":"buy"}]
tid is the ticket ID for stock buying and selling;
date is the second from 1970.1.1;
price/amount is at what price and how many stock traded.
Reuirement
My requirement is show user highest/lowest price at every minute/5 minutes/hour/day in real-time; show user the sum of amount in every minute/5 minutes/hour/day in real-time.
Question
My question is how to store the data to redis, so that I can easily and quickly get highest/lowest trade from DB for different periods.
My design is something like below:
[date]:[tid]:amount
[date]:[tid]:price
[date]:[tid]:type
I am new in redis. If the design is this is that means I need to use sorted set, will there any performance issue? Or is there any other way to get highest/lowest price for different periods.
Looking forward for your suggestion and design.
My suggestion is to store min/max/total for all intervals you are interested in and update it for current ones with every arriving data point. To avoid network latency when reading previous data for comparison, you can do it entirely inside Redis server using Lua scripting.
One key per data point (or, even worse, per data point field) is going to consume too much memory. For the best results, you should group it into small lists/hashes (see http://redis.io/topics/memory-optimization). Redis only allows one level of nesting in its data structures: if you data has multiple fields and you want to store more than one item per key, you need to somehow encode it yourself. Fortunately, standard Redis Lua environment includes msgpack support which is very a efficient binary JSON-like format. JSON entries in your example encoded with msgpack "as is" will be 52-53 bytes long. I suggest grouping by time so that you have 100-1000 entries per key. Suppose one-minute interval fits this requirement. Then the keying scheme would be like this:
YYmmddHHMMSS — a hash from tid to msgpack-encoded data points for the given minute.
5m:YYmmddHHMM, 1h:YYmmddHH, 1d:YYmmdd — window data hashes which contain min, max, sum fields.
Let's look at a sample Lua script that will accept one data point and update all keys as necessary. Due to the way Redis scripting works we need to explicitly pass the names of all keys that will be accessed by the script, i.e. the live data and all three window keys. Redis Lua has also JSON parsing library available, so for the sake of simplicity let's assume we just pass it JSON dictionary. That means that we have to parse data twice: on the application side and on the Redis side, but the performance effects of it are not clear.
local function update_window(winkey, price, amount)
local windata = redis.call('HGETALL', winkey)
if price > tonumber(windata.max or 0) then
redis.call('HSET', winkey, 'max', price)
end
if price < tonumber(windata.min or 1e12) then
redis.call('HSET', winkey, 'min', price)
end
redis.call('HSET', winkey, 'sum', (windata.sum or 0) + amount)
end
local currkey, fiveminkey, hourkey, daykey = unpack(KEYS)
local data = cjson.decode(ARGV[1])
local packed = cmsgpack.pack(data)
local tid = data.tid
redis.call('HSET', currkey, tid, packed)
local price = tonumber(data.price)
local amount = tonumber(data.amount)
update_window(fiveminkey, price, amount)
update_window(hourkey, price, amount)
update_window(daykey, price, amount)
This setup can do thousands of updates per second, not very hungry on memory, and window data can be retrieved instantly.
UPDATE: On the memory part, 50-60 bytes per point is still a lot if you want to store more a few millions. With this kind of data I think you can get as low as 2-3 bytes per point using custom binary format, delta encoding, and subsequent compression of chunks using something like snappy. It depends on your requirements, whether it's worth doing this.

Resources