N different versions of a single key that points to a json value with size M or N different keys with values of size M / N? - couchdb

I want to ask which approach is better - to have N different versions of a single key that points to a json value with size M or to have N different keys with values of size M / N?
I`m using CouchDB as a state databse.
Example:
Single key with many versions(every value will be inserted after different chaincode invocation):
"singleKey:1" -> {"values":[v1]}
"singleKey:2" -> {"values":[v1, v2]}
"singleKey:3" -> {"values":[v1, v2, v3]}
...
"singleKey:m" -> {"values":[v1, v2, v3, ..., vm]}
Multiple keys with one version:
"key1:1" -> {"value":"v1"}
"key2:1" -> {"value":"v2"}
"key3:1" -> {"value":"v3"}
...
"keym:1" -> {"value":"vm"}
Are there some optimizations for persisting arrays in the ledger? For example to keep only the changes without copying everything.

Don't know if I understood your question correctly. But you have generally 2 approaches for doing this. But before going into the details, storing a single key with an array as a value with each version getting appended is a strict no-no.
This is because, when you modify the same key concurrently or in different transactions in the same block, you will surely end up with an MVCC_READ_CONFLICT error.
This is because Fabric uses Optimistic Locking for committing read/write sets.
Coming back to the approaches [Both approaches are StateDB agnostic, you can use Couch/goLevelDB]:
Approach 1:
If you need to use the version while fetching the value, store each key as a composite key
key1-ver1 -> val1
key1-ver2 -> val2
.. and so on
https://github.com/hyperledger/fabric/blob/release-1.2/core/chaincode/shim/interfaces.go#L128
https://github.com/hyperledger/fabric/blob/release-1.2/core/chaincode/shim/interfaces.go#L121
Approach 2:
If you do not need the version while fetching, just need to fetch the previous versions, then Fabric internally stores the history of modification of a key using its own mechanism. You can query this history using APIs of chaincodes.
https://godoc.org/github.com/hyperledger/fabric/core/chaincode/shim#ChaincodeStub.GetHistoryForKey
https://github.com/hyperledger/fabric/blob/release-1.2/core/chaincode/shim/interfaces.go#L161
You can have a look at the marbles example for an idea of both approaches:
https://github.com/hyperledger/fabric-samples/blob/release-1.2/chaincode/marbles02/go/marbles_chaincode.go

Related

Java In-memory Distributed Linked List

I have a requirement to have data In memory and distributed across nodes, I could see Hazelcast and Apache Ignite support JCache and Key value pairs. But distributed by its own algo (like Hashing)
My requirement is data(element) should be sorted by timestamp(One of the fields in the Java Data object) and partitioned in Heap as a List (like Distributed Linked List)
Ex: Let's say we have 4 Nodes.
List 1 on Node 1 -> element(1), element(2), element(3). 
List 2 on Node 2 -> element(4), element(5), element(6).
List 3 on Node 3 -> element(7), element(8), element(9).
List 4 on Node 4 -> element(10), element(11), element(12). ```
element (n) transaction time < element (n+1) transaction time 
The goal is to run Algo in memory on each node on the local data without network call.  
For Hazelcast, you probably want near-cache.
This lets the system distribute the data the way it should, but each node can keep a local copy of the data it is using.
You can override the distribution algorithm if you wish certain pieces of data to be kept together. However, trying to control where that is stops a distributed system from rebalancing the data to even out load.
In addition to Neil's near-cache advice, you should also look into the Distributed Computing section within the Finding the Right tool chapter in Hazelcast documentation. There are 3 ways to proceed:
Hazelcast Jet stream & batch engine - your pipelines (jobs) can also process data locally;
ExecutorService - allows you to execute your code on cluster members;
EntryProcessor - allows your code to process IMap entries locally on members.

Storing arrays in Cassandra

I have lots of fast incoming data that is organised thusly;
Lots of 1D arrays, one per logical object, where the position of each element in the array is important and each element is calculated and produced individually in parallel, and so not necessarily in order.
The data arrays themselves are not necessarily written in order.
The length of the arrays may vary.
The data is either read as an entire array at a time so makes sense to store the entire thing together.
The way I see it, the issue is primarily caused by the way the data is made available for writing. If it was all available together I'd just store the entire lot together at the same time and be done with it.
For smaller data loads I can get away with the postgres array datatype. One row per logical object with a key and an array column. This allows me to scale by having one writer per array, writing the elements in any order without blocking any other writer. This is limited by the rate of a single postgres node.
In Cassandra/Scylla it looks like I have the options of either:
Storing each element as its own row which would be very fast for writing, reads would be more cumbersome but doable and involve potentially lots of very wide scans,
or converting the array to json/string, reading the cell, tweaking the value then re-writing it which would be horribly slow and lead to lots of compaction overhead
or having the writer buffer until it receives all the array values and then writing the array in one go, except the writer won't know how long the array should be and will need a timeout to write down whatever it has by this time which ultimately means I'll need to update it at some point in the future if the late data turns up.
What other options do I have?
Thanks
Option 1, seems to be a good match:
I assume each logical object have an unique id (or better uuid)
In such a case, you can create something like
CREATE TABLE tbl (id uuid, ord int, v text, PRIMARY KEY (id, ord));
Where uuid is the partition key, and ord is the clustering (ordering) key, strong each "array" as a partition and each value as a row.
This allows
fast retrieve of the entire "array", even a big one, using paging
fast retrieve of an index in an array

Running a streaming operation ONLY on the node which contains the relevant KEY

Let's say I have a large IStreamMap on a large cluster and I only want to do an operation on a few keys. I could just right a filter expression as shown below, but my understanding is that this will run on all nodes. And 99% of the nodes will be forced to stream the map even though ultimately nothing comes out of it. Is there a way to get the Hazelcast jet cluster to ONLY run the operation on the nodes that correspond to those keys? The code that ought to work is below, but I don't think it's efficient. (In my case, I might be running this operation many times on large distributed maps, so I would not want each node to try to execute this operation if I can tell ahead of time that 99% of the nodes are not relevant to the selected keys.)
final IStreamMap<String, Integer> streamMap = instance1.getMap("source");
// stream of entries, you can grab keys from it
streamMap.stream()
.filter(key -> key == 1 || key = 9999999)
.forEach(key -> <do something interesting>));
IStreamMap was removed from Hazelcast Jet three years ago, I think. You should use Jet through its Pipeline API.
You can try using a map source with a predicate:
Pipeline p = Pipeline.create();
BatchStage<Entry<K, V>> stage = p.readFrom(Sources.map("name",
(Map.Entry<K, V> mapEntry) -> myCondition(mapEntry),
e -> e));
This will still scan the entire map, though. If you simply have a set of keys you're interested in, then perhaps a better match for your use case is IMap.executeOnKeys().

Hyperledger fabric - Concurrent transactions

i'm wondering how is possible to execute concurrent transactions in hyperledger fabric using hyperledger composer.
When i try to submit two transactions at the same time against the same resource I get this error:
Error trying invoke business network. Error: Peer has rejected transaction \'transaction-number\' with code MVCC_READ_CONFLICT
Does anyone know if exist a workaround or a design pattern to avoid this?
Though I may not be providing the best solution, I hope to share some ideas and possible workarounds to this question.
First let's briefly explain why you are getting this error. The underlying DB of Hyperledger Fabric employs a MVCC-like (Multi-Version Concurrency Control) model. An example to this would be two clients trying to update an asset of version 0 to a certain value. One would succeed (updated the value and incremented the version number in the stateDB to 1), while another would fail with this error (MVCC_READ_CONFLICT) due to version mismatch.
One possible solution discussed here (https://medium.com/wearetheledger/hyperledger-fabric-concurrency-really-eccd901e4040) would be to implement a FIFO queue on your own between the business logic and Fabric SDK. Retry logic could also be added in this case.
Another way would be using the delta-concept. Suppose there is an asset A with value 10 (maybe it's representing account balance). This asset is being updated frequently (say being updated in this set of value 12 -> 19 -> 16) by multiple concurrent transactions and the above mentioned error would easily be triggered. Instead, we store the value as deltas (+2 -> +7 -> -3) and the final aggregated value would be the same in the ledger. But keep in mind this trick MAY NOT suit every case and in this example, you may also need to closely monitor the running total to avoid giving money if you got empty in your account. So it depends heavily on the data type and use case.
For more information, you can take a look at this: https://github.com/hyperledger/fabric-samples/tree/release-1.1/high-throughput
I recently ran into this problem and solved it by creating an array of promises of calls to async functions, then resolving one at a time.
My transactions add items from arrays of asset2Ids and asset3Ids to an array field on asset1. My transactions are all acting on the same asset so I was getting an MVCC_READ_CONFLICT error as the read/write set is changes before each transaction is committed. By forcing the transactions to resolve in a synchronous way, this conflict is fixed:
// Create a function array
let funcArray = [];
for (const i of asset2Ids) {
// Add this transaction to array of promises to be resolved
funcArray.push(()=>transactionFunctionThatAddsAsset2IdToAsset1(i).toPromise());
}
for (const j of asset3Ids) {
// Add this transaction to array of promises to be resolved
funcArray.push(()=>transactionFunctionThatAddsAsset3IdToAsset1(j).toPromise());
}
// Resolve all transaction promises against asset in a synchronous way
funcArray.reduce((p,fn) => p.then(fn), Promise.resolve());

Usage of Redis for very large memory cache

I am planning to consider Redis for storing large amount of data in cache. Currently I store them in my own cache written in java. My use case is below.
I get 15 minutes data from a source and i need to aggregate the data hourly. So for a given object A every hour I will get 4 values and I need to aggregate them to one value the formula I will use will max / min / sum.
Foe making key I plan to use like below
a) object id - long
b) time - long
c) property id - int (each object may have many property which I need to aggregate for each property separately)
So final key would look like;
objectid_time_propertyid
Every 15 minutes I may get around 50 to 60 Million keys , I need to fetch these keys every time convert the property value to double and apply the formula (max/min/sum etc.) then convert back to String and store back.
So I see for every key I have one read and one write and conversion in each case.
My questions are following.
Is is advisable to use redis for such use case , going forward I may aggregate hourly data to daily , daily to weekly and so on.
What would be performance of read and writes in cache (I did a sample test on Windows and 100K keys read and write took 30-40 seconds thats not great , but I did on windows and I finally need to run on linux.
I want to use persistence function of redis, what are pros and cons of it ?
If any one has real experience in usage of redis as memcache which requires frequent updation please give a suggestion.
Is is advisable to use redis for such use case , going forward I may aggregate hourly data to daily , daily to weekly and so on.
Advisable depends on who you ask, but I certainly feel Redis will be up to the job. If a single server isn't enough, your description suggests that the dataset can be easily sharded so a cluster will let you scale.
I would advise, however, that you store your data a little differently. First, every key in Redis has an overhead so the more of these, the more RAM you'll need. Therefore, instead of keeping a key per object-time-property, I recommend Hashes as a means for aggregating some values together. For example, you could use an object_id:timestamp key and store the property_id:value pairs under it.
Furthermore, instead of keeping the 4 discrete measurements for each object-property by timestamp and recomputing your aggregates, I suggest you keep just the aggregates and update these with new measurements. So, you'd basically have an object_id Hash, with the following structure:
object_id:hourtimestamp -> property_id1:max = x
property_id1:min = y
property id1:sum = z
When getting new data - d - for an object's property, just recompute the aggregates:
property_id1:max = max(x, d)
property_id1:min = min(y, d)
property_id1:sum = z + d
Repeat the same for every resolution needed, e.g. use object_id:daytimestamp to keep day-level aggregates.
Finally, don't forget expiring your keys after they are no longer required (i.e. set a 24 hours TTL for the hourly counters and so forth).
There are other possible approaches, mainly using Sorted Sets, that can be applicable to solve your querying needs (remember that storing the data is the easy part - getting it back is usually harder ;)).
What would be performance of read and writes in cache (I did a sample test on Windows and 100K keys read and write took 30-40 seconds thats not great , but I did on windows and I finally need to run on linux.
Redis, when running on my laptop on Linux in a VM, does an excess of 500K reads and writes per second. Performance is very dependent on how you use Redis' data types and API. Given your throughput of 60 million values over 15 minutes, or ~70K/sec writes of smallish data, Redis is more than equipped to handle that.
I want to use persistence function of redis, what are pros and cons of it ?
This is an extremely-well documented subject - please refer to http://redis.io/topics/persistence and http://oldblog.antirez.com/post/redis-persistence-demystified.html for starters.

Resources