Hyperledger Fabric 1.4 Private data collection - hyperledger-fabric

Hyperledger fabric provides inbuilt support storing offchain data with the help of private collections. For this we need to specify the collection config which contains various collection names along with the participants that has access to data present in those collections.
There is a setting called "BlockToLive" using which we can specify for how many blocks the peers should store the private data they have access to. The peers will automatically purge the private data after the ledger block height reaches to the mentioned threshold.
We have a requirement in which we need to use the private data collections but the data should be removed (automatically/manually) after exactly 30 days. Is there any possibility to achieve the same?
timeToLive: Is there any implementation for specifying the timeToLive or similar configuration? Using this the peers will automatically purge the data after mentioned duration.
If there is no automatic way possible currently, how can the data present in private collection be removed manually? Is there any way by which the data in private collections be removed directly using external scripts/code? We don't want to to create chaincode methods that will be used to invoke as transactions to delete private data as even the deletion of private data will need to be endorsed and sent to the orderer and needs to be added to the ledger. How can the private data be removed directly?

First, everything that you put on the blockchain is permanent and supposed to be decentralized. So, having unilateral control over when to delete the private data goes against the virtue of decentralization and you should avoid it (answer to point 2). Endorsers endorse every change or transactions. (including the BlockToLive), so it does not make sense to deviate from the agreed period.
Second, time in distributed systems is subjective and it impossible to have a global clock ⏰ (say 30 days for one node can be 29.99 for another or 29.80 days for another). Hence, time is measured in blocks which is objective for all nodes. So, it is recommended that you use BlockToLive. It can be difficult first, but you can calculate backwards.
Say you have BlockSize as 10 (no. of transactions in a block) and expect around 100 transactions per day, then you can set BlockToLive = 300. (Of course, this is a ballpark number).
Finally, if you still want to delete private data at will, I would recommend manual off-chain storage mechanisms.

Related

How to deal with external DB in a Hyperledger fabric chaincode

I have an operation with X meters, the number of meter may vary.
For each meter, I have to set a percentage of allocation.
So let's say, in my Operation 1, I have 3 meters, m1, m2, m3, I will assign 10% for m1, 50% for m2, and 40% for m3.
So, in this case, when I receive data from m1, I will want to check that operation 1 and meter 1 exists, that meter 1 belongs to operation 1, and get the repartition for my meter.
All those settings are present in an external DB (postgres). I can get it easily in Golang. Thing is I heard that chaincode must be deterministic, and it is a good practice not to have any external dependency. I understand that if the result of your chaincode depends on an external DB, you will not be able to audit this one, so the whole blockchain lose a bit of interest.
Should I hardcode it in an array or in a config file ? So each time I have a config change, I must publish my chaincode again ? I am not so happy on having 2 configs to sync ( DB + config file in DB), it might quickly lead to mistakes.
What is the recommended way of managing external DB connection in a chaincode ?
You could place the "meter information" into the blockchain data store and query it from there?
For instance the application may be used:
maintain the state of all meters, with what ever information they require. This data is written to the fabric state store, where is may be queried.
perform an additional transaction that contains the logic required to query meter information and act accordingly
In the above case the chaincode will be able to update meter information and act on information stored via queries and subsequent action.
Everything is then on-chain, therefore it is accessible, updatable and auditable

What exactly is the point of private-data collection?

If we know that some organizations might want to keep certain information private from others, why not just create a separate channel? Is private-data purely just for management and to reduce channel overhead?
I've read the documentation on when to use a collection within a channel vs. a separate channel:
Use channels when entire transactions (and ledgers) must be kept confidential within a set of organizations that are members of the channel.
Use collections when transactions (and ledgers) must be shared among a set of organizations, but when only a subset of those organizations should have access to some (or all) of the data within a transaction. Additionally, since private data is disseminated peer-to-peer rather than via blocks, use private data collections when transaction data must be kept confidential from ordering service nodes.
Take a practical example for this. There is an auction house and 3-4 vendors who regularly bid. The bidding type is a closed auction. The auction house is one node and will announce the item to be bid upon. This item must be visible by all the vendors. Each vendor will then submit their bid for the item over the blockchain. As each bid is private, the vendors may view only their bid, while the auction house has full visibility.
Without private data
1) Channel PUBLIC -> Auction house creates a bid, all vendors can view it
2) Channel VENDOR_1, VENDOR_2, VENDOR_3 - Only one vendor and auction house are on this channel. The vendor submits his bid over here
What happens is the auction house now has to check bids across multiple channels, choose the winner and then update all channels appropriately. At a larger scale and more complex systems the overhead associated is massive. You may require separate modules/ API calls that just ensure state of certain objects (bids) are the same across channels.
Instead private data will allow a single channel to be used. A vendor may submit a bid that is viewable by everyone, BUT mark the price of the bid as private, so only the auction house and the vendor can view it.
Yes,private-data is mostly used to reduce channel overhead.
Adding a new Private data collection dynamically is more convenient and easy and have pretty much no overhead on the network.
Where As Having too many channels in the network can lead to a maintenance nightmare and can drastically affect the networks performance.
when to use Multiple channels
when its okay to have isolated transactions
Number of channels are manageable.
When to use Private data collection.
when its just required to hide the txn data(confidential data) and
not isolate other users from viewing the interaction between the
parties involved.(others can only see the hash of the data anyway but
they would know there was a txn between the involved parties.)
Would like to highlight one important distinction (its also in your documentation quote): Private collections hide the transaction data from the orderers too i.e. these transactions are never submitted for ordering. When using the multiple channels approach, your transaction is shared with the orderer(s).

Custom controlled partitioning

I recently posted question and I received full answer. But I am encountering another problem.
Case scenario is the same as in my recent question.
How can I configure member to own partition key?
e.g. DataCenterOnRussia partition key must always be owned by member1 and DataCenterOnGermany partition key must always be owned by member2.
So member2 could request data from DataCenterOnRussia using PartitionAwareKey.
The intent of the PartitionAwareKey is to allow for data affinity ... orders for a customer should be stored in the same partition as the customer record, for example, since they are frequently accessed together.
The PartitionAwareKey allows grouping items together, but not a way to specify the placement of those items on a specific cluster member. (I guess if there were such a thing, it would likely be called MemberAwareKey).
A cluster in Hazelcast isn't a fixed-size entity; it is dynamically scalable, so members might be added or removed, and it is fault-tolerant, so a member could be lost without loss of the data that happened to be on that member. In order to support those features, the cluster must have the freedom to move partitions around to different machines as the cluster topology changes.
Hazelcast recommends that all members of a cluster be similarly configured (equivalent memory configuration, most particularly) because of the idea that cluster members are interchangeable, at least as far as data storage. (The MemberSelector facility does provide a provision for handling systems that have different processing capability, e.g., number of processor cores; but nothing similar exits to allow placement of specific data entries or partitions on a designated member).
If your use case requires specific placement on machines, it's an indication that those machines probably should not be part of the same cluster.

Can old block data be deleted from a blockchain?

Just a general question, if I'm building a blockchain for a business I want to store 3 years of transactions but anything older than that I won't need and don't want actively in the working database. Is there a way to backup and purge a blockchain or delete items older than some time frame? I'm more interested in the event logic than the forever memory aspect.
I'm not aware of any blockchain technology capable of this yet, but Hyperledger Fabric in particular is planning to support data archiving (checkpointing). Simply put, participants need to agree on a block height, so that older blocks can be discarded. This new block then becomes the source of trust, similar to original genesis block. Also, snapshot needs to be taken and consented, which captures current state.
From serviceability point of view, it's slightly more complicated, i.e. you may have nodes that are down while snapshotting, etc.
If you just want to purge the data after a while, Fabric Private Data has an option which could satisfy your desire.
blockToLive Represents how long the data should live on the private database in
terms of blocks. The data will live for this specified number of
blocks on the private database and after that it will get purged,
making this data obsolete from the network so that it cannot be
queried from chaincode, and cannot be made available to requesting
peers
You can read more here.
Personally, I don't think there is a way to remove a block from the chain. It might destroy the Immutable property of blockchain.
There are 2 concepts which help you achieve your goals.
The one thing is already mentioned. It is about Private Data. Private data gives you the possibility to 'label' data with a time to live. Then only the private data hashes are stored on the chain (to be able to verify this transaction) but the data itself is stored in so called SideDBs and gets fully pruned (except the hashes on the chain of course). This is kind of the basis for using Fabric without workarounds and achieving GDPR.
The other thing, which was not mentioned yet and kind of is very helpful to this question
Is there a way to backup and purge a blockchain or delete items older than some time frame?
Every peer only stores the 'current state' of the ledger in his StateDB. The current state could be described as the data which is labeled 'active' and probably soon to be used again. You can think of the StateDB being like a Cache. Every Data is comes into this Cache by creating or updating a new key (invoking). To remove a key from the Cache you can use 'DelState'. So it is labeled 'deleted' and not in the Cache anymore. BUT it is still on the ledger! and you can retrieve the history and data to that key.
Conclusion: For 'real' deleting of data you have to use the concept of Private Data and for managing data in your StateDB (think of the 'Cache' analogy) you can simply use built in functions.

Azure Table Storage Partition Key

Two somewhat related questions.
1) Is there anyway to get an ID of the server a table entity lives on?
2) Will using a GUID give me the best partition key distribution possible? If not, what will?
we have been struggling for weeks on table storage performance. In short, it's really bad, but early on we realized that using a randomish partition key will distribute the entities across many servers, which is exactly what we want to do as we are trying to achieve 8000 reads per second. Apparently our partition key wasn't random enough, so for testing purposes, I have decided to just use a GUID. First impression is it is waaaaaay faster.
Really bad get performance is < 1000 per second. Partition key is Guid.NewGuid() and row key is the constant "UserInfo". Get is execute using TableOperation with pk and rk, nothing else as follows: TableOperation retrieveOperation = TableOperation.Retrieve(pk, rk); return cloudTable.ExecuteAsync(retrieveOperation). We always use indexed reads and never table scans. Also, VM size is medium or large, never anything smaller. Parallel no, async yes
As other users have pointed out, Azure Tables are strictly controlled by the runtime and thus you cannot control / check which specific storage nodes are handling your requests. Furthermore, any given partition is served by a single server, that is, entities belonging to the same partition cannot be split between several storage nodes (see HERE)
In Windows Azure table, the PartitionKey property is used as the partition key. All entities with same PartitionKey value are clustered together and they are served from a single server node. This allows the user to control entity locality by setting the PartitionKey values, and perform Entity Group Transactions over entities in that same partition.
You mention that you are targeting 8000 requests per second? If that is the case, you might be hitting a threshold that requires very good table/partitionkey design. Please see the article "Windows Azure Storage Abstractions and their Scalability Targets"
The following extract is applicable to your situation:
This will provide the following scalability targets for a single storage account created after June 7th 2012.
Capacity – Up to 200 TBs
Transactions – Up to 20,000 entities/messages/blobs per second
As other users pointed out, if your PartitionKey numbering follows an incremental pattern, the Azure runtime will recognize this and group some of your partitions within the same storage node.
Furthermore, if I understood your question correctly, you are currently assigning partition keys via GUID's? If that is the case, this means that every PartitionKey in your table will be unique, thus every partitionkey will have no more than 1 entity. As per the articles above, the way Azure table scales out is by grouping entities in their partition keys inside independent storage nodes. If your partitionkeys are unique and thus contain no more than one entity, this means that Azure table will scale out only one entity at a time! Now, we know Azure is not that dumb, and it groups partitionkeys when it detects a pattern in the way they are created. So if you are hitting this trigger in Azure and Azure is grouping your partitionkeys, it means your scalability capabilities are limited to the smartness of this grouping algorithm.
As per the the scalability targets above for 2012, each partitionkey should be able to give you 2,000 transactions per second. Theoretically, you should need no more than 4 partition keys in this case (assuming that the workload between the four is distributed equally).
I would suggest you to design your partition keys to group entities in such a way that no more than 2,000 entities per second per partition are reached, and drop using GUID's as partitionkeys. This will allow you to better support features such as Entity Transaction Group, reduce the complexity of your table design, and get the performance you are looking for.
Answering #1: There is no concept of a server that a particular table entity lives on. There are no particular servers to choose from, as Table Storage is a massive-scale multi-tenant storage system. So... there's no way to retrieve a server ID for a given table entity.
Answering #2: Choose a partition key that makes sense to your application. just remember that it's partition+row to access a given entity. If you do that, you'll have a fast, direct read. If you attempt to do a table- or partition-scan, your performance will certainly take a hit.
See
http://blogs.msdn.com/b/windowsazurestorage/archive/2010/11/06/how-to-get-most-out-of-windows-azure-tables.aspx for more info on key selection (Note the numbers are 3 years old, but the guidance is still good).
Also this talk can be of some use as far as best practice : http://channel9.msdn.com/Events/TechEd/NorthAmerica/2013/WAD-B406#fbid=lCN9J5QiTDF.
In general a given partition can support up to 2000 tps, so spreading data across partitions will help achieve greater numbers. Something to consider is that atomic batch transactions only apply to entities that share the same partitionkey. Additionally, for smaller requests you may consider disabling Nagle as small requests may be getting held up at the client layer.
From the client end, I would recommend using the latest client lib (2.1) and Async methods as you have literally thousands of requests per second. (the talk has a few slides on client best practices)
Lastly, the next release of storage will support JSON and JSON no metadata which will dramatically reduce the size of the response body for the same objects, and subsequently the cpu cycles needed to parse them. If you use the latest client libs your application will be able to leverage these behaviors with little to no code change.

Resources