How to upload pdf, docs files etc.., to hyperledger blockchain - hyperledger-fabric

I am currently working on a project in Hyperledger, where I want to upload files like pdf and docs into hyperledger blockchain by distributing the document file across the node of the network, and retrieving the document file back. Please help me with how I can do and how should I approves. If it cannot be achieved in Hyperledger then please let me know by which blockchain it can be achieved. Thanks in advance.

As a general rule, blockchain technologies are not suitable to store large documents. Blockchain demands too much process and storage replication. Moreover, there are some more pitfalls, such as block size and confidentiality.
One common approach is storing the files in a distributed P2P storage system such as IPFS (https://ipfs.io/) and store the file's hash (hashes are used as file references in IPFS, so that integrity is assured) in the blockchain's state.
You may need also IPFS-Cluster (https://cluster.ipfs.io/) to ensure persistence and replication.
Encryption is not supported by IPFS, so, if needed, it should be applied end to end outside IPFS, encrypting before storing and decrypting after accessing.

From your comment it sounds like you are thinking of using Hyperledger Fabric. With Fabric you should be able to write chaincode and a client that stores files. It still might not be the best approach and you should think carefully about separating storage of large files from the blockchain consensus based chaincode, especially when storing a hash of the file on-chain would suffice. Each node will need to store the file and come to consensus that all nodes stored the same file (have the same state). This is expensive in terms of compute and network i/o.
With Fabric, you should be able to do just about anything in code, as long as it is deterministic and not a long-running process. In my experience, minimizing code and state on any blockchain is best practice.
A complete example of a Hyperledger Fabric implementation is here: https://fabric-chaintool.readthedocs.io/en/latest/getting-started/

I agree with Kekomal, don't put large files in a blockchain, as it bloats it. Rather store binary/large files offchain. One option in Hyperledger Fabric(HLF) are private data collections where you can store binary files in CouchDB as base64 text, and just store the checksum/hash onchain, but that would increase storage requirements - based on base64.guru a 20KB pdf in base64 text will be 26KB - probably significant with very large binary files, as well as improper use of a document db. Private data collections are nice as HLF let's you share files with various options

Related

Hyperledger-fabric use cases

I'm currently looking to securely replicate hundreds of Gbs of data across a few hundred hosts. I was looking at hyperledger-fabric private blockchain because of its use of TLS and peer to peer gossip protocol for data transmission, plus of course the security of the blockchain itself.
Is it reasonable for me to be considering using blockchain as a way to securely do data replication? I have not seen this in any blockchain use case, but from what I've read it seems reasonable even though everything I've read seems to indicate storing data in the blockchain is a bad idea. Usually the arguments are that it costs too much and the data has to be replicated across all the peers in the system. Cost isn't a concern in this case because its a private blockchain and for my use case the data replication (if it can be done efficiently) is what I'm looking for.
I could use ipfs, swift, S3, etc. to store the data, but that would add operational burden, especially if hyperledger-fabric can do the job on its own.
Also, if I use hyperledger private data collections, how much control over purging do I have? For my use cases, I can't just purge the oldest data as in some cases older data needs to be preserved for a long time and in some cases newer data can be purged fairly quickly.
On the subject of data replication:
TL;DR; Not a blockchain solution
Here's my thinking behind that.
Storing large amounts of data isn't a good idea as you've mentioned. Yes there's the replication side of the data across. (but that's a side-effect needed in this case). But also there's the signing and validation etc that nees to take place across all that data. So the costs in terms of processing would mean it would inefficient.
Definition of securely.. You don't say what quality of service would constitute 'secure'. For example
Access Control for users to access the data?
Assurance that the data has been replicated and is on disk at remote locations without corruption?
Encryption of data to protect it in transit and at rest.
Blockchain, and I'm thinking Hyperledger Fabric here, would offer you the assurance. But there's no encryption in transit, you'd need to add that. And access control, the primitives are there but required you to implement and use them.
I would tend to think of the use of Blockchain in this scenario would be to provide the audit trail of how the data was replicated between hosts, with some other protocol.
On the subject of private data collection purging:
Currently this is implemented by purging data when the peer reaches a certain block height. i.e. purge after 42 blocks. But we're working on a feature to allow 'purge-on-demand' based on a call from the chaincode.

Store Images on/off blockchain? which is better?

I'm using hyperledger fabric and need to store images but don't know what is the best way to store is it on or off the blockchain? both have their pros and cons. should sacrifice the security of the images or the performance of the network?
Blockchains are not suitable for storing big size data. There are drawbacks such as the replication rate and limitations such as the block size.
So I recommend using an off-chain storage system. A common approach is using IPFS, a P2P distributed storage system. IPFS provides availability, higher performance and integrity (as files are referenced by their hash). The IPFS hash can be saved suitably on the blockchain. IPFS is available as a public P2P network, but you can also deploy your own private IPFS network.
If using IPFS, I also recommend using IPFS-Cluster over IPFS, to manage persistence and replication.
https://ipfs.io/
https://cluster.ipfs.io/
If you need encryption (maybe you don't), you should implement it outside IPFS (in your clients). How do you implement it, is up to you and your use case. As #GraphicalDot suggests in its comment, you can encrypt your file via AES and store the key in the blockchain (encrypted in turn via ECIES or ECDH) if you need user-level encryption (although if have its drawbacks if your user enrolls new keys). Anyway, Fabric itself only provides privacy at organization level (not at user level).
I think the answer also comes down to the business case of why you want to store images on chain. If you are trying to establish provenance, you can do that with metadata stored on chain and the images off chain, etc. If you want to make sure the images are stored for immutability, then maybe, but if you want to be able to delete them or modify them down the road, then maybe not.

Hyperledger Fabric private data collection to distribute large files

We are currently researching on Hyperledger Fabric and from the document we know that a private data collection can be set up among some subset of organizations. There would be a private state DB (aka. side DB) on each of these organizations and per my understanding, the side DB is just like a normal state DB which normally adopts CouchDB.
One of our main requirements is that we have to distribute files (e.g. PDFs) among some subset of the peers. Each file has to be disseminated and stored at the related peers, so a centralized storage like AWS S3 or other cloud storage / server storage is not acceptable. As the file maybe large, the physical copies must be stored and disseminate off-chain. The transaction block may only store the hash of these documents.
My idea is that we may make use of the private data collection and the side DB. The physical files can be stored in the side DB (maybe in the form of base64string?) and can be distributed via Gossip Protocol (a P2P protocol) which is a feature in Hyperledger Fabric. The hash of the document along with other transaction details can be stored in a block as usual. As they are all native features by Hyperledger Fabric, I expect the transfer of the files via Gossip Protocol and the creation of the corresponding block will be in-sync.
My question is:
Is this way feasible to achieve the requirement? (Distribution of the files to different peers while creating a new block) I kinda feel like it is hacky.
Is this a good way / practice to achieve what we want? I have been doing research but I cannot find any implementation similar to this.
Most of the tutorial I found online pre-assumes that the files can be stored in a single centralized storage like cloud or some sort of servers, while our requirement demands a distribution of the files as well. Is my idea described above acceptable and feasible? We are very new to Blockchain and any advice is appreciated!
Is this way feasible to achieve the requirement? (Distribution of the files to different peers while creating a new block) I kinda feel like it is hacky.
So the workflow of private data distribution is that the orderer bundles the private data transaction containing only a hash to verify the data to a new block. So you dont have to do a workaround for this since private data provides this per default. The data itself gets distributed between authorized peers via gossip data dissemination protocol.
Is this a good way / practice to achieve what we want? I have been doing research but I cannot find any implementation similar to this.
Yes and no. Sry to say so. But this depends on your file sizes and amount. Fabric is capable of providing rly high throughput. I would test things out and see if it meets my requirements.
The other approach would be to do a work around and use IPFS (a p2p file system). You can read more about that approach here here
And here is an article discussing storing 'larger files' on chain. Maybe this gives some constructive insights aswell. But keep in mind this is an older article.
Check out IBM Blockchain Document Store, it is the implementation of storing any document (pdf or otherwise) both on and off chain. It has been done.
And while the implementation isn't publicly available, there is vast documentation on it's usage, can probably disseminate some information from it

File Server on Hyperledger Composer?

I am just a beginner in HyperLedger Composer and Fabrics.
Following IBM tutorial under 'https://www.coursera.org/learn/ibm-blockchain-essentials-for-developers'
I have one quick question:
How to create a File Server using Hyperledger Composer?
Is it possible now or not?
any feedback will be very helpful.
Using Composer and Fabric as a File Server would be unusual, and due to the nature of Fabric as a Distributed Ledger, it may well perform badly as a File Server.
There are other Questions and Answers in Stack Overflow about storing images etc using base64 encoding, such as this one.
Depending on your use case, it may be more appropriate to store some hash of the file on the Fabric enabling you to prove the validity of the file, whilst storing the file itself on a dedicated File Server.
With the nature of technology, saving a file in hyperledger fabric and using it as a file server would not be a good strategy.
You Hash the original file and Use hyperledger fabric to store the Hash value and other data you need for the system and to Validate. But you should use a different server to Store the Files.
Something similar to this would work I guess: A Sample System Design : Image
Any changes, updates or a version change, you store the new file in the server, create the new hash and add the data in to new entry in hyperledger fabric.

Limitations of Hyperledger Fabric

Im just wondering what are the limitations of Hyperledger Fabric in terms of how much data can be stored on each of the peers?
Following this question I'm wondering what are you options in managing large amounts of data on a Hyperledger network i.e. with decentralised networks etc.
I'm struggling to find good resources on this so it would be great if somebody could fill me in or point me to some good resources on the topic!
Your question is too broad. However, I will cut down a few things and try to answer this as concisely as possible.
The amount of data that can be stored is although a direct function of the peer's capacity, however storing large amounts of data on the ledger is itself not recommended in case of any ledger, because scaling is an issue. You can opt for off chain storage like IPFS to store bulky data off-chain and have the proof of the same as a hash on the ledger. Next, you can try segregating data into channels and have peers join the channel on a need basis.
Moreover, have your data properly indexed (in CouchDB) and consider for replication and scalability aspect of the couch DB itself.
I hope I was able to touch the iceberg. For more details, you can reach me offline for further discussions.

Resources