Securing data at rest in Kafka - security

We are preparing for our first deployment of Kafka to production, and I'm wondering about the best way to implement data-at-rest security.
I've seen a few articles talking about end to end security/encryption (i.e., https://lenses.io/blog/2017/06/kafka-data-security/
https://blog.codecentric.de/en/2016/10/transparent-end-end-security-apache-kafka-part-1/
https://www.symantec.com/connect/blogs/end-end-encryption-though-kafka-our-proof-concept), but we are not looking for end to end security as it runs within our internal network, just to secure the data at rest.
Which would be best approach? Disk encryption, filesystem-level encryption, a combination of both? Which would be the impact on performance?
What about Zookeeper? Do we need to encrypt Zookeeper's data as well?
Thanks in advance.

Related

Encrypt Data in Kafka?

My team is using Kafka; we need to add encryption for security compliance reasons so that data is encrypted before it is published to Kafka and is decrypted when an authorized consumer consumes it from Kafka. I see Kafka offers TLS security options, but that doesn't seem to address our needs. TLS secures communication, but internally, data is still stored unencrypted. With some searching I see KIP-317: End to end encryption (https://cwiki.apache.org/confluence/display/KAFKA/KIP-317%3A+Add+end-to-end+data+encryption+functionality+to+Apache+Kafka
), which seems to address our use case, but that KIP seems like it stalled and never got finished.
One simple option is to add a simple custom encryption layer on top of the Kafka API. Programs publishing events to Kafka use an encryption library and encrypt the data before publishing events. Programs consuming events use an encryption library to decrypt messages consumed from Kafka. This would work and is simple.
Is there a better solution or a more standard solution?
As you suggested the easiest and most straight forward way to solve this is by encrypting the message before sending it and after you receive it at the application level and instead of sending an object you are sending a blob.
For a more elegant and optimized approach I would go with a custom serde though. One of the advantages of kafka is that the data is actually processed and manipulated in binary form so you are already using a serde to convert to and from binary.
Now by writing a custom serde you should be able to get no overhead other than the obvious one required to encrypt/decrypt the bytes. Furthermore, going this way allows you to make the encryption completely transparent to the application. You could easily have an unencrypted dev environment while using the encrypted serde in production just by changing 2 lines in application.properties (or equivalent), no recompile required. Furthermore you can have a single person working on the serde while the rest of the team works on the software. When the serde is done, you just drop it in and you have encryption.
You could also try and check repositories like this and this. You might be able to use them as they are, fork them or at least get some inspiration.
Disclaimer: Never tested any of the 3 links I referenced in this answer but the principle behind them is sound.

Social network app architecture with React+Nodejs and Kafka

I have an idea of social network website and as I'm currently learning web development I thought that was a great idea to practice. I already worked out the business logic and front-end design with React. Now it's time for backend and I'm struggling.
I want to create a React+Nodejs event-driven app. It seems logical to use Kafka right away. I looked through various Kafka architecture examples and have several questions:
Is it possible to create an app that uses Kafka for data through API calls from Nodejs and to React and vice versa. And user relational database only for longterm storage?
Is it better to use Kafka to handle all events but communicate with some noSQL database like Cassandra or HBase. Then it seems that NodeJS will have to make API calls to them and send data to React.
Am I completely missing the point and talking nonsense?
Any answer to this question is going to be quite opinion based, so I'm just going to try to stick to the facts.
It is totally possible to create such an application. Moreover, Kafka is basically a distributed log, so you can use it as an event store and build your state from that.
That mainly depends on your architecture, and there are too many gaps here to answer this with any certainty - what kind of data are you saving? What kind of questions will you need answered? What does your domain model look like? You could use Kafka as a store, or as a persistent messaging service.
I don't think you're off the mark, but perhaps you're going for the big guns when in reality you don't really need them. Kafka is great for a very large volume of events going through. If you're building something new, you don't have that volume. Perhaps start with something simpler that doesn't require so much operational complexity.

Hyperledger Fabric private data collection to distribute large files

We are currently researching on Hyperledger Fabric and from the document we know that a private data collection can be set up among some subset of organizations. There would be a private state DB (aka. side DB) on each of these organizations and per my understanding, the side DB is just like a normal state DB which normally adopts CouchDB.
One of our main requirements is that we have to distribute files (e.g. PDFs) among some subset of the peers. Each file has to be disseminated and stored at the related peers, so a centralized storage like AWS S3 or other cloud storage / server storage is not acceptable. As the file maybe large, the physical copies must be stored and disseminate off-chain. The transaction block may only store the hash of these documents.
My idea is that we may make use of the private data collection and the side DB. The physical files can be stored in the side DB (maybe in the form of base64string?) and can be distributed via Gossip Protocol (a P2P protocol) which is a feature in Hyperledger Fabric. The hash of the document along with other transaction details can be stored in a block as usual. As they are all native features by Hyperledger Fabric, I expect the transfer of the files via Gossip Protocol and the creation of the corresponding block will be in-sync.
My question is:
Is this way feasible to achieve the requirement? (Distribution of the files to different peers while creating a new block) I kinda feel like it is hacky.
Is this a good way / practice to achieve what we want? I have been doing research but I cannot find any implementation similar to this.
Most of the tutorial I found online pre-assumes that the files can be stored in a single centralized storage like cloud or some sort of servers, while our requirement demands a distribution of the files as well. Is my idea described above acceptable and feasible? We are very new to Blockchain and any advice is appreciated!
Is this way feasible to achieve the requirement? (Distribution of the files to different peers while creating a new block) I kinda feel like it is hacky.
So the workflow of private data distribution is that the orderer bundles the private data transaction containing only a hash to verify the data to a new block. So you dont have to do a workaround for this since private data provides this per default. The data itself gets distributed between authorized peers via gossip data dissemination protocol.
Is this a good way / practice to achieve what we want? I have been doing research but I cannot find any implementation similar to this.
Yes and no. Sry to say so. But this depends on your file sizes and amount. Fabric is capable of providing rly high throughput. I would test things out and see if it meets my requirements.
The other approach would be to do a work around and use IPFS (a p2p file system). You can read more about that approach here here
And here is an article discussing storing 'larger files' on chain. Maybe this gives some constructive insights aswell. But keep in mind this is an older article.
Check out IBM Blockchain Document Store, it is the implementation of storing any document (pdf or otherwise) both on and off chain. It has been done.
And while the implementation isn't publicly available, there is vast documentation on it's usage, can probably disseminate some information from it

Performance Difference between Apache Ignite compute grid and Spark

I need to do some computation on a large set of data. In this computation, data will be manipulated and stored (persisted) to Database.
So can anyone suggest which technology is preferable in accordance with performance and Resource Utilization?
The best way here is to implement your logic and check how it works with both frameworks. Also, you can run different benchmarks in your environment.
Information about benchmarks for Apache Ignite you and read here:
https://apacheignite.readme.io/docs/perfomance-benchmarking
In case of any technical questions, you can ask them to the community.

Synchronizing on-premises DB with Cloud (Azure )

I have a Mainframe DB2 within my corporate network. I am asked to come up with an approach to create a miniature of this DB in Azure. What would be the best way to implement this? What is the best practice to establish a reliable and secure synchronization between these two DBs?
There are commercial products that do this sort of thing...google "DB2 SQL Server Synchronization". If you want a simple way to start, the ETL (Extract, Transform, Load) vendors - Informatica, Tibco, SyncSort, etc - all have variations on this capability.
It is much more challenging than it sounds because the two databases have such different feature sets. While you might get a simple set of tables to work, soon as you introduce triggers, stored procedures, EBCDIC vs. ASCII issues and so forth, you'll be wanting all the help you can get.

Resources