Encrypt Data in Kafka? - security

My team is using Kafka; we need to add encryption for security compliance reasons so that data is encrypted before it is published to Kafka and is decrypted when an authorized consumer consumes it from Kafka. I see Kafka offers TLS security options, but that doesn't seem to address our needs. TLS secures communication, but internally, data is still stored unencrypted. With some searching I see KIP-317: End to end encryption (https://cwiki.apache.org/confluence/display/KAFKA/KIP-317%3A+Add+end-to-end+data+encryption+functionality+to+Apache+Kafka
), which seems to address our use case, but that KIP seems like it stalled and never got finished.
One simple option is to add a simple custom encryption layer on top of the Kafka API. Programs publishing events to Kafka use an encryption library and encrypt the data before publishing events. Programs consuming events use an encryption library to decrypt messages consumed from Kafka. This would work and is simple.
Is there a better solution or a more standard solution?

As you suggested the easiest and most straight forward way to solve this is by encrypting the message before sending it and after you receive it at the application level and instead of sending an object you are sending a blob.
For a more elegant and optimized approach I would go with a custom serde though. One of the advantages of kafka is that the data is actually processed and manipulated in binary form so you are already using a serde to convert to and from binary.
Now by writing a custom serde you should be able to get no overhead other than the obvious one required to encrypt/decrypt the bytes. Furthermore, going this way allows you to make the encryption completely transparent to the application. You could easily have an unencrypted dev environment while using the encrypted serde in production just by changing 2 lines in application.properties (or equivalent), no recompile required. Furthermore you can have a single person working on the serde while the rest of the team works on the software. When the serde is done, you just drop it in and you have encryption.
You could also try and check repositories like this and this. You might be able to use them as they are, fork them or at least get some inspiration.
Disclaimer: Never tested any of the 3 links I referenced in this answer but the principle behind them is sound.

Related

Apache Cassandra - Listeners [duplicate]

I wonder if it is possible to add a listener to Cassandra getting the table and the primary key for changed entries? It would be great to have such a mechanism.
Checking Cassandra documentation I only find adding StateListener(s) to the Cluster instance.
Does anyone know how to do this without hacking Cassandras data store or encapsulate the driver and do something on my own?
Check out this future jira --
https://issues.apache.org/jira/browse/CASSANDRA-8844
If you like it vote for it : )
CDC
"In databases, change data capture (CDC) is a set of software design
patterns used to determine (and track) the data that has changed so
that action can be taken using the changed data. Also, Change data
capture (CDC) is an approach to data integration that is based on the
identification, capture and delivery of the changes made to enterprise
data sources."
-Wikipedia
As Cassandra is increasingly being used as the Source of Record (SoR)
for mission critical data in large enterprises, it is increasingly
being called upon to act as the central hub of traffic and data flow
to other systems. In order to try to address the general need, we,
propose implementing a simple data logging mechanism to enable
per-table CDC patterns.
If clients need to know about changes, the world has mostly gone to the message broker model-- a middleman which connects producers and consumers of arbitrary data. You can read about Kafka, RabbitMQ, and NATS here. There is an older DZone article here. In your case, the client writing to the database would also send out a change message. What's nice about this model is you can then pull whatever you need from the database.
Kafka is interesting because it can also store data. In some cases, you might be able to dispose of the database altogether.
Are you looking for something like triggers?
https://github.com/apache/cassandra/tree/trunk/examples/triggers
A database trigger is procedural code that is automatically executed
in response to certain events on a particular table or view in a
database. The trigger is mostly used for maintaining the integrity of
the information on the database. For example, when a new record
(representing a new worker) is added to the employees table, new
records should also be created in the tables of the taxes, vacations
and salaries.

Need architecture hint: Data replication into the cloud + data cleansing

I need to sync customer data from several on-premise databases into the cloud. In a second step, the customer data there needs some cleanup in order to remove duplicates (of different types). Based on that cleansed data I need to do some data analytics.
To achieve this goal, I'm searching for an open source framework or cloud solution I can use for. I took a look into Apache Apex and Apache Kafka, but I'm not sure whether these are the right solutions.
Can you give me a hint which frameworks you would use for such an task?
From my quick read on APEX it requires Hadoop underneath coupling to more dependencies than you probably want early on.
Kafka on the other hand is used for transmitting messages (it has other APIs such as streams and connect which im not as familiar with).
Im currently using Kafka to stream log files in real time from a client system. Out of the box Kafka really only provides fire and forget semantics. I have had to add a bit to make it an exactly once delivery semantic (Kafka 0.11.0 should solve this).
Overall, think of KAFKA being a more low level solution with logical message domains with queues and from what I skimmed over APEX being a more heavy packaged library with alot more things to explore.
Kafka would allow you to switch out the underlying analytical system of your choosing with their consumer api.
The question is very generic, but I'll try to outline a few different scenarios, as there are many parameters in play here. One of them is cost, which on the cloud it can quickly build up. Of course, the size of data is also important.
These are a few things you should consider:
batch vs streaming: do the updates flow continuously, or the process is run on demand/periodically (sounds the latter rather than the former)
what's the latency required ? That is, what's the maximum time that it would take an update to propagate through the system ? Answer to this question influences question 1)
how much data are we talking about ? If you're up the Gbyte size, Tbyte or Pbyte ? Different tools have different 'maximum altitude'
and what format ? Do you have text files, or are you pulling from relational DBs ?
Cleaning and deduping can be tricky in plain SQL. What language/tools are you planning on using to do that part ? Depending on question 3), data size, deduping usually requires a join by ID, which is done in constant time in a key value store, but requires a sort (generally O(nlogn)) in most other data systems (spark, hadoop, etc)
So, while you ponder all this questions, if you're not sure, I'd recommend you start your cloud work with an elastic solution, that is, pay as you go vs setting up entire clusters on the cloud, which could quickly become expensive.
One cloud solution that you could quickly fire up is amazon athena (https://aws.amazon.com/athena/). You can dump your data in S3, where it's read by Athena, and you just pay per query, so you don't pay when you're not using it. It is based on Apache Presto, so you could write the whole system using basically SQL.
Otherwise you could use Elastic Mapreduce with Hive (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive.html). Or Spark (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark.html). It depends on what language/technology you're most comfortable with. Also, there are similar products from Google (BigData, etc) and Microsoft (Azure).
Yes, you can use Apache Apex for your use case. Apache Apex is supported with Apache Malhar which can help you build application quickly to load data using JDBC input operator and then either store it to your cloud storage ( may be S3 ) or you can do de-duplication before storing it to any sink. It also supports Dedup operator for such kind of operations. But as mentioned in previous reply, Apex do need Hadoop underneath to function.

WHat is the best method to fetch social media data?

Hey I am a new guy to big data. I am making a system which will fetch data from social media and process the result, for this I am using apache spark.
Following is the flow of my model:
user will save the desired keywords using a webpage made in php.
with those key words I would be fetching data from social media,
processing the data(ex, sentiments and views) and then provide it to
the end user.
Now my confusion is how should I fetch data from social media. using
apache kafka
apache flume
or by directly calling the API twitter4j(just an example).
Though I have to learn to implement all three data fetching techniques, and If I happen to use direct api then I can skip the whole hadoop part. It would be great if you guys could suggest me which one is better.
All of the above I am doing on a local machine. I have completed the ui part now I am in the phase where I have to fetch data.
Thanks.
I guess I will make this a suggestion.
You may not want to fetch data from any source using distributed system, unless you plan to DDoS their production server. If your cluster is setup behind one router, your whole cluster may be blacklisted because all nodes consistently hit the access rate limit that's adding up at your router, depending on whether the server is powerful or not. Twitter server doesn't care about 100 threads to be honest (provided you know what you are doing), but any startup will probably get to you right away.
If you have a workstation with 4 cores, having it up catching streaming data should suffice for initial stage of academic research. Or if you really want tons of data, you can perhaps do Hadoop streaming with your fetcher script as mapper and no reducer, quick and easy. If you are superstar in Java or Scala, get a fetching thread on each vcore on Spark's executor.
Now, Twitter has the REST API, which means you can pretty much fetch data in any programming language. Of course sometimes it may be easier to use existing interfaces, assuming they are well-maintained, they are almost always more robust. But I get lazy all the time. For example, I sometimes just want a sample data point, so I just pipe curl into jq to check what I want to check.
Yes, learn about jq too, will save you tons of time. And be a gentleman who doesn't DDoS people.

How to do Data Encryption In Hadoop?

How can I apply AES encryption to hadoop? Is it possible? If not, help me encrypt the data file in hadoop
The latest versions of hadoop supports encryption. We can create encrypted zones and the data that we transfer to these zones will be encrypted automatically and the data retrieved from this zone will be decrypted automatically. This is also known as REST data encryption. The detailed steps are given in the apache website. This doesn't need any change in the code that access this data.
This can be also equated to server side encryption.
If you want custom encryption to be applied to the files in hdfs, it will be little complex, because you have to apply the encryption/decryption logic in all the programs that uses these data. If the data is encrypted using custom encryption logic, the RecordReader and RecordWriter classes needs to be modified to work with the data.

Streaming Big Data - where to store intermediate results?

I am working on spark streaming job that requires to store intermediate results in order to reuse them in next window stream. Number of data is extremely large so probably there is no way to store it in spark cache. What is more I need in someway to read data by some 'key'.
I was thinking about Cassandra as intermediate storage but it also has some drawbacks.
Alternatively, maybe Kafka will be do the job but it will require additional work in order to select given portion of data by key.
Could you advise me what I should do?
How such problems are resolved in Storm - is there any internal mechanism or it is preferred to use some external tools?
Solr as Index + Cassandra as NoSQL storage working fine for my use case where I have to process tera bytes of data. But in my case, I am using Cassandra for persistent storage of years of data.
Kafka is working fine as a replacement Jboss/AMQ due to it's simple architecture. Currently I am working Apache Storm + Kafka for real time stream processing in one of the projects.
Since you are storing intermediate data, I think Kafka is best choice by setting right retention period.
Have a look at one more SE Question and other article
As you mention, Kafka has some problems getting items by key. It really only provides APIs for FIFO paradigm. I would advise to use a dedicated storage software, Cassandra, MongoDB, I even seen Solr used to store text. It would be easier to use something designed for key retrieval rather than try to modify Kafka yourself and most likely introduce bugs/issues that could take forever to solve.
As SQL.injection said, you'll have to manage the storage and logic by yourself. Storm doesn't offer such a mechanism.

Resources