I am newbie to Druid. I am able to ingest data to Druid from S3, Kafka. Now, I want to load data from Cassandra hosted over AWS private subnet.
Is it even possible? If yes, please share some resources.
No. There doesn't seem to be any support for direct ingest from Cassandra. But you could setup Cassandra CDC to kafka and use kafka ingestion.
Related
Can someone let me know if its possible to to Spark structured streaming from a JDBC source? E.g SQL DB or any RDBMS.
I have looked at a few similar questions on SO, e.g
Spark streaming jdbc read the stream as and when data comes - Data source jdbc does not support streamed reading
jdbc source and spark structured streaming
However, I would like to know if its officially supported on Apache Spark?
If there is any sample code that would be helpful.
Thanks
No, there is no such built-in support in Spark Structured Streaming. The main reason is that most of databases doesn't provided an unified interface for obtaining the changes.
It's possible to get changes from some databases using archive logs, write-ahead logs, etc. But it's database-specific. For many databases the popular choice is Debezium that can read such logs and push list of changes into a Kafka, or something similar, from which it could be consumed by Spark.
I am on a project now architecting this using CDC Shareplex from ORACLE and writing to KAFKA and then using Spark Structured Streaming with KAFKA integration and MERGE on delta format on HDFS.
Ie that is the way to do it if not using Debezium. You can use change logs for base tables or materialized views to feed CDC.
So direct JDBC is not possible.
I don't see Cassandra connector for source in https://flink.apache.org/ecosystem.html . So wondering how can I use data that is stored in Cassandra for state.
Thanks
one method would be to send data from Cassandra to Kafka and then use Kafka as the data source.
I have a system that generate 100,000 rows/s and size of each row is 1KB and want to use Cassandra for database.
I get data from Apache Kafka and then should insert it into database.
What is the best way for load this volume of data into Cassandra ?
Kafka Connect is designed for this. On this page you will find a list of connectors including a Cassandra sink connectors https://www.confluent.io/product/connectors/
We are working on storing user clicking data coming as streaming data.
I have been investigating about possible ways to do it in AWS.
One way to do it is using DynamoDB to store the data along with all the native tools from AWS.
The other method is to install the spark streaming with Cassandra.
Datastax is providing the integrated package to install them on AWS.
From the reference I found online.
It seems that using native DynamoDB from AWS could be more expensive.
But it will save time on maintaining the system.
Does anyone have experience in it before and could provide some insights and suggests about the pros and cons of both of them?
Furthermore, We want to create a system that the database could handle both the batch data and streaming such as the lambda architecture. So both the streaming data and batch data will be end up with the same database. As far as I know, Cassandra is good for this case. Does DynamoDB support it as well?
Thank you so much!
I have Spark Streaming on a virtual machine, and I would like to connect it with an other vm which contains kafka . I want Spark to get the data from the kafka machine.
Is it possible to do that ?
Thanks
Yes, it is definitely possible. In fact, this is the reason why we have distributed systems in place :)
When writing your Spark Streaming program, if you are using Kafka, you will have to create a Kafka config data structure (syntax will vary depending on your programming language and client). In that config structure, you will have to specify the Kafka brokers IP. This would be the IP of your Kafka VM.
You then just need to run Spark Streaming Application on your Spark VM.
It's possible and makes perfect sense to have them on separate VM's. That way there is a clear separation of roles.