I have a system that generate 100,000 rows/s and size of each row is 1KB and want to use Cassandra for database.
I get data from Apache Kafka and then should insert it into database.
What is the best way for load this volume of data into Cassandra ?
Kafka Connect is designed for this. On this page you will find a list of connectors including a Cassandra sink connectors https://www.confluent.io/product/connectors/
Related
I have a requirement to implement the solution for below usecase.
Currently Applications are storing data into Postgres database but Postgres database is facing storage issue. So the plan is to move the data from postgres to Hadoop with near realtime data in hadoop. So we thought of below solution .
Write Kafka producer application to listen to postgres tables and capture changing data and write to Kafka Topic .
Write a Kafka sink application to read from kafka topic and write to hive tables(parquet -- external tables -- partitioned and non partitioned) . So for non partitioned tables if we want to apply updates/deletes then we need to touch the whole table in spark code right? which will lead to performance degrade for every record getting from kafka topic . We have already developed sqoop incremental job which runs for every 5 minutes to do the same. But client needs real time data in hadoop so kafka+spark processing came into discussion .
Could you provide pro's and con's for step2 comparing to sqoop incremental.
please share code snippets/links if any which helps my thought process.
Getting data into Kafka is easy - use Debezium.
For getting it out...
I wouldn't use Hive at all for this. Real time data (depending on on the volume of the data, obviously) results in tiny files in HDFS. Subsequently, Hive queries become slower and slower over time.
Hive is not a replacement for Postgres. In fact, the Hive metastore requires a relational database still, such as Postgres.
I also wouldn't use Spark. You have to write code when ingesting Kafka topics into queryable formats is already a solved problem with other tools.
Popular options include Apache Pinot, Druid, or Apache Iceberg storage with Presto (some of which may overlap with HDFS storage, but will be much, much faster than Hive to query). Only the third option requires writing Kafka consumer code; the other two have native Kafka ingestion.
And even still, if you're stuck with HDFS, Kafka Connect framework comes with Kafka. There's an HDFS Sink plugin, written by Confluent, which supports Hive integration.
What I have?
I have Spark Streaming Application (on Kafka Streams) on Hadoop Cluster that aggregates each 5 minutes users' clicks and some actions done on a web site and
converts them into metrics.
Also I have a table in GreenPlum (on its own cluster) with users data that may get updated. This table is filled using Logical Log Streaming Replication via Kafka. Table size is 100 mln users.
What I want?
I want to join Spark Streams with static data from GreenPlum every 1 or 5 minutes and then aggregate data already using e.g. user age from static table.
Notes
Definitely, I don't need to read all records from users table. There are rather stable core segment + number of new users registering each minute.
Currently I use PySpark 2.1.0
My solutions
Copy data from GreenPlum cluster to Hadoop cluster and save it as
orc/parquet files. Each 5 minute add new files for new users. Once a
day reload all files.
Create new DB on Hadoop and Setup Log replication via Kafka as it is
done for GreenPlum. Read data from DB and use built in Spark
Streaming joins.
Read data from GreenPlum on Spark in cache. Join stream data with
cache.
For each 5 minute save/append new user data in a file, ignore old
user data. Store extra column e.g. last_action to truncate this
file if a user wasn't active on web site during last 2 weeks. Thus,
join this file with stream.
Questions
What of these solutions are more suitable for MVP? for Production?
Are there any better solutions/best practices for such sorts of
problem. Some literature)
Spark streaming reading data from a cache like Apache geode make this better. used this approach in real-time fraud use case. In a nut shell I have features generated on Greenplum Database using historical data. The feature data and some decision making lookup data is pushed in to geode. Features are periodically refreshed (10 min interval) and then refreshed in geode. Spark scoring streaming job constantly scoring the transactions as the come in w/o reading from Greenplum. Also spark streaming job puts the score in geode, which is synced to Greenplum using different thread. I had spark streaming running on cloud foundry using k8. This is a very high level but should give you an idea.
You might want to check out the GPDB Spark Connector --
http://greenplum-spark-connector.readthedocs.io/en/latest/
https://greenplum-spark.docs.pivotal.io/130/index.html
You can load data directly from the segments into Spark.
Currently, if you want to write back to GPDB, you need to use a standard JDBC to the master.
I'm looking for an optimal data architecture.
I'm dealing with TS data that is flushed from a Redis database to OpenTSDB database each week.
OpenTSDB stores its data on HBase which is launched on a Hadoop cluster.
Then, the time series data available on OpenTSDB has to be batch processed (at 1-6 months interval).
Knowing that OpenTSDB data is stored in Binary large object format on HBase, I can't currently tackle HBase HTTP API.
Since Spark cannot directly access OpenTSDB API (While Kafka seems to be Okay with HTTP Api)... I'm facing architecture issues which can be expressed as follows, would it be more convenient to:
Use Apache Kafka to extract batch data (TByte) and use it as a pipeline to ingest and analyze data into Spark Dataframes ?
Flush redis data directly in HBase and hence use Spark directly from it ?
That said, I want to be sure than Spark can handle terabytes batch analytics and kafka can handle that amount before loading it as Spark RDD.
Any suggestion on help will be welcome. Thanks
I don't see Cassandra connector for source in https://flink.apache.org/ecosystem.html . So wondering how can I use data that is stored in Cassandra for state.
Thanks
one method would be to send data from Cassandra to Kafka and then use Kafka as the data source.
I have a Kafka - Spark Streaming application to ingest and process 60K events per min. I need a database to store my transformed dataframes to be accessed by visualization layer. Can Redshift be used for this with Spark Streaming or should Cassandra be used? I will be processing and storing the dataframes in every spark window of 30 seconds. Also I need to read from the datastore in every window. I guess Redhsift is primarily a data warehousing database not for OLTP sort of the processing.. any ideas?
You should check out SnappyData. SnappyData deeply integrates an in-memory database with Spark that allows hybrid OLTP/OLAP applications. You can write Spark Streaming applications on top of Snappy that can update/delete data from the database. Further, because it does not go over a connector, it performs better than the myriad datastores that have Spark connectors and even the native Spark cache. There may be other datastores that offer hybrid OLTP/OLAP applications on Spark in the aforementioned link.
Disclaimer: I am a SnappyData employee.