I'm trying to create a sort of a consumer group as it exist in Kafka but for Cassandra. The goal is to have a request been paginated and each page done by one instance of an App.
Is there any notion like the consumer group one in Cassandra ?
The TL;DR; is that no, the consumer-group notion doesn't exist in the clients in Cassandra. The burden of which client processes what is entirely on the app developer.
You can use Cassandra's tokens to do selective paging.
Assuming 2 clients (easy example)
Client 1 pages from -2^63 to 0
Client 2 pages from 1 to 2^63 - 1
The above idea assumes you want to page through all the data in something similar to a batch process which wouldn't be a good fit for Cassandra.
If you're after the latest N results, where the 1st half is sent to client 1 and the second to client 2 you can use a logical bucket in your partitioning key.
If you're looking to scale the processing of a large number of Cassandra rows, you might consider a scalable execution platform like Flink or Storm. You'd be able to parallelize both the reading of the rows and the processing of the rows, although a parallelizable source (or spout in Storm) is not something you can get out of the box.
Related
I have a cql table which has 2 columns
{
long minuteTimeStamp -> only minute part of epoch time. seconds are ignored.
String data -> some data
}
I have a 5 node cassandra cluster and I want to distribute per minute data uniformly on all 5 nodes. So if per minute data is ~10k records, so each node should consume ~2k data.
I also want to consume each minute data parallelly, means 5 different readers read data 1 on each node.
I came to one solution like I also keep one more column in table like
{
long minuteTimeStamp
int shardIdx
String data
partition key : (minuteTimeStamp,shardIdx)
}
By doing this while writing the data, I will do circular round-robin on shardIdx. Since cassandra uses vnodes, so it might be possible that (min0,0) goes to node0, and (min0,1) also goes to node0 only as this token might also belong to node0. This way I can create some hotspots and it will also hamper read, as 5 parallel readers who wanted to read 1 on each node, but more than one reader might land to same node.
How can we design our partition-key so that data is uniformly distributed without writing a custom partitioner ?
There's no need to make the data distribution more complex by sharding.
The default Murmur3Partitioner will distribute your data evenly across nodes as you approach hundreds of thousands of partitions.
If your use case is really going to hotspot on "data 1", then that's more an inherent problem with your use case/access pattern but it's rare in practice unless you have a super-node issue (for example) in a social graph use case where you have Taylor Swift or Barack Obama having millions more followers than everyone else. Cheers!
I am currently managing a percona xtradb cluster composed by 5 nodes, that hadle milions of insert every day. Write performance are very good but reading is not so fast, specially when i request a big dataset.
The record inserted are sensors time series.
I would like to try apache cassandra to replace percona cluster, but i don't understand how data reading works. I am looking for something able to split query around all the nodes and read in parallel from more than one node.
I know that cassandra sharding can have shard replicas.
If i have 5 nodes and i set a replica factor of 5, does reading will be 5x faster?
Cassandra read path
The read request initiated by a client is sent over to a coordinator node which checks the partitioner what are the replicas responsible for the data and if the consistency level is met.
The coordinator will check is it is responsible for the data. If yes, will satisfy the request. If no, it will send the request to fastest answering replica (this is determined using the dynamic snitch). Also, a request digest is sent over to the other replicas.
The node will compare the returning data digests and if all are the same and the consistency level has been met, the data is returned from the fastest answering replica. If the digests are not the same, the coordinator will issue some read repair operations.
On the node there are a few steps performed: check row cache, check memtables, check sstables. More information: How is data read? and ReadPathForUsers.
Load balancing queries
Since you have a replication factor that is equal to the number of nodes, this means that each node will hold all of your data. So, when a coordinator node will receive a read query it will satisfy it from itself. In particular(if you would use a LOCAL_ONE consistency level, the request will be pretty fast).
The client drivers implement the load balancing policies, which means that on your client you can configure how the queries will be spread around the cluster. Some more reading - ClientRequestsRead
If i have 5 nodes and i set a replica factor of 5, does reading will be 5x faster?
No. It means you will have up to 5 copies of the data to ensure that your query can be satisfied when nodes are down. Cassandra does not divide up the work for the read. Instead it tries to force you to design your data in a way that makes the reads efficient and fast.
Best way to read cassandra is by making sure that each query you generate hits cassandra partition. Which means the first part of your simple primary(x,y,z) key and first bracket of compound ((x,y),z) primary key are provided as query parameters.
This goes back to cassandra table design principle of having a table design by your query needs.
Replication is about copies of data and Partitioning is about distributing data.
https://docs.datastax.com/en/cassandra/3.0/cassandra/architecture/archPartitionerAbout.html
some references about cassandra modelling,
https://www.datastax.com/dev/blog/the-most-important-thing-to-know-in-cassandra-data-modeling-the-primary-key
https://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling
it is recommended to have 100 MB partitions but not compulsory.
You can use cassandra-stress utility to have look report of how your reads and writes look.
We have an application which the clients use to track their procurement cycle. We need to build a solution which will help the users to pull any column from any table in a particular subject area and they should be able to see all the rows of the result of this join of the tables from which the columns have been pulled. It needs to be similar to a Salesforce kind of reporting solution. We are looking at HDFS and Spark in Azure HDInsight to support these kind of querying capabilities. We would like to know if this is a valid use case for Spark. The volume of the joins of all tables can easily touch 500 million rows which will be pulled into the Spark driver memory before being displayed to the user.
Please let me know if this is something that can be done using Spark.
As per my understanding, Spark is mostly used for batch processing. If your use case is directly user-facing, then I am doubtful about using Spark because there may be better solutions(or alternate architectures). Becuase joining 500 million rows in realtime sounds crazy!
The volume of the joins of all tables can easily touch 500 million rows which will be pulled into the Spark driver memory before being displayed to the user.
This is another thing that puzzled me. Pulling all the 500 million rows into RAM of a single java process doesn't sound right because of the obvious reasons.
Updated
Just using spark for processing huge data will not be effective for realtime solutions(like your use case). But, Spark will be very effective if you are going to pre-process your data, cache the results using some other system, prepare views using the results can be served to your users. More or less similar to Lambda Architecture.
Spark on Yarn cluster to periodically process the data and generate/update the different views, a distributed storage system (preferably columnar storage systems) to cache the views, a REST API to serve the views to users.
Late reply to the question, but in case someone else is reading this in future. AWS Redshift does exactly this.
I have some flows that get the data from an azure eventhub, im using the GetAzureEventhub processor. The data that im getting is being multiplyed by the number of nodes that I have in the cluster, I have 4 nodes. If I indicate to the processor to just run on the primary node, the data is not replicated 4 times.
I found that the eventhub for each consumer group accepts up to 5 readers, I read this in this article, each reader will have its own separate offset and they consume the same data. So in conclussion Im reading the same data 4 times.
I have 2 questions:
How can I coordinate this 4 nodes in order to go throught the same reader?
In case this is not posible, how can indicate nifi to just one of the nodes to read?
Thanks, if you need any clarification, ask for it.
GetAzureEventHub currently does not perform any coordination across nodes so you would have to run it on primary node only to avoid duplication.
The processor would require refactoring to perform coordination across the nodes of the cluster and assign unique partitions to each node, and handle failures (i.e. if a node consuming partition 1 goes down, another node has to take over partition 1).
If the Azure client provided this coordination somehow (similar to the Kafka client) then it would require less work on the NiFi side, but I'm not familiar enough with Azure to know if it provides anything like this.
I'm trying to design an architecture of my streaming application and choose the right tools for the job.
This is how it works currently:
Messages from "application-producer" part have a form of (address_of_sensor, timestamp, content) tuples.
I've already implemented all functionality before Kafka, and now I've encountered major flaw in the design. In "Spark Streaming" part, consolidated stream of messages is translated into stream of events. The problem is that events for the most part are composite - consist of multiple messages, which have occurred at the same time at different sensors.
I can't rely on "time of arrival to Kafka" as a mean to detect "simultaneity". So I has to somehow sort messages in Kafka before extracting them with Spark. Or, more precisely, make queries over Kafka messages.
Maybe Cassandra is the right replacement for Kafka here? I have really simple data model, and only two possible types of queries to perform: query by address, and range query by timestamp. Maybe this is the right choice?
Do somebody have any numbers of Cassandra's throughput?
If you want to run queries on your time series, Cassandra may be the best fit - it is very write optimized, you can build 'wide' rows for your series. It is possible to make slices on your wide rows, so you can select some time ranges with only one query.
On the other hand, kafka can be considered as a raw data flow - you don't have queries, only recently produced data. In order to collect data based on some key in the same partition, you have to select this key carefully. All data within same partition are time sorted.
Range Query on Timestamp is the classic use case of cassandra , if u need address based queries as well u would have to make them as clustering column if using cassandra . As far as cassandra througput are concerned if you can invest in proper performance analysis on cassandra cluster you can achieve very high write throughput . But I have used SparkQL , Cassandra Driver and spark Cassandra connector they don't really give high query throughput speed until you have a big cluster with high CPU configuration , it does not work well with small dataset .
Kafka should not be used as data source for queries , its more of commit log