Cassandra replication for big data [closed] - cassandra

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
For now I need to use Cassandra replication(Master-Slave) topology, where slaves count about 1100. And I have some questions:
Are there any projects that use many slaves(about 1100) for Cassandra, PostgreSQL or Oracle?
Each slave need to contain only piece of all data from master(based on one property). In PostgreSQL, Oracle, etc. I might use "replication filters" for this. Is there an alternative in Cassandra?

Cassandra replaces the master-slave architecture with a peer-to-peer one. It distributes data across each node based on the partitioner used.
Are there any projects that use many slaves(about 1100) for Cassandra, PostgreSQL or Oracle?
Not slaves but peers, but still, yep there are some projects with massive clusters... One of the more impressive C* clusters is described in the last comment of this jira.
Each slave need to contain only piece of all data from master(based on one property). In PostgreSQL, Oracle, etc. I might use "replication filters" for this. Is there an alternative in Cassandra?
Again, master-slave is replaced by peer-to-peer, so no. But if you want to write to a master and then replicate it across to slaves isn't that a single point of failure?

Related

I have more than 3k rows of data to retrieve from cassandra using an api. I have indexing on it but its causing issue of connection reset [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 14 days ago.
Improve this question
I have more than 3k rows of data to retrieve from cassandra using an api. I have indexing on it but then also its causing issue of connection reset.
Should I look for any other data base to do so?
Is it possible to have a work around in cassandra?
Will providing limit or filter on between dates in query will help?
(so there will be restriction on api, is it standard practice)
So there's a lot missing here that is needed to help diagnose what is going on. Specifically, it'd be great to see the underlying table definition and the actual CQL query that the API is trying to run.
Without that, I can say that to me, it sounds like the API is trying to aggregate the 3000 rows from multiple partitions with a specific date range in the cluster (and is probably using the ALLOW FILTERING directive to accomplish this). Most multi-partition queries will time-out, just because of all the extra network time being introduced while polling each node in the cluster.
As with all queries in Cassandra, a table needs to be built to support a specific query. If it's not, this is generally what happens.
Will providing limit or filter on between dates in query will help?
Yes, breaking this query up into smaller pieces will help. If you can look at the underlying table definition, that might give you a clue as to the right way to properly query the table. But in this case, making 10 queries for 300 rows probably has a higher chance for success than 1 query for 3000 rows.

Cassandra cluster vs cassandra ring [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
Improve this question
If I have one Cassandra cluster setup across 5 data centers (3 are private DCs) and 2 are Public (Azure DCs), can I say I have 5 rings or is this 1 cluster and 1 ring ?
Can someone help understanding the term "ring" in this context.
Long answer:
Yes, cluster and ring can be used interchangeably. "Cluster" is certainly used more today. "Ring" comes from the early, pre-vNodes days of Cassandra, where each node was assigned a single, contiguous token range.
We used to have to manually configure the token range for each node. In fact, I would use this code to do it (assuming a 5 node cluster):
python -c 'print [str(((2**64 / 5) * i) - 2**63) for i in range(5)]'
['-9223372036854775808', '-5534023222112865485', '-1844674407370955162', '1844674407370955161', '5534023222112865484']
When computing token ranges of partition keys, each node was responsible for the next, sequential range. Once it got to the last node in the cluster, the range calculation looped back around to the first node, sort of making a ring-like diagram.
In summary, now with multiple non-contiguous token ranges on each node, Cassandra really doesn't form a "ring" anymore. To see what I mean, run a nodetool ring on a Cassandra 3 node with multiple vNodes, and you'll see what I mean.
Short answer:
For all intents and purposes, "cluster" == "ring." You can say either, and people will know what you mean.

Compact key value store in Rust [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I'm working on a Rust project that collects daily statistics for a web-site (number of requests, number of unique users, average latency etc.). I'd like to store this data in a compact key-value store where the key is a date (or a date string) and the value is an object that contain the statistics. I also need this data to be persisted to a file.
I don't have any special performance or storage requirements. That's why I don't want to use major DBs like Redis, MongoDB or Cassandra that require a separate installation and significant resources to run. I'd like something much simpler and lightweight.
The ideal solution for me would be a library that can read and write key-value data and persist it into a file. The data size I'm aiming for is around 1000-2000 records.
Can you recommend a library I can use?
I can recommend PickleDB-rs. I think it answers most of your requirements. PickleDB-rs is a Rust version of Python's PickleDB. It's intended for small DBs (I think 1000-2000 records should be ok) and the performance isn't guaranteed to be as great as large scale DBs, but for the purpose of dumping daily web-site stats into a file it should be sufficient.

Apache Storm vs Apache Samza vs Apache Spark [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have worked on Storm and Spark but Samza is quite new.
I do not understand why Samza was introduced when Storm is already there for real time processing. Spark provides in memory near real time processing and has other very useful components as graphx and mllib.
What are improvements that Samza brings and what further improvements are possible?
This is a good summary of the differences and pros and cons.
I would just add that Samza, which actually isn't that new, brings a certain simplicity since it is opinionated on the use of Kafka as its backend, while others try to be more generic at the cost of simplicity. Samza is pioneered by the same people who created Kafka, who are also the same people behind the Kappa Architecture--primarily Jay Kreps formerly of LinkedIn. That's pretty cool.
Also, the programming models are totally different between realtime streams with Samza, microbatches in Spark Streaming (which isn't exactly the same as Spark), and spouts and bolts with tuples in Storm.
None of these are "better." It all depends on your use cases, the strengths of your team, how the APIs match up with your mental models, quality of support, etc.
You also forgot Apache Flink and Twitter's Heron, which they made because Storm started to fail them. Then again, very few need to operate at the scale of Twitter.

How to operate the transaction of cassandra? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
In my project, I use the spring, but the cassandra cannot support transaction. How to operate the transaction of cassandra in service layer?
You can log every transaction you carry out, store them in a log file of some sort and when you want to undo it create a query that does the opposite of what you just did.
You need to think differently in noSQL. Read Building on Quicksand http://www-db.cs.wisc.edu/cidr/cidr2009/Paper_133.pdf. If using cassandra, you may want to check out PlayOrm as well and the nosql patterns page.

Resources