Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
Improve this question
If I have one Cassandra cluster setup across 5 data centers (3 are private DCs) and 2 are Public (Azure DCs), can I say I have 5 rings or is this 1 cluster and 1 ring ?
Can someone help understanding the term "ring" in this context.
Long answer:
Yes, cluster and ring can be used interchangeably. "Cluster" is certainly used more today. "Ring" comes from the early, pre-vNodes days of Cassandra, where each node was assigned a single, contiguous token range.
We used to have to manually configure the token range for each node. In fact, I would use this code to do it (assuming a 5 node cluster):
python -c 'print [str(((2**64 / 5) * i) - 2**63) for i in range(5)]'
['-9223372036854775808', '-5534023222112865485', '-1844674407370955162', '1844674407370955161', '5534023222112865484']
When computing token ranges of partition keys, each node was responsible for the next, sequential range. Once it got to the last node in the cluster, the range calculation looped back around to the first node, sort of making a ring-like diagram.
In summary, now with multiple non-contiguous token ranges on each node, Cassandra really doesn't form a "ring" anymore. To see what I mean, run a nodetool ring on a Cassandra 3 node with multiple vNodes, and you'll see what I mean.
Short answer:
For all intents and purposes, "cluster" == "ring." You can say either, and people will know what you mean.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Closed 3 years ago.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Improve this question
I looked into the SQL azure SLA and couldn't find this being listed explicitly:
https://azure.microsoft.com/en-us/support/legal/sla/sql-database/v1_1/
In a Sql Azure failover group, the data is asynchronously synced from primary to secondary.
What is the expected lag between the two? (as seen in practice)
What is the worst case lag?
You should monitor the lag time with respect to the RPO which is 5 seconds for auto-failover groups. The time period of updates that you might afford to lose is known as recovery point objective (RPO).
Sometimes replication_lag_sec on the primary database has a NULL value, which means that the primary does not currently know how far the secondary is. This typically happens after process restarts and should be a transient condition. Consider alerting the application if the replication_lag_sec returns NULL for an extended period of time. It would indicate that the secondary database cannot communicate with the primary due to a permanent connectivity failure. There are also conditions that could cause the difference between last_commit time on the secondary and on the primary database to become large. E.g. if a commit is made on the primary after a long period of no changes, the difference will jump up to a large value before quickly returning to 0. Consider it an error condition when the difference between these two values remains large for a long time.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I'm working on a Rust project that collects daily statistics for a web-site (number of requests, number of unique users, average latency etc.). I'd like to store this data in a compact key-value store where the key is a date (or a date string) and the value is an object that contain the statistics. I also need this data to be persisted to a file.
I don't have any special performance or storage requirements. That's why I don't want to use major DBs like Redis, MongoDB or Cassandra that require a separate installation and significant resources to run. I'd like something much simpler and lightweight.
The ideal solution for me would be a library that can read and write key-value data and persist it into a file. The data size I'm aiming for is around 1000-2000 records.
Can you recommend a library I can use?
I can recommend PickleDB-rs. I think it answers most of your requirements. PickleDB-rs is a Rust version of Python's PickleDB. It's intended for small DBs (I think 1000-2000 records should be ok) and the performance isn't guaranteed to be as great as large scale DBs, but for the purpose of dumping daily web-site stats into a file it should be sufficient.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I am using Spark SQL to do some analysis.
I`m wondering is there any Front end projects can be used to view the result? I mean the analysis result not the job successful / faile status
For example, granafa, kibana, etc..
Regards
Mingwei
If you mean visualization of your results (like the ones you've mentioned) you might be interested in Apache Zeppellin. It's more like IPython Notebook so you can write your code there and visualize results.
Otherwise you'd have to tell us what is your storage format and where are you storing your results - maybe there are some visualization tools for it.
Actually if you store your results of Spark jobs in ElasticSearch you can use Kibana with it.
Otherwise, I don't think there is anything. The difference between what you are referring to (openTSDB and Elasticsearch) and Spark is that the latter is not a datastore.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Assume we have 100 gb of file. And my system is 60gb.How apache spark will handle this data?
We all know spark performs partitions on its own based on the cluster. But then when there is a reduced amount of memory I wanna know how spark handles it
In short: Spark does not require the full dataset to fit in memory at once. However, some operations may demand an entire partition of the dataset to fit in memory. Note that Spark allows you to control the number of partitions (and, consequently, the size of them).
See this topic for the details.
It is also worth to note that Java objects usually take more space than the raw data, so you may want to look at this.
Also i would recommend to look at Apache Spark : Memory management and Graceful degradation
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
For now I need to use Cassandra replication(Master-Slave) topology, where slaves count about 1100. And I have some questions:
Are there any projects that use many slaves(about 1100) for Cassandra, PostgreSQL or Oracle?
Each slave need to contain only piece of all data from master(based on one property). In PostgreSQL, Oracle, etc. I might use "replication filters" for this. Is there an alternative in Cassandra?
Cassandra replaces the master-slave architecture with a peer-to-peer one. It distributes data across each node based on the partitioner used.
Are there any projects that use many slaves(about 1100) for Cassandra, PostgreSQL or Oracle?
Not slaves but peers, but still, yep there are some projects with massive clusters... One of the more impressive C* clusters is described in the last comment of this jira.
Each slave need to contain only piece of all data from master(based on one property). In PostgreSQL, Oracle, etc. I might use "replication filters" for this. Is there an alternative in Cassandra?
Again, master-slave is replaced by peer-to-peer, so no. But if you want to write to a master and then replicate it across to slaves isn't that a single point of failure?