Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I am using Cassandra for a social app I am working on. I really like all Cassandra has to offer. I want to know if it is okay to store username and password in Cassandra itself or should I use a second database (mongodb) for storing username and password. Will eventual consistency cause problems when a user resets password or changes email. I am using email as primary key to look up user data. I am writing the backend in JavaScript in node.js.
Let me explain what i am trying to do. I want to add a password reset and a lock out feature after lets say 5 wrong password tries. The problem is if the user tries to login before the third servers updates and what happens if the password is compared to the outdated data on the third server? Wouldn't the user get locked out. Is the best course of action to store the username and password in a separate database such as mongodb? Or is their another way to solve this issue.
You could store username/pwd in Cassandra - their basic data modeling page has a relevant example that may help your thinking: http://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling.
Eventual consistency could be an issue if you don't pay attention to how you read and write data (and your reads/writes happen near each other). It is possible to tune the cluster to behave like a 'normal' store (e.g. once I write the data, subsequent reads will get the last written value); however, the configuration below reduces some of the great things Cassandra brings to the table.
Data Consistency
You can configure a data consistency of ALL for your writes - meaning that you ensure that all nodes are up to date before your write completes. No eventual consistency issues on the writes.
A write must be written to the commit log and memtable on all replica nodes in the cluster for that partition key.
However, this write takes longer and requires that enough nodes are up and available for you to get the guarantees you are asking for.
You can also specify read consistency of ALL which means:
Returns the record after all replicas have responded. The read operation will fail if a replica does not respond.
In this case, if you are writing data to three nodes and one of them is unable to respond your read will fail because Cassandra cannot provide the requested guarantee of consistency.
I should point out that these are the most strict configurations you can use and there are others that may work for your situation. The point I am trying to make is: as you ask Cassandra to behave like a more traditional store, you trade off some of the aspects of the cluster that are really appealing.
You need to consider the anticipated characteristics of your system (read volume, write volume, importance of those two being in sync etc) when making this decision. IMOP while you probably could use Cassandra for this type of data, unless you are dealing with a huge volume of users, it wouldn't be my first choice.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have read all documents available on Microsoft websites and the internet but most of them talk about large data but my requirement is quite small.
I am trying to save Customer Onboarding data. Before Customer onboards we assign him his company Id and User Id and admin role and default environment. The company can create multiple dummy environments to test. E.g. Dev1, Stage And Test123, etc, and Onboarding will be done on Environment Level.
Onboarding JSON
{
"companyId": "Company123",
"environment": "stg1",
"userId": "User123",
"startDate": 1212121212,
"modifiedDate": 1212121212,
"uniqueId": "<companyId_UserId>"
}
Onboarding can be done at Environment Level. As per data a Company can have at most 10 to 15 environments. In the above document User Id is just metadata to check which user started onboarding on Environment stg1.
Initially I thought of using the company Id as partition key but in this case each logical partition will have at most 15 records.
My Cosmos Queries will contain Company Id & Environment Id as a filter.
Is it a good approach? Or Should I generate synthetic Partition Key using Hash Function and limit logical partitions to 10 or 20.
Which one is faster?
A large number of Logical Partitions but all partitions contains 10 to 15 Documents
A Less number of Logical Partitions but partitions contains more number of Documents.
My complete data size is about < 1 GB so please don't assume that we will reach the limit of "logical partition limit 10 GB" here.
My other Queries is
With Azure SDK In the case of inserting new document my RU is 7.67 but in the case of upsert it is 10.9. Is there any way to reduce this.
If your collection is never going to go over 20GB then what you use as a partition key is not as critical because all of your data (and your queries) will reside on a single physical partition. Partition keys (and partitioning) are all about scale (which is why we always talk about them in the context of large amounts of data or high volume of operations).
In a read-heavy workload, choosing a partition key that is used in all of your query where clauses is a safe strategy, in your case a synthetic key of environmentId-companyId is a good choice. If this is a write heavy workload then you also want the partition key values to distriubte writes across partitions. But again, if this is a small collection then this matters little here.
Your id property is fine as it will work having the same companyId-userId value with different partition key values which is what I assume you want. You also can do a point read with environmentId, companyId and userId if you have all three which you should do as much as possible rather than queries when looking for a single item. Even though this collection will not grow, based upon what you say, the partition strategy here should allow it to scale should you ever want it to.
Upserts are always going to be more expensive than an insert because it's two operations rather than one. The only way to reduce the cost of writes is to create a custom index policy and exclude paths you never query on. But based upon the example document in your post, a custom index policy will not get you any improvement.
Hope this is helpful.
Logical partition limit is not 20gb, as far as I'm aware. As far as I know from the talks with the product group developing cosmos db there is no harm in creating as many partitions as you need, just keep in mind you should avoid cross-partition queries at all costs (so design the data in such a fashion that you will never have to do cross partition queries).
so logical partition for a customer makes sense, unless you want to do queries across all customers. but given the data set size it should not have a tremendous impact. either way, both approaches will work. I'd say creating a synthetic key is only needed when you cannot find a reasonable key without generating it
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
I get a problem with Cassandra as below:
- Sys has 4 nodes (DL80, 64G RAM, 4SSD)
- One table contains about 200k records. This table is realtime update about: 200 record updated per second.
- Web app sometime do query full table for cache and meet exception timeout or tombstone warning.
Can anyone guide me to solve this problem?
Many thanks.
So I read this:
One table contains about 200k records. This table is realtime update about: 200 record updated per second.
...and then this:
this table has one partition key value to hold all records in one node.
The main problem I see, is that you are storing too many rows in a single partition. Cassandra has a max of 2 billion cells per partition. I don't know how many columns you have, but even if you haven't hit that limit, I expect that queries to that partition would eventually get slower and slower. Especially since you're updating rows in-place.
This is also another red flag:
Web app sometime do query full table
Querying all rows in a table is something that Cassandra was just not designed to be good at. Supporting this query is probably why you put everything in a single partition, but there are problems with that approach, as you are finding out.
I don't know what your table looks like, but that is where you need to make some adjustments.
If you really do need to query all rows in a table, there are several other databases out there which do this better than Cassandra does.
Try not to update data in-place. As Cassandra has a log-based, append-only storage engine, you're not actually "updating" anything. Updates and inserts are synonymous, and simply write a new value for the key. The old data is obsoleted, and is still there until compaction runs.
The single partition key approach simply does not scale. If you're doing that, you might as well just use a RDBMS. If your data is time-based, then building a partition key with a "time bucket" would distribute better.
Most problems with Cassandra come from bad data models (table definitions). It's not like Oracle where someone can "tune the database" by changing some config settings to make everything run better. There is no amount of config that can help a bad data model.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
we are looking for an opensource in memory database which can support indexes.
The use case is that we have lot of items that are going to grow in a big way.
Each item has a few fields on which we need to query.
Currently we store the data in application's memory. However with increasing data, we have to think about distributing/sharding the db.
We have looked at a few options
Redis cluster could be used, but it does not have the concept of
indexes or SQL like queries.
Apache Ignite is both in-memory, and distributed as well as provides
SQL queries. However, the problem is that ignite fires all
queries into all master nodes, so that the final result will be
slower than the slowest of those queries. It seems like a problem
because a non performing/slow node out of a number of nodes can
really slow down the application a lot. Further in ignite, reads are
done from the masters and slaves are not used, so that it is
difficult to scale the queries. Increasing the nodes will have
negative impact as the no of queries will increase and it will be
even slower.
Cassandra - The in-memory option in cassandra can be used, but it
seems that the max size of a table per node can be 1 GB. If
our table is more than 1 GB, we will have to resort to partitioning
which will inturn lead cassandra to make multiple queries(one per
node) and it is a problem(same as ignite). Not sure whether reads in
cassandra in-memory table can be scaled by increasing the number of
slaves.
We are open to other solutions but wondering whether the multi-query will be a problem everywhere(like hazelcast).
The ideal solution for our use case would be an in-memory database with indexes which could be read scaled by increasing the number of slaves. Making it distributed/sharded will lead to multiple queries and we are reluctant because one erring node could slow the whole system down.
Hazelcast supports indexes (sorted & unsorted) and what is important there is no Multi-Query problem with Hazelcast.
Hazelcast supports a PartitionPredicate that restricts the execution of a query to a node that is a primaryReplica of the key passed to the constructor of the PartitionPredicate. So if you know where the data resides you can just query this node. So no need to fix or implement anything to support it, you can use it right away.
It's probably not reasonable to use it all the time. Depends on your use-case.
For complex queries that scan a lot of data but return small results it's better to use OBJECT inMemoryFormat. You should get excellent execution times and low latencies.
Disclaimer: I am GridGain employee and Apache Ignite committer.
Several comments on your concerns:
1) Slow nodes will lead to problems in virtually any clustered environment, so I would not consider this as disadvantage. This is reality you should embrace and accept. It is necessary understand why it is slow and fix/upgrade it.
2) Ignite are able to perform reads from slaves both for regular cache operations [1] and for SQL queries executed over REPLICATED caches. In fact, using REPLICATED cache for reference data is one of the most important features allowing Ignite to scale smoothly.
3) As you correctly mentioned, currently query is broadcasted to all data nodes. We are going to improve it. First, we will let users to specify partitions to execute the query against [2]. Second, we are going to improve our optimizer so that it will try to calculate target data nodes in advance to avoid broadcast [3], [4]. Both improvements will be released very soon.
4) Last, but not least - persistent layer will be released in several months [5], meaning that Ignite will become distributed database with both in-memory and persistence capabilities.
[1] https://ignite.apache.org/releases/mobile/org/apache/ignite/configuration/CacheConfiguration.html#isReadFromBackup()
[2] https://issues.apache.org/jira/browse/IGNITE-4523
[3] https://issues.apache.org/jira/browse/IGNITE-4509
[4] https://issues.apache.org/jira/browse/IGNITE-4510
[5] http://apache-ignite-developers.2346864.n4.nabble.com/GridGain-Donates-Persistent-Distributed-Store-To-ASF-Apache-Ignite-tc16788.html
I can give opinions on cassandra. Max size of your table per node is configurable and tunable so it depends on the amount of the memory that you are willing to pay. Partitioning is built in into cassandra so basically cassandra manages it for you. It's relatively simple to do paritioning. Basically first part of the primary key syntax is partitioning key and it determines on which node in the cluster the data lives.
But I also guess you are aware of this since you are mentioning multiple query per node. I guess there is no nice way around it.
Just one slight remark there is no master slaves in cassandra. Every node is equal. Basically client asks any node in the cluster, this node then becomes coordinator nodes and since it gets partitioning key it knows which node to ask the data for and it gives it then to the client.
Other than that I guess you read upon cassandra enough (from what I can see in your question)
Basically it comes down to the access pattern, if you know how you are going to access your data then it's the way to go. But other databases are also pretty decent.
Indexing with cassandra usually hides some potential performance problems. Usually people avoid it because in cassandra index has to be build for every record there is on whole cluster and it's done per node. This doesn't really scale. Basically you always have to do query first no matter how ypu put it with cassandra.
Plus the in memory seems to be part of the DSE cassandra. Not the open source or community one. You have to take this into account also.
I'm not a mongodb expert, so I'm a little unsure about server setup now.
I have a single instance running mongo3.0.2 with wiredtiger, accepting both read and write ops. It collects logs from client, so write load is decent. Once a day I want to process this logs and calculate some metrics using aggregation framework, data set to process is something like all logs from last month and all calculation takes about 5-6 hours.
I'm thinking about splitting write and read to avoid locks on my collections (server continues to write logs while i'm reading, newly written logs may match my queries, but i can skip them, because i don't need 100% accuracy).
In other words, i want to make a setup with a secondary for read, where replication is not performing continuously, but starts in a configured time or better is triggered before all read operations are started.
I'm making all my processing from node.js so one option i see here is to export data created in some period like [yesterday, today] and import it to read instance by myself and make calculations after import is done. I was looking on replica set and master/slave replication as possible setups but i didn't get how to config it to achieve the described scenario.
So maybe i wrong and miss something here? Are there any other options to achieve this?
Your idea of using a replica-set is flawed for several reasons.
First, a replica-set always replicates the whole mongod instance. You can't enable it for individual collections, and certainly not only for specific documents of a collection.
Second, deactivating replication and enabling it before you start your report generation is not a good idea either. When you enable replication, the new slave will not be immediately up-to-date. It will take a while until it has processed the changes since its last contact with the master. There is no way to tell how long this will take (you can check how far a secondary is behind the primary using rs.status() and comparing the secondaries optimeDate with its lastHeartbeat date).
But when you want to perform data-mining on a subset of your documents selected by timespan, there is another solution.
Transfer the documents you want to analyze to a new collection. You can do this with an aggregation pipeline consisting only of a $match which matches the documents from the last month followed by an $out. The out-operator specifies that the results of the aggregation are not sent to the application/shell, but instead written to a new collection (which is automatically emptied before this happens). You can then perform your reporting on the new collection without locking the actual one. It also has the advantage that you are now operating on a much smaller collection, so queries will be faster, especially those which can't use indexes. Also, your data won't change between your aggregations, so your reports won't have any inconsistencies between them due to data changing between them.
When you are certain that you will need a second server for report generation, you can still use replication and perform the aggregation on the secondary. However, I would really recommend you to build a proper replica-set (consisting of primary, secondary and an arbiter) and leave replication active at all times. Not only will that make sure that your data isn't outdated when you generate your reports, it also gives you the important benefit of automatic failover should your primary go down for some reason.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I'm using MongoDB as a read only document source, used for computing statistics. Each document has no subdocuments, but the database has approximately ~900k documents and will grow by ~ 1k documents each day, added at a time where the database will be idle.
So, I'd like to understand the following things:
I've read that MongoDB works best when the entire collection is stored in RAM. Assuming my database is ~400MB and our server can easily cram the whole thing into RAM, is there a way I can tell MongoDB to pre-load my entire collection into RAM?
I've also read that there are cases where creating replica sets will help with the read performance of the database. Is my scenario one of the cases where this will help?
I'm threading my statistical calculations, but notice that the amount of time to complete the queries I run against mongoDB when doing these calculations triples when I thread them as opposed to running them synchronously. Is there anything I can do to improve the performance of the DB when I'm making requests against the same collection simultaneously?
No, MongoDB DOES NOT WORK BEST when the collection is in RAM. I have no idea who told you that but it is a common mis-conception about how MongoDB works.
MongoDB works best when it can not only fit your working set into RAM ( What does it mean to fit "working set" into RAM for MongoDB? ) but also load it in RAM at significantly great speed. One thing that can help the speed of paging in your working set is the size of your documents.
This is one reason why MongoDB is limited to 16MB, it has been found that sizes greater start to have a seriously detremental performance impact. Basically you spend too much time loading your data from the disk, this is one reason for de-normalisation by logically splitting tables in SQL techs; to make them faster to load.
This means you may have to optimise both the size of the value and the size of the field name to match performance needs for your reads. You will of course also have to match hardware.
Replica sets are not actually designed to help with read performance, they are designed to give your data high availability by making automated failover. The topic you read suggests getting stale reads from secondaries. This, as has been proven (edit: since proven is a strong word and this is scenario based I'm going to say "found") recently, can actually be less performant than using PrimaryPreferred read preference.
As for improving performance we would need stats from you on page faults, IO bottlenecks and general mongostat and top.
About Point 1:
You can use the touch command to persuade the database to load a collection into memory. But keep in mind that this isn't permanent. When you don't access the cached documents soon, they will get uncached in favor of more frequently-used documents.
About Point 2 and 3:
Replica-sets are a good way to improve the performance of parallel read operations. Each server of a replica-set mirrors the whole data and can respond to any query on its own without having to contact the other servers. That means when you double the number of servers in your replica-set, you also double the performance of simultaneous queries.
Keep in mind that the read preferences you set on your connection might prevent it from using more than one server.
Alternatively you can build a sharded cluster, but this is technically a lot more complex than a replica-set and won't improve read-performance much when your queries don't match the shard-key of the collection or when you selected your shard-key in a way that the requests aren't evenly distributed between the shards.