Cassandra reducing perfomance when enabling authorization - cassandra

i have a 6 nodes cassandra cluster, and i want to enable authorization/authentication on it, but i read a few comments of those who administer cassandra and they said that enabling authorization on cassandra reduces performance, is it really so?
who has experienced this and how to avoid it

Just my experience here, and it is not meant to discount the experience of others. Since 2012, I have personally built over 200 Apache Cassandra clusters on infra ranging from bare metal, to K8s, to the public clouds; spanning environments from Dev, Stage, Test, and (of course) Production.
Every single one of those clusters (even Dev) had Authorization and Authentication enabled. Some of them also had SSL enabled.
My team was also occasionally asked to assume management of clusters run directly by an application team. Some of those did not have auth enabled. Thus verifying/enabling auth was one of the first tasks that we performed. Latency incurred by activating authentication was often a voiced concern.
That being said, at no point was enabling Cassandra's native auth deemed to be disruptive. In fact, one of the prod clusters with both auth & SSL enabled would routinely post a P95 read latency of less than 5ms, while supporting throughput of up to 250k ops/sec.
In fact, the only time it was ever an issue was when we integrated a few clusters with a 3rd party plugin for LDAP. But Cassandra's own Authentication and Authorization never posed a noticeable issue.
If you find that enabling auth does cause latency, the main tuneable in the cassandra.yaml is credentials_validity_in_ms. It defaults to 2000ms (2 seconds), and represents how often a long running connection refreshes its cached credentials. I've heard of some folks setting that as high as 3 hours (which I think is too high). But if it becomes problematic, increasing that setting should help.

Related

Extending Cassandra cluster with datacenter in China (CGF)

I need to extend my cluster with a new datacenter to be present in China mainland, behind the Great Firewall. Currently I have datacenters in the US and Europe - so the cluster already matches to the requirements of the Geographical Location Scenario.
At this point I have the chinese infrastructure ready for Cassandra, but the network statistics from the past few days are bit troublesome and I am a bit afraid: if and how this can effect my current cluster and will be the new datacenter functional at all?
My actual questions regarding this are:
How does Cassandra handle huge packet-loss during replication? (occasionally up to 40%)
How does it effect the cluster when the network connection between two datacenters are really bad (only few kilobits/sec and latency as above) for hours?
Will the chinese dc considered as dead? Or Cassandra will still try to use the limited bandwidth?
Can this cause any problem on the non-chinese datacenters? e.g. they get slow, which results in client request timeouts.
Is it possible to enforce somehow, that only one of my non-chinese datacenter communicates with the chinese one? Or should I trust that Cassandra will handle this? (trying to avoid to possible harm all my datacenters)
Is there any way to fasten up the initial data replication (nodetool rebuild), because with the current speed it would take weeks to replicate our current data.
Any suggestion or remark is welcomed, thanks!
How does Cassandra handle huge packet-loss during replication? (occasionally up to 40%)
Usually packet loss will cause a large number of read repairs. In some cases it can cause requests to fail depending on replication factor and consistency. Also, be prepared to have very costly repairs which will create a lot of tiny SSTables and a substansial amount of IO.
I would suggest to run a test on a development requirement to see the actual behavior in your system. There are plenty of tools to simulate bad network.
How does it effect the cluster when the network connection between two datacenters are really bad (only few kilobits/sec and latency as above) for hours? Will the chinese dc considered as dead? Or Cassandra will still try to use the limited bandwidth? Can this cause any problem on the non-chinese datacenters?
It largely depends on how bad and what consistency level/replication factor you are running with. In some cases it will just cause rather high latency between clusters. However, if the connection is bad enough that the nodes will start marking the other as down - Then you are looking at issues in all datacenters. Your existing datacenters will struggle with performance caused by requests timing out. This will in turn cause requests to be held longer in memory which can lead to GC. (It can cause a number of other issues in your other cluster as well)
The threshold on how sensitive the failure detector is can be adjusted and fine tuned to suit your use case. phi_convict_threshold is a setting that can decrease the likelihood of a node being marked as down. You can find more about that here. If you find that sweet spot where your nodes are not marked down due to being unresponsive, you can have Cassandra leverage what little it has to work with.
Is it possible to enforce somehow, that only one of my non-chinese datacenter communicates with the chinese one? Or should I trust that Cassandra will handle this? (trying to avoid to possible harm all my datacenters)
There is not really a way to tell Cassandra to limit what datacenters to speak to. You are kind of stuck with communicating between the datacenters you include in your replication factor.
Is there any way to fasten up the initial data replication (nodetool rebuild), because with the current speed it would take weeks to replicate our current data.
I would recommend against the solution of using sstableloader for it functions very similar as rebuild does and requires a snapshot to operate. If network is what is causing the slow speed, then changing the way of streaming is not going to make much difference.
In my opinion, the first thing to do would be to measure where the bottleneck is for your system. If the slow network is really the bottleneck, one could add more nodes to stream from more sources at the same time but ultimately you will still be hampered by the slow network connection.

ThingWorx Horizontal Scalability

What architecture and application development best practices must be followed in order to scale a TWX application?
The majority of applications start with few devices but with time they quickly build up to thousands of devices. Once the amount of traffic is too much for one TWX instance what strategy should be followed?
The same question applies when the front end is overwhelmed by the number of users.
Anytime I have had ThingWorx architecture concerns, I have been redirected to the PTC ThingWorx guide linked below. I do not believe you need a PTC account to view it, but if so it is free.
ThingWorx 8 High Availability Administrators Guide
http://support.ptc.com/WCMS/files/173281/en/ThingWorx_8_High_Availability_Administrators_Guide.pdf
In your case where you have big load concerns, the guide recommends using
two ThingWorx instances to handle the load.
At least two ThingWorx instances are required for HA configuration. A
single instance is started, which becomes leader and fully connects to
the database. Standby servers boot up and can become the leader if
needed, but they do not fully connect to the database or load
information like the leader does. All ThingWorx servers have a service
that is called by the load balancer, which indicates their
availability. Different codes identify the leader, which receives
traffic, and standby nodes, which do not receive traffic but may
become leader.
High-Level Architecture example from the referenced guide:
The Load Balancer determines which ThingWorx instance is to be used by the user. Usually it is used to determine which is available in a redundant architecture (which is what makes it Highly Available). However, it can also be used to determine which to use based on performance. In PTC's HA Admin Guide, they use HAProxy (see page 47) as the Load Balancer. See Section 3.2 of the HAProxy Config Doc for how to configure based on performance.
Hope this helps! It is a pretty open-ended topic
With ThingWorx 9.0 release, the ThingWorx Foundation platform supports true horizontal scalability with an active-active clustering setup providing no single points of failure. The document here provides the details about the install and setup.
There is also a ThingWorx 9.0 deployment architecture guide for an overview of all the architectural details.
ThingWorx High Availability Clustering setup image

In-memory caching in Azure function

There is a need to cache objects to improve the perf of my Azure function. I tried .NET ObjectCache (System.Runtime.Caching) and it worked well in my testing (tested with upto 10min cache retention period).
In order to take this solution forward, I have few quick questions:
What is the recycling policy of Azure function. Is there any default? Can it be configured?
What is the implication in the cost?
Is my approach right or are there any better solutions?
Any questions that you may know, please help.
Thank you.
Javed,
An out-of-process solution such as Redis (or even using Table storage, depending on the workload) would be recommended.
As a rule of thumb, functions should be stateless, particularly if you're running in the dynamic runtime, where scaling operations (up and down) could happen at any time and your host is not guaranteed to stay up.
If you opt to use the classic hosting, you do have a little more flexibility, as you can enable the "always on" feature, but I'd still recommend the out-of-process approach. Running in the classic mode does have a cost implication as well, since you're no longer taking advantage of the consumption based billing model offered by the dynamic hosting.
I hope this helps!
If you just need a smallish key-value cache, you could use the file system. D:\HOME (also found in the environment variable %HOME%) is shared across all instances. I'm not sure if the capacities are any different for Azure Functions, but for Sites and WebJobs, Free and Shared sites get 1GB of space, Basic sites get 10GB, and Standard sites get 50GB.
Alternatively, you could try running .NET ObjectCache in production. It may survive multiple calls to the same instance (file system or static in-memory property). Note, this will not be shared across instances though so only use it as a best effort cache.
Note, both of these approaches pose problems for multi-tenant products as it could be an avenue for unintended cross-tenant data sharing or even more malicious activities like DNS cache poisoning. You'd want to implement authorization controls for these things just as if they came from a database.
As others have suggested, Functions ideally should be stateless and an out of process solution is probably best. I use DocumentDB because it has time-to-live functionality which is ideal for a cache. Redis is likely to be more performant especially if you don't need persistence across stop/restart.

New Azure SQL Database Services, how scalable and what are DTUs

The new new Azure SQL Database Services look good. However I am trying to work out how scalable they really are.
So, for example, assume a 200 concurrent user system.
For Standard
Workgroup and cloud applications with "multiple" concurrent transactions
For Premium
Mission-critical, high transactional volume with "many" concurrent users
What does "Multiple" and "Many" mean?
Also Standard/S1 offers 15 DTUs while Standard/S2 offers 50 DTUs. What does this mean?
Going back to my 200 user example, what option should I be going for?
Azure SQL Database Link
Thanks
EDIT
Useful page on definitions
However what is "max sessions"? Is this the number of concurrent connections?
There are some great MSDN articles on Azure SQL Database, this one in particular has a great starting point for DTUs. http://msdn.microsoft.com/en-us/library/azure/dn741336.aspx and http://channel9.msdn.com/Series/Windows-Azure-Storage-SQL-Database-Tutorials/Scott-Klein-Video-02
In short, it's a way to understand the resources powering each performance level. One of the things we know when talking with Azure SQL Database customers, is that they are a varied group. Some are most comfortable with the most absolute details, cores, memory, IOPS - and others are after a much more summarized level of information. There is no one-size fits all. DTU is meant for this later group.
Regardless, one of the benefits of the cloud is that it's easy to start with one service tier and performance level and iterate. In Azure SQL Database specifically you can change the performance level while you're application is up. During the change there is typically less than a second of elapsed time when DB connections are dropped. The internal workflow in our service for moving a DB from service tier/performance level follows the same pattern as the workflow for failing over nodes in our data centers. And nodes failing over happens all the time independent of service tier changes. In other words, you shouldn’t notice any difference in this regard relative to your past experience.
If DTU's aren't your thing, we also have a more detailed benchmark workload that may appeal. http://msdn.microsoft.com/en-us/library/azure/dn741327.aspx
Thanks Guy
It is really hard to tell without doing a test. By 200 users I assume you mean 200 people sitting at their computer at the same time doing stuff, not 200 users who log on twice a day. S2 allows 49 transactions per second which sounds about right, but you need to test. Also doing a lot of caching can't hurt.
Check out the new Elastic DB offering (Preview) announced at Build today. The pricing page has been updated with Elastic DB price information.
DTUs are based on a blended measure of CPU, memory, reads, and writes. As DTUs increase, the power offered by the performance level increases. Azure has different limits on the concurrent connections, memory, IO and CPU usage. Which tier one has to pick really depends upon
#concurrent users
Log rate
IO rate
CPU usage
Database size
For example, if you are designing a system where multiple users are reading and there are only a few writers, and if your application middle tier can cache the data as much as possible and only selective queries / application restart hit the database then you may not worry too much about the IO and CPU usage.
If many users are hitting the database at the same time, you may hit the concurrent connection limit and requests will be throttled. If you can control user requests coming to the database in your application then this shouldn't be a problem.
Log rate: Depends upon the volume of the data changes (including additional data pumping in the system). I have seen application steadily pumping the data vs data being pumped all at once. Selecting the right DTU again depends upon how one can do throttling at the application end and get steady rate.
Database size: Basic, standard, and premium has different allowed max sizes, and this is another deciding factor. Using table compression kind of features helps reducing the total size, and hence total IO.
Memory: Tuning the expesnive queries (joins, sorts etc), enabling lock escalation / nolock scans help controlling the memory usage.
The very common mistake people usually do in database systems is scaling up their database instead of tuning the queries and application logic. So testing, monitoring the resources / queries with different DTU limits is the best way of dealing this.
If choose the wrong DTU, don't worry you can always scale up/ down in SQL DB and it is completely online operation
Also unless a strong reason migrate to V12 to get even better performance and features.

Sudden Scaling of Simple Node.js App

My website is written in Node.js, has no database or external dependencies, but does have lot of large media files (images and some video) totalling some 2gb. The structure of the website is drawn from a couple of simple JSON files.
My problem is drastic and sudden scaling. Traffic to my site is usually easily handled by any small VPS instance, but occasionally traffic can get to hundreds of times it normal level for short periods. My problem is how to scale quickly, without downtime, and automatically. I know there are issues with autoscaling, but perhaps lacking a database will negate some of that.
What sort of scaling issues and options should I be looking at?
(For context, I am currently using a Digital Ocean VPS, but I can't find a clean way to scale it with no downtime. I am not wedded to my provider.)
Scalability is important, but scaling when you need to is also important. We all do not have the scaling needs of Facebook or Twitter : ) This might just be a case of resource management.
Test the problem
Without a database and using NodeJS, some of the strengths of node are its number of concurrent connections. For simple io load, it would seem you have picked a good choice of framework. And, since your problem set is a particular resource being bombarded, run some load testing on your server. Popular and free tools include:
Apache Bench
httperf
OpenLoad
And there are pay service like NeoLoad, LoadImpact (which is free at small levels), forecastweb, E-Load, etc..
With those results, Determine the Cause
Is it the size of the file being served? Is it the number of concurrent requests? What resources are being used, or maxed out, during a slowdown (ram, ports, file system, some other IO, CPU, bandwidth, etc...)?
Have a look at this question, which defines a few concepts for server load. To implement a solution, you will need to determine the cause of the slowdown. Is it: 1)Some queues fill up? 2)Problem with TCP Connections and Ports? 3) Too slow allocation of resources? That will help shape your solution.
Plan for scaling.
The type of scaling needed for your project may only be the portion needed for another. If you know the root cause in this case, it will increase your options.
Is the problem bandwidth? Perhaps using your web server as a router to multiple cloud instances of file serving would effectively increase the bandwidth your users see. Even just storing your files on a larger cloud that can guarantee the bandwidth you may need.
Is the problem CPU, RAM, etc? You may need multiple instances of the same web app (or an increased allotment for your VFS). This is the "Elastic" portion of Amazon's Elastic Cloud Computing (EC2), and other models like it. Create a "golden image" and duplicate when you see traffic start spiking, using built-in monitoring tools, turning it off when the rush is done. Can be programatic or simply manual.
Is the problem concurrent requests? The bottleneck should not be NodeJS, up to 1000's of concurrent requests anyway. Perhaps just check your implementation to ensure there is not a slowdown of the single node thread. Maybe node clustering or some worker threads would alleviate the bottleneck enough for your purposes.
Last Note: For serving static files I've heard nginx or even Apache Tomcat is a little more well-suited than NodeJS. Depending on your web app's complexity, you might be able to switch or benchmark fairly easily.
In case anyone is reading this rather specific question years later, I have gained some perspective on it. As Clay says, the ultimate answer is to spin up more servers, either manually or programatically based on load.
However, in my case that would be massive overkill - I'm not running Twitter. The problem was a relatively simple mistake in architecture. My app was reading the JSON data files from disk with every page request, and the disk I/O was getting saturated. I changed to loading the data files into memory on startup, and reloading them when they change using fs.watch().
My modest VPS can now easily handle the sorts of traffic that would previously crash it. I've never seen traffic that would make me want to up-size it.

Resources