Impact of increasing roles_validity_in_ms & permissions_validity_in_ms - cassandra

We are seeing lot of Operation Time Out Exception in our 3 node Cassandra Cluster. Below is portion of error stack.
com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) ~[guava-18.0.jar:na]
at com.google.common.cache.LocalCache.get(LocalCache.java:3937) ~[guava-18.0.jar:na]
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941) ~[guava-18.0.jar:na]
at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824) ~[guava-18.0.jar:na]
at org.apache.cassandra.auth.PermissionsCache.getPermissions(PermissionsCache.java:72) ~[apache-cassandra-3.0.9.jar:3.0.9]
Caused by: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
at org.apache.cassandra.auth.CassandraRoleManager.getRole(CassandraRoleManager.java:489) ~[apache-cassandra-3.0.9.jar:3.0.9]
at org.apache.cassandra.auth.CassandraRoleManager.isSuper(CassandraRoleManager.java:293) ~[apache-cassandra-3.0.9.jar:3.0.9]
at org.apache.cassandra.auth.Roles.hasSuperuserStatus(Roles.java:52) ~[apache-cassandra-3.0.9.jar:3.0.9]
at org.apache.cassandra.auth.AuthenticatedUser.isSuper(AuthenticatedUser.java:71) ~[apache-cassandra-3.0.9.jar:3.0.9]
at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:76) ~[apache-cassandra-3.0.9.jar:3.0.9]
Every time we see this exception related to either PermissionsCache or CassandraRoleManager. After little research I found a solution to increase roles_validity_in_ms & permissions_validity_in_ms. Thanks to Enable one time Cassandra Authentication and Authorization check and cache it forever
Question here is what is impact of increasing this value? Datastax documentation says the cache is effective at small duration.
How long permissions in cache remain valid to manage performance
impact of permissions queries. Fetching permissions can be resource
intensive. Set the cache validity period to your security tolerances.
The cache is used for the standard authentication and the row-level
access control (RLAC) cache. The cache is quite effective at small
durations.

These parameters control how long the permissions & list of roles stay valid during the current session. It heavily dependent on your business requirements - if your application needs that roles & permissions could be changed "online" during work, then you need to have lower values, if it's ok to have the same roles & permissions until next reconnect/restart of app, then you can go to higher values.
But you can also have a combination of both, if you'll use roles_update_interval_in_ms, credentials_update_interval_in_ms & permissions_update_interval_in_ms to lower values than roles_validity_in_ms, credentials_interval_in_ms, and permissions_interval_in_ms (see doc). If these values are specified, then roles, permissions & credentials will be checked in background in given intervals, and if request succeeds, then the cache will be updated, and if it fails, then cached value still will be used. For example, you can set roles_validity_in_ms to 1 day, and roles_update_interval_in_ms to 10 minutes, so you'll able to relatively quickly react to changes in the roles for given user.

Related

very high max response and error when submit looping form submission

so my requirement is to run 90 concurrent user doing mutiple scenario (15 scenario)simultenously for 30 minutes in virtual macine.so some of the threads i use concurrent thread group and normal thread group.
now my issue is
1)after i execute all 15 scenarios, my max response for each scenario displayed very high (>40sec). is there any suggestion to reduce this high max response?
2)one of the scenario is submit web form, there is no issue if submit only one, however during the 90 concurrent user execution, some of submit web form will get 500 error code. is the error is because i use looping to achieve 30 min duration?
In order to reduce the response time you need to find the reason for this high response time, the reasons could be in:
lack of resources like CPU, RAM, etc. - make sure to monitor resources consumption using i.e. JMeter PerfMon Plugin
incorrect configuration of the middleware (application server, database, etc.), all these components need to be properly tuned for high loads, for example if you set maximum number of connections on the application server to 10 and you have 90 threads - the 80 threads will be queuing up waiting for the next available executor, the same applies to the database connection pool
use a profiler tool to inspect what's going on under the hood and why the slowest functions are that slow, it might be the case your application algorithms are not efficient enough
If your test succeeds with single thread and fails under the load - it definitely indicates the bottleneck, try increasing the load gradually and see how many users application can support without performance degradation and/or throwing errors. HTTP Status codes 5xx indicate server-side errors so it also worth inspecting your application logs for more insights

Is delivery of Azure Application Insights custom events guaranteed once TelemetryClient.TrackEvent() is called?

Microsoft states that the SLA for Application Insights is:
We guarantee that the data latency of the Application Insights Service will not exceed two hours 99.9% of the time.
https://azure.microsoft.com/en-us/support/legal/sla/application-insights/v1_0/
For the 0.1% of time outside the SLA, when TelemetryClient.TrackEvent() executes in my code, Is Microsoft guaranteeing that the event will definitely be published at some point (just not within 2 hours)? Or could the event be lost during that 0.1% time?
No, just calling TrackEvent doesn't guarantee it is published, for lots of reasons:
sampling at any level of the process. see https://learn.microsoft.com/en-us/azure/application-insights/app-insights-sampling?toc=/azure/azure-monitor/toc.json but in general if sampling is on, some % of your events might be merged together. there are various ways to find those events, but in general it is possible that if you call trackMessage 1000 times in a tight loop with the same content, an SDK might sample that and send a single event with itemCount set to 1000.
the content of the event could be invalid (to large a payload, exceeding thresholds for sizes of fields, too many custom properties, too many custom metrics, etc)
the time of the event could be invalid. events too far in the past (>48h old?) or too far into the future (not sure the exact time there, but some future time is allowed to account for clock skew/drift)
caps - you could exceed the amount you're allowed to send per month - see https://learn.microsoft.com/en-us/azure/application-insights/app-insights-pricing, which at the time of this answer states:
The maximum cap is 1,000 GB/day unless you request a higher maximum for a high-traffic application.
throttling - you could exceed the allowed number of events per second/etc - see https://learn.microsoft.com/en-us/azure/application-insights/app-insights-pricing, which at the time of this answer states:
Throttling limits the data rate to 32,000 events per second, averaged over 1 minute per instrumentation key.
network issues, etc. calling track on the various sdks doesn't guarantee the data is accepted or retried. some of the sdks attempt to retry, some do not.
your application could shut down / crash between the call to track and the actual connection to application insights is created/completed.
other random issues, service issues, downtime of other dependent services, etc that account for that 0.1% of missing data. I'm not sure there's any APM/telemetry service that guarantees it will accept and process 100% of the events you send.
(100% - 99.9% is not 0.01%, it is 0.1%. there's a 10x difference there.)
I have escalated this issue to app insights team. If any feedback, I will update you.
As per my understanding, for the other 0.01% time outside SLA, if there is some downtime, the data would get lost. In any other condition, it would be published beyond 2 hours.
Hope it helps.

Is there a limit on the number of sessions for Azure Web SQL Database?

We are using the Azure SQL Database (Web Edition) for a MVC3 ASP.NET/EF5 application.
Is there a limit to the number of sessions that this SQL Database setup supports? I am just wondering whether any delays that we are getting is due to some form of queuing or pooling. Currently we have about 5 concurrent users.
Thanks.
The SQL Azure Web edition database should support a high number of concurrent users - we've had applications running that issue thousands of queries per minute against Web databases.
Throttling
SQL Azure does implement database throttling to maintain performance for all users of the platform. If throttling has been applied to the current operation you'll receive error 40501. The link I've provided also shows you how to determine why throttling is being applied. If you receive this error you can treat it as a transient error and wait before retrying.
It doesn't sound like your connections are being throttled, because you mention only 5 concurrent users and talk about delays, whereas the throttling error would occur pretty quickly.
Transient error handling
If you're getting connection timeouts etc you need to handle them as transient errors. Transient errors are timeouts or dropped connections, as well as error codes 10054, 10053, 40501 (throttling as described above) and 40197 (usually because an upgrade or failover operation is in progress).
You should ensure you implement retry logic to handle transient errors.
Query performance
If you're executing long running queries you can check which ones are slow by logging into the database management URL:
https://<database-id>.database.windows.net/#$database=<database-name>
Log in and click "Query Performance" - take a look at the longest running queries at the top.

Azure DataCache MaxConnectionToServer

I am using the AppFabricCacheSessionStoreProvider and occasionally get the error
ErrorCode:SubStatus:There is a temporary failure. Please retry later.
(The request failed, because you exceeded quota limits for this hour.
If you experience this often, upgrade your subscription to a higher
one). Additional Information : Throttling due to resource :
Connections.
I am using a basic 128mb cache with a web role which has two instances. What is the default MaxConnectionToServer value if it is not set? I think when I fire up a staging instance as well it can cause this error (4 simultaneous instances). Will setting MaxConnectionToServer to a higher value make it better or worse? I believe the 128mb cache has limit of 5 connections so should I set it to 1 which would mean only 4 connections could be used. The cache is not used elsewhere in the app.
The default for MaxConnectToServer is 1, so you shouldn't have to change this setting, but if you do set it to 1, it will avoid anyone else looking at your config from getting confused as well. If you set it to a higher value then you will see this problem more often.
The cache session provider seems to be a little slow at disposing of its connections to the cache when it doesn't need them any more. This means that if you're running a number of instances which is close to the limit for you cache size you do seem to see this error. You're correct a 128MB cache does only allow 5 concurrent connections. If you want to avoid this problem at the moment the only solution I'm aware of is to buy the next cache size up.

Throttling login attempts

(This is in principal a language-agnostic question, though in my case I am using ASP.NET 3.5)
I am using the standard ASP.NET login control and would like to implement the following failed login attempt throttling logic.
Handle the OnLoginError event and maintain, in Session, a count of failed login attempts
When this count gets to [some configurable value] block further login attempts from the originating IP address or for that user / those users for 1 hour
Does this sound like a sensible approach? Am I missing an obvious means by which such checks could be bypassed?
Note: ASP.NET Session is associated with the user's browser using a cookie
Edit
This is for an administration site that is only going to be used from the UK and India
Jeff Atwood mentioned another approach: Rather than locking an account after a number of attempts, increase the time until another login attempt is allowed:
1st failed login no delay
2nd failed login 2 sec delay
3rd failed login 4 sec delay
4th failed login 8 sec delay
5th failed login 16 sec delay
That would reduce the risk that this protection measure can be abused for denial of service attacks.
See http://www.codinghorror.com/blog/archives/001206.html
The last thing you want to do is storing all unsuccessful login attempts in a database, that'll work well enough but also makes it extremely trivial for DDOS attacks to bring your database server down.
You are probably using some type of server-side cache on your webserver, memcached or similar. Those are perfect systems to use for keeping track of failed attempts by IP address and/or username.  If a certain threshold for failed login attempts is exceeded you can then decide to deactivate the account in the database, but you'll be saving a bunch of reads and writes to your persisted storage for the failed login counters that you don't need to persist.
If you're trying to stop people from brute-forcing authentication, a throttling system like Gumbo suggested probably works best.  It will make brute-force attacks uninteresting to the attacker while minimizing impact for legitimate users under normal circumstances or even while an attack is going on.  I'd suggest just counting unsuccessful attempts by IP in memcached or similar, and if you ever become the target of an extremely distributed brute-force attack, you can always elect to also start keeping track of attempts per username, assuming that the attackers are actually trying the same username often.  As long as the attempt is not extremely distributed, as in still coming from a countable amount of IP addresses, the initial by-IP code should keep attackers out pretty adequately.
The key to preventing issues with visitors from countries with a limited number of IP addresses is to not make your thresholds too strict; if you don't receive multiple attempts in a couple of seconds, you probably don't have much to worry about re. scripted brute-forcing.  If you're more concerned with people trying to unravel other user's passwords manually, you can set wider boundaries for subsequent failed login attempts by username.
One other suggestion, that doesn't answer your question but is somewhat related, is to enforce a certain level of password security on your end-users.  I wouldn't go overboard with requiring a mixed-case, at least x characters, non-dictionary, etc. etc. password, because you don't want to bug people to much when they haven't even signed up yet, but simply stopping people from using their username as their password should go a very long way to protect your service and users against the most unsophisticated – guess why they call them brute-force ;) – of attacks.
The accepted answer, which inserts increasing delays into successive login attempts, may perform very poorly in ASP.NET depending on how it is implemented. ASP.NET uses a thread pool to service requests. Once this thread pool is exhausted, incoming requests will be queued until a thread becomes available.
If you insert the delay using Thread.Sleep(n), you will tie up an ASP.NET thread pool thread for the duration of the delay. This thread will no longer be available to execute other requests. In this scenario a simple DOS style attack would be to keep submitting your login form. Eventually every thread available to execute requests will be sleeping (and for increasing periods of time).
The only way I can think of to properly implement this delay mechanism is to use an asynchronous HTTP handler. See Walkthrough: Creating an Asynchronous HTTP Handler. The implementation would likely need to:
Attempt authentication during BeginProcessRequest and determine the delay upon failure
Return an IAsyncResult exposing a WaitHandle that will be triggered after the delay
Make sure the WaitHandle has been triggered (or block until it has been) in EndProcessRequest
This could possibly effect your genuine users too. For ex. in countries like Singapore there are limited number of ISPs and a smaller set of IPs which are available for home users.
Alternatively , you could possibly insert a captcha after x failed attempts to thwart script kiddies.
I think you'll need to keep the count outside the session - otherwise the trivial attack is to clear cookies before each login attempt.
Otherwise a count and lock-out is reasonable - although an easier solution might be to have a doubling-timeout between each login failure. i.e. 2 seconds after first login attempt, 4 seconds after next, 8 etc.
You implement the timeout by refusing logins in the timeout period - even if the user gives the correct password - just reply with human readable text saying that the account is locked-out.
Also monitor for same ip/different user and same user/different ip.

Resources