Is bcrypt viable for large web sites? - security

I've been on the bcrypt bandwagon for a while now, but I'm having trouble answering a simple nagging question.
Imagine I have a reasonably successful web site in the U.S... about 100,000 active users that each have activity patterns requiring 2 to 3 authentication attempts on average over the course of a typical American business day (12 hours when you include timezones). That's 250,000 authentication requests per day, or about 5.8 authentications per second.
One of the neat things about bcrypt is that you tune it, so that over time it scales as hardware does, to stay ahead of the crackers. A common tuning is to get it to take just over 1/10 of a second per hash creation... let's say I get it to .173 seconds per hash. I chose that number because it just so happens that .173 seconds per hash works out to about 5.8 hashes per second. In other words, my hypothetical web server is literally spending all it's time doing nothing but authenticating users. Never mind actually doing any useful work.
To address this issue, I would have to either tune bcrypt way down (not a good idea) or get a dedicated server just to do authentications, and nothing else. Now imagine that the site grows and adds another 100,000 users. Suddenly I need two servers: again, doing nothing but authentication. Don't even start thinking about load spikes, as you have light and busy periods throughout a day.
As I see it right now, this is one of those problems that would be nice to have, and bcrypt would still be worth the trouble. But I'd like to know if I'm I missing something obvious here? Something subtle? Or can anyone out there actually point to a well known web site running a whole server farm just for the authentication portion of their site?

Even if you tune bcrypt to take only, say, 1/1000 of a second, that's still quite a bit slower than simple hashing — a quick and dirty Perl benchmark says my not-so-new computer can calculate about 300,000 SHA-256 hashes per second.
Yes, the difference between 1000 and 300,000 is only about 8 bits, but that's still 8 bits of security margin you wouldn't have otherwise, and that difference is only going to increase as CPUs get faster.
Also, if you use scrypt instead of bcrypt, it will retain its memory-hardness property even if the iteration count is lowered, which will still make brute forcing it harder to parallelize.

Related

how can I detect tampered big file in a short time?

I felt very frustrated in the battle with cheaters of my game. I found a lot of hackers tampered my game data to avoid the anti-cheat system. I have tried some methods to verify if the game data has tampered or not. Such as encrypting my asset-package or check the hash of the package header.
However, I got stuck on the issue that my asset-package is huge. It is almost 1~3GB. I know the digital signature is doing very well in verifying data. But I need this to be done in almost real-time.
It seems I have to make a trade-off between verifying the whole file and the performance. Does there any way to verify a huge file in a short-time?
AES-NI based hashing such as Meow Hash can easily reach 16 bytes per cycle on a single thread, that is, for data already on-cache, it process tens of gigabytes of input in a second. Obviously in reality the memory and disk I/O speed becomes the limiting factor, but they apply on any method, so you can think of them as the upper limit. Since it's not designed for security, it's also possible for cheaters to quickly figure out a viable collision.
But, even if you figure out a sweet spot between speed and security, you're still relying on cheaters not forwarding your file/memory I/O. Additionally, it's still possible for the cheaters to just NOP any asset verification call. Since you care about cheaters, I'd assume this is an online game. The more common practice is to rearchitect the game to prevent cheating even with a broken asset. On Valorant, they move the line of sight calculation to server-side. LoL add kernel driver

Why does it take so long for most applications to verify that a password is incorrect?

SSH and *NIX logins in general come to mind in particular, though I've seen this in many different apps, including Windows logins. It's seemingly much less common in web apps for some reason. Isn't the process just "hash the input password and compare it to an existing hash"? Wouldn't this cost the same regardless of whether the password is correct or not? In *NIX you can kill the login attempt no problem so its not really much of a deterrent to an attacker making repeated attempts.
Usually, there is an artifical random wait time if the login failed. This is to prevent timing attacks. If the password would be stored unhashed (or a weak hashing method is used) this prevents an attacker from measuring how long it took the system to compare the two password strings. Lets assume the correct password would be "abc". Now an attacker would try the passwords 'a' to 'z' repeatedly and see that for the password 'a' the time to denial the login is longer than for 'b' to 'z'. That is because the system has to compare two bytes ('a' and null-byte) to the real password to realize that 'a' is not the correct password. Then the attacker would try 'aa' to 'az', and he could realize that for 'ab' it takes the system longer to realize that the password is wrong. Using this method it is possible to massively reduce the search space and therefore decrease the time it takes to bruteforce the password. If you compensate for this by adding pseudo-random wait times after the passwords have been compared, measuring the time it took the system to compare the two passwords becomes more difficult. Even more so if the login process measures how long it took to compare the passwords and then subtract that from the random wait time, to prevent drawing information by statistical anomaly detection. Obviously, if the passwords are hashed using strong hashing methods for which no rainbow tables exist, then this becomes pointless as one cannot purposefully generate passwords with a certain hash prefix.
The reason that most web apps do not employ this technique is either that HTTP and common webservers add so much noise to the processing time, that such timing attacks become infeasible or that most web devs just don't care ;)
Additionally, this also decreases the amount of possible password attempts for an attacker, as the login has to be aborted prior to trying the next password on that connection (in the case of ssh).
One big offender on *nix system is UseDNS in sshd_config when the target host cannot reach a DNS server - it will wait for the name resolution to timeout before going any further. Try UseDNS no in sshd_config and see if it improves :-)

Threads in application server and connections in HttpServer

I'm building a system, I will use two servers and one load balancer,
This company has more than 60,000 users and they expect 10,000 concurrent users, all transactions will occurs within 5 seconds
I'm not sure how about this for each server:
Amount of connections in HttpServer
Amount of threads in application server
I understand that I will find out this numbers when the system is in production but I need to start with something
any suggestion or advice?
This is about Capacity Planning I can give some suggestions as below, however all depend on the technical feasibility and business requirements of your system.
Try to find out your maximum capacity which you need to support. So you can do required stress test to figure this out.
Make sure system capable of improving performance by horizontally with clustering etc.
Decide on predicted capacity requirement(CR) , CR may be H/W ,bandwidth etc
Predicted CR = Current CR + 30%*Current CR
Finally this is about continuous improvements , keep eye on the changes .
Check how reliable the system is , decide on changes to H/W , Software , Architecture etc.
Hope this add some value to you.
Setup a test-server and extrapolate the numbers from there (get some time to do the research to come up with an educated guess).
For example, the "amount of threads in application server" depends on what kind of HTTP-server you use. Some servers can process hundreds of connections with a single thread, but badly programmed/configured servers might end up using 1 thread per connection.
The performance requirement "all transactions will occurs within 5 seconds" needs some further detailing. Showing web-pages with data (from a database) to a user may not take more than 3 seconds (if it takes longer, users get irritated), but ideally should be less than 1 second (an average user would expect that). On the other hand, it might be OK to take 10 seconds to store data from a complex form (as long as the form is not used too often).
I would be skeptical about the load requirement "they expect 10,000 concurrent users". That would mean 1 in 6 company employees is actively using the web-application. I think this would be 'peak usage' and not 'average usage'. This is important to know with respect to the performance requirements and costs: if the system must adhere to the performance requirements during peak usage, you need more money for better and/or more hardware.

Should I go for faster queries or less cpu consuption / faster processing?

I have to choose between performing a query for X size data and not process it, just send it to the client,
OR
I can choose to perform a query for half X size data and do a little processing, then send it to the client.
Now, in my life of a programmer I met storage vs speed problem quite a bit, but in this case, I have to choose between "fast query + processing" or "slow query + no processing".
If it matters, I am using nodejs for the server and mongodb for the database.
If you care, I am holding non intersecting map areas and I am testing if an area intersects any or no map area. All are boxes. If I keep them as origin point, its only one pair of coordinates and I have to process the point into an area(all map areas have the same size). If I store them as an area directly, I don't have to process it anymore, but its 4 pairs of coordinates now. 4 times the size and I think, slower query.
There is no right answer to this question, it all depends on your infrastructure. If you're for example using Amazon Webservices for this, it depends on the transaction price. If you've got your own infrastructure, it depends on the load of the DB and web servers. If they are on the same server, it's a matter of the underlying hardware whether the I/O from the DB starts to limit before the CPU/memory become the bottle neck.
The only way to determine the right answer to this question for your specific situation is to set it up and do a stress test, for example using Load Impact or one of the tons of other good tools to do this. While it is getting hammered, monitor your system load using top and watch the wa column specifically - if it starts going up over 50% consistently you're I/O limited, and the DB should be offloaded to the CPU.

Is there any modern review of solutions to the 10000 client/sec problem

(Commonly called the C10K problem)
Is there a more contemporary review of solutions to the c10k problem (Last updated: 2 Sept 2006), specifically focused on Linux (epoll, signalfd, eventfd, timerfd..) and libraries like libev or libevent?
Something that discusses all the solved and still unsolved issues on a modern Linux server?
The C10K problem generally assumes you're trying to optimize a single server, but as your referenced article points out "hardware is no longer the bottleneck". Therefore, the first step to take is to make sure it isn't easiest and cheapest to just throw more hardware in the mix.
If we've got a $500 box serving X clients per second, it's a lot more efficient to just buy another $500 box to double our throughput instead of letting an employee gobble up who knows how many hours and dollars trying to figure out how squeeze more out of the original box. Of course, that's assuming our app is multi-server friendly, that we know how to load balance, etc, etc...
Coincidentally, just a few days ago, Programming Reddit or maybe Hacker News mentioned this piece:
Thousands of Threads and Blocking IO
In the early days of Java, my C programming friends laughed at me for doing socket IO with blocking threads; at the time, there was no alternative. These days, with plentiful memory and processors it appears to be a viable strategy.
The article is dated 2008, so it pulls your horizon up by a couple of years.
To answer OP's question, you could say that today the equivalent document is not about optimizing a single server for load, but optimizing your entire online service for load. From that perspective, the number of combinations is so large that what you are looking for is not a document, it is a live website that collects such architectures and frameworks. Such a website exists and its called www.highscalability.com
Side Note 1:
I'd argue against the belief that throwing more hardware at it is a long term solution:
Perhaps the cost of an engineer that "gets" performance is high compared to the cost of a single server. What happens when you scale out? Lets say you have 100 servers. A 10 percent improvement in server capacity can save you 10 servers a month.
Even if you have just two machines, you still need to handle performance spikes. The difference between a service that degrades gracefully under load and one that breaks down is that someone spent time optimizing for the load scenario.
Side note 2:
The subject of this post is slightly misleading. The CK10 document does not try to solve the problem of 10k clients per second. (The number of clients per second is irrelevant unless you also define a workload along with sustained throughput under bounded latency. I think Dan Kegel was aware of this when he wrote that doc.). Look at it instead as a compendium of approaches to build concurrent servers, and micro-benchmarks for the same. Perhaps what has changed between then and now is that you could assume at one point of time that the service was for a website that served static pages. Today the service might be a noSQL datastore, a cache, a proxy or one of hundreds of network infrastructure software pieces.
You can also take a look at this series of articles:
http://www.metabrew.com/article/a-million-user-comet-application-with-mochiweb-part-3
He shows a fair amount of performance data and the OS configuration work he had to do in order to support 10K and then 1M connections.
It seems like a system with 30GB of RAM could handle 1 million connected clients on a sort of social network type of simulation, using a libevent frontend to an Erlang based app server.
libev runs some benchmarks against themselves and libevent...
I'd recommend Reading Zed Shaw's poll, epoll, science, and superpoll[1]. Why epoll isn't always the answer, and why sometimes it's even better to go with poll, and how to bring the best of both worlds.
[1] http://sheddingbikes.com/posts/1280829388.html
Have a look at the RamCloud project at Stanford: https://ramcloud.atlassian.net/wiki/display/RAM/RAMCloud
Their goal is 1,000,000 RPC operations/sec/server. They have numerous benchmarks and commentary on the bottlenecks that are present in a system which would prevent them from reaching their throughput goals.

Resources