Are Varnish Hashtwo/Xkey and Fastly's "Surrogate Keys" the same?

Are Varnish Hashtwo/Xkey and Fastly's "Surrogate Keys" the same? - varnish

I'm currently deciding whether to manage my own Varnish servers or use a hosted service like Fastly. One of the most important decision factors here is efficient tag-based cache invalidation, since I plan to put Varnish in front of our API and we'll need to frequently issue purge requests that invalidate a number of related pages.
Fastly offers Surrogate Keys, and Varnish appears to offer a separate module that goes by a number of names, including Hashtwo, Hashninja, and XKey. These features appear to be the same. Are they in fact the same, or is there some key technical or efficiency difference between the two features that is not clear from the blog posts about them?

Surrogate Keys is Fastly's implementation of this functionality. I wrote our current implementation, but have not used HashTwo/Hashninja/xkey, so I am not an authority on differences between the implementations. Xkey is publicly available as a vmod at https://github.com/varnish/libvmod-xkey.
Surrogate Keys are a standard part of Fastly's service, but as a CDN, we provide it as part of our hosted platform. It is not open source for mostly no good reason; there's been some discussion about doing that, but it's not a huge priority (partially because our Varnish is a fork from 2.1.4).
Individual keys are not allowed to exceed 1kb (because why?) and the entire key list is not allowed to exceed 16kb. We raised the limit to these values (previously it was 1kb total) at customer request a year or so ago. There is no limit on the number of keys as long as they fit within that space (though I realize this does effectively bound the keyspace). The rationale for bounding the length is that a key purge results in some number of linear-time operations, which we prefer to keep bounded. I'd be surprised if there was any practical problem with our current limitations.
I would note that xkey is also limited in length and number of keys in the sense that the key is also specified through a header, and header length is effectively bounded by the available workspace for the thread servicing the connection. This length is tunable if you run your own Varnish, and this may not be a practical limitation for you, but it does exist.
Another minor difference I noted scanning through the code is that the xkey vmod supports multiple xkey headers, while Fastly Surrogate Keys are taken from the first matching header. There are some differences in terms of data structures used to achieve the functionality (partially due to the fact that we run a multi-tenant Varnish), but the functionality otherwise appears to be similar.
Finally, we (at this point in time) have a cluster of several hundred Varnish installations globally. Part of our infrastructure has to do with reliably distributing purges through our network and ensuring that they are applied globally. If you run a cluster of Varnish nodes, you may have to do some additional work to invalidate cache across multiple nodes (though this is unlikely to be a significant problem for a small cluster).

xkey and hashtwo (hashninja in some marketing material) are the same.
I think the main difference to the Fastly offering is that xkey doesn't add any restrictions on length or number of keys per object/URL. As far as I know, both work pretty well. (full disclosure: I work at Varnish Software)

Related

Any gotchas with using one CloudFront distribution as a custom origin in another distribution?

We serve a lot of our org's content through one big CloudFront distribution. It has cache behaviors that route traffic to various components in our system. We manage the distribution with CloudFormation. As we manage more and more components (some with Lambda#Edge) this one distribution is getting a bit complex.
I'd like to start breaking up this distribution into smaller pieces to limit that complexity and to limit blast radius. To this end, I'm thinking about putting some behavior into a new distribution that serves as a custom origin of our main distribution.
I see very little about this approach on the web, which worries me :-) Anyone know of any documentation or blogs about this? Are there any gotchas here that others have experienced? E.g. is there significant extra latency? I can imagine that this might incur extra HTTPS request charges (as each request would in effect be hitting two distributions instead of one).

How to calculate impact metrics in the CVSS standard

I am trying to calculate impact metrics (confidentiality, availability and integrity) in the CVSS standard.
Can anyone suggest how this is done?

You should be using the definitions provided in the standard since you are following a standard to communicate the findings in a standardized and commonly-agreed way.
Refer to this document: https://www.first.org/cvss/specification-document

First of all, there is a distinction between CVSS v2 and v3, so you would need to specify which one you use (one has 'complete' the other 'high' as possible impacts for CIA).
In any case, in order to assess the proper impact, you need to understand the vulnerability and the standard. The official standard has many different examples.
A couple here: assume you have a vulnerability that allows you to crash the Apache Web Server by sending a malformed request from remote. If we have to assign a CVSS score for the Web Server itself, Confidentiality would not be impacted as nothing is leaked, Integrity would not be impacted as no data is being manipulated. Availability however would be completely impacted, as the server would not respond anymore.
However, let's assume that we have to assign a CVSS for the same issue, only this time we do it for an operating system (e.g. the server itself on which the http server is running). While confidentiality and integrity still are not impacted, the availability of the server would be impacted only PARTIALLY. While the web server might not respond anymore, other services might (e.g. SSH, Mailserver, etc).
This was to make it clear, that CVSS cannot be calculated automatically, but has to be assessed for each vulnerability in relation to the affected software component. There are also a number of other pitfalls defined by the standard, so I strongly recommend you go read it!

Client Server Security Architecture

I would like go get my head around how is best to set up a client server architecture where security is of up most importance.
So far I have the following which I hope someone can tell me if its good enough, or it there are other things I need to think about. Or if I have the wrong end of the stick and need to rethink things.
Use SSL certificate on the server to ensure the traffic is secure.
Have a firewall set up between the server and client.
Have a separate sql db server.
Have a separate db for my security model data.
Store my passwords in the database using a secure hashing function such as PBKDF2.
Passwords generated using a salt which is stored in a different db to the passwords.
Use cloud based infrastructure such as AWS to ensure that the system is easily scalable.
I would really like to know is there any other steps or layers I need to make this secure. Is storing everything in the cloud wise, or should I have some physical servers as well?
I have tried searching for some diagrams which could help me understand but I cannot find any which seem to be appropriate.
Thanks in advance

Hardening your architecture can be a challenging task and sharding your services across multiple servers and over-engineering your architecture for semblance security could prove to be your largest security weakness.
However, a number of questions arise when you come to design your IT infrastructure which can't be answered in a single SO answer (will try to find some good white papers and append them).
There are a few things I would advise which is somewhat opinionated backed up with my own thought around it.
Your Questions
I would really like to know is there any other steps or layers I need to make this secure. Is storing everything in the cloud wise, or should I have some physical servers as well?
Settle for the cloud. You do not need to store things on physical servers anymore unless you have current business processes running core business functions that are already working on local physical machines.
Running physical servers increases your system administration requirements for things such as HDD encryption and physical security requirements which can be misconfigured or completely ignored.
Use SSL certificate on the server to ensure the traffic is secure.
This is normally a no-brainer and I would go with a straight, "Yes"; however you must take into consideration the context. If you are running something such as a blog site or documentation-related website that does not transfer any sensitive information at any point in time through HTTP then why use HTTPS? HTTPS has it's own overhead, it's minimal, but it's still there. That said, if in doubt, enable HTTPS.
Have a firewall set up between the server and client.
That is suggested, you may also want to opt for a service such as CloudFlare WAF, I haven't personally used it though.
Have a separate sql db server.
Yes, however not necessarily for security purposes. Database servers and Web Application servers have different hardware requirements and optimizing both simultaneously is not very feasible. Additionally, having them on separate boxes increases your scalability quite a bit which will be beneficial in the long run.
From a security perspective; it's mostly another illusion of, "If I have two boxes and the attacker compromises one [Web Application Server], he won't have access to the Database server".
At foresight, this might seem to be the case but is rarely so. Compromising the Web Application server is still almost a guaranteed Game Over. I will not go into much detail into this (unless you specifically ask me to) however it's still a good idea to keep both services separate from eachother in their own boxes.
Have a separate db for my security model data.
I'm not sure I understood this, what security model are you referring to exactly? Care to share a diagram or two (maybe an ERD) so we can get a better understanding.
Store my passwords in the database using a secure hashing function such as PBKDF2.
Obvious yes; what I am about to say however is controversial and may be flagged by some people (it's a bit of a hot debate)—I recommend using BCrypt instead of PKBDF2 due to BCrypt being slower to compute (resulting in slower to crack).
See - https://security.stackexchange.com/questions/4781/do-any-security-experts-recommend-bcrypt-for-password-storage
Passwords generated using a salt which is stored in a different db to the passwords.
If you use BCrypt I would not see why this is required (I may be wrong). I go into more detail regarding the whole username and password hashing into more detail in the following StackOverflow answer which I would recommend you to read - Back end password encryption vs hashing
Use cloud based infrastructure such as AWS to ensure that the system is easily scalable.
This purely depends on your goals, budget and requirements. I would personally go for AWS, however you should read some more on alternative platforms such as Google Cloud Platform before making your decision.
Last Remarks
All of the things you mentioned are important and it's good that you are even considering them (most people just ignore such questions or go with the most popular answer) however there are a few additional things I want to point:
Internal Services - Make sure that no unrequired services and processes are running on server especially in productions. These services will normally be running old versions of their software (since you won't be administering them) that could be used as an entrypoint for your server to be compromised.
Code Securely - This may seem like another no-brainer yet it is still overlooked or not done properly. Investigate what frameworks you are using, how they handle security and whether they are actually secure. As a developer (and not a pen-tester) you should at least use an automated web application scanner (such as Acunetix) to run security tests after each build that is pushed to make sure you haven't introduced any obvious, critical vulnerabilities.
Limit Exposure - Goes somewhat hand-in-hand with my first point. Make sure that services are only exposed to other services that depend on them and nothing else. As a rule of thumb, keep everything entirely closed and open up gradually when strictly required.
My last few points may come off as broad. The intention is to keep a certain philosophy when developing your software and infrastructure rather than a permanent rule to tick on a check-box.
There are probably a few things I have missed out. I will update the answer accordingly over time if need be. :-)

UUID on database level used as a security measure instead of a true rights control?

Can UUID on database level be used as a security measure instead of a true rights control?
Consider a web application where all servlets implements "normal" access control by having a session id connected to the user calling it (through the web client). All users are therefore authenticated.
The next level of security needed is if a authenticated user actually "owns" the data being changed. In a web application this could for example be editing some text in a form. The client makes sure a user, by accident, doesn’t do something wrong (JavaScript). The issue is of course is that any number of network tools could easily repeat the call made by the browser and, by only changing the ID, edit a different row in the database table behind the servlet that the user does not "own".
My question is if it would be sufficient to use UUID's as keys in the database table and thereby making it practically impossible to guess a valid ID (https://en.wikipedia.org/wiki/Universally_unique_identifier#Random_UUID_probability_of_duplicates)? As far as I know similar approaches is used in Google Photos (http://www.theverge.com/2015/6/23/8830977/google-photos-security-public-url-privacy-protected) but I'm not sure it is 100% comparable.
Another option is off cause to have every servlet verify that the user is only performing an action on its own data, but in a big application with 200+ servlets and 50-100 tables this could be a very cumbersome task where mistakes could easily happen. In my mind this weakens the security far more, but I'm not sure if that is true.
I'm leaning towards the UUID solution, but I'm also curious if there are other obvious approaches to this problem that I ought to consider.
Update:
I should probably have clarified that my plan would be to use UUIDv4 which is supposed to be random. I know that entropy comes in to play here in regards to how random the UUID's actually are, but as far as I have read then Java (which is the selected platform/language) uses SecureRandom which is supposed to be "cryptographically strong" (link).
And in that case wiki states (link):
In other words, only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%.

Using UUIDs in this manner has two major issues:
If there are no additional authentication methods, any attacker could simply guess UUIDs until they find one belonging to someone else. Google Photos doesn't need to worry about this as much, because they only use UUIDs to obfuscate publicly-shared photo views; you still need to authenticate to modify the photos. This is especially dangerous because:
UUIDs are intended to be unique, not random. There are likely to be predictable patterns in your UUIDs that an attacker would be able to observe and take advantage of. In addition, even without a clear pattern, the number of UUIDs an attacker needs to test to find a valid one swiftly decreases as your userbase grows.
I will always recommend using secure, continuously-checked authentication. However, if you have a fairly small userbase, and you are only using this to obfuscate public data access, then using UUIDs in this manner might be alright. Even then, you should be using actual random strings, and not UUIDs.

Another option is off cause to have every servlet verify that the user
is only performing an action on its own data, but in a big application
with 200+ servlets and 50-100 tables this could be a very cumbersome
task where mistakes could easily happen. In my mind this weakens the
security far more, but I'm not sure if that is true.
With a large legacy application adding in security later is always a complex task. And you're right - the more complicated an application, the harder it is to verify security. Complexity is the main enemy of security.
However, this is the best way to go rather than by trying to obscure insecure direct object reference problems.
If you are using these UUIDs in the query string then this information within URLs may be logged in various locations, including the user's browser, the web server, and any forward or reverse proxy servers between the two endpoints. URLs may also be displayed on-screen, bookmarked or emailed around by users. They may be disclosed to third parties via the Referer header when any off-site links are followed. Placing direct object references into the URL increases the risk that they will be captured by an attacker. An existing user of the application that then has their access revoked to certain bits of data - they will still be able to access this data by using a previously bookmarked URL (or by using their browser history). Even where the ID is passed outside of the URL mechanism, a local attacker that knows (or has figured out) how your system works could have purposely saved IDs just for the occasion.
As said by other answers, GUIDs/UUIDs are not meant to be unguessable, they are just meant to be unique. Granted, the Java implementation does actually generate cryptographically secure random numbers. However, what if this implementation changes in future releases, or what if your system is ported elsewhere where this functionality is different? If you're going to do this, you might as well generate your own cryptographically secure random numbers using your own implementation to use as identifiers. If you have 128bits of entropy in your identifiers, it is completely infeasible for anyone ever to guess them (even if they had all of the world's computing power).
However, for the above reasons I recommend you implement access checks instead.

You are trying to bypass authorisation controls by hoping that the key is unguessable. This is a security no-no. Depending on whom you ask, they may refer to it as an insecure direct object reference or a violation of the complete mediation principle.
As noted by F. Stephen Q, your assumption that UUIDs are unique does not imply that they are not predictable. The threat here is that if a user knows a few UUIDs, say his own, does that allow him to predict other peoples' UUIDs? This is a very real threat, see: Cautionary note: UUIDs generally do not meet security requirements. Especially note what the UUID RFC says:
Do not assume that UUIDs are hard to guess; they should not be used as
security capabilities (identifiers whose mere possession grants
access), for example.
You can use UUIDs for keys, but you still need to do authorisation checks. When a user wants to access his data, the database should identify the owner of the data, and the server logic needs to enforce that the current user is the same as the database claims the owner is.

Good distributed general purpose filesystem in my case?

I've been researching the idea of using distributed file system along with my dedicated servers instead of going with Amazon S3 and the results are nothing but massive headaches!
My project have the following characteristics/requirements:
User files are stored in dedicated servers. Each file is stored in 2 separate machines, located in different data centers (150-200 miles away from each other)
I'm using Amazon RDS to host the associated mysql database (*). It's fairly compact (only hold IDs/files metadata)
Files/data is around 50TB. Naturally, data does change and will definitely grow with time
My question is: is there a good general-purpose, distributed parallel fault-tolerant file system that have the following characteristics:
Stable & reasonably fast (upload/download)
Fairly easy to setup & maintain
Handle data storage so that I only have to care about removing/adding new servers if the need arise (ie. add new servers to the filesystem's server pool by editing a simple config, or something like that)
I've read about OpenStack, GlusterFS, MogileFS, XtreemFS, etc...but the more I read, the more I get confused!
(*) Yes, I realize the contradiction. Cost-wise it does make sense to host the database on RDS. But storing (up to) 50TB of users files on amazon is way too expensive compared to using dedicated servers (provided it's good enough).
PS. my app isn't live yet, so I'm open to suggestion if someone have a good idea that fits well in my case.
EDIT I'm not trying to make a S3 clone, I just need to use an existing hosting infrastructure to build small-scale cloud solution, my question is about finding the right distributed file system to handle/automate this.

We recently switched from an expensive storage solution to the opensource Lizardfs for our Distributed storage solution. It is quite simple to set up and scale once your understand the basic concept.
Check out https://docs.lizardfs.com/introduction.html#architecture for a quick overview. But forget about shadow master en meta loggers for now. What you need to know is that there are
a master: that regulates the traffic (make sure that has enough cpu)
chunkservers: which actually store the data. Use any kind of off the shelf hardware with a bunch of harddisks attached.
Clients: which are just simple mount points. So you can get a giant 50TB mount if you want. The master will tell the client where to find/store the files. The actual data is being transfered straight from the client->chunkserver and back.
You can add as many chunkservers as you want, the master will automatically try to balance your storage usage across them. Adding storage is a matter of adding harddrives, or adding servers. They don't have to be actual bare metal machines, but that is probably the cheapest.
There are 2 amazing features in lizardfs that allow georeplication.
Goals (see https://docs.lizardfs.com/adminguide/replication.html#standard-goals): How important are files to you. You can define, on a file level/folder level how many times a file needs to be replicated. Do you want 2 copies 3? 10? You could define a goal of 2 copies for old files that are simply there for archiving purposes. And define a goal of 4 copies on SSD drives for all new files.
Those same goals can also be used to do georeplication. You define that your data has to be stored it least two different locations by labeling your chunkservers accordingly. (e.g. DC1 and DC2)
Rack awareness (see https://docs.lizardfs.com/adminguide/advanced_configuration.html#configuring-rack-awareness-network-topology): you basically define IP ranges to teach the system how your network looks like. This way, clients will try to serve files from the closest server.
The ease of setting it up is what sold lizardfs for me. I've heard very good things about Ceph, but setting it up is another matter...
What worried me at first was how proven the technology is/was. So I spent quite a lot of research on figuring out who uses it.
Orange Poland (A large telecom provider) is one of the users.
And Cloudweavers/opennebula actualy built a business around it selling complete solutions.

Won't it take more than one person a few months a year to manage these servers? That will cost some $, then you have the cost of hosting the data yourself, then you have the added huge cost that the business / system you are building is not obviously scalable? In addition any likely investor will be turned away by a complex home grown data hosting system. How will you ensure integrity/security on par with Amazon? Your max savings per year look like $30,000 or so.
You could save money by doing a de-duplicated storage system where you just store all the unique chunks of data - also see rsync. Don't know how redundant your data is though.

I recommend LizardFS and GfarmFS.
IMHO Ceph is a major disappointment and so is XtreemFS.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string