Open-source production data for developers?

Open-source production data for developers? - security

I'm building a website that will be an open-source, user-contributed content kind of thing, and I think if developers had access to nightly production SQL dumps, they'd be more likely to check out the code from github and play with it.
In line with that idea, I'm considering either:
Not collecting private user information at all, using open-id for accounts and making heavy use of memcache for things like session authentication.
Anonymizing sensitive data before publishing
Sometimes I get carried away with "wouldn't it be cool if...?" ideas, so I'm hoping for a sanity check here. Any obvious flaws in either approach? Is this a sane idea?

Speaking generally, I think you should do both. Any private data you collect is simply a liability for you, and not just because you intend to publish your databases. The less you can collect, the better.
By the same token, however, you probably realize that it is not just IDs and passwords which are sensitive. Remember the AOL search data leak? Or the Netflix database publication? Even without having IDs, people managed to figure out the real identities of some of the accounts, simply by piecing together trails of user behavior, and corresponding that with data from other places. Some people are embarrassed by their search histories and their movie rentals. Go figure.
Therefore, I think the general rule should be to collect as little as possible, and anonymize what is left. Even if you don't store the identity of the person corresponding to a certain account, you may want to scramble what the various logins did.
On the other hand, there some cases where you simply don't care about this kind of privacy. In Wikipedia, for example, pretty much everything you can do on the site is public anyway. At least, everything which gets recorded in the database. If the information is already available through the API, there is no point in hiding it in a database download.

In addition to collecting less data and anonymizing the data you do collect, you could add a bit/flag for the users to select whether their data is included or not. You could make it a CC license flag to give users the warm'n'fuzzies while filling your need.

Sounds like a pretty good idea. The one thing you have to be careful with though is security, since hackers will know the exact schema of your DB. Although this isn't impossible to deal with, just look at most open source projects. But you will need to put a little extra emphasis on security since say a potential SQL injection is now made much easier.
Another thing is to make sure doubly that the sensitive data is anonymized. Also, some people may (wrongly) try and claim their copyrights on user submitted content is being violated, so you may want to specify a CC license or something just to make everything extra clear and prevent future headaches (even if you're right anyway).

Related

Does Microsoft have a recommended way to handle secrets in headers in HttpClient?

Very closely related: How to protect strings without SecureString?
Also closely related: When would I need a SecureString in .NET?
Extremely closely related (OP there is trying to achieve something very similar): C# & WPF - Using SecureString for a client-side HTTP API password
The .NET Framework has class called SecureString. However, even Microsoft no longer recommends its use for new development. According to the first linked Q&A, at least one reason for that is that the string will be in memory in plaintext anyway for at least some amount of time (even if it's a very short amount of time). At least one answer also extended the argument that, if they have access to the server's memory anyway, in practice security's probably shot anyway, so it won't help you. (The second linked Q&A implies that there was even discussion of dropping this from .NET Core entirely).
That being said, Microsoft's documentation on SecureString does not recommend a replacement, and the consensus on the linked Q&A seems to be that that kind of a measure wouldn't be all that useful anyway.
My application, which is an ASP.NET Core application, makes extensive use of API Calls to an external vendor using the HttpClient class. The generally-recommended best practice for HttpClient is to use a single instance rather than creating a new instance for each call.
However, our vendor requires that all API Calls include our API Key as a header with a specific name. I currently store the key securely, retrieve it in Startup.cs, and add it to our HttpClient instance's headers.
Unfortunately, this means that my API Key will be kept in plaintext in memory for the entire lifecycle of the application. I find this especially troubling for a web application on a server; even though the server is maintained by corporate IT, I've always been taught to treat even corporate networks as semi-hostile environments and not to rely purely on corporate firewalls for application security in such cases.
Does Microsoft have a recommended best practice for cases like this? Is this a potential exception to their recommendation against using SecureString? (Exactly how that would work is a separate question). Or is the answer on the other Q&A really correct in saying that I shouldn't be worried about plaintext strings living in memory like this?
Note: Depending on responses to this question, I may post a follow-up question about whether it's even possible to use something like SecureString as part of HttpClient headers. Or would I have to do something tricky like populate the header right before using it and then remove it from memory right afterwards? (That would create an absolute nightmare for concurrent calls though). If people think that I should do something like this, I would be glad to create a new question for that.

You are being WAY too paranoid.
Firstly, if a hacker gets root access to your web server, you have WAY bigger problems than your super-secret web app credentials being stolen. Way, way, way bigger problems. Once the hackers are on your side of the airtight hatchway, it is game over.
Secondly, once your infosec team detects the intrusion (if they don't, again, you've got WAY bigger problems) they're going to tell you and the first thing you're going to do is change every key and password you know of.
Thirdly, if a hacker does get root access to your webserver, their first thought isn't going to be "let's take a memory dump for later analysis". A dumpfile is rather large (will take time to transfer over the wire, and the network traffic might well be noticed) and (at least on Windows) hangs the process until it's complete (so you'd notice your web app was unresponsive) - both of which are likely to raise some red flags.
No, hackers are there to grab as much valuable information in the least amount of time, because they know their access could be discovered at any second. So they're going to go for the low-hanging fruit first - usernames and passwords. Then they'll move on to trying to find out what's connected to that server, and since your DB credentials are likely in a config file on that server, they will almost certainly switch their attentions to that far more interesting target.
So all things considered, your API key is pretty darn unlikely to be compromised - and even if it is, it won't be because of something you did or didn't do. There are far more productive ways of focusing your time than trying to secure something that already is (or should be) incredibly secure. And, at the end of the day, no matter how many layers of security you put in place... that API or SSL key is going to be raw, in memory, at some stage.

UUID on database level used as a security measure instead of a true rights control?

Can UUID on database level be used as a security measure instead of a true rights control?
Consider a web application where all servlets implements "normal" access control by having a session id connected to the user calling it (through the web client). All users are therefore authenticated.
The next level of security needed is if a authenticated user actually "owns" the data being changed. In a web application this could for example be editing some text in a form. The client makes sure a user, by accident, doesn’t do something wrong (JavaScript). The issue is of course is that any number of network tools could easily repeat the call made by the browser and, by only changing the ID, edit a different row in the database table behind the servlet that the user does not "own".
My question is if it would be sufficient to use UUID's as keys in the database table and thereby making it practically impossible to guess a valid ID (https://en.wikipedia.org/wiki/Universally_unique_identifier#Random_UUID_probability_of_duplicates)? As far as I know similar approaches is used in Google Photos (http://www.theverge.com/2015/6/23/8830977/google-photos-security-public-url-privacy-protected) but I'm not sure it is 100% comparable.
Another option is off cause to have every servlet verify that the user is only performing an action on its own data, but in a big application with 200+ servlets and 50-100 tables this could be a very cumbersome task where mistakes could easily happen. In my mind this weakens the security far more, but I'm not sure if that is true.
I'm leaning towards the UUID solution, but I'm also curious if there are other obvious approaches to this problem that I ought to consider.
Update:
I should probably have clarified that my plan would be to use UUIDv4 which is supposed to be random. I know that entropy comes in to play here in regards to how random the UUID's actually are, but as far as I have read then Java (which is the selected platform/language) uses SecureRandom which is supposed to be "cryptographically strong" (link).
And in that case wiki states (link):
In other words, only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%.

Using UUIDs in this manner has two major issues:
If there are no additional authentication methods, any attacker could simply guess UUIDs until they find one belonging to someone else. Google Photos doesn't need to worry about this as much, because they only use UUIDs to obfuscate publicly-shared photo views; you still need to authenticate to modify the photos. This is especially dangerous because:
UUIDs are intended to be unique, not random. There are likely to be predictable patterns in your UUIDs that an attacker would be able to observe and take advantage of. In addition, even without a clear pattern, the number of UUIDs an attacker needs to test to find a valid one swiftly decreases as your userbase grows.
I will always recommend using secure, continuously-checked authentication. However, if you have a fairly small userbase, and you are only using this to obfuscate public data access, then using UUIDs in this manner might be alright. Even then, you should be using actual random strings, and not UUIDs.

Another option is off cause to have every servlet verify that the user
is only performing an action on its own data, but in a big application
with 200+ servlets and 50-100 tables this could be a very cumbersome
task where mistakes could easily happen. In my mind this weakens the
security far more, but I'm not sure if that is true.
With a large legacy application adding in security later is always a complex task. And you're right - the more complicated an application, the harder it is to verify security. Complexity is the main enemy of security.
However, this is the best way to go rather than by trying to obscure insecure direct object reference problems.
If you are using these UUIDs in the query string then this information within URLs may be logged in various locations, including the user's browser, the web server, and any forward or reverse proxy servers between the two endpoints. URLs may also be displayed on-screen, bookmarked or emailed around by users. They may be disclosed to third parties via the Referer header when any off-site links are followed. Placing direct object references into the URL increases the risk that they will be captured by an attacker. An existing user of the application that then has their access revoked to certain bits of data - they will still be able to access this data by using a previously bookmarked URL (or by using their browser history). Even where the ID is passed outside of the URL mechanism, a local attacker that knows (or has figured out) how your system works could have purposely saved IDs just for the occasion.
As said by other answers, GUIDs/UUIDs are not meant to be unguessable, they are just meant to be unique. Granted, the Java implementation does actually generate cryptographically secure random numbers. However, what if this implementation changes in future releases, or what if your system is ported elsewhere where this functionality is different? If you're going to do this, you might as well generate your own cryptographically secure random numbers using your own implementation to use as identifiers. If you have 128bits of entropy in your identifiers, it is completely infeasible for anyone ever to guess them (even if they had all of the world's computing power).
However, for the above reasons I recommend you implement access checks instead.

You are trying to bypass authorisation controls by hoping that the key is unguessable. This is a security no-no. Depending on whom you ask, they may refer to it as an insecure direct object reference or a violation of the complete mediation principle.
As noted by F. Stephen Q, your assumption that UUIDs are unique does not imply that they are not predictable. The threat here is that if a user knows a few UUIDs, say his own, does that allow him to predict other peoples' UUIDs? This is a very real threat, see: Cautionary note: UUIDs generally do not meet security requirements. Especially note what the UUID RFC says:
Do not assume that UUIDs are hard to guess; they should not be used as
security capabilities (identifiers whose mere possession grants
access), for example.
You can use UUIDs for keys, but you still need to do authorisation checks. When a user wants to access his data, the database should identify the owner of the data, and the server logic needs to enforce that the current user is the same as the database claims the owner is.

Corporate Espionage of Website Source Code

This may not be the most technical question, but I was just interested, nonetheless...
How does a giant company like Google keep from having their code stolen by employees? Maybe I'm wrong, but I would assume that their source code to their search algorithms (amongst other things) would be valuable to their competitors (i.e. Microsoft).
I guess I can best phrase it like this:
What's keeping an unscrupulous
employee who has sufficient clearance from
accessing Google's code repository for
a specific project and copying significant amounts of code
to a flash drive and taking it to their
competitors?

Fear of being sued?
Things within a company like Google are also compartmentalized. So not everybody has access to all code. If someone has access to code, you can bet that Google knows when they access it. I'm sure they have some kind of algorithm that looks and sees if somebody just downloads a lot of files very fast. The search algorithm isn't a small file obviously, it is a gigantic application.
All this would allow them to track who has stolen the code from within. There is also the fact that any self-respecting company or company with something to lose (i.e. Microsoft) would not take anything like this from somebody. They would probably even tell Google about it.

It is called protocol. The idea that only a few people get to know the code. In which then those few have to tell a major very embarrassing secret to the others. So then nobody can tell or else they get outed in the public. Which can be very simple like they like something, compared to as bashful as they are all the way to they killed somebody.

Many employers, including one that I've worked for, completely block flash drives.
In many cases, though, this is to protect non-technical confidential information.

Companies that are serious about protecting their assets will have access logging on their core systems and active scanning to detect suspicious patterns. Similar security is implemented for employees of government agencies (e.g. tax, social security) holding sensitive personal information. Users who access data outside of their assigned cases can be flagged and investigated.
I suspect (but don't know) that similar scanning could be implemented in high value source code repositories.
Some organizations block the use of removable media (It has been reported that some agencies have reacted to Wikileaks with such policies), in some cases by physically gluing up the USB/media ports. This restricts potential thiefs to network transfers of material which can be scanned.

I think companies such as Google will implement access control on their source code repository / version control system. So their employee would only be able to access source code in which they were involved. And their access could be revoked from previous repository if they're being assigned to different project. Its the same thing with normal internal documents, would a security-conscious company let documents be downloaded by any employee freely ?

I think codethis hit the nail on the head. Some fly-by-night operation may be interested, but Microsoft, Yahoo, etc - wouldn't touch stolen code with a ten foot pole. And the fly-by-night wouldn't have the infrastructure. If you didn't tell anybody it was stolen - it's not like you could get away with walking in to a company with an entire spider/searching algorithm on your thumbdrive and declare you wrote it last week.
The bigger threat is details of the search algorithm getting out. SEOers, as a whole, are rather shady - and many would kill for solid facts about how the algorithm ranked or downranked pages. Even then, Google has demonstrated the ability to change their ranking algorithms so quickly that it wouldn't much matter.
On the other hand, Google doesn't have that much super-secret code. Most of their cool stuff (MapReduce et.al) is publicly available (see Hadoop). This question is probably more applicable to a company like Adobe. Some of their Photoshop algorithms are really cool, and would probably hurt them if they got out - but again, no legit company would touch it.

Paranoid attitude: What's your degree about web security concerns?

this question can be associated to a subjective question, but this is not a really one.
When you develop a website, there is several points you must know: XSS attacks, SQL injection, etc.
It can be very very difficult (and take a long time to code) to secure all potential attacks.
I always try to secure my application but I don't know when to stop.
Let's take the same example: a social networking like Facebook. (Because a bank website must secure all its datas.)
I see some approaches:
Do not secure XSS, SQL injection... This can be really done when you trust your user: back end for a private enterprise. But do you secure this type of application?
Secure attacks only when user try to access non owned datas: This is for me the best approach.
Secure all, all, all: You secure all datas (owner or not): the user can't break its own datas and other user datas: this is very long to do and is it very useful?
Secure common attacks but don't secure very hard attacks (because it's too long to code comparing to the chance of being hacked).
Well, I don't know really what to do... For me, I try to do 1, 2, 4 but I don't know if it's the great choice.
Is there an acceptable risk to not secure all your datas? May I secure all datas but it takes me double time to code a thing? What's the enterprise approach between risk and "time is money"?
Thank you to share this because I think a lot of developers don't know what is the good limit.
EDIT: I see a lot of replies talking about XSS and SQL injection, but this is not the only things to take care about.
Let's take a forum. A thread can be write in a forum where we are moderator. So when you send data to client view, you add or remove the "add" button for this forum. But when a user tries to save a thread in server side, you must check that user has the right to dot it (you can't trust on client view security).
This is a very simple example, but in some of my apps, I've got a hierarchy of rights which can be very very difficult to check (need a lot of SQL queries...) but in other hand, it's really hard to find the hack (datas are pseudo encrypted in client view, there is a lot of datas to modify to make the hack runs, and the hacker needs a good understanding of my app rules to do a hack): in this case, may I check only surface security holes (really easy hack) or may I check very hard security holes (but it will decrease my performances for all users, and takes me a long time to develop).
The second question is: Can we "trust" (to not develop a hard and long code which decreases performance) on client view for very hard hack?
Here is another post talking of this sort of hack: (hibernate and collection checking) Security question: how to secure Hibernate collections coming back from client to server?

I think you should try and secure everything you can, the time spent doing this is nothing compared to the time needed to fix the mess done by someone exploiting a vulnerability you left somewhere.
Most things anyway are quite easy to fix:
sql injections have really nothing to do with sql, it's just string manipulation, so if you don't feel comfortable with that, just use prepared statements with bound parameters and forget about the problem
cross site exploit are easily negated by escaping (with htmlentities or so) every untrusted data before sending it out as output -- of course this should be coupled with extensive data filtering, but it's a good start
credentials theft: never store data which could provide a permanent access to protected areas -- instead save a hashed version of the username in the cookies and set a time limit to the sessions: this way an attacker who might happen to steal this data will have a limited access instead of permanent
never suppose that just because a user is logged in then he can be trusted -- apply security rules to everybody
treat everything you get from outside as potentially dangerous: even a trusted site you get data from might be compromised, and you don't want to fall down too -- even your own database could be compromised (especially if you're on a shared environment) so don't trust its data either
Of course there is more, like session hijacking attacks, but those are the first things you should look at.
EDIT regarding your edit:
I'm not sure I fully understand your examples, especially what you mean by "trust on client security". Of course all pages with restricted access must start with a check to see if the user has rights to see the content and optionally if he (or she) has the correct level of privilege: there can be some actions available to all users, and some others only available to a more restricted group (like moderators in a forum). All this controls have to be done on the server side, because you can never trust what the client sends you, being it data through GET, POST and even COOKIES. None of these are optional.

"Breaking data" is not something that should ever be possible, by the authorized user or anybody else. I'd file this under "validation and sanitation of user input", and it's something you must always do. If there's just the possibility of a user "breaking your data", it'll happen sooner or later, so you need to validate any and all input into your app. Escaping SQL queries goes into this category as well, which is both a security and data sanitation concern.
The general security in your app should be sound regardless. If you have a user management system, it should do its job properly. I.e. users that aren't supposed to access something should not be able to access it.
The other problem, straight up XSS attacks, has not much to do with "breaking data" but with unauthorized access to data. This is something that depends on the application, but if you're aware of how XSS attacks work and how you can avoid them, is there any reason not to?
In summary:
SQL injection, input validation and sanitation go hand in hand and are a must anyway
XSS attacks can be avoided by best-practices and a bit of consciousness, you shouldn't need to do much extra work for it
anything beyond that, like "pro-active" brute force attack filters or such things, that do cause additional work, depend on the application
Simply making it a habit to stick to best practices goes a long way in making a secure app, and why wouldn't you? :)
You need to see web apps as the server-client architecture they are. The client can ask a question, the server gives answers. The question is just a URL, sometimes with a bit of attached POST data.
Can I have /forum/view_thread/12345/ please?
Can I POST this $data to /forum/new_thread/ please?
Can I have /forum/admin/delete_all_users/ please?
Your security can't rely only on the client not asking the right question. Never.
The server always needs to evaluate the question and answer No when necessary.

All applications should have some degree of security. You generally don't ask for SSL on intranet websites, but you need to take care of SQL/XSS attacks.
All data your user enters into your application should be under your responsibility. You must make sure nobody unauthorized get access to it. Sometimes, a "not critical" information can pose a very security problem, because we're all lazy people.
Some time ago, a friend used to run a games website. Users create their profiles, forum , all that stuff. Then, some day, someone found a SQL injection open door somewhere. That attacker get all user and password information.
Not a big deal, huh? I mean, who cares about a player account into a website? But most users used same user/password to MSN, Counter Strike, etc. So become a big problem very fast.
Bottom line is: all applications should have some security concern. You should take a look into STRIDE to understand your attack vectors and take best action.

I personally prefer to secure everything at all times. It might be a paranoid approach, but when I see tons of websites throughout internet, that are vulnerable to SQL injection or even much simpler attacks, and they are not bothered to fix it until someone "hacks" them and steal their precious data, it makes me pretty much afraid. I don't really want to be the one responsible for leaked passwords or other user info.
Just ask someone with hacking experiences to check your application / website. It should give you a fair idea what's wrong and what should be updated.
You want to have strong API side ACL. Some days ago I saw a problem where a guy had secured every single UI, but the website was vulnerable through AJAX, just because his API (where he was sending requests) just trusted every single request to be checked. I could basically pull whole database through this bug.

I think it's helpful to distinguish between preventing code injection and plain data authorization.
In my opinion, all it takes is a few good coding habits to completely eliminate SQL injection. There is simply no excuse for it.
XSS injection is a little bit different - i think it can always be prevented, but it may not be trivial if your application features user generated content. By that I simply mean that it may not be as trivial to secure your app against XSS as it is compared to SQL injection. So I do not mean that it is ok to allow XSS - I still think there is no excuse for allowing it, it's just harder to prevent than SQL injection if your app revolves around user generated content.
So SQL injection and XSS are purely technical matters - the next level is authorization: how thoroughly should one shield of access to data that is no business of the current user. Here I think it really does depend on the application, and I can imagine that it makes sense to distinguish between: "user X may not see anything of user Y" vs "Not bothering user X with data of user Y would improve usability and make the application more convenient to use".

saving passwords inside your application code

I have a doubt concerning how to store a password for usage in my application. I need to encrypt/decrypt data on the fly, so the password will need to be somewhere. Options would be to have it hard-coded in my app or load it from a file.
I want to encrypt a license file for an application and one of the security steps involves the app being able to decrypt the license (other steps follow after). The password is never know to the user and only to me as e really doesn't need it!
What I am concerned is with hackers going through my code and retrieving the password that I have stored there and use it to hack the license breaking the first security barrier.
At this point I am not considering code obfuscation (eventually I will), so this is an issue.
I know that any solution that stores passwords is a security hazard but there's no way around it!
I considered assembling the password from multiple pieces before really needing it, but at some point the password is complete so a debugger and a well place breakpoint is all that is needed.
What approaches do you guys(and galls), use when you need to store your passwords hard-coded in your app?
Cheers

My personal opinion is the same as GregS above: it is a waste of time. The application will be pirated, no matter how much you try to prevent it. However...
Your best bet is to cut down on casual-piracy.
Consider that you have two classes of users. The normal user and the pirate. The pirate will go to great lengths to crack your application. The normal user just wants to use your application to get something done. You can't do anything about the pirate.
A normal user isn't going to know anything about cracking code ("uh...what's a hex editor?"). If it is easier for this type of person to buy the application than it is to pirate it, then they are more likely to buy it.
It looks like the solutions you have already considered will be effective against the normal user. And that's about all that you can do.

Decide now how much time/effort you want to spend on preventing piracy. If someone is determined, they're probably going to get your application to work anyway.

I know you don't want to hear it, but it's a waste of time, and if your app needs a hardcoded password then that is a flaw.

I don't know that there is any approach to solving this problem that would deter a hacker in any meaningful way. Keeping the secret a secret is one of cryptography's great problems.

An approach I have done in the past was to generate an unique ID during the install, it would get the HDD and MCU's SN and use it in a complex structure, then the user will send this number for our automated system and we reply back with another block of that, the app will now decrypt and compare this data on the fly during the use.
Yes I works but it still have the harded password, we have some layers for protection (ie. there are some techniques that prevents a mid-level hacker to understand our security system).
I would just recommend you to do a very complex system and try to hack it on your own, see if disassembly can lead to an easy path. Add some random calls to random subroutines, make it very alleatory, try to fake the use of registry keys and global variables, turn the hacker life in a hell so he will eventually give up.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string