How to sanitize user created filenames for a networked application?

How to sanitize user created filenames for a networked application? - security

I'm working on an instant messaging app, where users can receive files from their friends.
The names of the files received are set by the sender of the file, and multiple files can be sent together with the possibility of subdirectories. For example, two files sent together might be "1" and "sub/2" such that the downloaded results should be like "downloads/1" and "downloads/sub/2".
I'm worried about the security implications of this. Right off the top of my head, two potentially dangerous filnames would be something like "../../../somethingNasty" or "~/somethingNasty" for Unix-like users. Other potential issues that cross my mind are filenames with characters that are unsupported on the target filesystem, but that seems much harder and may just be better to ignore?
I'm considering stripping received filenames for ".." and "~" but this type of blacklist approach where I individually think of problem cases hardly seems like the recipe for good security. What's the recommended way to sanitize filenames to ensure nothing sinister happens?
If it makes a difference, my app is running on C++ with the QT framework.

It's wiser to replace ".." with say XXX and ~ with say YYY. This way you convert any invalid path to a perfectly valid path. I.e. if the user wants to upload "../../../somethingNasty" - no problems, let him upload the file and store it in XXX/XXX/XXX/somethingNasty.
Or even better, you can encode all nonalphanumeric characters (except slashes) with %XY where XY is hexidecimal code of the character. This way you would have %2E%2E/%2E%2E/%2E%2E/SomethingNasty

Related

How to deal with user input files (images / video)?

In our company, we have to deal with a lot of user uploads, for example images and videos. Now I was wondering: how do you guys "deal with that" in terms of safety? Is it possible for an image to contain malicious content? Of course, there are the "unwanted" pixels, like porn or something. But that's not what I mean now. I mean images which "break" machines while they are being decoded, etc. I already saw this: How can a virus exist in an image.
Basically I was planning to do this:
Create a DMZ
Store the assets in a bucket (we use GCP here) which lives inside the DMZ
Then apply "malicious code"-detection on the file
If it turns out to be fine... then move the asset into the "real" landscape (the non-dmz)
Now the 3rd part... what can I do here?
Applying a virus scanner
No problem with this, there are a lot of options here. Simple approach and good chance that viruses are being caught.
Do mime-type detection
Based on the first few bytes, I do a mime type detection. For example, if someone sends us a "image.jpg" but in fact its an executable, then we would detect this. Right? Is this safe enough? I was thinking about this package.
What else???
Now... what else can I do? How do other big companies do this? I'm not really looking for answers in terms of orchestration, etc. I know how to use a DMZ, link it all together with a few pubsub topics, etc. I'm purely interested in what techniques to apply to really find out that an incoming asset is "safe".

What I would suggest is to not to do it outside the DMZ , let this be within your DMZ and it should have all the regular security controls as any other system will have within your data center.
Besides the things ( Virus Scan , Mime - Type detection ) that you have outlined , i would suggest a few additional checks to perform.
Size Limitation - You would not want anyone to just bloat out all the space and choke your server.
Throttling - Again you may want to control the throughput , at least have the ability to limit to some maximum value.
Heuristic Scan - Perhaps add a layer to the Anti Virus to do heuristics as well rather than simple signature scans.
File System Access Control - Make sure that the file system access control is foolproof , even in case something malicious comes in it should be able to propagate out to other folders / paths .
Network control - Make sure all the outbound connections are fire walled as well , just in case anything tries to make outward connections.

How to Protect Against Unicode Security Vulnerabilities

"Five things everyone should know about Unicode" is a blog post showing how Unicode characters can be used as an attack vector for websites.
The main example given of such a real world attack is a fake WhatsApp app submitted to the Google Play store using a unicode non-printable space in the developer name which made the name unique and allowed it to get past Google's filters. The Mongolian Vowel Separator (U+180E) is one such non-printable space character.
Another vulnerability is to use alternative Unicode characters that look similar. The Mimic tool shows how this can work.
An example I can think of is to protect usernames when registering a new user. You don't want two usernames to be the same or for them to look the same either.
How do you protect against this? Is there a list of these characters out there? Should it be common practice to strip all of these types of characters from all form inputs?

What you are talking about is called a homoglyph attack.
There is a "confusables" list by Unicode here, and also have a look at this. There should be libraries based on these or pontentially other databases. One such library is this one that you can use in Java or Javascript. The same must exist for other languages as well, or you can write one.
The important thing I think is to not have your own database - the library or service is easy to do on top of good data.
As for whether you should filter out similar looking usernames - I think it depends. If there is an interest for users to try and fake each other's usernames, maybe yes. For many other types of data, maybe there is no point in doing so. There is no generic best practice I think other than you should assess the risk in your application, with your datapoints.
Also a different approach for a different problem, but what may often work for Unicode input validation is the \w word character in a regular expression, if your regex engine is Unicode-ready. In such an engine, \w should match all Unicode classes of word characters, ie. letters, modifiers and connectors in any language, but nothing else (no special characters). This does not protect against homoglyph attacks, but may protect against some injections while keeping your application Unicode-friendly.

All sanitization works best when you have a whitelist of known safe values, and exclude all others.
ASCII is one such set of characters.

This could be approached in various ways, however each one might increase the number of false positives, causing legitimate users' annoyance. Also, none of them will work for 100% of the cases (even if combined). They will just add an extra layer.
One approach would be to have tables with characters that look similar and check if duplicate names exist. What 'look similar' means is subjective in many cases, so building such list might be tricky. This method might produce false positives in certain occasions.
Also, reversing the order of certain letters might trick many users. Checking for anagrams or very similar names can be achieved using algorithms like Jaro-Winkler and Levenshtein distance (i.e., checking if a similar username/company name already exists). Sometimes however, this might be due to a different spelling of some word in some region (e.g., 'centre' vs 'center'), or the name of some company might deliberately contain an anagram. This approach might further increase the number of false positives.
Furthermore, as Jonathan mentioned, sanitisation is also a good approach, however it might not protect against anagrams and cause issues to legitimate users who want to use some special character.
As the OP also mentioned, special characters can also be stripped. Other parts of the name might also need to be stripped, for example common names like 'Inc.', '.com' etc.
Finally, the name can be restricted to only contain characters in one language and not a mixture of characters from various languages (a more relaxed version of this may not allow mixture of characters in the same word - while would allow if separated by space). Restricting using a capital first letter and lower case for the rest of the letters can further improve this approach, as certain lower case letters (like 'l') may look like upper case ones (like 'I') when certain fonts are used. Excluding the use of certain symbols (like '|') will enhance this approach further. This solution will increase the amount of annoyance of certain users who will not be able to use certain names.
A combination of some/all aforementioned approaches can also be used. The selection of the methods and how exactly they will be applied (e.g., you may choose to forbid similar names, or to require moderator approval in case a name is similar, or to not take any action, but simply warn a moderator/administrator) depends on the scenario that you are trying to solve.

I may have an innovative solution to this problem regarding usernames. Obviously, you want to allow ASCII characters, but in some special cases, other characters will be used (different language, as you said).
I think an intuitive way to allow both ASCII and other characters to be used in an username, while being protected against "Unicode Vulnerabilities", would be something like this:
Allow all ASCII characters and disallow other characters, except when there are x or more of these special characters in the username(the username is in another language).
Take for example this:
Whatsapp, Inc + (U+180E) - Not allowed, only has 1 special character.
элч + (U+180E) - Allowed! It has more than x special characters (for example, 3). It can use the Mongolian separator since it's Mongolian.
Obviously, this does not protect you 100% from these types of vulnerabilities, but it is a very efficient method I have been using, ESPECIALLY if you do not mention the existence of this algorithm on the "login" or "register" page, as attackers might figure out that you have an algorithm protecting the website from these types of attacks, but not mention it so they cannot reverse engineer it and find a way to bypass it.
Sorry if this is not an answer you are looking for, just sharing my ideas.
Edit: Or you can use a RNN (Recurrent Neural Network) AI to detect the language and allow specific characters from that language.

How to filesearch for a substring of a base64'd string

I have a client with a website that looks as if it has been hacked. Random pages throughout the site will (seemingly at random) automatically forward to a youtube video. This happens for a while (not sure how long yet... still trying to figure that out) and then the redirect disappears. May have something to do with our site caching though. Regardless, the client isn't happy about it.
I'm searching the code base (this is a Wordpress site, but this question was generic enough that I put it here instead of in the Wordpress groups...) for "base64_decode" but not having any luck.
So, since I know the specific url that the site is getting forwarded to every time, I thought I'd search for the video id that is in the youtube url. This method could also be pertinent when the hack-inserted base64'd string is defined to a variable and then that variable is decoded (so a grep for "base64_decode" wouldn't necessarily come up with any answers that looked suspicious).
So, what I'm wondering is if there's a way to search for a substring of a string that has been base64'd and then inserted into the code. Like, take the substring I'm searching for, base64 it, and then search the code base for the resultant string. (Maybe after manipulating it slightly?)
Is there a way to do that? Is that method even valid? I don't really have any idea how the whole base64 algorithm works, or if this is possible, so I thought I'd quickly throw the question out here to see if anyone else did.

Nothing to it (for somebody with the chutzpah to call himself "Programmer Dan").
Well, maybe a little. Your have to know the encoding for the values 0 to 63.
In general, encoding to Base64 is done by taking three 8-bit characters of plain text at a time, breaking those bits into four sets of 6-bit numbers, and creating four characters of encoded text by converting the numbers (0 to 63) to arbitrary characters. Actually, the encoded characters aren't completely arbitrary, as they must be acceptable to pretty much ANY method of transmission, since that's the original reason for using Base64 encoding. The system I usually work with uses {A..Z,a..z,0..9,+,/} in that order.
If one wanted to be nasty (which one might expect in the case you're dealing with), one might change the order, or even the characters, during the process. Of course, if you have examples of the encoded Base64, you can see what the character set is (unless the encoding uses more than 64 characters). But you still have the possibility of things like changing the order as you encode or decode (simple rotation, for example). But, I digress. The question is about searching for encoded text, not deciphering deliberate obfuscation. I could tell you a lot about that, too.
Simple methodology:
Encode the plain text you're looking for. If the encoding results in one or two equal signs (padding) at the end, eliminate them and the last encoded character that precedes them. Search for the result.
Same as (1) except stick a blank on the front of your plain text. Eliminate the first two encoded characters. Search for the result.
Same as (2) except with two blanks on the front. Again, eliminate the first two encoded characters. Search for the result.
These three searches will find all files containing the encoding of the plain text you're looking for.
This is all “air code”, meaning off the top of my head, at best. Some might suggest I pulled it out of somewhere else. I can think of three possible problems with this algorithm, excluding any issues of efficiency. But, that’s what you get at this price.
Let me know if you want the working version. Or send me yours. Good luck.
Cplusman

Transfer openAM 9 configuration between environments

I’m trying to find the least painful way of transporting config files between different environments and I have found many things that can break the system after the transport. I got a script that will keep the values for attributes correct for the ones that are depended on the environment but here is the list of couple of things that I’m not sure about. Maybe someone can shed some light on them.
What I want to do is to simply transport xml file based on the steps from the book for openAM 9 (simply export/import using ssoadm to xml file) but by analyzing the file in depth I find many differences that might break the system, so any help is appreciated.
In every xml file we have sections for ‘iplanet-am-auth-ldap-bind-passwd’ with hash value under it but in one xml file we’re missing one line with hash. I was wondering if we add that line with the correct hash value will it break the system or it won’t matter as long as the hash matches target environment?
Does the size of the ‘iplanet-am-logging-buffer-size’ has to match what was originally setup in the target environment or it will be ok if we overwrite the value from the source xml file?
For some reason we have different links in delegation-rules with the same name, for example:
# environment1 - sms://dc=test-domain,dc=net/sunEntitlementService/1.0/application/ws/1/entitlement/entitlements
# environment2 - sms://dc=test-domain,dc=net/sunEntitlementService/1.0/application/ws/1/entitlement/decision
# environment3 - sms://*dc=test-domain,dc=net/sunIdentityRepositoryService/1.0/application/agent
It could be due the way the server was setup long time ago or due to development processes over time ( I don’t know) but my question is:
If the rule names are the same but some(or all) options/values are different between environments and we overwrite them with the source file from different environment, will this break things or it won’t matter ?

Are user names ever case sensitive?

I'm looking at some code that converts user names to lower case, before storing them. I'm 90% sure this is ok, but are there systems out there that actually require case sensitivity on the user names (specifically in the health industry)?
Note: my particular code is not at the point of entry. We are taking user names from other systems. The worry I have is depending on those systems (which may or may not be under our control) to consistently pass us usernames in the same case as each other (when describing the same user).
Also of note - the code is:
userName.toLowerCase(Locale.ENGLISH)
Are all user names in english? Is this just so it matches collation in the database? Note that (in java at least) String.toLowerCase() is defined as String.toLowerCase(Locale.getDefault())

unix logins are case sensitive...
Are there any other systems that do this?

toLowerCase has only one reason for it to accept a locale:
since small letter i has a dot in every standard language, the letter I is transformed to a i with a dot.
but in turkish, there is also a capital letter İ with a dot above. this is transformed to a small letter i.
the "regular" turkish capital I is transformed to a small ı - without a dot.
so, unless your turkish usernames are all called IiI1I1iiII, i would hardly worry about this.
every other language than turkish has a identical toLowerCaseImplementation. so you could chose Locale.ENGLISH or Locale.GERMAN or whatever..just make sure you do not pick turkish.
see the javadoc for more detailed information
edit: thanks to utku karatas i could/copy paste the correct glyphs in ths post.

Using case sensitive username/passwords is an easy way to increase security, so the question is, how much do you care about security vs usability. Just keep in mind that the way you're looking at solving the case insensitivity may have some localization problems, but if you don't care then don't worry about it.

Lowercasing the user name using the English locale is bound to cause you problems. I would suggest lowercasing using the invariant culture.

It depends on context, but in the Informix dialect of SQL, there are 'owners' (basically equivalent to a schema in standard SQL), and how you write the owner name matters.
SELECT *
FROM someone.sometable, "someone".sometable,
SOMEONE.sometable, "SOMEONE".sometable
The two quoted names are definitely different; the two unquote names are mapped to the same name, which (depending on database mode) could be either of the other two. There is some code around which does case-conversion on the (unquoted) names. Fortunately, most of the time you don't need to specify the name, and when you do you write the name without quotes and it all works; or you write the name with quotes and are consistent and it all works. Occasionally, though, people like me have to really understand the details to get programs to work sanely despite all the hoops.
Also, (as Stephen noted) Unix logins are case-sensitive, and always have been. I believe Windows logins are mostly case-insensitive - but I don't experiment with that (there are too many ways to get screwed up on Windows without adding that sort trickery to the game).
If you really want to confuse someone on Unix, give them a numeric user name (e.g. 123) but give them a different UID (e.g. 234).

Kerberos, which can be used in Windows environments too, has case sensitivity problems. You can configure it in a certain way to ensure that case sensitivity issues will not arise, but it can go the other way too.

If your only goal is differentiating one user from another, it seems logical that you would want more than case to be a factor.

I have never encountered a system that enforced case-sensitivity on usernames (nor would I want to).
Most likely the code forces them lowercase at the point of entry as an attempt to prevent case-sensitivity problems later.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string