How to filesearch for a substring of a base64'd string

How to filesearch for a substring of a base64'd string - base64

I have a client with a website that looks as if it has been hacked. Random pages throughout the site will (seemingly at random) automatically forward to a youtube video. This happens for a while (not sure how long yet... still trying to figure that out) and then the redirect disappears. May have something to do with our site caching though. Regardless, the client isn't happy about it.
I'm searching the code base (this is a Wordpress site, but this question was generic enough that I put it here instead of in the Wordpress groups...) for "base64_decode" but not having any luck.
So, since I know the specific url that the site is getting forwarded to every time, I thought I'd search for the video id that is in the youtube url. This method could also be pertinent when the hack-inserted base64'd string is defined to a variable and then that variable is decoded (so a grep for "base64_decode" wouldn't necessarily come up with any answers that looked suspicious).
So, what I'm wondering is if there's a way to search for a substring of a string that has been base64'd and then inserted into the code. Like, take the substring I'm searching for, base64 it, and then search the code base for the resultant string. (Maybe after manipulating it slightly?)
Is there a way to do that? Is that method even valid? I don't really have any idea how the whole base64 algorithm works, or if this is possible, so I thought I'd quickly throw the question out here to see if anyone else did.

Nothing to it (for somebody with the chutzpah to call himself "Programmer Dan").
Well, maybe a little. Your have to know the encoding for the values 0 to 63.
In general, encoding to Base64 is done by taking three 8-bit characters of plain text at a time, breaking those bits into four sets of 6-bit numbers, and creating four characters of encoded text by converting the numbers (0 to 63) to arbitrary characters. Actually, the encoded characters aren't completely arbitrary, as they must be acceptable to pretty much ANY method of transmission, since that's the original reason for using Base64 encoding. The system I usually work with uses {A..Z,a..z,0..9,+,/} in that order.
If one wanted to be nasty (which one might expect in the case you're dealing with), one might change the order, or even the characters, during the process. Of course, if you have examples of the encoded Base64, you can see what the character set is (unless the encoding uses more than 64 characters). But you still have the possibility of things like changing the order as you encode or decode (simple rotation, for example). But, I digress. The question is about searching for encoded text, not deciphering deliberate obfuscation. I could tell you a lot about that, too.
Simple methodology:
Encode the plain text you're looking for. If the encoding results in one or two equal signs (padding) at the end, eliminate them and the last encoded character that precedes them. Search for the result.
Same as (1) except stick a blank on the front of your plain text. Eliminate the first two encoded characters. Search for the result.
Same as (2) except with two blanks on the front. Again, eliminate the first two encoded characters. Search for the result.
These three searches will find all files containing the encoding of the plain text you're looking for.
This is all “air code”, meaning off the top of my head, at best. Some might suggest I pulled it out of somewhere else. I can think of three possible problems with this algorithm, excluding any issues of efficiency. But, that’s what you get at this price.
Let me know if you want the working version. Or send me yours. Good luck.
Cplusman

Related

How to Protect Against Unicode Security Vulnerabilities

"Five things everyone should know about Unicode" is a blog post showing how Unicode characters can be used as an attack vector for websites.
The main example given of such a real world attack is a fake WhatsApp app submitted to the Google Play store using a unicode non-printable space in the developer name which made the name unique and allowed it to get past Google's filters. The Mongolian Vowel Separator (U+180E) is one such non-printable space character.
Another vulnerability is to use alternative Unicode characters that look similar. The Mimic tool shows how this can work.
An example I can think of is to protect usernames when registering a new user. You don't want two usernames to be the same or for them to look the same either.
How do you protect against this? Is there a list of these characters out there? Should it be common practice to strip all of these types of characters from all form inputs?

What you are talking about is called a homoglyph attack.
There is a "confusables" list by Unicode here, and also have a look at this. There should be libraries based on these or pontentially other databases. One such library is this one that you can use in Java or Javascript. The same must exist for other languages as well, or you can write one.
The important thing I think is to not have your own database - the library or service is easy to do on top of good data.
As for whether you should filter out similar looking usernames - I think it depends. If there is an interest for users to try and fake each other's usernames, maybe yes. For many other types of data, maybe there is no point in doing so. There is no generic best practice I think other than you should assess the risk in your application, with your datapoints.
Also a different approach for a different problem, but what may often work for Unicode input validation is the \w word character in a regular expression, if your regex engine is Unicode-ready. In such an engine, \w should match all Unicode classes of word characters, ie. letters, modifiers and connectors in any language, but nothing else (no special characters). This does not protect against homoglyph attacks, but may protect against some injections while keeping your application Unicode-friendly.

All sanitization works best when you have a whitelist of known safe values, and exclude all others.
ASCII is one such set of characters.

This could be approached in various ways, however each one might increase the number of false positives, causing legitimate users' annoyance. Also, none of them will work for 100% of the cases (even if combined). They will just add an extra layer.
One approach would be to have tables with characters that look similar and check if duplicate names exist. What 'look similar' means is subjective in many cases, so building such list might be tricky. This method might produce false positives in certain occasions.
Also, reversing the order of certain letters might trick many users. Checking for anagrams or very similar names can be achieved using algorithms like Jaro-Winkler and Levenshtein distance (i.e., checking if a similar username/company name already exists). Sometimes however, this might be due to a different spelling of some word in some region (e.g., 'centre' vs 'center'), or the name of some company might deliberately contain an anagram. This approach might further increase the number of false positives.
Furthermore, as Jonathan mentioned, sanitisation is also a good approach, however it might not protect against anagrams and cause issues to legitimate users who want to use some special character.
As the OP also mentioned, special characters can also be stripped. Other parts of the name might also need to be stripped, for example common names like 'Inc.', '.com' etc.
Finally, the name can be restricted to only contain characters in one language and not a mixture of characters from various languages (a more relaxed version of this may not allow mixture of characters in the same word - while would allow if separated by space). Restricting using a capital first letter and lower case for the rest of the letters can further improve this approach, as certain lower case letters (like 'l') may look like upper case ones (like 'I') when certain fonts are used. Excluding the use of certain symbols (like '|') will enhance this approach further. This solution will increase the amount of annoyance of certain users who will not be able to use certain names.
A combination of some/all aforementioned approaches can also be used. The selection of the methods and how exactly they will be applied (e.g., you may choose to forbid similar names, or to require moderator approval in case a name is similar, or to not take any action, but simply warn a moderator/administrator) depends on the scenario that you are trying to solve.

I may have an innovative solution to this problem regarding usernames. Obviously, you want to allow ASCII characters, but in some special cases, other characters will be used (different language, as you said).
I think an intuitive way to allow both ASCII and other characters to be used in an username, while being protected against "Unicode Vulnerabilities", would be something like this:
Allow all ASCII characters and disallow other characters, except when there are x or more of these special characters in the username(the username is in another language).
Take for example this:
Whatsapp, Inc + (U+180E) - Not allowed, only has 1 special character.
элч + (U+180E) - Allowed! It has more than x special characters (for example, 3). It can use the Mongolian separator since it's Mongolian.
Obviously, this does not protect you 100% from these types of vulnerabilities, but it is a very efficient method I have been using, ESPECIALLY if you do not mention the existence of this algorithm on the "login" or "register" page, as attackers might figure out that you have an algorithm protecting the website from these types of attacks, but not mention it so they cannot reverse engineer it and find a way to bypass it.
Sorry if this is not an answer you are looking for, just sharing my ideas.
Edit: Or you can use a RNN (Recurrent Neural Network) AI to detect the language and allow specific characters from that language.

Defense against approximate string matching

How can I alter a string so that variations of approximate string matching can't match it with the original?
I made an IRCbot which runs a game based on the logfile of the channel. It prints quotes from the logs and players collect points by guessing "who said it". The channel is rather geeky and it took no more than 30 minutes for one of the players to build a bot which wins the game every time. I realize manual cheating is also easy and impossible to defend against, but consider this a competition between automated bots. I want to update my bot so that any fully automated bot will not be able to play the game :)
I've considered randomly deleting a character from the quotes, but agrep would still be able to match the string. I've considered replacing some of the characters by similar-looking alternate characters, but that would be trivial to reverse-engineer. I'm looking for ideas that will be harder to break.
Example line:
[14:15] <baobot> [QUOTE 13/15] Who famously declared "minulla ainakin paperin tekemisessä 1% ajasta menee algon suunnitteluun ja 99% menee paperin kirjoittamiseen"?

Print your quote as ascii-art.
Use something similar to the command-line-tools figlet or toilet (explaination).
Here is a quick example: like string2ascii-generator.
To get you started, you might want to copy the sourcecode from figlet.

Anything that can be used to scramble can most likely be unscrambled. Below are some suggestions though for your experiment:
Humans can read words if the first and last letter are in place and the inner portion is scrambled.
You can also do substitution, such as elite speak to replace some characters with numbers.
You might be able to find other characters in other languages that also look familiar to letters that are used, which means you can randomly substitute them as well.
You can also try to randomize the positions of the spaces. So remove them from the original position then move them around, or remove them completely.
Reverse some words.
Find ways to phoneticized words... in english "ph" sounds like "f" so you can find and replace some of them.
Try a combination of different things above, remove all spaces, CaMEl CaSE words, then do character substitutions, etc.
Overall, there are lots of ways to help make it harder, however if you follow a set pattern every time, then it'll be easier to program something to undo it. If you randomly do different things, so one input can yield several different outputs, then it'll be harder for someone to write a program to reverse the process.

Use Google translate.
For example, I ran your quote to Russian, then to English, and back to Finnish, and got
Minulla on ainakin 1% ajasta kirjassa otetaan suunnittelussa Algon ja 99% menee kirjoituspaperia
I have no idea if it is a correct Finnish; as far as I can tell it is still somewhat recognizable. If you think it is too recognizable for an approximate search, do more intermediate translations.

Using "seed" based math to recreate application instances

Okay so I was thinking today about Minecraft a game which so many of you are so familiar with, I'm sure and while my question isn't directly related to the game I find it much simply to describe my question using the game as an example.
My question is, is there any way a type of "seed" or string of characters can be used to recreate an instance of a program (not in the literal programming sense) by storing a code which when re-entered into this program as a string at run-time, could recreate the data it once held again, in fields, text boxes, canvases, for example, exactly as it was.
As I understand it, Minecraft takes the string of ASCII characters you enter, all which truly are numbers, and performs a series of operations on it which evaluate to some type of hash or number which is finite... this number (again as I understand) is the representation of that string you entered. So it makes sense that because a string when parsed by this algorithm will always evaluate to the same hash. 1 + 1 will always = 2 so a seeds value must always equal that seeds value in the end. And in doing so you have the ability to replicate exactly, worlds, by entering this sort of key which is evaluated the same on every machine.
Now, if we can exactly replicate worlds like this this is it possible to bring it into a more abstract concept like the following?...
Say you have an application, like Microsoft Word. Word saved the data you have entered as a file on your hard drive it holds formatting data, the strings you've entered, the format of the file... all that on a physical file... Now imagine if when you entered your essay into Word instead of saving it and bringing your laptop to school you instead click on parse and instead of creating a file, you are given a hash code... Now you goto school you know you have to print it. so you log onto the computer and open Word... Now instead of open there is an option now called evaluate you click it and enter the hash your other computer formulated and it creates the exact essay you have written.
Is this possible, and if so are there obvious implementations of this i simply am not thinking of or are just so seemingly part of everyday I don't think recognize it? And finally... if possible, what methods and algorithms would go into such a thing?
[EDIT]
I had to do some research on the anatomy of a seed and I think this explains it well
The limit is 32 characters or for a
numeric seed, 19 digits plus the minus sign.
Numeric seeds can range from -9223372036854775808 to
9223372036854775807 which is a total of 18446744073709551616 Text
strings entered will be "hashed" to one of the numeric seeds in the
above range. The "Seed for the World Generator" window only allows 32
characters to be entered and will not show or use any more than that."
BUT looking back on it lossless compression IS EXACTLY what I was
describing after re-reading the wiki page and remembering that (you
are very correct) the seed only partakes in the generation, the final
data is stores as a "physical" file on the HDD (which again, you are correct) is raw uncompressed data in a file
So in retrospect, I believe I was describing lossless compression, trying in my mind to figure out how the seed was able to replicate the exact same world, forgetting the seed was only responsible for generating the code, not the saving or compression of it.
So thank you for your help guys! It's really appreciated I believe we can call this one solved!

There are several possibilities to achieve this "string" that recovers your data. However they're not all applicable depending on the context.
An actual seed, which initializes for example a peudo-random number generator, then allows to recreate the same sequence of pseudo-random numbers (see this question).
This is possibly similar to what Minecraft relies on, because the whole process of how to create a world based on some choices (possibly pseudo-random choices) is known in advance. Even if we pretend that we have random numbers, computers are actually deterministic, which makes this possible.
If your document were generated randomly then this would be applicable: with the same seed, the same gibberish comes out.
Some key-value dictionary, or hash map. Then the values have to be accessible by both sides and the string is the key that allows to retrieve the value.
Think for example of storing your word file on an online server, then your key is the URL linking to your file.
Compressing all the information that is in your data into the string. This is much harder, and there are strong limits due to the entropy of the data. See Shannon's source coding theorem for example.
You would be better off (as in, it would be easier) to just compress your file with a usual algorithm (zip or 7z or something else), rather than reimplementing it yourself, especially as soon as your document starts having fancy things (different styles, tables, pictures, unusual characters...)
With the simple hypothesis of 27 possible characters (26 letters and the space), Shannon himself shows in Prediction and Entropy of Printed English (Bell System Technical Journal, 30: 1. January 1951 pp 50-64, online version) that there is about 2.14 bits of entropy per letter in English. That's about 550 characters encoded with your 32 character string.
While this is significantly better than the 8 bits we use for each ASCII character, it also shows it is very likely to be impossible to encode a document in English in less than a fourth of its size. Then you'd still have to add punctuation, and all the rest of the fuss.

matching common strings between two data sets

I am working on a website conversion. I have a dump of the database backend as an sql file. I also have a scrape of the website from wget.
What I'm wanting to do is map database tables and columns to directories, pages, and sections of pages in the scrape. I'd like to automate this.
Is there some tool or a script out there that could pull strings from one source and look for them in the other? Ideally, it would return a set of results that would say soemthing like
string "piece of website content here" on line 453 in table.sql matches string in website.com/subdirectory/certain_page.asp on line 56.
I don't want to do line comparisons because lines from the database dump (INSERT INTO table VALUES (...) ) aren't going to match lines in the page where it actually populates (<div id='left_column'><div id='left_content'>...</div></div>).
I realize this is a computationally intensive task, but even letting it run over the weekend is fine.
I've found similar questions, but I don't have enough CS background to know if they are identical to my problem or not. SO kindly suggested this question, but it appears to be dealing with a known set of needles to match against the haystack. In my case, I need to compare haystack to haystack, and see matching straws of hay.
Is there a command-line script or command out there, or is this something I need to build? If I build it, should I use the Aho–Corasick algorithm, as suggested in the other question?

So your two questions are 1) Is there already a solution that will do what you want, and 2) Should you use the Aho-Corasick algorithm.
The first answer is that I doubt you'll find a ready-built tool that will meet your needs.
The second answer is that, since you don't care about performance and have a limited CS background, that you should use whatever algorithm you find simplest to implement.
I will go one step further and propose an architecture.
First, you need to be able to parse the .sql files into a meaningful way, one that go line-by-line and return tablename, column_name, and value. A StreamReader is probably best for this.
Second, you need a parser for your webpages that will go element-by-element and return each text node and the name of each parent element all the way up to the html element and its parent filename. An XmlTextReader or similar streaming XML parser, such as SAXON is probably best, as long as it will operate on non-valid XML.
You would need to tie these two parsers together with a mutual search algorithm of some sort. You will have to customize it to suit your needs. Aho-Corasick will apparently get you the best performance if you can pull it off. A naive algorithm is easy to implement, though, and here's how:
Assuming you have your two parsers that loop through each field (on the one hand) and each text node (on the other hand), pick one of the two parsers and have it go through each of the strings in its data source, calling the other parser to search the other data source for all possible matches, and logging the ones it finds.

This cannot work, at least not reliably. Best case: you would fit every piece of data to its counterpart in your HTML files, but you would have many false positives. For example user names that are actual words etc.
Furthermore text is often manipulated before it is displayed. Sites often capitalize titles or truncate texts for preview etc.
AFAIK there is no such tool, and in my opinion there cannot exist one that solves your problem adequately.
Your best choice is to get the source code the site uses/used and analyze it. If that fails/is not possible you have to analyse the database manually. Get as much content as possible from the URL and try to fit the puzzle.

I am building a search engine. How do I remove duplicates from search results?

When I search for something, I get content that have the same text and title.
Of course, there is always an original (where others copy/leech from)
If you have expertise in search and crawling...how do you recommend that I remove these duplicates? (in a very feasible and efficient mannter)

Sounds like a programming question to me.
If you have a clear idea about what the stolen and original components of these pages are, and those differences are general enough that you can write a filter to separate them, then do that, hash the 'stolen' content, and then you should be able to compare hashes to determine if two pages are the same.
I guess web-page thieves might go to some further code-obfuscation to mess you up, including changing whitespace, so you might want to normalise the html before hashing, for instance removing any redundant whitespace, making all attributes use " quotes etc.

Here's a technique based on simhash.
Here's one that uses stopwords to work around ads.

Have you tried looking at the origin date of the site? After comparing a value of word strings to verify duplication, whitelist the one that is earlier.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string