Make a model to identify a string - string

I have a string like this
ODQ1OTc3MzY0MDcyNDk3MTUy.YKoz0Q.wlST3vVZ3IN8nTtVX1tz8Vvq5O8
The first part of the string is a random 18 digit number in base64 format and the second is a unix timestamp in base64 too, while the last is an hmac.
I want to make a model to recognize a string like this.
How may i do it?

While I did not necessarily think deeply about it, this would be what comes to my mind first.
You certainly don't need machine learning for this. In fact, machine learning would not only be inefficient for problems like this but may even be worse, depending on a given approach.
Here, an exact solution can be achieved, simply by understanding the problem.
One way people often go about matching strings with a certain structure is with so called regular expressions or RegExp.
Regular expressions allow you to match string patterns of varying complexity.
To give a simple example in Python:
import re
your_string = "ODQ1OTc3MzY0MDcyNDk3MTUy.YKoz0Q.wlST3vVZ3IN8nTtVX1tz8Vvq5O8"
regexp_pattern = r"(.+)\.(.+)\.(.+)"
re.findall(regexp_pattern, your_string)
>>> [('ODQ1OTc3MzY0MDcyNDk3MTUy', 'YKoz0Q', 'wlST3vVZ3IN8nTtVX1tz8Vvq5O8')]
Now one problem with this is how do you know where your string starts and stops. Most of the times there are certain anchors, especially in strings that were created programmatically. For instance, if we knew that prior to each string you wanted to match there is the word Token: , you could include that in your RegExp pattern r"Token: (.+)\.(.+)\.(.+)".
Other ways to avoid mismatches would be to clearer define the pattern requirements. Right now we simply match a pattern with any amount of characters and two . separating them into three sequences.
If you would know which implementation of base64 you were using, you could limit the alphabet of potential characters from . (thus any) to the alphabet used in your base64 implementation [abcdefgh1234]. In this example it would be abcdefgh1234, so the pattern could be refined like this r"([abcdefgh1234]+).([abcdefgh1234]+).(.+)"`.
The same applies to the HMAC code.
Furthermore, you could specify the allowed length of each substring.
For instance, you said you have 18 random digits. This would likely mean each is encoded as 1 byte, which would translate to 18*8 = 144 bits, which in base64, would translate to 24 tokens (where each encodes a sextet, thus 6 bits of information). The same could be done with the timestamp, assuming a 32 bit timestamp, this would likely necessitate 6 base64 tokens (representing 36 bits, 36 because you could not divide 32 into sextets).
With this information, you could further refine the pattern
r"([abcdefgh1234]{24})\.([abcdefgh1234]{6})\.(.+)"`
In addition, the same could be applied to the HMAC code.
I leave it to you to read a bit about RegExp but I'd guess it is the easiest solution and certainly more appropriate than any kind of machine learning.

Related

How to use Unicode::Normalize to create most compatible windows-1252 encoded string?

I have a legacy app in Perl processing XML encoded in UTF-8 most likely and which needs to store some data of that XML in some database, which uses windows-1252 for historical reasons. Yes, this setup can't support all possible characters of the Unicode standard, but in practice I don't need to anyway and can try to be reasonable compatible.
The specific problem currently is a file containing LATIN SMALL LETTER U, COMBINING DIAERESIS (U+0075 U+0308), which makes Perl break the existing encoding of the Unicode string to windows-1252 with the following exception:
"\x{0308}" does not map to cp1252
I was able to work around that problem using Unicode::Normalize::NFKC, which creates the character U+00FC (ü), which perfectly fine maps to windows-1252. That lead to some other problem of course, e.g. in case of the character VULGAR FRACTION ONE HALF (½, U+00BD), because NFKC creates DIGIT ONE, FRACTION SLASH, DIGIT TWO (1/2, U+0031 U+2044 U+0032) for that and Perl dies again:
"\x{2044}" does not map to cp1252
According to normalization rules, this is perfectly fine for NFKC. I used that because I thought it would give me the most compatible result, but that was wrong. Using NFC instead fixed both problems, as both characters provide a normalization compatible with windows-1252 in that case.
This approach gets additionally problematic for characters for which a normalization compatible with windows-1252 is available in general, only different from NFC. One example is LATIN SMALL LIGATURE FI (fi, U+FB01). According to it's normalization rules, it's representation after NFC is incompatible with windows-1252, while using NFKC this time results in two characters compatible with windows-1252: fi (U+0066 U+0069).
My current approach is to simply try encoding as windows-1252 as is, if that fails I'm using NFC and try again, if that fails I'm using NFKC and try again and if that fails I'm giving up for now. This works in the cases I'm currently dealing with, but obviously fails if all three characters of my examples above are present in a string at the same time. There's always one character then which results in windows-1252-incompatible output, regardless the order of NFC and NFKC. The only question is which character breaks when.
BUT the important point is that each character by itself could be normalized to something being compatible with windows-1252. It only seems that there's no one-shot-solution.
So, is there some API I'm missing, which already converts in the most backwards compatible way?
If not, what's the approach I would need to implement myself to support all the above characters within one string?
Sounds like I would need to process each string Unicode-character by Unicode-character, normalize individually with what is most compatible with windows-1252 and than concatenate the results again. Is there some incremental Unicode-character parser available which deals with combining characters and stuff already? Does a simple Unicode-character based regular expression handles this already?
Unicode::Normalize provides additional functions to work on partial strings and such, but I must admit that I currently don't fully understand their purpose. The examples focus on concatenation as well, but from my understanding I first need some parsing to be able to normalize individual characters differently.
I don't think you're missing an API because a best-effort approach is rather involved. I'd try something like the following:
Normalize using NFC. This combines decomposed sequences like LATIN SMALL LETTER U, COMBINING DIAERESIS.
Extract all codepoints which aren't combining marks using the regex /\PM/g. This throws away all combining marks remaining after NFC conversion which can't be converted to Windows-1252 anyway. Then for each code point:
If the codepoint can be converted to Windows-1252, do so.
Otherwise try to normalize the codepoint with NFKC. If the NFKC mapping differs from the input, apply all steps recursively on the resulting string. This handles things like ligatures.
As a bonus: If the codepoint is invariant under NFKC, convert to NFD and try to convert the first codepoint of the result to Windows-1252. This converts characters like Ĝ to G.
Otherwise ignore the character.
There are of course other approaches that convert unsupported characters to ones that look similar but they require to create mappings manually.
Since it seems that you can convert individual characters as needed (to cp-1252 encoding), one way is to process character by character, as proposed, once a word fails the procedure.
The \X in Perl's regex matches a logical Unicode character, an extended grapheme cluster, either as a single codepoint or a sequence. So if you indeed can convert all individual (logical) characters into the desired encoding, then with
while ($word =~ /(\X)/g) { ... }
you can access the logical characters and apply your working procedure to each.
In case you can't handle all logical characters that may come up, piece together an equivalent of \X using specific character properties, for finer granularity with combining marks or such (like /((.)\p{Mn}?)/, or \p{Nonspacing_Mark}). The full, grand, list is in perluniprops.

Case-insensitive string comparison in Julia

I'm sure this has a simple answer, but how does one compare two string and ignore case in Julia? I've hacked together a rather inelegant solution:
function case_insensitive_match{S<:AbstractString}(a::S,b::S)
lowercase(a) == lowercase(b)
end
There must be a better way!
Efficiency Issues
The method that you have selected will indeed work well in most settings. If you are looking for something more efficient, you're not apt to find it. The reason is that capital vs. lowercase letters are stored with different bit encoding. Thus it isn't as if there is just some capitalization field of a character object that you can ignore when comparing characters in strings. Fortunately, the difference in bits between capital vs. lowercase is very small, and thus the conversions are simple and efficient. See this SO post for background on this:
How do uppercase and lowercase letters differ by only one bit?
Accuracy Issues
In most settings, the method that you have will work accurately. But, if you encounter characters such as capital vs. lowercase Greek letters, it could fail. For that, you would be better of with the normalize function (see docs for details) with the casefold option:
normalize("ad", casefold=true)
See this SO post in the context of Python which addresses the pertinent issues here and thus need not be repeated:
How do I do a case-insensitive string comparison?
Since it's talking about the underlying issues with utf encoding, it is applicable to Julia as well as Python.
See also this Julia Github discussion for additional background and specific examples of places where lowercase() can fail:
https://github.com/JuliaLang/julia/issues/7848

Important algorithm involving random access to a string?

I am implementing a different string representation where accessing a string in non-sequential manner is very costly. To avoid this I try to implement certain position caches or character blocks so one can jump to certain locations and scan from there.
In order to do so, I need a list of algorithms where scanning a string from right to left or random access of its characters is required, so I have a set of test cases to do some actual benchmarking and to create a model I can use to find a local/global optimum for my efforts.
Basically I know of:
String.charAt
String.lastIndexOf
String.endsWith
One scenario where one needs right to left access of strings is extracting the file extension and the file name (item) of paths.
For random access i find no algorithm at all unless one has prefix tables and access the string more randomly checking all those positions for longer than prefix strings.
Does anyone know other algorithms with either right to left or random access of string characters is required?
[Update]
The calculation of the hash-code of a String is calculated using every character and accessed from left to right along the value is stored in a local primary variable. So this is not something for random access.
Also the MD5 or CRC algorithm also all process the complete string. So I do not find any random access examples at all.
One interesting algorithm is Boyer-Moore searching, which involves both skipping forward by a variable number of characters and comparing backwards. If those two operations are not O(1), then KMP searching becomes more attractive, but BM searching is much faster for long search patterns (except in rare cases where the search pattern contains lots of repetitions of its own prefix). For example, BM shines for patterns which must be matched at word-boundaries.
BM can be implemented for certain variable-length encodings. In particular, it works fine with UTF-8 because misaligned false positives are impossible. With a larger class of variable-length encodings, you might still be able to implement a variant of BM which allows forward skips.
There are a number of algorithms which require the ability to reset the string pointer to a previously encountered point; one example is word-wrapping an input to a specific line length. Those won't be impeded by your encoding provided your API allows for saving a copy of an iterator.

What is a good method for obfuscating a base 64 string?

Base64 encoding is often used to obfuscate plaintext, I am wondering if there are any quick/easy ways of obfuscating a base 64 string, so that it is not easily recognizeable as such. To do so the method should obfuscate the padding characters (='s) such that they become some other symbol and are more dispersed.
Does anyone know of an easy (and easily reversible) way to do this?
You could use a shift cipher, but I am looking for something that's a little more comprehensive, for example if my shift cipher mapped = to a, someone might notice a string that frequently ends in a's.
The purpose is not to add security, it is actually simply to make base64 unrecognizeable as base 64. It also does not need to pass a security proffesional, just an individual that knows what base64 is and what it looks like. Ex (='s at the end etc.)
The method I describe would probably add non base 64 characters, like ^%$##!, to help obfuscate the reader.
Most of the replies seem to be on the topic of WHY I would want to do this, and the basic answer is that the operation would be completed numerous times (So I want something inexpensive), and done in a way where no password can be remembered (Why I don't XOR). Also the data isn't highly sensitive, and is just to be used as a method against the casual user, who might have knowledge of what a base 64 string is.
A couple of suggestions:
Strip any ending = (according to Wikipedia they are no needed) and then bitwise negate each byte. This will transform the text into mostly non-readable characters.
Loop over the data and xor each character with it's position, modulo 256. This will eliminate any simple statistical analysis since the mapping of each character depends on the position in the string.
In contrast to one of the points in Anders Abel's best answer, the = signs in the base64 strings seem to matter:
$ echo -n foobar | base64
Zm9vYmFy
$ echo -n foobar1 | base64
Zm9vYmFyMQ==
$ echo -n Zm9vYmFyMQ | base64 -D
foobar$ echo -n Zm9vYmFyMQ= | base64 -D
foobar$ echo -n Zm9vYmFyMQ== | base64 -D
foobar1$
What you are asking for is called "security by obscurity" and generally is a bad idea.
Base64 encoding was never designed or intended to be used to obfuscate text or data. Its used to encode binary data which needs to travel trough some communication channel which allows only ASCII characters - like email messages, or be part of XML, etc.
Better use real encryption if you want to hide the data. In any case, even after encrypting the data, you need to pass it as XML, etc., you may end up again encode it in Base64 for transport purposes.
I suppose you could generate a small amount of random data, and then use that to encode the Base64 characters. Prepend the random data to the re-encoded Base64 data.
A very simple example: given an input string "Hello", generate a random number in the range 1-9 and use that as the offset to apply to each input character. Suppose you generate "5", then the re-encoded string would be "5Mjqqt". Or encode the offset as a letter rather than as a number (a=1, b=2, ...) Then the "=" padding will be translated to a different character each time.
Or you could just drop the padding; according to the Wikipedia article, it's not really necessary.
(But consider whether this is really a necessary and sufficient thing to be doing in the first place. It's not clear from your question why you want to obfuscate base 64 data.)
agreed with the responses suggesting use of encryption if your requirements are to actually keep someone who is determined to decode the data from reversing the process.
otherwise, the answer somewhat depends on other constraints of your system, but a few ideas came to mind. if you're just concerned about the delimiter characters, and you have control over the process that generates the Base64 to begin with, you could choose some method of padding the data prior to conversion, thus eliminating the '=' characters from the output.
along this same vein, you could use one of the variants like 'base64url' encoding (see http://en.wikipedia.org/wiki/Base64 for lots of good info on the variants) that does not use the pad character.
after eliminating the '=' by one of these methods, you could perhaps do some sort of char-swapping on the generated Base64, just swapping every other character, just leaving any final character in place. you could also perhaps do some sort of substitution of the upper- or lowercase letters into some other characters to make it look less like Base64 to a quick glance.
however, whatever idea you choose, just remember that it will not be a substitute for a real encryption scheme if you require real protection of that data.
Base64 usually used when you want your data goes through some channel that can distort non-alpha-numeric symbols - for example in XML. If it is your task too - your code will be similar to Base64 no matter how you try :)
If your channel handles binary data well - then just get source text (decode Base64 back), get binary representation for it and use some sort of xor. For example make xor 37 with every byte in source bytes. The same operation will restore your text back.
But it still easily recognizable by anyone who has basic knowledge of cryptanalysis. If it is a problem - use real encryption.

String recurring subsequences and compression

I'd like to do some kind of "search and replace" algorithm which will, in an efficient manner if possible, identify a substring of a string which occurs more than once and replace all occurrences of that substring with a token.
For example, given a string "AbcAdAefgAbijkAblmnAbAb", notice that "A" recurs, so reduce in pass one to "#1bc#1d#1efg#1bijk#1blmn#1b#1b" where #_ is an indexed pattern (we note the patterns in an indexed table), then notice that "#1b" recurs so reduce to "#2c#1d#1efg#2ijk#2lmn#2#2". No more patterns occur in the string so we're done.
I have found some information on "longest common subsequences" and compression algorithms, but nothing that seems to do this. They either are for comparing two string or for getting some kind of storage-optimal result.
My objective, on the other hand, is to reduce the genome to its "words" instead of "letters". ie, instead of gatcatcgatc I want to see 2c1c2c. I could do some regex afterwards to find things like "#42*#42"; it would be cool to see recurring brackets in dna.
If I could just find that online I would skip doing it myself but I can't see this question answered before in terms I could uncover. To anyone who can point me in the right direction many thanks.
The byte pair encoding does something pretty close to what you want.
Rather than searching directly for the longest repeated string (top-down),
each pass of byte pair encoding searches for repeated byte pairs (bottom-up).
But eventually it discovers the longest repeated string(*).
gatcatcgatc
1=at g1c1cg1c
2=atc g22g2
3=gatc 2=atc 323
As you can see, it has found the longest repeated string "gatc".
(*) byte pair encoding either eventually finds the longest repeated string,
or else it stops early after making (2^8 - uniquechars(source) ) substitutions.
I suspect it may be possible to tweak byte pair encoding so that the early-stop condition is relaxed a little -- perhaps (2^9 - uniquechars(source) ) or 2^12 or 2^16.
Even if that hurts compression performance, perhaps it will give interesting results for applications like yours.
Wikipedia: byte pair encoding
Stack Overflow: optimizing byte-pair encoding

Resources