I'm doing some reverse engineering and came across this string in one field of a ProtoBuf-encoded HTTP request. The ProtoBuf binary structure is already decoded; this is the contents of one of its fields.
Does anybody recognize this encoding? It's not base 64 and doesn't appear to be escaped Unicode characters since there are regular non-escaped characters interspersed throughout.
\002\000\000\000P\030\326--\037\352Hx\232\244\322.\224\'\246\004P\3314\372g\274\366\362\337\277\226b\236\nr8\351 ]u\362\214\374\330O5\246Y+\276\005\212\234\017\216\333\312*\313g\357\267t\227\034\244_}\205jiO\261\271\304\013\224\373.zZ\224\230\260\004\2411\000\323\362\345K\300h\307\\\220\335\304\022\357\2230\355\375\032\210\330\2711\374\272\336\277kC]\334?\226\370w\262\023)4 D\273\344H\212\000\347}u\336lOp\237\3666\337j\002*s\033|\010\000\000%\3157/w\327\364\252)\235\245wQ\325+W\026*\215\357E\005\271w\002\246\216\325\002&e\217T!\242\376\307\321\267\016_\017Q\265p\007\035\367\324\216H\314\222\3244\004\353b\017\325\025N\017\205dk\257\237g\"\367\245\324*\204^\010\233\244\002\266\007\231\226w\006\2056\313\265_\236Y\270\nP\216\nq\373\330#\345,\271\241\177\331\271\023K\227\013\317d\335mg\255\266\232pp2d\253\332A:Gs!0>O\226\315_\264G\234\326\240\213\261\253\017\352\214\365\007{ \022\365r<\306\354\355]\320\010\2511\225\215M\276\366P\264\003\315F\314\301\244\350\034\316P\375\317v(\360\244\347\371<$$9\360\267\340H\372\362\271\307\357\215J\3433\215\331=\tqQ1\354\213\333R\331--\213Tc\352a\337\236\346[#-\266]\354\202\335\307\333\330\213\351b\254kt\304\210\276\300\013\322\306\242\000\037\177\354#[U|\302j\r{\243\247\257\000XY\344\020x*\310\363\242^\315\271\371\335\210\030\310\255S\240)\234C\'\200+\313\246oM\304\271^6s\345IG3.\306Xhf\235\244\004[\314Tb\373\360\023\3565\265\254\351\236\227b\337\264\207\302\nq\t\\\236\253e\271I\244u\004\204\220\336\367\333\337# \005\361IPe\270v\016\010RPd\306r\254\2651J\250P\320\312l\177g\036\021\276fC\0136\363\372\265\003v\243y\215w\266q\364a\025\210\264\251\227\333\235r\316\275\237r\256o\231\263\3358?\240\001\306: \211\223\375-\247=\361\207\022\321Gb\326\230x\342\203\014^\371\243\037N\000\376+\370\302\351\227F\025\025\017C\225x]\201c\370\373{\230\2656\222\334,\266\016Q\320\005-Y\203z\200\207\205\2667\264\320\027\250\3007,\303\204\006\357\036\254)\271\271!;\233\300P\250\220\306V/_EP\352\272-\242\276q\252C\337BV\022\357\2467y\025\377A\017u\\\335m\352\037\215\026f\354\3100\312\032\235i\333\312|h\255\266\376\234\345\\\361lC\n\022\341K\022cnU\217\'\222Hl\312\006;0\003V\006\255\256\016\262Z\220zo\002\004\316\370\317\371\220O^q\247\313g\301\376\354W\346\001F\262\233\354\024\004kzk\032\313\0132u\346R\013z:TQ\007\347\273\343\022&X\357\334\305\307;\221W\301\236\360Ap\311\t\024W\004i\221\301a\356}\036\362\002J\267R\335\371(\357\025<\322H\232\334a\375\215eSl\324\214P\367\377T\236\346\346\026\367h\214\275;\013\205\n\302%\\\017\227a\373\376\347\222\\\014cT\340\'\361\024t\t:\203c\314\361W\252\336+\376e\353\336\237\272\2745\315\354\356\272\037Z\246Z\277;\344j\271\022\273\274\025\367\037\257\372p\204\224\314\244\026&o\365\220\235`\365c\377\306\304)&f]q\241\252|d\270H\010?\300$\275\200^!\r\272_\237V\241=\245\020#\314\362\032\031\312t\037\0344\254\264\213Y\315:\215\271\222\277\332\007\220\t\357N]\361O\\\257\352<F\001{\214\317\226\314\'&\232\026\314\350\020\200\316\370\216\231\325\2574\373R\231\316\251\257\260z!\033\203\357\364\310\021\0029\000)\034\010\276Tr\336y0\376\232h~\332y-\354\327w\220\254\321\022\210\266\345\245\325gy\210\357\356\215P$\270\372\3169\365\022\357\225A\324\352\313\340\3445\247\267\352{\037\266\244\205\262\023\t\\\224\020\236\307C\241\371\214\345\216^\271\320\345?\0052\341TD\235j\370\306\236\274\254J\213 \377\212K\032\265\251\367,Q\331\0067ZE\235\253\256\311\022\320\232\205p\262\370\032h\255\304\304D\366\340\276\006\200\307S\230\340?\212jj\261\377r\337\223 \305\217\310\344Xi()*\225z[Y\313_t}\331\240\000>\024:\3242\322\030\352ZWB\247`\320\340\243\204\224\312&\274\321qi\375\231\374\201\235\234{\344\367\002lO\350\363X\361\rh)\231\337r\361\306w\360B\271\013\233IoG\245~:X5%h\222\247J\\w\373\266\374\340\314\313\226\224\204:\250\363\243\265H|\003Y\263\023sZ7#\351)V]{\3065E\210t\207\353^\205q\211\003Yj\373\227Qb4v\2213TO\"S\301^\272\035\t\212|eJ\332t\243\177\274\016ni^ 8\273\317p(N\263j\375\254k\253h\206%ta*LM\270v\2473\220\263\366\211\302=Q\217~\0029\246\236\374\350\247%\221\001`B\337\321N\216wR\235\336\244.K;Y\330\033\372i<\3156{z\310\255\031\021wr{{\331F01\227\010\346B#\341\276\'\246\372S\250\356\222\370,\334h\217\025\334S\016\005\007/,\024\355\024V\246\007\036;\030\337\002c\254\304[\253\tN\331X|$[%*\242\353\254\227>\031\304\203\275\277``c\240\344\277\213\377\204\223\202\026#\367\271\302k\027\262\020H\024p\010\203\264iM\233F7\333\354\352\303\223\217Hi\'\375\010\302\035\013\273F(\032\272\377\252:8\213\304\036\264y\t\265\025\300\317\324za4\010I$Eu\310,\006y,^\3531\027g\343o\314j\270\3152gif\271(\037g\031\375\325\341\320\320\317HJ+\374>%\320\234V\317\332\232x\034x\233R9\245\346_r\307{\030y\234z\331zV\031\264\035\324\003\260AQ\024\217\230\213w\021\3205g\273\275nn\357\275\217?Kd\031\353yF\'\234\201\335+\177\350\001\340D\324\"\340\335\254\304\360=\301\'$\274\235e\032$N\345+\244WKC\204\342\024\307\3103\2722\024\216\002\221UbTn\233\244\261\347\303\340A\312l\317\263Gm\352\000v\245X\334\"\263\315z\374N\244\365\013\375\260\220\251\203\036gD\364p3i#n\016\031[,\336\300\0000\352\001NK()\214\023\222w\014B\242\220\206\034\333\256\265\331-\220\361F\203s\014S\236p\265\236\343g\020HR\235\325W\360\030(\374\341\000\261\315s\315vv\017]s\311o\033c\206\303\245\347\372C\345\207\244\207AL+\306c\026\001\307\3409\331\205\340\371\365\006\263\352kF\010\035K\354\225\035\341\014\360*\232\035\251\t\344\205\374\235\374\352\n}\262+\252\321\377\010G\215\263GA\230\364Z\037\323\351\220\226\272\002\207\254\241\263X\t_ N\307\326\350\246hI\223\223J&\373-\344\243\316\300m.FHmNdS?\tCf\001\252\307\346\205H\026\375)#\006\261g\036\307\252\205\000\027}\212_\021 )4\207#\213n\254H\205\036\325q\217\025\305\036\010J\017\320\257\203\226\025X\313\032,\003\341\003\023cw\375r\337\223\233>+\335\223\206\203}\035!\3100\242Tv\350\255\276\343&\220\213\361\354Ij\035\312_\273\233\333\327;\022\016\315a5\373\217t\324ZJ\202\304(~,B(\215\005E\341\375\036\260C\213\364\240\020\373\340\275\310\2048*\326\"^$\366\367\252#\201\355\000\273\010#`J\230\363\320\363L9\261\216\353((#;3\366oKR\021\nL\244a\244\376\032\304\376\001|\317c\222%c=\\\225\340I\225\301\277G\227\242\366\025\323y0\273\241\217E[\032=\253e\001\270q\005\241\374\276\267$\277Lj\3528\257z\247\242+}\304\254(\013\336g\230\237\270\212I#\245\247\271)\026i\346\366\342\021\005\373i\341`A\020|\367\337\312$`\241\322\007YaQ#\216cy&\371\206\223\264+g\0213b\315\217\371\364\013x\327\2478\0013\352\372\375E\233\352\200\213<\021puH\347x;\354\036\024\\\253_\340\200xH\353\350b\364\207\276*\323J\341\200\r\276]e\217\307\305\275\350\004V\300\272\271\010\345KM\330\2716$\030\225\223\322\347\325\260\331Ok\0340Y\241\276\353\223\276\253>\256\022\257CE\320\007D\236\201\026\214\177\036\277\347\031\001\254\240L\203\n\332\252c\211Y\031\310\212\r+J\274E0
Related
I wanted to decode my PUBG name. I come to interact with this site: http://ddecode.com/hexdecoder/
It decodes as I want, but now I want to know what technique they use, so I can use it in my project.
Input :
PSYCH%C3%98%E4%B9%82JOKER
Decoded String:
PSYCHØ乂JOKER
Here Is The result Url: http://ddecode.com/hexdecoder/?results=48d3b517a922349a1838240623f6e7c3
You should take a look at Percent encoding, this is a way to encode stuff to be valid written in URLs. The characters after the % symbol are just the hexadecimal UTF-8 values to encode the special characters Ø乂.
0xC3 0x98 corresponds to Ø and 0xE4 0xB9 0x82 to 乂 in UTF-8.
By the way, since you added the encryption badge and wrote the word in your question. In this situation, we cannot speak of decryption; you might want to take a look at the difference between all that terminology (encoding and encryption, for example).
I have the following problem:
From a SQL Server database I am reading data using python module pypyodbc and ODBC Driver 13 for SQL Server and writing to txt files.
Database contains all kinds of special characters and they read as:
'PR\xc3\x86KVAL'
The '\xc3\x86' part is bytecode and should be interpreted that way. The other characters should be interpreted as shown. UTF8 would translate '\xc3\x86' to Æ.
If I type the value in b'PR\xc3\x86KVAL' , python recognizes it as bytecode and I can translate it to PRÆKVAL. See below:
s = b'PR\xc3\x86KVAL'
print(s)
bb = s.decode('utf-8')
print(bb)
The problem is that I don’t know how I can turn 'PR\xc3\x86KVAL’ to be recognized as a bytecode object.
I want the value that has to be decoded to be a variable so that all data from database can flow through it.
I Also tried ast.literal_eval(r”b'PR\xc3\x86KVAL'”), but variables won’t work in this way.
Since you start out with PR\xc3\x86KVAL as a text string and decode indeed expects a raw byte sequence, you need to convert the text string into a bytes object. But when converting from one "encoding" value to another, Python needs to know what encoding it is starting with!
The easiest way to do so is explicitly encoding the string, using an encoding that does not change the special characters. You must be careful, because it is very well possible that a character code might be translated to something else, destroying their meaning.
You can see that with a simple example: attempting to tell Python this should be plain ASCII fails, for an obvious reason.
>>> s = 'PR\xc3\x86KVAL'.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-3: ordinal not in range(128)
Even though there are more than 1,000 questions on Stack Overflow about this, the reason for the failure should be easy to understand. All an encoder/decoder pair does is translate each character from 'source' to 'destination'. This can only work if the character in question actually exists in both the 'source' and 'destination' encodings. Suppose you want to translate a Greek character β to a Russian б, then the source must be able to decode the Greek character (because that is what you entered it in) and the destination must be able to encode the Russian character.
So you must be careful to choose an encoding which does not change the character \x86 in your input string into Ж (which it would do when using cp866, for example).
Fortunately, as quoted from https://stackoverflow.com/a/2617930/2564301, there is an encoding that does not mess up things:
Pass data.decode('latin1') to the codec. latin1 maps bytes 0-255 to Unicode characters 0-255, which is kinda elegant.
and so this should work:
>>> s = 'PR\xc3\x86KVAL'.encode('latin1')
>>> print(s)
b'PR\xc3\x86KVAL'
Now s is a properly encoded byte object, so you can decode it at will:
>>> bb = s.decode('utf-8')
>>> print(bb)
PRÆKVAL
Done!
Given the following decoder, write the encoder. (The encoder should be written to compress whenever possible):
p14a8xkpq -> p14akkkkkkkkpq
(8xk gets decoded to kkkkkkkk. The only other requirement is that encodings be unambiguous)
Note that the String can have any possible ascii character
My approach would be to find sequences of repeating characters and replace them. For e.g. kkkkkkkk will b replaced by 8xk. However the problem with this solutin is that its ambigious. "8xk" may appear in the uncompressed string itself. I was thinking of using some special character to distinguish it, but then the string can have any possible character so that does not really help
I have file which contains some data (text copied and pasted from the "What You Will Learn" portion of this PDF). Firstly, I have converted the contents in the file to bits successfully. However, when I try to convert it back to the original format, some of the characters are not correctly converted, as shown below:
Cisco has
developed the Cisco Open Network Environment (ONE)
architecture as a multifaceted approach to network
programmability delivered across three pillars:
??)É¥ Í?н??ÁÁ±¥?Ñ¥½¸ÁɽÉ?µµ¥¹?¥¹Ñ?É???Ì?¡A%̤?)?áÁ½Í??¥É?ѱ佸Íݥѡ?Ì?¹É½ÕÑ?ÉÌѼ?Õµ?¹Ð?)?á¥ÍÑ¥¹?=Á?¹±½ÜÍÁ?¥?¥?Ñ¥½¹Ì* ¤&öGV7F?öâ×&VG?÷VäfÆ÷r6öçG&öÆÆW"æB÷VäfÆ÷r ¦vVçG0¨?HÝZ]HÙ??ÙXÝÈÈ[]?\??\X[Ý?\?^\Ë?\X[?Ù\?XÙ\Ë[??\ÛÝ\?ÙHÜ?Ú\Ý?][Û?Ø\X?[]Y\È[?H?]HÙ[
As you can see here some characters are converted successfully, others are not.
My code is below:
file = open("test.txt",'r')
myfile = ''.join(map(str,file))
l = []
for i in myfile:
asc11 = ord(i)
b = "{0:08b}".format(asc11)
l.extend(int(y) for y in b)
string_bin = ''.join(map(str,l))
mydata = ''.join(chr(int(string_bin[i:i+8], 2)) for i in range(0,len(string_bin), 8))
print(mydata)
What wrong with my code? What I need to change to make it work properly?
What's Going On?
You are running into an encoding issue because some characters in the PDF are non-ASCII characters. For example, the bullet points are U+2022 which require 3 bytes of storage.
When Python reads from your file, it doesn't know what encoding you used to write that data. Thus it reads bytes from the file and uses a character encoding to translate them into strs which are stored using Python's own internal unicode format. (This differs from Python 2 where open() returned raw bytes stored in a str which you could then manually decoded to unicode.)
Thus, in Python 3, open() accepts a named encoding parameter. For example open("test.txt",'r', encoding='ascii'). Because you don't specify the encoding when you call open(), you end up using your system's default encoding. For instance, on my laptop, the default encoding is CP1252 (LATIN-1). Yours may differ.
Whatever encoding Python uses to interpret your file, it then internally uses it's own unicode format to store your string. This means that your string may internally use mutli-byte characters even if the original encoding did not. For example, my laptop uses CP1252 to interpret U+2022 as • which is internally stored as U+00e2, U+20AC and U+00A2 -- € is stored using a multi-byte character even though it was just one byte in the original file.
Let's assume you computer is sane and uses UTF-8 by default (this explanation is similar for many multi-byte characters). When you reach a bullet point, it is stored as U+2022. When you call ord('\u2022') the result is 8226. When you then call "{0:08b}".format(8226) this returns "10000000100010". That's a 14 character string. Your parsing code assumes all of the ordinals will generate 8 character strings. Because of this, the "binary" output becomes misaligned. This means that when you then parse the binary string in 8-character segments, it gets thrown off and starts interpreting things as control characters and all sorts of foreign language characters.
If you call open(..., encoding='ascii'), Python will actually throw an exception because it reads non-valid ASCII characters.
Possible Solutions
I'm not sure why exactly you are converting the input string into the representation that you are using. It's not binary, as your question title would suggest. Rather, you've converted the data into a textual representation of it's binary encoding.
Technically speaking, when you store encoded text to a file, it's stored using a binary representation. Python, and any text editor, has to decode those bytes into it's internal character representation before it can display them as text. Thus, calling open("test.txt", "r", encoding="utf-8") reads the binary data out of your text file and converts it into Python's internal unicode format. Similarly, calling myfile.encode('utf-8') will return the UTF-8 encoded bytes which can then be written to a file, network socket, etc.
If, however, you do need to use a format similar to what you are currently using, first, I still recommend you specify an encoding when you call open() (I recommend UTF-8). Then you can consider these options:
Detect and omit non-ASCII characters. They will have an ordinal >= 128.
Mimic UTF-16 or UTF-32 and output multi-byte output for all characters. For example, use "{0:032b}".format(asc11) and then parse the result in 32-character chunks. It's memory and storage inefficient, but it will preserve multi-byte characters.
Regardless, I highly recommend reading the Dive Into Python 3 chapter about strings.
User inserts a string in a html form input on browser. This string is saved in database. How this string is encoded and decoded at each stage based on character encoding?
Flow as per technology stack used: browser --> ajax post --> spring mvc -->hibernate -->mysql db
You can expect that the browser post is an URL encoded UTF-8. Within the Java JVM, the string uses UTF-16, therefore roughly doubling the size taken if it is English text. Hibernate is part of that and it does not really care about the encoding, although it does pass around with connection strings as described next (hibernate.connection.url property).
The UTF-16 string is then translated by the JDBC driver which, in case of MySQL, will use the characterEncoding property inside the connection string. It helps if this matches the encoding of the database declared in CREATE DATABASE statement, avoiding another re-encoding.
Finally, "latin" is not a name of a specific character set or encoding. You probably mean ISO 8859-1, also known as Latin-1. This is not a good choice for a web server as it will not be able to represent most non-English strings. You should use UTF-8 in the database and in the connection string, ending up with UTF-8 -> UTF-16 -> UTF-8 which is a safe and reasonably efficient sequence (not counting any encoding that might be happening in the browser itself).
If you decide to alter the database to use UTF-8, be careful about changing the encoding at table level, too. Each table may use its own encoding and it does not change automatically.