How to get packet.tcp.payload and packet.http.data as string? - python-3.x

The return value for these attributes are in hex format seperated by ':'
Eg : 70:79:f6:2e: something like this.
When I am trying to decode it to plain string ( human readable ) it doesn't work. What encoding is being used? I tried various different methods like codecs.decode(), binascii.unhexlify(), bytes.fromhex() also different encodings ASCII and UTF-8. Nothing worked, any help is appreciated. I am using python 3.6

Thanks for your question! I believe you're wanting to read the payload in chunks of two hex places. The functions you tried are not able to parse the : delimiter out-of-the-box. Something like splitting the string by the : delimiter, converting their values to human-readable characters, and joining the "list" to a string should do the trick.
hex_string = '70:79:f6:2e'
hex_split = hex_string.split(':')
hex_as_chars = map(lambda hex: chr(int(hex, 16)), hex_split)
human_readable = ''.join(hex_as_chars)
print(human_readable)
Is this what you have in mind?

Related

Convert bytes to string while replacing SOH character

Using python 3.6. I have a bytes array that is coming over a socket (a FIX message) that contains hex character 1, start of header, as a delimiter. The raw bytes look like
b'8=FIX.4.4\x019=65\x0135=0\x0152=20220809-21:37:06.893\x0149=TRADEWEB\x0156=ABFIXREPO\x01347=UTF-8\x0110=045\x01'
I want to store this bytes array to a file for logging. I have seen FIX messages where this delimiter is converted to ^A control character. The final string I would like to have is -
8=FIX.4.4^A9=65^A35=0^A52=20220809-21:37:06.893^A49=TRADEWEB^A56=ABFIXREPO^A347=UTF-8^A10=045^A
I have tried various different ways to achieve this but could not, for example, tried repr(bytes) and (ord(b) in bytes).
Any pointers are highly appreciated.
Thanks.
One way I can think of doing this is by splitting the original decoded byte array , and insert control character using a loop.
data = b'8=FIX.4.4\x019=65\x0135=0\x0152=20220809-21:37:06.893\x0149=TRADEWEB\x0156=ABFIXREPO\x01347=UTF-8\x0110=045\x01'
data = data.decode().split("\x01")
print(data)
ctl_char = "^A"
string =""
for i in data:
string+=i
string+=ctl_char
print(string)
Please note there can be a better way of doing it.

How to parse a configuration file (kind of a CSV format) using LUA

I am using LUA on a small ESP8266 chip, trying to parse a text string that looks like the one below. I am very new at LUA, and tried many similar scripts found at this forum.
data="
-- String to be parsed\r\n
Tiempo1,20\r\n
Tiempo2a,900\r\n
Hora2b,27\r\n
Tiempo2b,20\r\n
Hora2c,29\r\n
Tiempo2c,18\r\n"
My goal would be to parse the string, and return all the configuration pairs (name/value).
If needed, I can modify the syntax of the config file because it is created by me.
I have been trying something like this:
var1,var2 = data:match("([Tiempo2a,]), ([^,]+)")
But it is returning nil,nil. I think I am on the very wrong way to do this.
Thank you very much for any help.
You need to use gmatch and parse the values excluding non-printable characters (\r\n) at the end of the line or use %d+
local data=[[
-- String to be parsed
Tiempo1,20
Tiempo2a,900
Hora2b,27
Tiempo2b,20
Hora2c,29
Tiempo2c,18]]
local t = {}
for k,v in data:gmatch("(%w-),([^%c]+)") do
t[#t+1] = { k, v }
print(k,v)
end

How to use f'string bytes'string together? [duplicate]

I'm looking for a formatted byte string literal. Specifically, something equivalent to
name = "Hello"
bytes(f"Some format string {name}")
Possibly something like fb"Some format string {name}".
Does such a thing exist?
No. The idea is explicitly dismissed in the PEP:
For the same reason that we don't support bytes.format(), you may
not combine 'f' with 'b' string literals. The primary problem
is that an object's __format__() method may return Unicode data
that is not compatible with a bytes string.
Binary f-strings would first require a solution for
bytes.format(). This idea has been proposed in the past, most
recently in PEP 461. The discussions of such a feature usually
suggest either
adding a method such as __bformat__() so an object can control how it is converted to bytes, or
having bytes.format() not be as general purpose or extensible as str.format().
Both of these remain as options in the future, if such functionality
is desired.
In 3.6+ you can do:
>>> a = 123
>>> f'{a}'.encode()
b'123'
You were actually super close in your suggestion; if you add an encoding kwarg to your bytes() call, then you get the desired behavior:
>>> name = "Hello"
>>> bytes(f"Some format string {name}", encoding="utf-8")
b'Some format string Hello'
Caveat: This works in 3.8 for me, but note at the bottom of the Bytes Object headline in the docs seem to suggest that this should work with any method of string formatting in all of 3.x (using str.format() for versions <3.6 since that's when f-strings were added, but the OP specifically asks about 3.6+).
From python 3.6.2 this percent formatting for bytes works for some use cases:
print(b"Some stuff %a. Some other stuff" % my_byte_or_unicode_string)
But as AXO commented:
This is not the same. %a (or %r) will give the representation of the string, not the string iteself. For example b'%a' % b'bytes' will give b"b'bytes'", not b'bytes'.
Which may or may not matter depending on if you need to just present the formatted byte_or_unicode_string in a UI or if you potentially need to do further manipulation.
As noted here, you can format this way:
>>> name = b"Hello"
>>> b"Some format string %b World" % name
b'Some format string Hello World'
You can see more details in PEP 461
Note that in your example you could simply do something like:
>>> name = b"Hello"
>>> b"Some format string " + name
b'Some format string Hello'
This was one of the bigger changes made from python 2 to python3. They handle unicode and strings differently.
This s how you'd convert to bytes.
string = "some string format"
string.encode()
print(string)
This is how you'd decode to string.
string.decode()
I had a better appreciation for the difference between Python 2 versus 3 change to unicode through this coursera lecture by Charles Severence. You can watch the entire 17 minute video or fast forward to somewhere around 10:30 if you want to get to the differences between python 2 and 3 and how they handle characters and specifically unicode.
I understand your actual question is how you could format a string that has both strings and bytes.
inBytes = b"testing"
inString = 'Hello'
type(inString) #This will yield <class 'str'>
type(inBytes) #this will yield <class 'bytes'>
Here you could see that I have a string a variable and a bytes variable.
This is how you would combine a byte and string into one string.
formattedString=(inString + ' ' + inBytes.encode())

Parsing a non-Unicode string with Flask-RESTful

I have a webhook developed with Flask-RESTful which gets several parameters with POST.
One of the parameters is a non-Unicode string, encoded in cp1251.
Can't find a way to correctly parse this argument using reqparse.
Here is the fragment of my code:
parser = reqparse.RequestParser()
parser.add_argument('text')
msg = parser.parse_args()
Then, I write msg to a text file, and it looks like this:
{"text": "\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd !\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\n\n-- \n\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd."}
As you can see, Flask somehow replaces all Cyrillic characters with \ufffd. At the same time, non-Cyrillic characters, like ! or \n are processed correctly.
Anything I can do to advise RequestParser with the string encoding?
Here is my code for writing the text to disk:
f = open('log_msg.txt', 'w+')
f.write(json.dumps(msg))
f.close()
I tried f = open('log_msg.txt', 'w+', encoding='cp1251') with the same result.
Then, I tried
f = open('log_msg_ascii.txt', 'w+')
f.write(ascii(json.dumps(msg)))
Also, no difference.
So, I'm pretty sure it's RequestParser() tries to be too smart and can't understand the non-Unicode input.
Thanks!
Okay, I finally found a workaround. Thanks to #lenz for helping me with this issue. It seems that reqparse wrongly assumes that every string parameter comes as UTF-8. So when it sees a non-Unicode input field (among other Unicode fields!), it tries to load it as Unicode and fails. As a result, all characters are U+FFFD (replacement character).
So, to access that non-Unicode field, I did the following trick.
First, I load raw data using get_data(), decode it using cp1251 and parse with a simple regexp.
raw_data = request.get_data()
contents = raw_data.decode('windows-1251')
match = re.search(r'(?P<delim>--\w+\r?\n)Content-Disposition: form-data; name=\"text\"\r?\n(.*?)(?P=delim)', contents, re.MULTILINE | re.DOTALL)
text = match.group(2)
Not the most beautiful solution, but it works.

Encode to Unicode UTF-8 not working

My Code -
var utf8 = require('utf8');
var y = utf8.encode('एस एम एस गपशप');
console.log(y);
Input -
एस एम एस गपशप
Expecting Output - \xE0\xA4\x8F\xE0\xA4\xB8\x20\xE0\xA4\x8F\xE0\xA4\xAE\x20\xE0\xA4\x8F\xE0\xA4\xB8\x20\xE0\xA4\x97\xE0\xA4\xAA\xE0\xA4\xB6\xE0\xA4\xAA
Example Encoding using utf8.js
Output -
à¤à¤¸ à¤à¤® à¤à¤¸ à¤à¤ªà¤¶à¤ª
What am I doing wrong? Please help!
That code appears to be working. That output looks like UTF-8 bytes interpreted as some 8-bit character set, most likely ISO-8859-1, which is easily recognisable by the repeating patterns.
That example output is just how you would represent that string in source code.
Try this:
var utf8 = require('utf8');
var y = utf8.encode('एस');
console.log(y);
console.log('\xE0\xA4\x8F\xE0\xA4\xB8');
You will probably see the same output twice.
You can easily write some code to get that hexadecimal forms back using a lookup table and the charCodeAt function, but it is a rather unusual way to represent a string in JavaScript. JSON for example either just uses the literal characters, or '\uXXXX' escapes.

Resources