Python : stripping, converting bytes type - python-3.x

Under Python 3.10, I do have an UDP socket that listens to a COM port.
I do get datas like this :
b'SENDPKT: "STN1" "" "SH/DX\r"\x98\x00'
The infos SH/DX before the "\n" can change and has a different length and I need to extract them.
.strip('b\r') doesn't work.
Using .decode() and str(), I tried to convert this bytes datas to a string for easier manipulation, but that doesn't work either.
I get an error "invalid start byte at position 27 for 0x98
Any guess, how I can solve this ?
Thanks,

For sophisticated input you can try ignoring errors while decoding:
b = b'SENDPKT: "STN1" "" "SH/DX\r"\x98\x00'
s = b.decode(errors='ignore')
res = s[20:s.find('\r')] # 'SH/DX'

Related

Groovy - String created from UTF8 bytes has wrong characters

The problem came up when getting the result of a web service returning json with Greek characters in it. Actually it is the city of Mykonos. The challenge is whatever encoding or conversion I'm using it is always displayed as:ΜΎΚΟxCE?ΟΣ . But it should show: ΜΎΚΟΝΟΣ
With Powershell I was able to verify, that the web service is returning the correct characters.
I narrowed the problem down when the byte array gets converted to a String in Groovy. Below is code that reproduces the issue I have. myUTF8String holds the byte array I get from URLConnection.content.text. The UTF8 byte sequence to look at is 0xce, 0x9d. After converting this to a string and back to a byte array the byte sequence for that character is 0xce, 0x3f. The result of below code will show the difference at position 9 of the original byte array and the one from the converted string. For the below test I'm using Groovy Console 4.0.6.
Any hints on this one?
import java.nio.charset.StandardCharsets;
def myUTF8String = "ce9cce8ece9ace9fce9dce9fcea3"
def bytes = myUTF8String.decodeHex();
content = new String(bytes).getBytes()
for ( i = 0; i < content.length; i++ ) {
if ( bytes[i] != content[i] ) {
println "Different... at pos " + i
hex = Long.toUnsignedString( bytes[i], 16).toUpperCase()
print hex.substring(hex.length()-2,hex.length()) + " != "
hex = Long.toUnsignedString( content[i], 16).toUpperCase()
println hex.substring(hex.length()-2,hex.length())
}
}
Thanks a lot
Andreas
you have to specify charset name when building String from bytes otherwise default java charset will be used - and it's not necessary urf-8.
Charset.defaultCharset() - Returns the default charset of this Java virtual machine.
The same problem with String.getBytes() - use charset parameter to get correct byte sequence.
Just change the following line in your code and issue will disappear:
content = new String(bytes, "UTF-8").getBytes("UTF-8")
as an option you can set default charset for the whole JVM instance with the following command line parameter:
java -Dfile.encoding=UTF-8 <your application>
but be careful because it will affect whole JVM instance!
https://docs.oracle.com/en/java/javase/19/intl/supported-encodings.html#GUID-DC83E43D-52F6-41D9-8F16-318F3F39D54F

Pickle in python3, error on concating string to bytes

I am converting some code from python2 to 3 and saw an error that the 2to3 did not catch on a line:
pickle.dumps(('predskew', predskewData[0])) + pickleSep
That produces an error in python3:
pickledPredskewData = pickle.dumps(('predskew', predskewData[0])) + pickleSep
TypeError: can't concat str to bytes
I know from other posts on stack over flow I could perhaps use an encode? or a decode? I just wasn't sure where or what. So I did try this in python2:
pickleSep = ":::::"
pickle.dumps(('predskew',0)) + pickleSep
Which produces:
"(S'predskew'\np0\nI0\ntp1\n.:::::"
Also,
pickle.dumps(('predskew',0)) + pickleSep.encode()
Gives the same result.
Now if I try the same line in python3, I get what 'looks' like vastly different output:
pickle.dumps(('predskew', 0)) + pickleSep.encode()
Gives the output of:
b'\x80\x04\x95\x10\x00\x00\x00\x00\x00\x00\x00\x8c\x08predskew\x94K\x00\x86\x94.:::::'
So not sure my encode fix is the right approach as the answers seem different (unless it is the print just showing me the bytes itself?!)

From SSH not decoded from bytes to ASCII?

Good afternoon.
I get the example below from SSH:
b"rxmop:moty=rxotg;\x1b[61C\r\nRADIO X-CEIVER ADMINISTRATION\x1b[50C\r\nMANAGED OBJECT DATA\x1b[60C\r\n\x1b[79C\r\nMO\x1b[9;19HRSITE\x1b[9;55HCOMB FHOP MODEL\x1b[8C\r\nRXOTG-58\x1b[10;19H54045_1800\x1b[10;55HHYB"
I process ssh.recv (99999) .decode ('ASCII')
but some characters are not decoded for example:
\x1b[61C
\x1b[50C
\x1b[9;55H
\x1b[9;19H
The article below explains that these are ANSI escape codes that appear since I use invoke_shell. Previously everything worked until it moved to another server.
Is there a simple way to get rid of junk values that come when you SSH using Python's Paramiko library and fetch output from CLI of a remote machine?
When I write to the file, I also get:
rxmop:moty=rxotg;[61C
RADIO X-CEIVER ADMINISTRATION[50C
MANAGED OBJECT DATA[60C
[79C
MO[9;19HRSITE[9;55HCOMB FHOP MODEL[8C
RXOTG-58[10;19H54045_1800[10;55HHYB
If you use PuTTY everything is clear and beautiful.
I can't get away from invoke_shell because the connection is being thrown from one server to another.
Sample code below:
# coding:ascii
import paramiko
port = 22
data = ""
client = paramiko.SSHClient()
client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
client.connect(hostname=host, username=user, password=secret, port=port, timeout=10)
ssh = client.invoke_shell()
ssh.send("rxmop:moty=rxotg;\n")
while data.find("<") == -1:
time.sleep(0.1)
data += ssh.recv(99999).decode('ascii')
ssh.close()
client.close()
f = open('text.txt', 'w')
f.write(data)
f.close()
The normal output is below:
MO RSITE COMB FHOP MODEL
RXOTG-58 54045_1800 HYB BB G12
SWVERREPL SWVERDLD SWVERACT TMODE
B1314R081D TDM
CONFMD CONFACT TRACO ABISALLOC CLUSTERID SCGR
NODEL 4 POOL FLEXIBLE
DAMRCR CLTGINST CCCHCMD SWVERCHG
NORMAL UNLOCKED
PTA JBSDL PAL JBPTA
TGFID SIGDEL BSSWANTED PACKALG
H'0001-19B3 NORMAL
What can you recommend in order to return normal output, so that all characters are processed?
Regular expressions do not help, since the structure of the record is shifted, then characters from certain positions are selected in the code.
PS try to use ssh.invoke_shell (term='xterm') don't work.
There is an answer here:
How can I remove the ANSI escape sequences from a string in python
There are other ways...
https://unix.stackexchange.com/questions/14684/removing-control-chars-including-console-codes-colours-from-script-output
Essentially, you are 'screen-scraping' input, and you need to strip the ANSI codes. So, grab the input, and then strip the codes.
import re
... (your ssh connection here)
data = ""
while data.find("<") == -1:
time.sleep(0.1)
chunk = ssh.recv(99999)
data += chunk
... (your ssh connection cleanup here)
ansi_escape = re.compile(r'\x1B(?:[#-Z\\-_]|\[[0-?]*[ -/]*[#-~])')
data = ansi_escape.sub('', data)

Unknown encoding CP500

I need to convert a String to a byte array by using the CP500 encoding.
I tried this line:
const byteArray = Buffer.from(someString, "cp500");
Which led to:
TypeError: Unknown encoding: cp500TypeError [ERR_UNKNOWN_ENCODING]: Unknown encoding: cp500
I googled "node cp500" and looked at this answer but I wasn't able to find any information pointing to cp500 support in node/javascript.
In addition, I can't find any mention of a plugin that supports this specific encoding.
Is there way to get a buffer of bytes from a string in node.js with the cp500 encoding?
I used the codepage package that was pointed by Xaqron in a comment.
I had to import it as:
const codepage: typeof import('codepage').default = require('codepage');
Then, I used the package's encode function as follows in order to encode my string:
codepage.utils.encode(500, somestring, 'arr');
Which corresponds to the target encoding.

Decode UTF8 symbols

I have a string in swift:
let flag = "Cattì ò"
I am trying to convert the UTF8 symbols.
I have tried using
stringByRemovingPercentEncoding
but noting changes. How can I convert the symbols properly ?
Welcome to the encoding guessing game! Look like somewhere along the pathway, your string didn't get the correct code page. Here's one way to guess it:
let flag = "Cattì ò"
let encodings = [NSASCIIStringEncoding,
NSNEXTSTEPStringEncoding,
NSJapaneseEUCStringEncoding,
NSUTF8StringEncoding,
NSISOLatin1StringEncoding,
NSSymbolStringEncoding,
NSNonLossyASCIIStringEncoding,
NSShiftJISStringEncoding,
NSISOLatin2StringEncoding,
NSUnicodeStringEncoding,
NSWindowsCP1251StringEncoding,
NSWindowsCP1252StringEncoding,
NSWindowsCP1253StringEncoding,
NSWindowsCP1254StringEncoding,
NSWindowsCP1250StringEncoding,
NSISO2022JPStringEncoding,
NSMacOSRomanStringEncoding,
NSUTF16StringEncoding,
NSUTF16BigEndianStringEncoding,
NSUTF16LittleEndianStringEncoding,
NSUTF32StringEncoding,
NSUTF32BigEndianStringEncoding,
NSUTF32LittleEndianStringEncoding]
for encoding in encodings {
if let bytes = flag.cStringUsingEncoding(encoding),
flag_utf8 = String(CString: bytes, encoding: NSUTF8StringEncoding) {
print("\(encoding): \(flag_utf8)")
}
}
The array contains all the encodings that Cocoa supports.
From the results, it seems like your string was encoded in NSISOLatin1StringEncoding (a.k.a ISO-8859-1), the default encoding for HTML 4.01. This gives Cattì ò in UTF-8, not exactly match your desired result but is the closest among all code pages.
Other good candidates are NSWindowsCP1252StringEncoding and NSWindowsCP1254StringEncoding so I'd suggest you check with other strings.

Resources