Groovy - String created from UTF8 bytes has wrong characters - groovy

The problem came up when getting the result of a web service returning json with Greek characters in it. Actually it is the city of Mykonos. The challenge is whatever encoding or conversion I'm using it is always displayed as:ΜΎΚΟxCE?ΟΣ . But it should show: ΜΎΚΟΝΟΣ
With Powershell I was able to verify, that the web service is returning the correct characters.
I narrowed the problem down when the byte array gets converted to a String in Groovy. Below is code that reproduces the issue I have. myUTF8String holds the byte array I get from URLConnection.content.text. The UTF8 byte sequence to look at is 0xce, 0x9d. After converting this to a string and back to a byte array the byte sequence for that character is 0xce, 0x3f. The result of below code will show the difference at position 9 of the original byte array and the one from the converted string. For the below test I'm using Groovy Console 4.0.6.
Any hints on this one?
import java.nio.charset.StandardCharsets;
def myUTF8String = "ce9cce8ece9ace9fce9dce9fcea3"
def bytes = myUTF8String.decodeHex();
content = new String(bytes).getBytes()
for ( i = 0; i < content.length; i++ ) {
if ( bytes[i] != content[i] ) {
println "Different... at pos " + i
hex = Long.toUnsignedString( bytes[i], 16).toUpperCase()
print hex.substring(hex.length()-2,hex.length()) + " != "
hex = Long.toUnsignedString( content[i], 16).toUpperCase()
println hex.substring(hex.length()-2,hex.length())
}
}
Thanks a lot
Andreas

you have to specify charset name when building String from bytes otherwise default java charset will be used - and it's not necessary urf-8.
Charset.defaultCharset() - Returns the default charset of this Java virtual machine.
The same problem with String.getBytes() - use charset parameter to get correct byte sequence.
Just change the following line in your code and issue will disappear:
content = new String(bytes, "UTF-8").getBytes("UTF-8")
as an option you can set default charset for the whole JVM instance with the following command line parameter:
java -Dfile.encoding=UTF-8 <your application>
but be careful because it will affect whole JVM instance!
https://docs.oracle.com/en/java/javase/19/intl/supported-encodings.html#GUID-DC83E43D-52F6-41D9-8F16-318F3F39D54F

Related

Python : stripping, converting bytes type

Under Python 3.10, I do have an UDP socket that listens to a COM port.
I do get datas like this :
b'SENDPKT: "STN1" "" "SH/DX\r"\x98\x00'
The infos SH/DX before the "\n" can change and has a different length and I need to extract them.
.strip('b\r') doesn't work.
Using .decode() and str(), I tried to convert this bytes datas to a string for easier manipulation, but that doesn't work either.
I get an error "invalid start byte at position 27 for 0x98
Any guess, how I can solve this ?
Thanks,
For sophisticated input you can try ignoring errors while decoding:
b = b'SENDPKT: "STN1" "" "SH/DX\r"\x98\x00'
s = b.decode(errors='ignore')
res = s[20:s.find('\r')] # 'SH/DX'

Includes return false even though the String is inside the string

I want to see if the String "Streamer" is inside, but it always returns false. The String isn't for sure inside the file. I think that the String is really a String. If I print the whole String some weird characters appear on top of the file.
Console Image
// index.js
let sini = fs.readFileSync("C:\\Windows\\Sandboxie.ini", 'utf-8');
console.log(sini.includes("Streamer"))
# Sandboxie.ini
#
# Sandboxie-Plus configuration file
#
[GlobalSettings]
Template=WindowsRasMan
Template=WindowsLive
Template=OfficeLicensing
Template=OfficeClickToRun
[UserSettings_0D06021F]
SbieCtrl_UserName=tomie
SbieCtrl_NextUpdateCheck=1655566649
SbieCtrl_WindowCoords=121,311,1237,632
SbieCtrl_ActiveView=40021
SbieCtrl_ProcessViewColumnWidths=250,70,300
SbieCtrl_HideMessage=2223,Eckernfoerde.mov [Mjolnir4285177803 / 115906982]
SbieCtrl_ExplorerWarn=n
SbieCtrl_BoxExpandedView=DefaultBox
[DefaultBox]
ConfigLevel=9
AutoRecover=y
BlockNetworkFiles=y
Template=OpenSmartCard
Template=OpenBluetooth
Template=SkipHook
Template=FileCopy
Template=qWave
Template=BlockPorts
Template=LingerPrograms
Template=AutoRecoverIgnore
RecoverFolder=%{374DE290-123F-4565-9164-39C4925E467B}%
RecoverFolder=%Personal%
RecoverFolder=%Desktop%
BorderColor=#00FFFF,ttl,6
Enabled=y
[Streamer360384]
Enabled=y
[Streamer127138]
Enabled=y

Groovy - how to Decode a base 64 string and save it to a pdf/doc in local directory

Question
I need help with decoding a base 64 string and saving it to a pdf/doc in local directory using groovy
This script should work in SOAP UI
The base64 string is 52854 characters long
I have tried the following
File f = new File("c:\\document1.doc")
FileOutputStream out = null
byte[] b1 = Base64.decodeBase64(base64doccontent);
out = new FileOutputStream(f)
try {
out.write(b1)
} finally {
out.close()
}
But - it gives me below error
No signature of method: static org.apache.commons.codec.binary.Base64.decodeBase64() is applicable for argument types: (java.lang.String) values: [base64stringlong] Possible solutions: decodeBase64([B), encodeBase64([B), encodeBase64([B, boolean)
Assuming the base64 encoded text is coming from a file, a minimal example for soapUI would be:
import com.itextpdf.text.*
import com.itextpdf.text.pdf.PdfWriter;
String encodedContents = new File('/path/to/file/base64Encoded.txt')
.getText('UTF-8')
byte[] b1 = encodedContents.decodeBase64();
// Save as a text file
new File('/path/to/file/base64Decoded.txt').withOutputStream {
it.write b1
}
// Or, save as a PDF
def document = new Document()
PdfWriter.getInstance(document,
new FileOutputStream('/path/to/file/base64Decoded.pdf'))
document.open()
document.add(new Paragraph(new String(b1)))
document.close()
The File.withOutputStream method will ensure the stream is closed when the closure returns.
Or, to convert the byte array to a PDF, I used iText. I dropped itextpdf-5.5.13.jar in soapUI's bin/ext directory and restarted and then it was available for Groovy.

Trouble in decoding the Path of the Blob file in Azure Search

I have set a Azure search for blob storage and since the path of the file is a key property it is encoded to Base 64 format.
While searching the Index, I need to decode the path and display it in the front end. But when I try to do that in few of the scenarios it throws error.
int mod4 = base64EncodedData.Length % 4;
if (mod4 > 0)
{
base64EncodedData += new string('=', 4 - mod4);
}
var base64EncodedBytes = System.Convert.FromBase64String(base64EncodedData);
return System.Text.Encoding.ASCII.GetString(base64EncodedBytes);
Please let me know what is the correct way to do it.
Thanks.
Refer to Base64Encode and Base64Decode mapping functions - the encoding details are documented there.
In particular, if you're using .NET, you should use HttpServerUtility.UrlTokenDecode method with UTF-8 encoding, not ASCII.

Decode UTF8 symbols

I have a string in swift:
let flag = "Cattì ò"
I am trying to convert the UTF8 symbols.
I have tried using
stringByRemovingPercentEncoding
but noting changes. How can I convert the symbols properly ?
Welcome to the encoding guessing game! Look like somewhere along the pathway, your string didn't get the correct code page. Here's one way to guess it:
let flag = "Cattì ò"
let encodings = [NSASCIIStringEncoding,
NSNEXTSTEPStringEncoding,
NSJapaneseEUCStringEncoding,
NSUTF8StringEncoding,
NSISOLatin1StringEncoding,
NSSymbolStringEncoding,
NSNonLossyASCIIStringEncoding,
NSShiftJISStringEncoding,
NSISOLatin2StringEncoding,
NSUnicodeStringEncoding,
NSWindowsCP1251StringEncoding,
NSWindowsCP1252StringEncoding,
NSWindowsCP1253StringEncoding,
NSWindowsCP1254StringEncoding,
NSWindowsCP1250StringEncoding,
NSISO2022JPStringEncoding,
NSMacOSRomanStringEncoding,
NSUTF16StringEncoding,
NSUTF16BigEndianStringEncoding,
NSUTF16LittleEndianStringEncoding,
NSUTF32StringEncoding,
NSUTF32BigEndianStringEncoding,
NSUTF32LittleEndianStringEncoding]
for encoding in encodings {
if let bytes = flag.cStringUsingEncoding(encoding),
flag_utf8 = String(CString: bytes, encoding: NSUTF8StringEncoding) {
print("\(encoding): \(flag_utf8)")
}
}
The array contains all the encodings that Cocoa supports.
From the results, it seems like your string was encoded in NSISOLatin1StringEncoding (a.k.a ISO-8859-1), the default encoding for HTML 4.01. This gives Cattì ò in UTF-8, not exactly match your desired result but is the closest among all code pages.
Other good candidates are NSWindowsCP1252StringEncoding and NSWindowsCP1254StringEncoding so I'd suggest you check with other strings.

Resources