Groovy - String created from UTF8 bytes has wrong characters

Groovy - String created from UTF8 bytes has wrong characters - groovy

The problem came up when getting the result of a web service returning json with Greek characters in it. Actually it is the city of Mykonos. The challenge is whatever encoding or conversion I'm using it is always displayed as:ΜΎΚΟxCE?ΟΣ . But it should show: ΜΎΚΟΝΟΣ
With Powershell I was able to verify, that the web service is returning the correct characters.
I narrowed the problem down when the byte array gets converted to a String in Groovy. Below is code that reproduces the issue I have. myUTF8String holds the byte array I get from URLConnection.content.text. The UTF8 byte sequence to look at is 0xce, 0x9d. After converting this to a string and back to a byte array the byte sequence for that character is 0xce, 0x3f. The result of below code will show the difference at position 9 of the original byte array and the one from the converted string. For the below test I'm using Groovy Console 4.0.6.
Any hints on this one?
import java.nio.charset.StandardCharsets;
def myUTF8String = "ce9cce8ece9ace9fce9dce9fcea3"
def bytes = myUTF8String.decodeHex();
content = new String(bytes).getBytes()
for ( i = 0; i < content.length; i++ ) {
if ( bytes[i] != content[i] ) {
println "Different... at pos " + i
hex = Long.toUnsignedString( bytes[i], 16).toUpperCase()
print hex.substring(hex.length()-2,hex.length()) + " != "
hex = Long.toUnsignedString( content[i], 16).toUpperCase()
println hex.substring(hex.length()-2,hex.length())
}
}
Thanks a lot
Andreas

you have to specify charset name when building String from bytes otherwise default java charset will be used - and it's not necessary urf-8.
Charset.defaultCharset() - Returns the default charset of this Java virtual machine.
The same problem with String.getBytes() - use charset parameter to get correct byte sequence.
Just change the following line in your code and issue will disappear:
content = new String(bytes, "UTF-8").getBytes("UTF-8")
as an option you can set default charset for the whole JVM instance with the following command line parameter:
java -Dfile.encoding=UTF-8 <your application>
but be careful because it will affect whole JVM instance!
https://docs.oracle.com/en/java/javase/19/intl/supported-encodings.html#GUID-DC83E43D-52F6-41D9-8F16-318F3F39D54F

Related

Python : stripping, converting bytes type

Under Python 3.10, I do have an UDP socket that listens to a COM port.
I do get datas like this :
b'SENDPKT: "STN1" "" "SH/DX\r"\x98\x00'
The infos SH/DX before the "\n" can change and has a different length and I need to extract them.
.strip('b\r') doesn't work.
Using .decode() and str(), I tried to convert this bytes datas to a string for easier manipulation, but that doesn't work either.
I get an error "invalid start byte at position 27 for 0x98
Any guess, how I can solve this ?
Thanks,

For sophisticated input you can try ignoring errors while decoding:
b = b'SENDPKT: "STN1" "" "SH/DX\r"\x98\x00'
s = b.decode(errors='ignore')
res = s[20:s.find('\r')] # 'SH/DX'

Includes return false even though the String is inside the string

I want to see if the String "Streamer" is inside, but it always returns false. The String isn't for sure inside the file. I think that the String is really a String. If I print the whole String some weird characters appear on top of the file.
Console Image
// index.js
let sini = fs.readFileSync("C:\\Windows\\Sandboxie.ini", 'utf-8');
console.log(sini.includes("Streamer"))
# Sandboxie.ini
#
# Sandboxie-Plus configuration file
#
[GlobalSettings]
Template=WindowsRasMan
Template=WindowsLive
Template=OfficeLicensing
Template=OfficeClickToRun
[UserSettings_0D06021F]
SbieCtrl_UserName=tomie
SbieCtrl_NextUpdateCheck=1655566649
SbieCtrl_WindowCoords=121,311,1237,632
SbieCtrl_ActiveView=40021
SbieCtrl_ProcessViewColumnWidths=250,70,300
SbieCtrl_HideMessage=2223,Eckernfoerde.mov [Mjolnir4285177803 / 115906982]
SbieCtrl_ExplorerWarn=n
SbieCtrl_BoxExpandedView=DefaultBox
[DefaultBox]
ConfigLevel=9
AutoRecover=y
BlockNetworkFiles=y
Template=OpenSmartCard
Template=OpenBluetooth
Template=SkipHook
Template=FileCopy
Template=qWave
Template=BlockPorts
Template=LingerPrograms
Template=AutoRecoverIgnore
RecoverFolder=%{374DE290-123F-4565-9164-39C4925E467B}%
RecoverFolder=%Personal%
RecoverFolder=%Desktop%
BorderColor=#00FFFF,ttl,6
Enabled=y
[Streamer360384]
Enabled=y
[Streamer127138]
Enabled=y

Groovy - how to Decode a base 64 string and save it to a pdf/doc in local directory

Question
I need help with decoding a base 64 string and saving it to a pdf/doc in local directory using groovy
This script should work in SOAP UI
The base64 string is 52854 characters long
I have tried the following
File f = new File("c:\\document1.doc")
FileOutputStream out = null
byte[] b1 = Base64.decodeBase64(base64doccontent);
out = new FileOutputStream(f)
try {
out.write(b1)
} finally {
out.close()
}
But - it gives me below error
No signature of method: static org.apache.commons.codec.binary.Base64.decodeBase64() is applicable for argument types: (java.lang.String) values: [base64stringlong] Possible solutions: decodeBase64([B), encodeBase64([B), encodeBase64([B, boolean)

Assuming the base64 encoded text is coming from a file, a minimal example for soapUI would be:
import com.itextpdf.text.*
import com.itextpdf.text.pdf.PdfWriter;
String encodedContents = new File('/path/to/file/base64Encoded.txt')
.getText('UTF-8')
byte[] b1 = encodedContents.decodeBase64();
// Save as a text file
new File('/path/to/file/base64Decoded.txt').withOutputStream {
it.write b1
}
// Or, save as a PDF
def document = new Document()
PdfWriter.getInstance(document,
new FileOutputStream('/path/to/file/base64Decoded.pdf'))
document.open()
document.add(new Paragraph(new String(b1)))
document.close()
The File.withOutputStream method will ensure the stream is closed when the closure returns.
Or, to convert the byte array to a PDF, I used iText. I dropped itextpdf-5.5.13.jar in soapUI's bin/ext directory and restarted and then it was available for Groovy.

Trouble in decoding the Path of the Blob file in Azure Search

I have set a Azure search for blob storage and since the path of the file is a key property it is encoded to Base 64 format.
While searching the Index, I need to decode the path and display it in the front end. But when I try to do that in few of the scenarios it throws error.
int mod4 = base64EncodedData.Length % 4;
if (mod4 > 0)
{
base64EncodedData += new string('=', 4 - mod4);
}
var base64EncodedBytes = System.Convert.FromBase64String(base64EncodedData);
return System.Text.Encoding.ASCII.GetString(base64EncodedBytes);
Please let me know what is the correct way to do it.
Thanks.

Refer to Base64Encode and Base64Decode mapping functions - the encoding details are documented there.
In particular, if you're using .NET, you should use HttpServerUtility.UrlTokenDecode method with UTF-8 encoding, not ASCII.

Decode UTF8 symbols

I have a string in swift:
let flag = "CattÃ¬ Ã²"
I am trying to convert the UTF8 symbols.
I have tried using
stringByRemovingPercentEncoding
but noting changes. How can I convert the symbols properly ?

Welcome to the encoding guessing game! Look like somewhere along the pathway, your string didn't get the correct code page. Here's one way to guess it:
let flag = "CattÃ¬ Ã²"
let encodings = [NSASCIIStringEncoding,
NSNEXTSTEPStringEncoding,
NSJapaneseEUCStringEncoding,
NSUTF8StringEncoding,
NSISOLatin1StringEncoding,
NSSymbolStringEncoding,
NSNonLossyASCIIStringEncoding,
NSShiftJISStringEncoding,
NSISOLatin2StringEncoding,
NSUnicodeStringEncoding,
NSWindowsCP1251StringEncoding,
NSWindowsCP1252StringEncoding,
NSWindowsCP1253StringEncoding,
NSWindowsCP1254StringEncoding,
NSWindowsCP1250StringEncoding,
NSISO2022JPStringEncoding,
NSMacOSRomanStringEncoding,
NSUTF16StringEncoding,
NSUTF16BigEndianStringEncoding,
NSUTF16LittleEndianStringEncoding,
NSUTF32StringEncoding,
NSUTF32BigEndianStringEncoding,
NSUTF32LittleEndianStringEncoding]
for encoding in encodings {
if let bytes = flag.cStringUsingEncoding(encoding),
flag_utf8 = String(CString: bytes, encoding: NSUTF8StringEncoding) {
print("\(encoding): \(flag_utf8)")
}
}
The array contains all the encodings that Cocoa supports.
From the results, it seems like your string was encoded in NSISOLatin1StringEncoding (a.k.a ISO-8859-1), the default encoding for HTML 4.01. This gives Cattì ò in UTF-8, not exactly match your desired result but is the closest among all code pages.
Other good candidates are NSWindowsCP1252StringEncoding and NSWindowsCP1254StringEncoding so I'd suggest you check with other strings.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Groovy - String created from UTF8 bytes has wrong characters - groovy

Related

Python : stripping, converting bytes type

Includes return false even though the String is inside the string

Groovy - how to Decode a base 64 string and save it to a pdf/doc in local directory

Trouble in decoding the Path of the Blob file in Azure Search

Decode UTF8 symbols

Categories

Resources