Vala: Reading UTF-8 string from bytes not recognizing multibyte characters - string

For the application that I am currently working on, I am required to read UTF-8 encoded strings from a binary file. These strings are not null-terminated, but rather are prefaced with a byte specifying their length.
When I attempt to read in such a string, all multibyte UTF-8 characters become ?. Find below a sample:
public void main(string[] args) {
File file = File.new_for_path("test.bin");
DataInputStream instream = new DataInputStream(file.read());
uint8[] chars = new uint8[instream.read_byte()];
instream.read(chars);
print(#"$((string) chars)\n");
}
This is, of course, a stripped sample. The actual binary files in question are encrypted, which is not reflected here. If I use this with a sample file test.bin that contains the byte sequence 09 52 C3 AD 61 73 74 72 61 64, or Ríastrad prefaced with its byte length in UTF-8. The expected output is thus Ríastrad, but the actual output is R?astrad.
Might anyone be able to shed some light on the problem and, perhaps, a solution?

You need to add Intl.setlocale (); to your code:
public void main(string[] args) {
Intl.setlocale ();
File file = File.new_for_path("test.bin");
DataInputStream instream = new DataInputStream(file.read());
uint8[] chars = new uint8[instream.read_byte()];
instream.read(chars);
print(#"$((string) chars)\n");
}
The default locale for print () is the C locale, which is US ASCII. Any character outside the US ASCII character range is presented as a ?. Using Intl.setlocale (); sets the locale to be the same as the machine running the program.

Related

ByteBuffer to String & VIce Versa diferent result

I have created two helper function(One for ByteBuffer to String & Vice-versa)
public static Charset charset = Charset.forName("UTF-8");
public static String bb_to_str(ByteBuffer buffer, Charset charset){
System.out.println("Printing start");
byte[] bytes;
if(buffer.hasArray()) {
bytes = buffer.array();
} else {
bytes = new byte[buffer.remaining()];
buffer.get(bytes);
}
return new String(bytes, charset);
}
public static ByteBuffer str_to_bb(String msg, Charset charset){
return ByteBuffer.wrap(msg.getBytes(charset));
}
I have a data key that I am encrypting using AWS KMS which is giving my ByteBuffer.
// Encrypt the data key using AWS KMS
ByteBuffer plaintext = ByteBuffer.wrap("ankit".getBytes(charset));
EncryptRequest req = new EncryptRequest().withKeyId(keyId);
req.setPlaintext(plaintext);
ByteBuffer ciphertext = kmsClient.encrypt(req).getCiphertextBlob();
// Convert the byte buffer to String
String cip = bb_to_str(ciphertext, charset);
Now the issue is that this is not working :
DecryptRequest req1 = new DecryptRequest().withCiphertextBlob(str_to_bb(cip, charset)).withKeyId(keyId);
but this is working.
DecryptRequest req1 = new DecryptRequest().withCiphertextBlob(ciphertext).withKeyId(keyId);
What is wrong with my code?
The error is trying to convert an arbitrary byte array into a String in bb_to_str(ciphertext, charset);.
ciphertext does not in any reasonable way represent a readable string, and definitely doesn't use the charset that you specify (whichever one it is).
String is meant to represent Unicode text. Trying to use it to represent anything else will run into any number of problems (mostly related to encodings).
In some programming languages the string type is a binary string (i.e. doesn't strictly represent Unicode text), but those are usually the same languages that cause massive encoding confusions.
If you want to represent an arbitrary byte[] as a String for some reason, then you need to pick some encoding to represent it. Common one is Base64 or hex strings. Base64 is more compact and hex string conceptually simpler, but takes up more space for the same amount of input data.

Converting string to bytes32 to be used in solidity contract call

How can I convert a Perl string to bytes32 like the java function below:
public static Bytes32 stringToBytes32(String string) {
byte[] byteValue = string.getBytes();
byte[] byteValueLen32 = new byte[32];
System.arraycopy(byteValue, 0, byteValueLen32, 0, byteValue.length);
return new Bytes32(byteValueLen32);
}
Is there any module available in CPAN to do this?
It looks like all this function does it to encode a string into bytes, then truncate/pad it so the result is exactly 32 bytes long.
The first part may be tricky because according to the documentation:
public byte[] getBytes()
Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.
The behavior of this method when this string cannot be encoded in the default charset is unspecified.
Perl doesn't really have a concept of a "default charset", but if you're willing to settle for UTF-8, it's not hard:
sub stringToBytes32 {
my ($str) = #_;
utf8::encode $str;
return pack 'a32', $str;
}
(See Encode::encode if you need a different encoding.)
pack is handy for producing data in binary formats. Here we use it to truncate/pad to 32 bytes.

Why is my data converted to ASCII using Serial.print function in arduino?

I am coding a small software to send data with an RN2483 transciever, and I have realised that my data is converted to ASCII when I sent it through serial. It is to say, I have the following part in the sender, the data has to be HEX
String aux = String(message.charAt(i),HEX);
dataToBeTx = "radio tx " + aux+ "\r\n";
Serial1.print(dataToBeTx)
On the receiver I am reading Serial1 till I get the message, which I receive properly, however it is an ASCII representation of the HEX data, and I would like to have it HEX, I mean, I send HI that is converted to HEX (H I=>0x48 0x49) on the receiver if I translate that value to HEX again I got different things than my H or I , so I guess it is being encoded in ASCII, how can I ride off from that?
Thanks in advance,
regards
It is very unclear what you are trying to achieve. The first line in your code converts a single character into a string in hexadecimal. For example:
void setup ()
{
Serial.begin (115200);
Serial.println ();
String aux = String('A', HEX);
Serial.print ("aux = ");
Serial.println (aux);
} // end of setup
void loop ()
{
} // end of loop
Output:
aux = 41
So the 'A' in my code (internally represented as 0x41) has now become two ASCII characters: 4 and 1. That is, a string which is two bytes long.
So, in a sense, you can say it is already in hex.
if I translate that value to HEX again I got different things than my H or I
Well, yes, if you translate it "again" then you would get 0x34 and 0x31.
Do you want to send A in this case, 41 or something else?

Converting wchar_t* to char* on iOS

I'm attempting to convert a wchar_t* to a char*. Here's my code:
size_t result = wcstombs(returned, str, length + 1);
if (result == (size_t)-1) {
int error = errno;
}
It indeed fails, and error is filled with 92 (ENOPROTOOPT) - Protocol not available.
I've even tried setting the locale:
setlocale(LC_ALL, "C");
And this one too:
setlocale(LC_ALL, "");
I'm tempted to just throw the characters with static casts!
Seems the issue was that the source string was encoded with a non-standard encoding (two ASCII characters for each wide character), which looked fine in the debugger, but clearly internally was sour. The error code produced is clearly not documented, but it's the equivalent to simply not being able to decode said piece of text.

How can I have Packed decimal and normal text in a single file?

I need to generate a fixed width file with few of the columns in packed decimal format and few of the columns in normal number format. I was able to generate. I zipped the file and passed it on to the mainframe team. They imported it and unzipped the file and converted to EBCDIC. They were able to get the packed decimal columns without any problem but the normal number fields seemed to have messed up and are unreadable. Is there something specific that I need to do while process/zip my file before sending it to mainframe? I am using COMP3 packed decimal. Currently working on Windows XP but the real production will be on RHEL.
Thanks in advance for helping me out. This is urgent.
Edited on 06 June 2011:
This is how it looks when I switch on HEX.
. . . . . . . . . . A . .
333333333326004444
210003166750C0000
The 'A' in the first row has a slight accent so it is not the actual upper case A.
210003166 is the raw decimal. The value of the packed decimal before comp3 conversion is 000000002765000 (we can ignore the leading zeroes if required).
UPDATE 2 : 7th June 2011
This how I am converting creating the file that gets loaded into the mainframe:
File contains two columns - Identification number & amount. Identification number doesn't require comp3 conversion and amount requires comp3 conversion. Comp3 conversion is performed at oracle sql end. Here is the query for performing the conversion:
Select nvl(IDENTIFIER,' ') as IDENTIFIER, nvl(utl_raw.cast_to_varchar2(comp3.convert(to_number(AMOUNT))),'0') as AMOUNT from TABLEX where IDENTIFIER = 123456789
After executing the query, I do the following in Java:
String query = "Select nvl(IDENTIFIER,' ') as IDENTIFIER, nvl(utl_raw.cast_to_varchar2(comp3.convert(to_number(AMOUNT))),'0') as AMOUNT from TABLEX where IDENTIFIER = 210003166"; // this is the select query with COMP3 conversion
ResultSet rs = getConnection().createStatement().executeQuery(sb.toString());
sb.delete(0, sb.length()-1);
StringBuffer appendedValue = new StringBuffer (200000);
while(rs.next()){
appendedValue.append(rs.getString("IDENTIFIER"))
.append(rs.getString("AMOUNT"));
}
File toWriteFile = new File("C:/transformedFile.txt");
FileWriter writer = new FileWriter(toWriteFile, true);
writer.write(appendedValue.toString());
//writer.write(System.getProperty(ComponentConstants.LINE_SEPERATOR));
writer.flush();
appendedValue.delete(0, appendedValue.length() -1);
The text file thus generated is manually zipped by a winzip tool and provided to the mainframe team. Mainframe team loads the file into mainframe and browses the file with HEXON.
Now, coming to the conversion of the upper four bits of the zoned decimal, should I be doing it before righting it to the file? Or am I to apply the flipping at the mainframe end? For now, I have done the flipping at java end with the following code:
public static String toZoned(String num) {
if (num == null) {
return "";
}
String ret = num.trim();
if (num.equals("") || num.equals("-") || num.equals("+")) {
// throw ...
return "";
}
char lastChar = ret.substring(ret.length() - 1).charAt(0);
//System.out.print(ret + " Char - " + lastChar);
if (lastChar < '0' || lastChar > '9') {
} else if (num.startsWith("-")) {
if (lastChar == '0') {
lastChar = '}';
} else {
lastChar = (char) (lastChar + negativeDiff);
}
ret = ret.substring(1, ret.length() - 1) + lastChar;
} else {
if (num.startsWith("+")) {
ret = ret.substring(1);
}
if (lastChar == '0') {
lastChar = '{';
} else {
lastChar = (char) (lastChar + positiveDiff);
}
ret = ret.substring(0, ret.length() - 1) + lastChar;
}
//System.out.print(" - " + lastChar);
//System.out.println(" -> " + ret);
return ret;
}
The identifier becomes 21000316F at the java end and that is what gets written to the file. I have passed on the file to mainframe team and awaiting the output with HEXON. Do let me know if I am missing something. Thanks.
UPDATE 3: 9th Jun 2011
Ok I have got mainframe results. I am doing this now.
public static void main(String[] args) throws FileNotFoundException {
// TODO Auto-generated method stub
String myString = new String("210003166");
byte[] num1 = new byte[16];
try {
PackDec.stringToPack("000000002765000",num1,0,15);
System.out.println("array size: " + num1.length);
} catch (DecimalOverflowException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
} catch (DataException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
byte[] ebc = null;
try {
ebc = myString.getBytes("Cp037");
} catch (UnsupportedEncodingException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
PrintWriter pw = new PrintWriter("C:/transformationTextV1.txt");
pw.printf("%x%x%x%x%x%x%x%x%x",ebc[0],ebc[1],ebc[2],ebc[3],ebc[4], ebc[5], ebc[6], ebc[7], ebc[8]);
pw.printf("%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x",num1[0],num1[1],num1[2],num1[3],num1[4], num1[5], num1[6], num1[7],num1[8], num1[9],num1[10], num1[11],num1[12], num1[13], num1[14],num1[15]);
pw.close();
}
And I get the following output:
Á.Á.Á.Á.Á.Á.Á.Á.Á.................Ä
63636363636363636333333333333333336444444444444444444444444444444444444444444444
62616060606361666600000000000276503000000000000000000000000000000000000000000000
I must be doing something very wrong!
UPDATE 4: 14th Jun 2011
This query was resolved after using James' suggestion. I am currently using the below code and it gives me the expected output:
public static void main(String[] args) throws IOException {
// TODO Auto-generated method stub
String myString = new String("210003166");
byte[] num1 = new byte[16];
try {
PackDec.stringToPack("02765000",num1,0,8);
} catch (DecimalOverflowException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
} catch (DataException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
byte[] ebc = null;
try {
ebc = myString.getBytes("Cp037");
} catch (UnsupportedEncodingException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
FileOutputStream writer = new FileOutputStream("C:/transformedFileV3.txt");
writer.write(ebc,0,9);
writer.write(num1,0,8);
writer.close();
}
As you are coding in Java and you require a mix of EBCDIC and COMP-3 in your output you wiil need to do the unicode to EBCDIC conversion in your own program.
You cannot leave this up to the file transfer utility as it will corrupt your COMP-3 fields.
But luckily you are using Java so its easy using the getBytes method of the string class..
Working Example:
package com.tight.tran;
import java.io.*;
import name.benjaminjwhite.zdecimal.DataException;
import name.benjaminjwhite.zdecimal.DecimalOverflowException;
import name.benjaminjwhite.zdecimal.PackDec;
public class worong {
/**
* #param args
* #throws IOException
*/
public static void main(String[] args) throws IOException {
// TODO Auto-generated method stub
String myString = new String("210003166");
byte[] num1 = new byte[16];
try {
PackDec.stringToPack("000000002765000",num1,0,15);
System.out.println("array size: " + num1.length);
} catch (DecimalOverflowException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
} catch (DataException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
byte[] ebc = null;
try {
ebc = myString.getBytes("Cp037");
} catch (UnsupportedEncodingException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
FileOutputStream writer = new FileOutputStream("C:/transformedFile.txt");
writer.write(ebc,0,9);
writer.write(num1,0,15);
writer.close();
}
}
Produces (for me!):
0000000: f2f1 f0f0 f0f3 f1f6 f600 0000 0000 0000 ................
0000010: 0000 0000 2765 000c 0d0a ....'e....
"They were able to get the packed decimal columns without any problem but the normal number fields seemed to have messed up " would seem to indicate that they did not translate ASCII to EBCDIC.
ASCII zero x'30' should translate to EBCDIC zero x'F0'. If this was not done then (depending on the EBCDIC code page) then x'30' does not map to a valid character on most EBCDIC displays.
However even if they did translate you will have different problem as all or some of your COMP-3 data will be corrupted. The simple translate programs have no way to distinguish between character and comp-3 so they will convert a number such as x'00303C' to x'00F06E' which will cause any mainframe program to bomb out with the dreaded "0C7 Decimal Arithmetic Exception" ( culturally equivalent to "StackOverflow").
So basically you are in a lose/lose situation. I would suggest you ditch the packed decimals and use plain ASCII characters for your numbers.
The zipping should not cause you a problem, except, the file transfer utility was probably doing ASCII to EBCDIC on the plain text file, but, not on the zipped file.
"... converted to EBCDIC..." may be part of the problem.
Unless the mainframe conversion process is "aware" of the record layout it is
working with (ie. which columns contain binary, packed and/or character data),
it is going to mess something up because the mapping process is format dependant.
You have indicated the COMP-3 data are ok, I am willing to bet that either
the "converted to EBCDIC" doesn't do anything, or it is performing some sort of
ASCII to COMP-3 conversion on all of your data - thus messing up non COMP-3 data.
Once you get to the mainframe, this is what you should see:
COMP-3 - each byte contains 2 digits except the last (right most, least
significant). The least significant
byte contains only 1 decimal digit in the upper 4 bits and the sign field in the
lower 4 bits. Each decimal digit is recorded in hex (eg. 5 = B'0101')
Zoned Decimal (normal numbers) - each byte contains 1 decimal digit. The upper
four bits should always contain HEX F, except possibly the least most significant
byte where the upper 4 bits may contain the sign and the lower 4 bits a digit. The
4 bit digit is recored in hex (eg. 5 = B'0101')
You need to see what the un-zipped converted data look like on the mainframe.
Ask someone to "BROWSE" your file on the mainframe with "HEX ON" so you can
see what the actual HEX content of your file is. From there you should be able
to figure out what sort hoops and loops you need to jump through to make this
work.
Here are a couple of links that may be of help to you:
IBM Mainframe Numeric data representation
ASCII to EBCDIC chart
Update: If the mainframe guys can see the correct digits when browsing with
"HEX ON" then there are two possible problems:
Digit is stored in the wrong nibble. The digit should be visible in the
lower 4 bits. If it is in the upper 4 bits, that is definitely a problem.
The non-digit nibble (upper 4 bits) does not contain HEX 'F' or valid sign value.
Unsigned digits always contain HEX 'F' in the upper 4 bits of the byte. If the number
is signed (eg. PIC S9(4) - or something like that), the upper 4 bits of the least
most significant digit (last one) should contain HEX 'C' or 'D'.
Here is a bit of a screen shot of what BROWSE with 'HEX ON' should look like:
File Edit Edit_Settings Menu Utilities Compilers Test Help
VIEW USERID.TEST.DATA - 01.99 Columns 00001 00072
Command ===> Scroll ===> CSR
****** ***************************** Top of Data ******************************
000001 0123456789
FFFFFFFFFF44444444444444444444444444444444444444444444444444444444444444
012345678900000000000000000000000000000000000000000000000000000000000000
------------------------------------------------------------------------------
000002 |¬?"±°
012345678944444444444444444444444444444444444444444444444444444444444444
FFFFFFFFF000000000000000000000000000000000000000000000000000000000000000
------------------------------------------------------------------------------
000003 àíÃÏhr
012345678944444444444444444444444444444444444444444444444444444444444444
012345678900000000000000000000000000000000000000000000000000000000000000
------------------------------------------------------------------------------
The lines beginning with '000001', '000002' and '000003' shows 'plain' text. the two lines below
each of them show the HEX representation of the character above it. The first line of HEX
shows the upper 4 bits, the second line the lower 4 bits.
Line 1 contains the number '0123456789' followed by blank spaces (HEX 40).
Line 2 shows junk because the upper and lower nibbles are flipped. The exact silly character
is just a matter of code page selection so do not get carried away with what you see.
Line 3 shows similar junk because both upper and lower nibbles contain a digit.
Line '000001' is the sort of thing you should see for unsigned zoned decimal numbers
on an IBM mainframe using EBCDIC (single byte character set).
UPDATE 2
You added a HEX display to your question on June 6th. I think maybe there
were a couple of formatting issues. If this is what
you were trying to display, the following discussion might be of help to you:
..........A..
33333333326004444
210003166750C0000
You noted that this is a display of two "numbers":
210003166 in Zoned Decimal
000000002765000 in COMP-3
This is what an IBM mainframe would be expecting:
210003166 :Á : <-- Display character
FFFFFFFFF00002600 <-- Upper 4 bits of each byte
2100031660000750C <-- Lower 4 bits of each byte
Notice the differences between what you have and the above:
The upper 4 bits of the Zoned Decimal data in your display contain
a HEX '3', they should contain a HEx 'F'. The lower 4 bits contain the
expected digit. Get those upper 4 bits fixed
and you should be good to go. BTW... it looks to me that whatever 'conversion' you
have attempted to Zoned Decimal is having no affect. The bit patterns you have for
each digit in the Zoned Decimal correspond to digits in the ASCII character set.
In the COMP-3 field you indicated that the leading zeros could be truncated.
Sorry, but they are either part of the number or they are not! My display above
includes leading zeros. Your display appears to have truncated leading zeros and then padded
trailing bytes with spaces (HEX 40). This won't work! COMP-3 fields are defined
with a fixed number of digits and all digits must be represented - that means leading
zeros are required to fill out the high order digits of each number.
The Zoned Decimal fix should be pretty easy... The COMP-3 fix is probably just a
matter of not stripping leading zeros (otherwise it looks pretty good).
UPDATE 3...
How do you flip the 4 high order bits? I got the impression somewhere along the line that you might be doing your conversion via a Java program.
I, unfortunately, am a COBOL programmer, but I'll take a shot at it (don't
laugh)...
Based on what I have seen here, all you need to do is take each ASCII
digit and flip the high 4 bits to HEX F and the result will be the equivalent
unsighed Zoned Decimal EBCDIC digit. Try something like...
public static byte AsciiToZonedDecimal(byte b) {
//flip upper 4 bits to Hex F...
return (byte)(b | 0xF0)
};
Apply the above to each ASCII digit and the result should be an unsigned EBCDIC
Zoned Decimal number.
UPDATE 4...
At this point the answers provided by James Anderson should put you on the right track.
James pointed you to name.benjaminjwhite.zdecimal and
this looks like it has all the Java classes you need to convert your data. The
StringToZone method
should be able to convert the IDENTIFIER string you get back from Oracle into a byte array that you then append to the
output file.
I am not very familiar with Java but I believe Java Strings are stored internally as Unicode Characters which are 16 bits long. The EBCDIC
characters you are trying to create are only 8 bits long. Given this, you might be better off writting to the output file using byte arrays (as opposed to strings).
Just a hunch from a non Java programmer.
The toZoned method in your question above appears to only concern itself with the first
and last characters of the string. Part of the problem is that each and every character
needs to be converted - the 4 upper bits of each byte, except possibly the last, needs to be patched to contain Hex F. The lower 4 bits contain one digit.
BTW... You can pick up the source for this Java utility class at: http://www.benjaminjwhite.name/zdecimal
It sounds like the problem is in the EBCDIC conversion. The packed decimal would use characters as byte values, and isn't subject to the transliterations EBCDIC <-> ASCII.
If they see control characters (or square markers on Windows), then they may be viewing ASCII data as EBCDIC.
If they see " ñòóôõö øù" in place of "0123456789" then they are viewing EBCDIC characters in a viewer using ANSI, or extended ASCII.

Resources