EBCDIC code page does not convert lower case 'a' - groovy

I'm attempting to convert a string of characters from ASCII to EBCDIC using an IBM codepage. The conversion is correct except for lower case 'a' which is converted to an unprintable character.
Here is a piece of a groovy script running in Windows 7 that illustrates the problem.
groovy:000> letters='abcdABCD'
===> abcdABCD
groovy:000> String.format("%04x", new BigInteger(1, letters.getBytes())
===> 6162636441424344
groovy:000> lettersx=new String(letters.getBytes('IBM500'))
===> ?éâä┴┬├─
groovy:000> String.format("%04x", new BigInteger(1, lettersx.getBytes()))
===> 3f828384c1c2c3c4
After converting to EBCDIC all the characters in the string are valid except the first one, a lower case 'a'. Try as I might I can't find any information on this problem. I've tried a number of IBM code pages with the same results (IBM01140, IBM1047 etc.)

The problem is in this expression:
new String(letters.getBytes('IBM500'))
letters.getBytes creates a byte-array containing (in hexadecimal):
81 82 83 84 C1 C2 C3 C4
but then you're immediately converting that back to a Unicode String using your platform default encoding:
new String( <byte-array> );
If you want the ordinal values of the the characters in your String to be equal to the byte value, you must specify an encoding that does that, for example ISO-8859-1:
new String(letters.getBytes('IBM500'), "ISO-8859-1")
The encoding you're using does not define a character encoding for byte 81 so it is replacing it with ? (3f). You're most likely using Windows-1252.
Strings contain characters, not bytes. Java will always apply an encoding conversion when going from one to the other.
EDIT: responding to #mister270's comment:
Here's a program in Java to demonstrate:
public class Ebcdic
{
public static void main(String[] args) throws Exception
{
String letters = "abcdABCD";
byte[] ebcdic = letters.getBytes("IBM500");
System.out.print("Ebcdic bytes:");
for (byte b: ebcdic)
{
System.out.format(" %02X", b & 0xFF);
}
System.out.println();
String lettersEbcdic = new String(ebcdic, "ISO-8859-1");
System.out.print("Ebcdic bytes stored in chars:");
for (char c: lettersEbcdic.toCharArray())
{
System.out.format(" %04X", (int) c);
}
System.out.println();
System.out.println("Ebcdic bytes in chars printed in using my default platform encoding: " + lettersEbcdic);
}
}
Output is:
Ebcdic bytes: 81 82 83 84 C1 C2 C3 C4
Ebcdic bytes stored in chars: 0081 0082 0083 0084 00C1 00C2 00C3 00C4
Ebcdic bytes in chars printed in using my default platform encoding: ????��ǎ
What this shows is that
the Ebcdic conversion into the byte-array is occurring correctly using "IBM500"
the "identity" conversion of bytes to chars using "ISO-8859-1" is occurring correctly
My system doesn't have a mapping to convert Unicode character U+0081 etc to my default platform character encoding so it displays it as ?
Java (so Groovy too) stores characters internally as Unicode. UTF16, to be precise. If you want to encode them as Ebcdic, then they stop being characters and should no longer be held in the Strings. Ebcdic is an 8-bit encoding so each character can be stored in a byte. If you need to interface with a system that expects a particular encoding (in your case, Ebcdic), then that system really should accept bytes, not Strings, otherwise you end up with just these sorts of confusion.
If you must use Strings to hold Ebcdic bytes, then you must use the ISO-8859-1 encoding whenever you use an InputStream or OutputStream (including System.out) to ensure that your ebcdic codes are not "translated" from bytes to characters

Related

base64.decode: Invalid encoding before padding

I'm working on a flutter project and I'm currently getting an error with some of the strings I try do decode using the base64.decode() method. I've created a short dart code which can reproduce the problem I'm facing with a specific string:
import 'dart:convert';
void main() {
final message = 'RU5UUkVHQUdSQVRJU1==';
print(utf8.decode(base64.decode(message)));
}
I'm getting the following error message:
Uncaught Error: FormatException: Invalid encoding before padding (at character 19)
RU5UUkVHQUdSQVRJU1==
I've tried decoding the same string with JavaScript and it works fine. Would be glad if someone could explain why am I getting this error, and possibly show me a solution. Thanks.
Base64 encoding breaks binary data into 6-bit segments of 3 full bytes and represents those as printable characters in ASCII standard. It does that in essentially two steps.
The first step is to break the binary string down into 6-bit blocks. Base64 only uses 6 bits (corresponding to 2^6 = 64 characters) to ensure encoded data is printable and humanly readable. None of the special characters available in ASCII are used.
The 64 characters (hence the name Base64) are 10 digits, 26 lowercase characters, 26 uppercase characters as well as the Plus sign (+) and the Forward Slash (/). There is also a 65th character known as a pad, which is the Equal sign (=). This character is used when the last segment of binary data doesn't contain a full 6 bits
So RU5UUkVHQUdSQVRJU1== doesn't follow the encoding pattern.
Use Underline character "_" as Padding Character and Decode With Pad Bytes Deleted
For some reason dart:convert's base64.decode chokes on strings padded with = with the "invalid encoding before padding error". This happens even if you use the package's own padding method base64.normalize which pads the string with the correct padding character =.
= is indeed the correct padding character for base64 encoding. It is used to fill out base64 strings when fewer than 24 bits are available in the input group. See RFC 4648, Section 4.
However, RFC 4648 Section 5 which is a base64 encoding scheme for Urls uses the underline character _ as padding instead of = to be Url safe.
Using _ as the padding character will cause base64.decode to decode without error.
In order to further decode the generated list of bytes to Utf8, you will need to delete the padding bytes or you will get an "Invalid UTF-8 byte" error.
See the code below. Here is the same code as a working dartpad.dev example.
import 'dart:convert';
void main() {
//String message = 'RU5UUkVHQUdSQVRJU1=='; //as of dart 2.18.2 this will generate an "invalid encoding before padding" error
//String message = base64.normalize('RU5UUkVHQUdSQVRJU1'); // will also generate same error
String message = 'RU5UUkVHQUdSQVRJU1';
print("Encoded String: $message");
print("Decoded String: ${decodeB64ToUtf8(message)}");
}
decodeB64ToUtf8(String message) {
message =
padBase64(message); // pad with underline => ('RU5UUkVHQUdSQVRJU1__')
List<int> dec = base64.decode(message);
//remove padding bytes
dec = dec.sublist(0, dec.length - RegExp(r'_').allMatches(message).length);
return utf8.decode(dec);
}
String padBase64(String rawBase64) {
return (rawBase64.length % 4 > 0)
? rawBase64 += List.filled(4 - (rawBase64.length % 4), "_").join("")
: rawBase64;
}
The string RU5UUkVHQUdSQVRJU1== is not a compliant base 64 encoding according to RFC 4648, which in section 3.5, "Canonical Encoding," states:
The padding step in base 64 and base 32 encoding can, if improperly
implemented, lead to non-significant alterations of the encoded data.
For example, if the input is only one octet for a base 64 encoding,
then all six bits of the first symbol are used, but only the first
two bits of the next symbol are used. These pad bits MUST be set to
zero by conforming encoders, which is described in the descriptions
on padding below. If this property do not hold, there is no
canonical representation of base-encoded data, and multiple base-
encoded strings can be decoded to the same binary data. If this
property (and others discussed in this document) holds, a canonical
encoding is guaranteed.
In some environments, the alteration is critical and therefore
decoders MAY chose to reject an encoding if the pad bits have not
been set to zero. The specification referring to this may mandate a
specific behaviour.
(Emphasis added.)
Here we will manually go through the base 64 decoding process.
Taking your encoded string RU5UUkVHQUdSQVRJU1== and performing the mapping from the base 64 character set (given in "Table 1: The Base 64 Alphabet" of the aforementioned RFC), we have:
R U 5 U U k V H Q U d S Q V R J U 1 = =
010001 010100 111001 010100 010100 100100 010101 000111 010000 010100 011101 010010 010000 010101 010001 001001 010100 110101 ______ ______
(using __ to represent the padding characters).
Now, grouping these by 8 instead of 6, we get
01000101 01001110 01010100 01010010 01000101 01000111 01000001 01000111 01010010 01000001 01010100 01001001 01010011 0101____ ________
E N T R E G A G R A T I S P
The important part is at the end, where there are some non-zero bits followed by padding. The Dart implementation is correctly determining that the padding provided doesn't make sense provided that the last four bits of the previous character do not decode to zeros.
As a result, the decoding of RU5UUkVHQUdSQVRJU1== is ambiguous. Is it ENTREGAGRATIS or ENTREGAGRATISP? It's precisely this reason why the RFC states, "These pad bits MUST be set to zero by conforming encoders."
In fact, because of this, I'd argue that an implementation that decodes RU5UUkVHQUdSQVRJU1== to ENTREGAGRATIS without complaint is problematic, because it's silently discarding non-zero bits.
The RFC-compliant encoding of ENTREGAGRATIS is RU5UUkVHQUdSQVRJUw==.
The RFC-compliant encoding of ENTREGAGRATISP is RU5UUkVHQUdSQVRJU1A=.
This further highlights the ambiguity of your input RU5UUkVHQUdSQVRJU1==, which matches neither.
I suggest you check your encoder to determine why it's providing you with non-compliant encodings, and make sure you're not losing information as a result.

Buffers in Node

Consider following Node code
const bufA = Buffer.from('tést');
bufA :
<Buffer 74 c3 a9 73 74>
Why four characters in input translated to 5 hexadecimal bytes ?
When you call Buffer.from(string), a couple of things happen:
The encoding is defaulted to utf-8
The JS string, which is internally stored as an array of UCS-2 characters, is encoded into UTF-8
In UTF-8, é is a multi-byte character, like most accented characters from Latin scripts. Here's more information on how the character is encoded in different systems: https://www.fileformat.info/info/unicode/char/e9/index.htm
As you can see, the UTF-8 representation of this character is 0xC3 0xA9, which corresponds to the second and third bytes c3 a9 in your Buffer.
This also means that, when decoding Buffers (e.g. when concatenating data coming in from a Stream), some characters may fall on the buffer boundary and it may be impossible to decode the string until you have the remainder of the character (0xC3 on its own would be invalid). This is why code examples you find on the Web which do:
let result = '';
stream.on('data', function(buf) {
// BUG! Does not account for multi-byte characters.
result += buf.toString();
});
are almost always wrong - unless the Stream is already set up for handling encoding itself (then, you'll get strings, not buffers, when reading).

Capital text with space. What kind of text format is this?

I've got a list of songs that look like this.
GIFT〜白〜(冬恋/君の歌をうたう)【完全生産限定盤】
The latin letters GIFT here looks odd and I can't figure out how to make it read like a normal text. For example if you copy this word, it doesn't have a space in-between the letters or anything but seems to be in a different text format.
Can someone help me how I can convert this into normal text?
These are Unicode characters.
For example, the 'G' is
Unicode Character 'FULLWIDTH LATIN CAPITAL LETTER G' (U+FF27)
UTF-8 (hex) 0xEF 0xBC 0xA7 (efbca7)
see here
You can copy the string to Notepad++ and then convert it into Hex code (Extensions/Converter/ASCII->HEX)
and get EFBCA7EFBCA9EFBCA6EFBCB4 for the word 'GIFT'
Then serach for "Unicode EFBCA7" to find the above information.
This can be converted into normal latin characters. For example in .Net there is the Normalize function:
using System;
using System.Text;
public class Program
{
public static void Main()
{
Console.WriteLine("Unicode:");
String text = "GIFT";
Console.WriteLine(text);
byte[] bytes = Encoding.UTF8.GetBytes(text);
foreach(var b in bytes)
Console.Write("{0:X} ", b);
Console.WriteLine("\nASCII:");
String text2 = text.Normalize(NormalizationForm.FormKC);
Console.WriteLine(text2);
bytes = Encoding.UTF8.GetBytes(text2);
foreach(var b in bytes)
Console.Write("{0:X} ", b);
}
}
try it on .Net Fiddle
This will print out:
Unicode:
GIFT
EF BC A7 EF BC A9 EF BC A6 EF BC B4
ASCII:
GIFT
47 49 46 54
Probably there are similar functions in other languages too. Now you know what you're looking for.
The search term "convert unicode FULLWIDTH LATIN" will help you.
See also here
When you can't find a function for that, you can also just do your own conversion, after all the character code is just an offset to the normal ASCII/UTF-8 latin character set. See examples here.

How to Turn string into bytes?

Using python3 and I've got a string which displayed as bytes
strategyName=\xe7\x99\xbe\xe5\xba\xa6
I need to change it into readable chinese letter through decode
orig=b'strategyName=\xe7\x99\xbe\xe5\xba\xa6'
result=orig.decode('UTF-8')
print()
which shows like this and it is what I want
strategyName=百度
But if I save it in another string,it works different
str0='strategyName=\xe7\x99\xbe\xe5\xba\xa6'
result_byte=str0.encode('UTF-8')
result_str=result_byte.decode('UTF-8')
print(result_str)
strategyName=ç¾åº¦é£é©ç­ç¥
Please help me about why this happening,and how can I fix it.
Thanks a lot
Your problem is using a str literal when you're trying to store the UTF-8 encoded bytes of your string. You should just use the bytes literal, but if that str form is necessary, the correct approach is to encode in latin-1 (which is a 1-1 converter for all ordinals below 256 to the matching byte value) to get the bytes with utf-8 encoded data, then decode as utf-8:
str0 = 'strategyName=\xe7\x99\xbe\xe5\xba\xa6'
result_byte = str0.encode('latin-1') # Only changed line
result_str = result_byte.decode('UTF-8')
print(result_str)
Of course, the other approach could be to just type the Unicode escapes you wanted in the first place instead of byte level escapes that correspond to a UTF-8 encoding:
result_str = 'strategyName=\u767e\u5ea6'
No rigmarole needed.

Remove trailing "=" when base64 encoding

I am noticing that whenever I base64 encode a string, a "=" is appended at the end. Can I remove this character and then reliably decode it later by adding it back, or is this dangerous? In other words, is the "=" always appended, or only in certain cases?
I want my encoded string to be as short as possible, that's why I want to know if I can always remove the "=" character and just add it back before decoding.
The = is padding. <!------------>
Wikipedia says
An additional pad character is
allocated which may be used to force
the encoded output into an integer
multiple of 4 characters (or
equivalently when the unencoded binary
text is not a multiple of 3 bytes) ;
these padding characters must then be
discarded when decoding but still
allow the calculation of the effective
length of the unencoded text, when its
input binary length would not be a
multiple of 3 bytes (the last non-pad
character is normally encoded so that
the last 6-bit block it represents
will be zero-padded on its least
significant bits, at most two pad
characters may occur at the end of the
encoded stream).
If you control the other end, you could remove it when in transport, then re-insert it (by checking the string length) before decoding.
Note that the data will not be valid Base64 in transport.
Also, Another user pointed out (relevant to PHP users):
Note that in PHP base64_decode will accept strings without padding, hence if you remove it to process it later in PHP it's not necessary to add it back. – Mahn Oct 16 '14 at 16:33
So if your destination is PHP, you can safely strip the padding and decode without fancy calculations.
I wrote part of Apache's commons-codec-1.4.jar Base64 decoder, and in that logic we are fine without padding characters. End-of-file and End-of-stream are just as good indicators that the Base64 message is finished as any number of '=' characters!
The URL-Safe variant we introduced in commons-codec-1.4 omits the padding characters on purpose to keep things smaller!
http://commons.apache.org/codec/apidocs/src-html/org/apache/commons/codec/binary/Base64.html#line.478
I guess a safer answer is, "depends on your decoder implementation," but logically it is not hard to write a decoder that doesn't need padding.
In JavaScript you could do something like this:
// if this is your Base64 encoded string
var str = 'VGhpcyBpcyBhbiBhd2Vzb21lIHNjcmlwdA==';
// make URL friendly:
str = str.replace(/\+/g, '-').replace(/\//g, '_').replace(/\=+$/, '');
// reverse to original encoding
if (str.length % 4 != 0){
str += ('===').slice(0, 4 - (str.length % 4));
}
str = str.replace(/-/g, '+').replace(/_/g, '/');
See also this Fiddle: http://jsfiddle.net/7bjaT/66/
= is added for padding. The length of a base64 string should be multiple of 4, so 1 or 2 = are added as necessary.
Read: No, you shouldn't remove it.
On Android I am using this:
Global
String CHARSET_NAME ="UTF-8";
Encode
String base64 = new String(
Base64.encode(byteArray, Base64.URL_SAFE | Base64.NO_PADDING | Base64.NO_CLOSE | Base64.NO_WRAP),
CHARSET_NAME);
return base64.trim();
Decode
byte[] bytes = Base64.decode(base64String,
Base64.URL_SAFE | Base64.NO_PADDING | Base64.NO_CLOSE | Base64.NO_WRAP);
equals this on Java:
Encode
private static String base64UrlEncode(byte[] input)
{
Base64 encoder = new Base64(true);
byte[] encodedBytes = encoder.encode(input);
return StringUtils.newStringUtf8(encodedBytes).trim();
}
Decode
private static byte[] base64UrlDecode(String input) {
byte[] originalValue = StringUtils.getBytesUtf8(input);
Base64 decoder = new Base64(true);
return decoder.decode(originalValue);
}
I had never problems with trailing "=" and I am using Bouncycastle as well
If you're encoding bytes (at fixed bit length), then the padding is redundant. This is the case for most people.
Base64 consumes 6 bits at a time and produces a byte of 8 bits that only uses six bits worth of combinations.
If your string is 1 byte (8 bits), you'll have an output of 12 bits as the smallest multiple of 6 that 8 will fit into, with 4 bits extra. If your string is 2 bytes, you have to output 18 bits, with two bits extra. For multiples of six against multiple of 8 you can have a remainder of either 0, 2 or 4 bits.
The padding says to ignore those extra four (==) or two (=) bits. The padding is there tell the decoder about your padding.
The padding isn't really needed when you're encoding bytes. A base64 encoder can simply ignore left over bits that total less than 8 bits. In this case, you're best off removing it.
The padding might be of some use for streaming and arbitrary length bit sequences as long as they're a multiple of two. It might also be used for cases where people want to only send the last 4 bits when more bits are remaining if the remaining bits are all zero. Some people might want to use it to detect incomplete sequences though it's hardly reliable for that. I've never seen this optimisation in practice. People rarely have these situations, most people use base64 for discrete byte sequences.
If you see answers suggesting to leave it on, that's not a good encouragement if you're simply encoding bytes, it's enabling a feature for a set of circumstances you don't have. The only reason to have it on in that case might be to add tolerance to decoders that don't work without the padding. If you control both ends, that's a non-concern.
If you're using PHP the following function will revert the stripped string to its original format with proper padding:
<?php
$str = 'base64 encoded string without equal signs stripped';
$str = str_pad($str, strlen($str) + (4 - ((strlen($str) % 4) ?: 4)), '=');
echo $str, "\n";
Using Python you can remove base64 padding and add it back like this:
from math import ceil
stripped = original.rstrip('=')
original = stripped.ljust(ceil(len(stripped) / 4) * 4, '=')
Yes, there are valid use cases where padding is omitted from a Base 64 encoding.
The JSON Web Signature (JWS) standard (RFC 7515) requires Base 64 encoded data to omit
padding. It expects:
Base64 encoding [...] with all trailing '='
characters omitted (as permitted by Section 3.2) and without the
inclusion of any line breaks, whitespace, or other additional
characters. Note that the base64url encoding of the empty octet
sequence is the empty string. (See Appendix C for notes on
implementing base64url encoding without padding.)
The same applies to the JSON Web Token (JWT) standard (RFC 7519).
In addition, Julius Musseau's answer has indicated that Apache's Base 64 decoder doesn't require padding to be present in Base 64 encoded data.
I do something like this with java8+
private static String getBase64StringWithoutPadding(String data) {
if(data == null) {
return "";
}
Base64.Encoder encoder = Base64.getEncoder().withoutPadding();
return encoder.encodeToString(data.getBytes());
}
This method gets an encoder which leaves out padding.
As mentioned in other answers already padding can be added after calculations if you need to decode it back.
For Android You may have trouble if You want to use android.util.base64 class, since that don't let you perform UnitTest others that integration test - those uses Adnroid environment.
In other hand if You will use java.util.base64, compiler warns You that You sdk may to to low (below 26) to use it.
So I suggest Android developers to use
implementation "commons-codec:commons-codec:1.13"
Encoding object
fun encodeObjectToBase64(objectToEncode: Any): String{
val objectJson = Gson().toJson(objectToEncode).toString()
return encodeStringToBase64(objectJson.toByteArray(Charsets.UTF_8))
}
fun encodeStringToBase64(byteArray: ByteArray): String{
return Base64.encodeBase64URLSafeString(byteArray).toString() // encode with no padding
}
Decoding to Object
fun <T> decodeBase64Object(encodedMessage: String, encodeToClass: Class<T>): T{
val decodedBytes = Base64.decodeBase64(encodedMessage)
val messageString = String(decodedBytes, StandardCharsets.UTF_8)
return Gson().fromJson(messageString, encodeToClass)
}
Of course You may omit Gson parsing and put straight away into method Your String transformed to ByteArray

Resources