Encoder and Decoder

Encoder and Decoder - string

Other way to encode byte[] to String and decode String to byte[] without using Base64.
Because when I encode a byte[] to String and then I compress the String using LZW. I can't decode it back to byte[] using Base64. Is there an encoder or decoder which can keep decode a String although the String has modified by LZW?

Not practical, but easy to implement and easily reversible encoding of bytes to Unicode characters is to encode each byte into (offset+byte_value) in such a way that all 256 values fit into some valid Unicode block.
I.e. looking at Unicode blocks range 2200..22FF (Mathematical Operators) is quite reasonable for such operation (C# sample):
char EncodeByte(byte x) { return (char)(0x2200 + x);}
byte DecodeByte(char x) { return (byte)(x - 0x2200);}
Note: regular LZW manipulates sequences of bytes - so no encoding necessary when starting from bytes.

Related

Using Protocol Buffer to Serialize Bytes Python3

I am trying to serialize a bytes object - which is an initialization vector for my program's encryption. But, the Google Protocol Buffer only accepts strings. It seems like the error starts with casting bytes to string. Am I using the correct method to do this? Thank you for any help or guidance!
Or also, can I make the Initialization Vector a string object for AES-CBC mode encryption?
Code
Cast the bytes to a string
string_iv = str(bytes_iv, 'utf-8')
Serialize the string using SerializeToString():
serialized_iv = IV.SerializeToString()
Use ParseToString() to recover the string:
IV.ParseFromString( serialized_iv )
And finally, UTF-8 encode the string back to bytes:
bytes_iv = bytes(IV.string_iv, encoding= 'utf-8')
Error
string_iv = str(bytes_iv, 'utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9b in position 3: invalid start byte

If you must cast an arbitrary bytes object to str, these are your option:
simply call str() on the object. It will turn it into repr form, ie. something that could be parsed as a bytes literal, eg. "b'abc\x00\xffabc'"
decode with "latin1". This will always work, even though it technically makes no sense if the data isn't text encoded with Latin-1.
use base64 or base85 encoding (the standard library has a base64 module wich covers both)

How to Turn string into bytes?

Using python3 and I've got a string which displayed as bytes
strategyName=\xe7\x99\xbe\xe5\xba\xa6
I need to change it into readable chinese letter through decode
orig=b'strategyName=\xe7\x99\xbe\xe5\xba\xa6'
result=orig.decode('UTF-8')
print()
which shows like this and it is what I want
strategyName=百度
But if I save it in another string,it works different
str0='strategyName=\xe7\x99\xbe\xe5\xba\xa6'
result_byte=str0.encode('UTF-8')
result_str=result_byte.decode('UTF-8')
print(result_str)
strategyName=ç¾åº¦é£é©çç¥
Please help me about why this happening,and how can I fix it.
Thanks a lot

Your problem is using a str literal when you're trying to store the UTF-8 encoded bytes of your string. You should just use the bytes literal, but if that str form is necessary, the correct approach is to encode in latin-1 (which is a 1-1 converter for all ordinals below 256 to the matching byte value) to get the bytes with utf-8 encoded data, then decode as utf-8:
str0 = 'strategyName=\xe7\x99\xbe\xe5\xba\xa6'
result_byte = str0.encode('latin-1') # Only changed line
result_str = result_byte.decode('UTF-8')
print(result_str)
Of course, the other approach could be to just type the Unicode escapes you wanted in the first place instead of byte level escapes that correspond to a UTF-8 encoding:
result_str = 'strategyName=\u767e\u5ea6'
No rigmarole needed.

Encode and Decode byte[] into string in java?

How to encode byte array into string and the same decode the encodedstring into byte array? Mostly i want to save the encoded string into server and decoded the same into byte array for image loading from server.

If programmers want to encode and decode byte[] in java. Here is the sample code for your knowledge. Use this below code
Encode Byte array into String
String strByteValue = Base64.encodeToString(byteArrayValue, Base64.URL_SAFE);
Decode String into Byte[]
byte decodedString[] = Base64.decode(strByteValue , Base64.URL_SAFE);

How to read CSV which is in Hindi font using C#.net?

I am trying to read data from csv and put that in drop down. This CSV is written in Hindi font (shusha.ttf).
While reading each line I am getting junk values.
string sFileName = "C://MyFile.csv";
Assembly assem = Assembly.GetCallingAssembly();
FileStream[] fss = assem.GetFiles();
if (!File.Exists(sFileName))
{
MessageBox.Show("Items File Not Present");
return false;
}
StreamReader sr = new StreamReader(sFileName);
string sItem = null;
bool isFirstLine = true;
do
{
sItem = sr.ReadLine();
if (sItem != null)
{
string[] arrItems = sItem.Split(',');
if (!isFirstLine)
{
listItems.Add(arrItems[0]);
}
isFirstLine = false;
}
} while (sItem != null);
return true;

You're not providing an encoding paramter to the StreamReader, so it is assuming a default encoding, which is not the encoding the file was written with.
Not all text files or csv files are the same. Encoding systems choose how to convert 'characters' (glyphs, word pictures, letters, whatever) to bytes to store in a computer.
There are many different encoding systems - ASCII, EBDIC, Utf8, Utf16, Utf32, etc.
You need to figure out which encoding the file was written with and pass that as the Encoding parameter to the StreamReader class.
I would have figured that the file was written with UTF8, since it's a pretty universal standard for non-english text; StreamReader's default is to use UTF8 when you don't provide a value, so it is probably not utf8. It's possible it's UTF16, or perhaps even some other completely different encoding.
For the curious who want some background on Unicode - unicode is a standard that assigns simple numbers to glyphs, ranging form ascii to snowmen to mandarin, etc. Unicode just gives each glyph a number, known as a code point. Unicode however is NOT an encoding - it doesn't say how to actually represent those code points as bytes.
UTF8 is a unicode encoding that can cover the entirety of the unicode space, as is UTF16 and UTF32. UTF8 writes 1 byte out for code points below a certain value, 2 bytes for code points below a certain higher value, and so on, and uses signaling bits in each byte to help indicate whether a code point was written using one, two, three, etc bytes.
Internally, for instance, C# represents strings using UTF16, which is why if you look at the raw memory for strings containing only ascii text, you'll see lots of '0' values - ascii doesn't need the other 8 bits, so the values end up being all 0.
Here's a link from wikipedia that explains how UTF8 packs bits from the code point value, with signalling bits, into bytes to store in memory: https://en.wikipedia.org/wiki/UTF-8

Remove trailing "=" when base64 encoding

I am noticing that whenever I base64 encode a string, a "=" is appended at the end. Can I remove this character and then reliably decode it later by adding it back, or is this dangerous? In other words, is the "=" always appended, or only in certain cases?
I want my encoded string to be as short as possible, that's why I want to know if I can always remove the "=" character and just add it back before decoding.

The = is padding. <!------------>
Wikipedia says
An additional pad character is
allocated which may be used to force
the encoded output into an integer
multiple of 4 characters (or
equivalently when the unencoded binary
text is not a multiple of 3 bytes) ;
these padding characters must then be
discarded when decoding but still
allow the calculation of the effective
length of the unencoded text, when its
input binary length would not be a
multiple of 3 bytes (the last non-pad
character is normally encoded so that
the last 6-bit block it represents
will be zero-padded on its least
significant bits, at most two pad
characters may occur at the end of the
encoded stream).
If you control the other end, you could remove it when in transport, then re-insert it (by checking the string length) before decoding.
Note that the data will not be valid Base64 in transport.
Also, Another user pointed out (relevant to PHP users):
Note that in PHP base64_decode will accept strings without padding, hence if you remove it to process it later in PHP it's not necessary to add it back. – Mahn Oct 16 '14 at 16:33
So if your destination is PHP, you can safely strip the padding and decode without fancy calculations.

I wrote part of Apache's commons-codec-1.4.jar Base64 decoder, and in that logic we are fine without padding characters. End-of-file and End-of-stream are just as good indicators that the Base64 message is finished as any number of '=' characters!
The URL-Safe variant we introduced in commons-codec-1.4 omits the padding characters on purpose to keep things smaller!
http://commons.apache.org/codec/apidocs/src-html/org/apache/commons/codec/binary/Base64.html#line.478
I guess a safer answer is, "depends on your decoder implementation," but logically it is not hard to write a decoder that doesn't need padding.

In JavaScript you could do something like this:
// if this is your Base64 encoded string
var str = 'VGhpcyBpcyBhbiBhd2Vzb21lIHNjcmlwdA==';
// make URL friendly:
str = str.replace(/\+/g, '-').replace(/\//g, '_').replace(/\=+$/, '');
// reverse to original encoding
if (str.length % 4 != 0){
str += ('===').slice(0, 4 - (str.length % 4));
}
str = str.replace(/-/g, '+').replace(/_/g, '/');
See also this Fiddle: http://jsfiddle.net/7bjaT/66/

= is added for padding. The length of a base64 string should be multiple of 4, so 1 or 2 = are added as necessary.
Read: No, you shouldn't remove it.

On Android I am using this:
Global
String CHARSET_NAME ="UTF-8";
Encode
String base64 = new String(
Base64.encode(byteArray, Base64.URL_SAFE | Base64.NO_PADDING | Base64.NO_CLOSE | Base64.NO_WRAP),
CHARSET_NAME);
return base64.trim();
Decode
byte[] bytes = Base64.decode(base64String,
Base64.URL_SAFE | Base64.NO_PADDING | Base64.NO_CLOSE | Base64.NO_WRAP);
equals this on Java:
Encode
private static String base64UrlEncode(byte[] input)
{
Base64 encoder = new Base64(true);
byte[] encodedBytes = encoder.encode(input);
return StringUtils.newStringUtf8(encodedBytes).trim();
}
Decode
private static byte[] base64UrlDecode(String input) {
byte[] originalValue = StringUtils.getBytesUtf8(input);
Base64 decoder = new Base64(true);
return decoder.decode(originalValue);
}
I had never problems with trailing "=" and I am using Bouncycastle as well

If you're encoding bytes (at fixed bit length), then the padding is redundant. This is the case for most people.
Base64 consumes 6 bits at a time and produces a byte of 8 bits that only uses six bits worth of combinations.
If your string is 1 byte (8 bits), you'll have an output of 12 bits as the smallest multiple of 6 that 8 will fit into, with 4 bits extra. If your string is 2 bytes, you have to output 18 bits, with two bits extra. For multiples of six against multiple of 8 you can have a remainder of either 0, 2 or 4 bits.
The padding says to ignore those extra four (==) or two (=) bits. The padding is there tell the decoder about your padding.
The padding isn't really needed when you're encoding bytes. A base64 encoder can simply ignore left over bits that total less than 8 bits. In this case, you're best off removing it.
The padding might be of some use for streaming and arbitrary length bit sequences as long as they're a multiple of two. It might also be used for cases where people want to only send the last 4 bits when more bits are remaining if the remaining bits are all zero. Some people might want to use it to detect incomplete sequences though it's hardly reliable for that. I've never seen this optimisation in practice. People rarely have these situations, most people use base64 for discrete byte sequences.
If you see answers suggesting to leave it on, that's not a good encouragement if you're simply encoding bytes, it's enabling a feature for a set of circumstances you don't have. The only reason to have it on in that case might be to add tolerance to decoders that don't work without the padding. If you control both ends, that's a non-concern.

If you're using PHP the following function will revert the stripped string to its original format with proper padding:
<?php
$str = 'base64 encoded string without equal signs stripped';
$str = str_pad($str, strlen($str) + (4 - ((strlen($str) % 4) ?: 4)), '=');
echo $str, "\n";

Using Python you can remove base64 padding and add it back like this:
from math import ceil
stripped = original.rstrip('=')
original = stripped.ljust(ceil(len(stripped) / 4) * 4, '=')

Yes, there are valid use cases where padding is omitted from a Base 64 encoding.
The JSON Web Signature (JWS) standard (RFC 7515) requires Base 64 encoded data to omit
padding. It expects:
Base64 encoding [...] with all trailing '='
characters omitted (as permitted by Section 3.2) and without the
inclusion of any line breaks, whitespace, or other additional
characters. Note that the base64url encoding of the empty octet
sequence is the empty string. (See Appendix C for notes on
implementing base64url encoding without padding.)
The same applies to the JSON Web Token (JWT) standard (RFC 7519).
In addition, Julius Musseau's answer has indicated that Apache's Base 64 decoder doesn't require padding to be present in Base 64 encoded data.

I do something like this with java8+
private static String getBase64StringWithoutPadding(String data) {
if(data == null) {
return "";
}
Base64.Encoder encoder = Base64.getEncoder().withoutPadding();
return encoder.encodeToString(data.getBytes());
}
This method gets an encoder which leaves out padding.
As mentioned in other answers already padding can be added after calculations if you need to decode it back.

For Android You may have trouble if You want to use android.util.base64 class, since that don't let you perform UnitTest others that integration test - those uses Adnroid environment.
In other hand if You will use java.util.base64, compiler warns You that You sdk may to to low (below 26) to use it.
So I suggest Android developers to use
implementation "commons-codec:commons-codec:1.13"
Encoding object
fun encodeObjectToBase64(objectToEncode: Any): String{
val objectJson = Gson().toJson(objectToEncode).toString()
return encodeStringToBase64(objectJson.toByteArray(Charsets.UTF_8))
}
fun encodeStringToBase64(byteArray: ByteArray): String{
return Base64.encodeBase64URLSafeString(byteArray).toString() // encode with no padding
}
Decoding to Object
fun <T> decodeBase64Object(encodedMessage: String, encodeToClass: Class<T>): T{
val decodedBytes = Base64.decodeBase64(encodedMessage)
val messageString = String(decodedBytes, StandardCharsets.UTF_8)
return Gson().fromJson(messageString, encodeToClass)
}
Of course You may omit Gson parsing and put straight away into method Your String transformed to ByteArray

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string