Antlr rule for numbers and hex

Antlr rule for numbers and hex - antlr4

I'm writing a parser for libretro datfiles. It's my first antlr grammar and parser, so please forgive me for the lack of best practices.
But man, how cool antlr is, can't believe it was that simple to get me this far :)
game (
name "10-Yard Fight (World, set 1)"
year "1983"
developer "Irem"
rom ( name 10yard.zip size 62708 crc 8f401426 md5 040dc0827e184b8da9b91499f96c1fce sha1 8b15a2e54e9fdded3d61205bac6535ccc7b41271 )
)
And my grammar:
grammar Datafile;
game: GAME LP
gameBody+
RP;
gameBody: gameName
| year
| developer
| rom+
;
rom: ROM LP
size
| crc
| md5
| serial
| sha1
RP;
size: SIZE NUMBER;
crc: CRC HEX;
md5: MD5 HEX;
sha1: SHA1 HEX;
serial: SERIAL STRING;
gameName: NAME STRING;
year: YEAR STRING;
developer: DEVELOPER STRING;
//keywords
GAME: 'game';
ROM: 'rom';
NAME: 'name';
YEAR: 'year';
DEVELOPER: 'developer';
SIZE: 'size';
CRC: 'crc';
MD5: 'md5';
SHA1: 'sha1';
SERIAL: 'serial';
LP: '(';
RP: ')';
STRING : '"' ( '\\"' | . )*? '"';
WS : [ \t\r\n\u000C]+ -> skip;
NUMBER: DIGIT+ ([.,] DIGIT+)?;
HEX: [a-fA-F0-9]+;
fragment LETTER: [A-Za-z];
fragment DIGIT: [0-9];
So I'm not being very strict on capturing things like year/developer via using a generic String rule, because sometimes dates are string in the dataset, I'll handle those on the parser logic.
But I bumped into a problem with the hex values. As they are not escaped as in certain languages via 0x prefix. What's happening is that if I have a crc 0001100 then the parser captures this as a NUMBER and not a HEX.
Is there a way out of this, or do I just map both size and crc sha1 md5 as hex and deal with parsing logic on the parser?
Another issue I bumped was trying to figure out how to capture the filename (rom -> name), that could be any valid file name, and I'm not being able to write a rule that captures that.
Hope anyone could chime in :)

Without some special syntax for hex numbers, the Lexer will not be able to identify them as such. You only know that something should be interpreted as a hex number by the context. Parser rules gives you context.
So:
hexNumber: HEX | NUMBER;
crc: CRC hexNumber;
md5: MD5 hexNumber;
sha1: SHA1 hexNumber;
Now you’ll have a HexNumberContext in your parse tree and will know how to interpret it.
You could also just do:
crc: CRC ( HEX | NUMBER)
You can try both and look at the resulting parse tree and context classes to see which you prefer.

I was thinkning that you could use LEXER modes to force the number to be parsed as a hex number. I'm not too happy about Lexer modes as I'm still an ANTLR4 newbie, but I thought of something like this:
CRC : 'crc' ->pushMode(HexOnly);
MD5 : 'md5' ->pushMode(HexOnly);
SHA1: 'sha1' ->pushMode(HexOnly);
...
mode HexOnly;
HEX1: [a-fA-F0-9]+ ->popMode;
... any other HexOnly mode thing
In the exampe I used HEX1 as I find when I have modes invarably I have a token type that is also used elsewhere, and as I'm still learning to think "ANTLR4" I will hack my way to a solution until I learn an elegant solution.
I find Lexer modes are sort of non-elegant so far.
Cool but come with baggage.

Related

Is this just a flawed grammar?

I was looking through a grammar for focal and found someone had defined their numbers as follows:
number
: mantissa ('e' signed_)?
;
mantissa
: signed_
| (signed_ '.')
| ('.' signed_)
| (signed_ '.' signed_)
;
signed_
: PLUSMIN? INTEGER
;
PLUSMIN
: '+'
| '-'
;
I was curious because I thought this would mean that, for example, 1.-1 would get identified as a number by the grammar rather than subtraction. Would a branch with unsigned_ be worth it to prevent this issue? I guess this is more of a question for the author, but are there any benefits to structuring it this way (besides the obvious avoiding floats vs ints)?

It’s not necessarily flawed.
It does appear that it will recognize 1.-1 as a mantissa. However, that doesn’t mean that some post-parse validation doesn’t catch this problem.
It would be flawed if there’s an alternative, valid interpretation of 1.-1.
Sometimes, it’s just useful to recognize an invalid construct and produce a parse tree for “the only way to interpret this input”, and then you can detect it in a listener and give the user an error message that might be more meaningful than the default message that ANTLR would produce.
And, then again, it could also just be an oversight.
The `signed_` rule on the other hand, being:
signed_ : PLUSMIN? INTEGER;
Instead of
signed_ : PLUSMIN? INTEGER+;
does make this grammar somewhat suspect as a good example to work from.

Your analyze looks correct to me saying that :
1.-1 is recognized as a number
a branch with unsigned_ could fix it
Saying it's "flawd" taste like a value judgement, which seems not relevant.
If that was for my own usage, I would prefer to :
recognize 0.-4 as an invalid number
recognize -.4 as a valid number
So I do prefer something like :
number
: signed_float('e' signed_integer)?
;
signed_float
: PLUSMIN? unsigned_float
;
unsigned_float
: integer
| (integer '.')
| ('.' integer)
| (integer'.' integer)
;
signed_integer
: PLUSMIN? unsigned_integer
;
PLUSMIN
: '+'
| '-'
;

ANTLR v4: How do I capture an arbitrary trimmed string to the end of line/file after a certain token?

For curiosity's sake I'm learning ANTLR, in particular, 4 and I'm trying to create a simple grammar. I chose NES (Nintentdo Entertainment System) Game Genie files the very first attempt. Let's say, here is a sample Game Genie file for Jurassic Park found somewhere in Internet:
GZUXXKVS Infinite ammo on pick-up
PAVPAGZE More bullets picked up from small dinosaurs
PAVPAGZA Fewer bullets picked up from small dinosaurs
GZEULOVK Infinite lives--1st 2 Levels only
ATVGZOSA Immune to most attacks
VEXASASA + VEUAXASA 3-ball bolas picked up
NEXASASA + NEUAXASA Explosive multi-shots
And here is a grammar I'm working on.
grammar NesGameGenie;
all: lines EOF;
lines: (anyLine? EOL+)* anyLine?;
anyLine: codeLine;
codeLine: code;
code: CODE (PLUS? CODE)*;
CODE: SHORT_CODE | LONG_CODE;
fragment SHORT_CODE: FF FF FF FF FF FF;
fragment LONG_CODE: FF FF FF FF FF FF FF FF;
fragment FF: [APZLGITYEOXUKSVN];
COMMENT: COMMENT_START NOEOL -> skip;
COMMENT_START: [#;];
EOL: '\r'? '\n';
PLUS: '+';
WS: [ \t]+ -> skip;
fragment NOEOL: ~[\r\n]*;
Well it's ridiculously short and easy, but it still has two issues I can see:
The cheats descriptions cause recognition errors like line 1:16 token recognition error at: 'In' because there is no a description rule provided to the grammar.
Adding the # symbol to the description will probably cause ignore the rest to the end of line. At least, AAAAAA Player #1 ammo only reports Player and #1 ammo is unfortunately parsed as a comment, but I think it could be fixed once the description rule is introduced.
My previous attempts to add the description rule caused a lot of various errors, and I've found a non-error but still not a good solution:
...
codeLine: code description?;
...
description: PRINTABLE+;
...
PRINTABLE: [\u0020-\uFFFE];
...
Unfortunately every character is parsed as a single PRINTABLE, and what I'm looking for is a description rule to match arbirtrary text until the end of line (or file) including whitespaces, but trimmed on left and right. If I add + to the end of the PRINTABLE, the whole document is considered invalid. I guess that PRINTABLE might be safely inlined to the description rule somehow, but description: ('\u0020' .. '\uFFFE')+; captures way more.
How should the description rule be declared to let it capture all characters to the end of line right after the codes, but trimming whitespaces ([ \t]) on both left and right only? Simply speaking, I would have a grammar that would parse into something like (including the # character not parsing it as a comment):
code=..., description="Infinite ammo on pick-up"
code=..., description="More bullets picked up from small dinosaurs"
code=..., description="Fewer bullets picked up from small dinosaurs"
code=..., description="Infinite lives--1st 2 Levels only"
code=..., description="Immune to most attacks"
code=..., description="3-ball bolas picked up"
code=..., description="Explosive multi-shots"
One more note, I'm using:
IntelliJ IDEA 2016.1.1 CE
IJ plugin: ANTLR v4 grammar plugin 1.8.1
IJ plugin: ANTLRWorks 1.3.1

Quite easy actually, just use lexer modes. Once you hit certain tokens, change the mode.
Here is the lexer grammar, parser is easy based on that (filename is NesGameGenieLexer.g4):
lexer grammar NesGameGenieLexer;
CODE: [A-Z]+;
WS : [ ]+ -> skip, mode(COMMENT_MODE);
mode COMMENT_MODE;
PLUS: '+' (' ')* -> mode(DEFAULT_MODE);
fragment ANY_CHAR: [a-zA-Z_/0-9=.\-\\ ];
COMMENT: ANY_CHAR+;
NEWLINE: [\r\n] -> skip, mode(DEFAULT_MODE);
I've assumed that + can't be in comments. If you use ANTLRWorks lexer debugger you can see all the token types and token modes nicely highlighted.
And here is the parser grammar (filename is NesGameGenieParser.g4):
parser grammar NesGameGenieParser;
options {
tokenVocab=NesGameGenieLexer;
}
file: line+;
line : code comment
| code PLUS code comment;
code: CODE;
comment: COMMENT;
Here I've assumed that CODE is just set of chars before PLUS but obviously that's very easy to change :)

Spending the whole sleepless night and having much less time to sleep, I seem to have managed to write the lexer and parser grammars. No need to explain much, see the comments in the source code, so in short:
The lexer:
lexer grammar NesGameGenieLexer;
COMMENT: [#;] ~[\r\n]+ [\r\n]+ -> skip;
CODE: (SHORT_CODE | LONG_CODE) -> mode(CODE_FOUND_MODE);
fragment SHORT_CODE: FF FF FF FF FF FF;
fragment LONG_CODE: FF FF FF FF FF FF FF FF;
fragment FF: [APZLGITYEOXUKSVN];
WS: [\t ]+ -> skip;
mode CODE_FOUND_MODE;
PLUS: [\t ]* '+' [\t ]* -> mode(DEFAULT_MODE);
// Skip inline whitespaces and switch to the description detection mode.
DESCRIPTION_LEFT_DELIMITER: [\t ]+ -> skip, mode(DESCRIPTION_FOUND_MODE);
NEW_LINE_IN_CODE_FOUND_MODE: [\r\n]+ -> skip, mode(DEFAULT_MODE);
mode DESCRIPTION_FOUND_MODE;
// Greedily grab all non-CRLF characters and ignore trailing spaces - this is a trimming operation equivalent, I guess.
DESCRIPTION: ~[\r\n]*~[\r\n\t ]+;
// But then terminate the line and switch to the code detection mode.
// This operation is probably required because the `DESCRIPTION: ... -> mode(CODE_FOUND_MODE)` seems not working
NEW_LINE_IN_DESCRIPTION_FOUND_MODE: [\r\n]+ -> skip, mode(DEFAULT_MODE);
The parser:
parser grammar NesGameGenieParser;
options {
tokenVocab = NesGameGenieLexer;
}
file
: line+
;
line
: code description?
| code (PLUS code)* description?
;
code
: CODE
;
description
: DESCRIPTION
;
It is looks much more complicated than I thought it should work, but it seems to work exactly the way I want. Also, I'm not sure if the grammars above are really well-written and idiomatic. Thanks to #cantSleepNow for giving the modes idea.

Token recognition order

My full grammar results in an incarnation of the dreaded "no viable alternative", but anyway, maybe a solution to the problem I'm seeing with this trimmed-down version can help me understand what's going on.
grammar NOVIA;
WS : [ \t\r\n]+ -> skip ; // whitespace rule -> toss it out
T_INITIALIZE : 'INITIALIZE' ;
T_REPLACING : 'REPLACING' ;
T_ALPHABETIC : 'ALPHABETIC' ;
T_ALPHANUMERIC : 'ALPHANUMERIC' ;
T_BY : 'BY' ;
IdWord : IdLetter IdSeparatorAndLetter* ;
IdLetter : [a-zA-Z0-9];
IdSeparatorAndLetter : ([\-]* [_]* [A-Za-z0-9]+);
FigurativeConstant :
'ZEROES' | 'ZERO' | 'SPACES' | 'SPACE'
;
statement : initStatement ;
initStatement : T_INITIALIZE identifier+ T_REPLACING (T_ALPHABETIC | T_ALPHANUMERIC) T_BY (literal | identifier) ;
literal : FigurativeConstant ;
identifier : IdWord ;
and the following input
INITIALIZE ABC REPLACING ALPHANUMERIC BY SPACES
results in
(statement (initStatement INITIALIZE (identifier ABC) REPLACING ALPHANUMERIC BY (identifier SPACES)))
I would have expected to see SPACES being recognized as "literal", not "identifier".
Any and all pointer greatly appreciated,
TIA - Alex

Every string that might match the FigurativeConstant rule will also match the IdWord rule. Because the IdWord rule is listed first and the match length is the same with either rule, the Lexer issues an IdWord token, not a FigurativeConstant token.
List the FigurativeConstant rule first and you will get the result you were expecting.
As a matter of style, the order in which you are listing your rules obscures the significance of their order, particularly for the necessary POV of the Lexer and Parser. Take a look at the grammars in the antlr/grammars-v4 repository as examples -- typically, for a combined grammar, parser on top and a top-down ordering. I would even hazard a guess that others might have answered sooner had your grammar been easier to read.

Perl's default string encoding and representation

In the following:
my $string = "Can you \x{FB01}nd my r\x{E9}sum\x{E9}?\n";
The x{FB01} and x{E9} are code points. And code points are encoded via an encoding scheme to a series of octets.
So the character è which has the codepoint \x{FB01} is part of the string of $string. But how does this work? Are all the characters in this sentence (including the ASCII ones) encoded via UTF-8?
If yes why do I get the following behavior?
my $str = "Some arbitrary string\n";
if(Encode::is_utf8($str)) {
print "YES str IS UTF8!\n";
}
else {
print "NO str IT IS NOT UTF8\n";
}
This prints "NO str IT IS NOT UTF8\n"
Additionally Encode::is_utf8($string) returns true.
In what way are $string and $str different and one is considered UTF-8 and the other not?
And in any case what is the encoding of $str? ASCII? Is this the default for Perl?

In C, a string is a collection of octets, but Perl has two string storage formats:
String of 8-bit values.
String of 72-bit values. (In practice, limited to 32-bit or 64-bit.)
As such, you don't need to encode code points to store them in a string.
my $s = "\x{2660}\x{2661}";
say length $s; # 2
say sprintf '%X', ord substr($s, 0, 1); # 2660
say sprintf '%X', ord substr($s, 1, 1); # 2661
(Internally, an extension of UTF-8 called "utf8" is used to store the strings of 72-bit chars. That's not something you should ever have to know except to realize the performance implications, but there are bugs that expose this fact.)
Encode's is_utf8 reports which type of string a scalar contains. It's a function that serves absolutely no use except to debug the bugs I previously mentioned.
An 8-bit string can store the value of "abc" (or the string in the OP's $str), so Perl used the more efficient 8-bit (UTF8=0) string format.
An 8-bit string can't store the value of "\x{2660}\x{2661}" (or the string in the OP's $string), so Perl used the 72-bit (UTF8=1) string format.
Zero is zero whether it's stored in a floating point number, a signed integer or an unsigned integer. Similarly, the storage format of strings conveys no information about the value of the string.
You can store code points in an 8-bit string (if they're small enough) just as easily as a 72-bit string.
You can store bytes in a 72-bit string just as easily as an 8-bit string.
In fact, Perl will switch between the two formats at will. For example, if you concatenate $string with $str, you'll get a string in the 72-bit format.
You can alter the storage format of a string with the builtins utf8::downgrade and utf8::upgrade, should you ever need to work around a bug.
utf8::downgrade($s); # Switch to strings of 8-bit values (UTF8=0).
utf8::upgrade($s); # Switch to strings of 72-bit values (UTF8=1).
You can see the effect using Devel::Peek.
>perl -MDevel::Peek -e"$s=chr(0x80); utf8::downgrade($s); Dump($s);"
SV = PV(0x7b8a74) at 0x4a84c4
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x7bab9c "\200"\0
CUR = 1
LEN = 12
>perl -MDevel::Peek -e"$s=chr(0x80); utf8::upgrade($s); Dump($s);"
SV = PV(0x558a6c) at 0x1cc843c
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x55ab94 "\302\200"\0 [UTF8 "\x{80}"]
CUR = 2
LEN = 12

The \x{FB01} and \x{E9} are code points.
Not quiet, the numeric values inside the braces are codepoints. The whole \x expression is just a notation for a character. There are several notations for characters, most of them starting with a backslash, but the common one is the simple string literal. You might as well write:
use utf8;
my $string = "Can you ﬁnd my résumé?\n";
# ↑ ↑ ↑
And code points are encoded via an encoding scheme to a series of octets.
True, but so far your string is a string of characters, not a buffer of octets.
But how does this work?
Strings consist of characters. That's just Perl's model. You as a programmer are supposed to deal with it at this level.
Of course, the computer can't, and the internal data structure must have some form of internal encoding. Far too much confusion ensues because "Perl can't keep a secret", the details leak out occasionally.
Are all the characters in this sentence (including the ASCII ones) encoded via UTF-8?
No, the internal encoding is lax UTF8 (no dash). It does not have some of the restrictions that UTF-8 (a.k.a. UTF-8-strict) has.
UTF-8 goes up to 0x10_ffff, UTF8 goes up to 0xffff_ffff_ffff_ffff on my 64-bit system. Codepoints greater than 0xffff_ffff will emit a non-portability warning, though.
In UTF-8 certain codepoints are non-characters or illegal characters. In UTF8, anything goes.
Encode::is_utf8
… is an internals function, and is clearly marked as such. You as a programmer are not supposed to peek. But since you want to peek, no one can stop you. Devel::Peek::Dump is a better tool for getting at the internals.
Read http://p3rl.org/UNI for an introduction to the topic of encoding in Perl.

is_utf8 is a badly-named function that doesn't mean what you think it means or have anything to do with that. The answer to your question is that $string doesn't have an encoding, because it's not encoded. When you call Encode::encode with some encoding, the result of that will be a string that is encoded, and has a known encoding

Remove trailing "=" when base64 encoding

I am noticing that whenever I base64 encode a string, a "=" is appended at the end. Can I remove this character and then reliably decode it later by adding it back, or is this dangerous? In other words, is the "=" always appended, or only in certain cases?
I want my encoded string to be as short as possible, that's why I want to know if I can always remove the "=" character and just add it back before decoding.

The = is padding. <!------------>
Wikipedia says
An additional pad character is
allocated which may be used to force
the encoded output into an integer
multiple of 4 characters (or
equivalently when the unencoded binary
text is not a multiple of 3 bytes) ;
these padding characters must then be
discarded when decoding but still
allow the calculation of the effective
length of the unencoded text, when its
input binary length would not be a
multiple of 3 bytes (the last non-pad
character is normally encoded so that
the last 6-bit block it represents
will be zero-padded on its least
significant bits, at most two pad
characters may occur at the end of the
encoded stream).
If you control the other end, you could remove it when in transport, then re-insert it (by checking the string length) before decoding.
Note that the data will not be valid Base64 in transport.
Also, Another user pointed out (relevant to PHP users):
Note that in PHP base64_decode will accept strings without padding, hence if you remove it to process it later in PHP it's not necessary to add it back. – Mahn Oct 16 '14 at 16:33
So if your destination is PHP, you can safely strip the padding and decode without fancy calculations.

I wrote part of Apache's commons-codec-1.4.jar Base64 decoder, and in that logic we are fine without padding characters. End-of-file and End-of-stream are just as good indicators that the Base64 message is finished as any number of '=' characters!
The URL-Safe variant we introduced in commons-codec-1.4 omits the padding characters on purpose to keep things smaller!
http://commons.apache.org/codec/apidocs/src-html/org/apache/commons/codec/binary/Base64.html#line.478
I guess a safer answer is, "depends on your decoder implementation," but logically it is not hard to write a decoder that doesn't need padding.

In JavaScript you could do something like this:
// if this is your Base64 encoded string
var str = 'VGhpcyBpcyBhbiBhd2Vzb21lIHNjcmlwdA==';
// make URL friendly:
str = str.replace(/\+/g, '-').replace(/\//g, '_').replace(/\=+$/, '');
// reverse to original encoding
if (str.length % 4 != 0){
str += ('===').slice(0, 4 - (str.length % 4));
}
str = str.replace(/-/g, '+').replace(/_/g, '/');
See also this Fiddle: http://jsfiddle.net/7bjaT/66/

= is added for padding. The length of a base64 string should be multiple of 4, so 1 or 2 = are added as necessary.
Read: No, you shouldn't remove it.

On Android I am using this:
Global
String CHARSET_NAME ="UTF-8";
Encode
String base64 = new String(
Base64.encode(byteArray, Base64.URL_SAFE | Base64.NO_PADDING | Base64.NO_CLOSE | Base64.NO_WRAP),
CHARSET_NAME);
return base64.trim();
Decode
byte[] bytes = Base64.decode(base64String,
Base64.URL_SAFE | Base64.NO_PADDING | Base64.NO_CLOSE | Base64.NO_WRAP);
equals this on Java:
Encode
private static String base64UrlEncode(byte[] input)
{
Base64 encoder = new Base64(true);
byte[] encodedBytes = encoder.encode(input);
return StringUtils.newStringUtf8(encodedBytes).trim();
}
Decode
private static byte[] base64UrlDecode(String input) {
byte[] originalValue = StringUtils.getBytesUtf8(input);
Base64 decoder = new Base64(true);
return decoder.decode(originalValue);
}
I had never problems with trailing "=" and I am using Bouncycastle as well

If you're encoding bytes (at fixed bit length), then the padding is redundant. This is the case for most people.
Base64 consumes 6 bits at a time and produces a byte of 8 bits that only uses six bits worth of combinations.
If your string is 1 byte (8 bits), you'll have an output of 12 bits as the smallest multiple of 6 that 8 will fit into, with 4 bits extra. If your string is 2 bytes, you have to output 18 bits, with two bits extra. For multiples of six against multiple of 8 you can have a remainder of either 0, 2 or 4 bits.
The padding says to ignore those extra four (==) or two (=) bits. The padding is there tell the decoder about your padding.
The padding isn't really needed when you're encoding bytes. A base64 encoder can simply ignore left over bits that total less than 8 bits. In this case, you're best off removing it.
The padding might be of some use for streaming and arbitrary length bit sequences as long as they're a multiple of two. It might also be used for cases where people want to only send the last 4 bits when more bits are remaining if the remaining bits are all zero. Some people might want to use it to detect incomplete sequences though it's hardly reliable for that. I've never seen this optimisation in practice. People rarely have these situations, most people use base64 for discrete byte sequences.
If you see answers suggesting to leave it on, that's not a good encouragement if you're simply encoding bytes, it's enabling a feature for a set of circumstances you don't have. The only reason to have it on in that case might be to add tolerance to decoders that don't work without the padding. If you control both ends, that's a non-concern.

If you're using PHP the following function will revert the stripped string to its original format with proper padding:
<?php
$str = 'base64 encoded string without equal signs stripped';
$str = str_pad($str, strlen($str) + (4 - ((strlen($str) % 4) ?: 4)), '=');
echo $str, "\n";

Using Python you can remove base64 padding and add it back like this:
from math import ceil
stripped = original.rstrip('=')
original = stripped.ljust(ceil(len(stripped) / 4) * 4, '=')

Yes, there are valid use cases where padding is omitted from a Base 64 encoding.
The JSON Web Signature (JWS) standard (RFC 7515) requires Base 64 encoded data to omit
padding. It expects:
Base64 encoding [...] with all trailing '='
characters omitted (as permitted by Section 3.2) and without the
inclusion of any line breaks, whitespace, or other additional
characters. Note that the base64url encoding of the empty octet
sequence is the empty string. (See Appendix C for notes on
implementing base64url encoding without padding.)
The same applies to the JSON Web Token (JWT) standard (RFC 7519).
In addition, Julius Musseau's answer has indicated that Apache's Base 64 decoder doesn't require padding to be present in Base 64 encoded data.

I do something like this with java8+
private static String getBase64StringWithoutPadding(String data) {
if(data == null) {
return "";
}
Base64.Encoder encoder = Base64.getEncoder().withoutPadding();
return encoder.encodeToString(data.getBytes());
}
This method gets an encoder which leaves out padding.
As mentioned in other answers already padding can be added after calculations if you need to decode it back.

For Android You may have trouble if You want to use android.util.base64 class, since that don't let you perform UnitTest others that integration test - those uses Adnroid environment.
In other hand if You will use java.util.base64, compiler warns You that You sdk may to to low (below 26) to use it.
So I suggest Android developers to use
implementation "commons-codec:commons-codec:1.13"
Encoding object
fun encodeObjectToBase64(objectToEncode: Any): String{
val objectJson = Gson().toJson(objectToEncode).toString()
return encodeStringToBase64(objectJson.toByteArray(Charsets.UTF_8))
}
fun encodeStringToBase64(byteArray: ByteArray): String{
return Base64.encodeBase64URLSafeString(byteArray).toString() // encode with no padding
}
Decoding to Object
fun <T> decodeBase64Object(encodedMessage: String, encodeToClass: Class<T>): T{
val decodedBytes = Base64.decodeBase64(encodedMessage)
val messageString = String(decodedBytes, StandardCharsets.UTF_8)
return Gson().fromJson(messageString, encodeToClass)
}
Of course You may omit Gson parsing and put straight away into method Your String transformed to ByteArray

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string