Split Erlang UTF8 binary by characters

Split Erlang UTF8 binary by characters - string

How to split binary Erlang string treating its data as UTF8 characters?
Let's say we have a binary, which should be split into two parts, and the first part should contain first two UTF8 characters. Here are few examples:
<<"ąčęė">> should become [<<"ąč">>, <<"ęė">>]
<<"あぁぅうぁ">> should become [<<"あぁ">>, <<"ぅうぁ">>]

To just split a utf-8 encoded binary string into two parts with the first part containing the first two characters and the second part the rest you could use the function:
split_2(<<One/utf8,Two/utf8,Rest/binary>>) ->
%% One and Two are now the unicode codepoints of the first 2 characters.
[<<One/utf8,Two/utf8>>,Rest].
Matching against a binary with a utf8 will extract the first utf-8 encoded character and return the unicode codepoint as an integer which is why we must build the resultant binary of the first two characters. This function will fail if there are not 2 utf-8 encoded characters first in the binary.
The difference between a bitstring and a binary is that the size of a binary must be a multiple of 8 bits while a bitstring can be any size.

Still, it's unclear for me, but I think this would make the trick:
Eshell V6.2 (abort with ^G)
1> Input = <<"ąčęė">>.
<<"ąčęė">>
2> L = [X || <<X:2/binary>> <= Input].
[<<"ąč">>,<<"ęė">>]
3>
UPDATE: This one will split it into S, TheRest:
%% S is the number of characters you want
split_it(S, Bin) when S > 0 ->
case Bin of
<<P:S/binary, R/binary>> -> [P | split_it(infinity, R)];
<<>> -> [];
_ -> [Bin]
end.

happen to need a function like this. and here is what I end up with:
trunc_utf8(Utf8s, Count) ->
trunc_utf8(Utf8s, Count, <<>>).
trunc_utf8(<<>>, _Count, Acc) -> Acc;
trunc_utf8(_Utf8s, 0, Acc) -> Acc;
trunc_utf8(<<H/utf8, T/binary>> = _Utf8s, Count, Acc) ->
trunc_utf8(T, Count - 1, <<Acc/binary, H/utf8>>).

Related

Binary Formatting Variables in TCL

I am trying to create a binary message to send over a socket, but I'm having trouble with the way TCL treats all variables as strings. I need to calculate the length of a string and know its value in binary.
set length [string length $message]
set binaryMessagePart [binary format s* { $length 0 }]
However, when I run this I get the error 'expected integer but got "$length"'. How do I get this to work and return the value for the integer 5 and not the char 5?

To calculate the length of a string, use string length. To calculate the length of a string in a particular encoding, convert the string to that encoding and use string length:
set enc "utf-8"; # Or whatever; you need to know this ahead of time for sanity's sake
set encoded [encoding convertto $enc $message]
set length [string length $encoded]
Note that with the encoded length, this will be in bytes whereas the length prior to encoding is in characters. For some messages and some encodings, the difference can be substantial.
To compose a binary message with the length and the body of the message (a fairly common binary format), use binary format like this:
# Assumes the length is big-endian; for little-endian, use i instead of I
set binPart [binary format "Ia*" $length $encoded]
What you were doing wrong was using s* which consumes a list of integers and produces a sequence of little-endian short integer binary values in the output string, and yet were feeding the list that was literally $length 0; and the string $length is not an integer as those don't start with $. We could have instead done [list $length 0] to produce the argument to s* and that would have worked, but that doesn't seem quite right for the context of the question.
In binary format, these are the common formats (there are many more):
a is for string data (mnemonically “ASCII”); this is binary string data, and you need to encode it first.
i and I are for 32-bit numbers (mnemonically “int” like in many programming languages, but especially C). Upper case is big-endian, lower case is little-endian.
s and S are for 16-bit numbers (mnemonically “short”).
c is for 8-bit numbers (mnemonically “char” from C).
w and W are for 64-bit numbers (mnemonically “wide integers”).
f and d are for IEEE binary floating point numbers (mnemonically “float” and “double” respectively, so 4 and 8 bytes).
All can be followed by an optional length, either a number or a *. For the number ones, instead of inserting a single number they insert a list of them (and so consume a list); numbers give fixed lengths, and * does “all the list”. For the string format indicator, a number uses a fixed number of bytes in the message (truncating or padding with zero bytes as necessary) and * does “all the string” (never truncating or padding).

Is there a way to check if a string is alphanumeric in erlang

I am collecting tweets from twitter using erlang and I am trying to save only the hashtags to a database. However when I'm converting the bitstrings to list-strings all the non-latin-letter tweets converts to strange symbols.
Is there any way to check if a string is only containing alphanumeric characters in erlang?

for latin chars you can use this function:
is_alpha([Char | Rest]) when Char >= $a, Char =< $z ->
is_alpha(Rest);
is_alpha([Char | Rest]) when Char >= $A, Char =< $Z ->
is_alpha(Rest);
is_alpha([Char | Rest]) when Char >= $0, Char =< $9 ->
is_alpha(Rest);
is_alpha([]) ->
true;
is_alpha(_) ->
false.
for other coding, you can add their rang of code and add them.

There are three io_lib functions specifically for this:
io_lib:printable_list/1
io_lib:printable_latin1_list/1
io_lib:printable_unicode_list/1
Here is an example of one in use:
-spec show_message(ExParent, Message) -> ok
when WxParent :: wx:wx_object(),
Message :: unicode:chardata() | term().
show_message(WxParent, Message) ->
Format =
case io_lib:printable_unicode_list(Message) of
true -> "~ts";
false -> "~tp"
end,
Modal = wxMessageDialog:new(WxParent, io_lib:format(Format, [Message])),
_ = wxMessageDialog:showModal(Modal),
ok = wxMessageDialog:destroy(Modal).
Check out the io_lib docs: http://www.erlang.org/doc/man/io_lib.html#printable_list-1
Addendum
Because this subject isn't always easy to research in Erlang a related, but slightly broader Q/A might be of interest:
How to check whether input is a string in Erlang?

The easiest way is to use regular expressions.
StringAlphanum = "1234abcZXYM".
StringNotAlphanum = "1ZXYMÄ#kMp&?".
re:run(StringAlphanum, "^[0-9A-Za-z]+$").
>> {match,[{0,11}]}
re:run(StringNotAlphanum, "^[0-9A-Za-z]+$").
>> nomatch
You can easily make a function out of it...
isAlphaNum(String) ->
case re:run(String, "^[0-9A-Za-z]+$") of
{match, _} -> true;
nomatch -> false
end.
But, in my opinion, the better way would be to solve the underlying Problem, the correct interpretation of unicode binary strings.
If you want to represent unicode-characters correctly, do not use binary_to_list. Use the unicode-module instead. Unicode-binary strings can not be interpreted naiveley as binary, the UTF-8 character encoding for example has some special constraints that prevent this. For example: the most significant bit in the first character determines, if it is a multi-byte character.
I took the following example from this site, lets define a UTF8-String:
Utf8String = <<195, 164, 105, 116, 105>>.
Interpreted naiveley as binary it yields:
binary_to_list(Utf8String).
"Ã¤iti"
Interpreted with unicode-support:
unicode:characters_to_list(Utf8String, utf8).
"äiti"

How can I convert a character code to a string character in Lua?

How can I convert a character code to a string character in Lua?
E.g.
d = 48
-- this is what I want
str_d = "0"

You are looking for string.char:
string.char (···)
Receives zero or more integers. Returns a string with length equal to the number of arguments, in which each character has the internal numerical code equal to its corresponding argument.
Note that numerical codes are not necessarily portable across platforms.
For your example:
local d = 48
local str_d = string.char(d) -- str_d == "0"

For ASCII characters, you can use string.char.
For UTF-8 strings, you can use utf8.char(introduced in Lua 5.3) to get a character from its code point.
print(utf8.char(48)) -- 0
print(utf8.char(29790)) -- 瑞

Go's LeftStr, RightStr, SubStr

I believe there are no LeftStr(str,n) (take at most n first characters), RightStr(str,n) (take at most n last characters) and SubStr(str,pos,n) (take first n characters after pos) function in Go, so I tried to make one
// take at most n first characters
func Left(str string, num int) string {
if num <= 0 {
return ``
}
if num > len(str) {
num = len(str)
}
return str[:num]
}
// take at most last n characters
func Right(str string, num int) string {
if num <= 0 {
return ``
}
max := len(str)
if num > max {
num = max
}
num = max - num
return str[num:]
}
But I believe those functions will give incorrect output when the string contains unicode characters. What's the fastest solution for those function, is using for range loop is the only way?

As mentioned in already in comments,
combining characters, modifying runes, and other multi-rune
"characters"
can cause difficulties.
Anyone interested in Unicode handling in Go should probably read the Go Blog articles
"Strings, bytes, runes and characters in Go"
and "Text normalization in Go".
In particular, the later talks about the golang.org/x/text/unicode/norm package which can help in handling some of this.
You can consider several levels increasingly of more accurate (or increasingly more Unicode aware) spiting the first (or last) "n characters" from a string.
Just use n bytes.
This may split in the middle of a rune but is O(1), is very simple, and in many cases you know the input consists of only single byte runes.
E.g. str[:n].
Split after n runes.
This may split in the middle of a character. This can be done easily, but at the expense of copying and converting with just string([]rune(str)[:n]).
You can avoid the conversion and copying by using the unicode/utf8 package's DecodeRuneInString (and DecodeLastRuneInString) functions to get the length of each of the first n runes in turn and then return str[:sum] (O(n), no allocation).
Split after the n'th "boundary".
One way to do this is to use
norm.NFC.FirstBoundaryInString(str) repeatedly
or norm.Iter to find the byte position to split at and then return str[:pos].
Consider the displayed string "cafés" which could be represented in Go code as: "cafés", "caf\u00E9s", or "caf\xc3\xa9s" which all result in the identical six bytes. Alternative it could represented as "cafe\u0301s" or "cafe\xcc\x81s" which both result in the identical seven bytes.
The first "method" above may split those into "caf\xc3"+"\xa9s" and cafe\xcc"+"\x81s".
The second may split them into "caf\u00E9"+"s" ("café"+"s") and "cafe"+"\u0301s" ("cafe"+"́s").
The third should split them into "caf\u00E9"+"s" and "cafe\u0301"+"s" (both shown as "café"+"s").

Erlang howto make a list from this binary <<"a,b,c">>

I have a binary <<"a,b,c">> and I would like to extract the information from this binary.
So I would like to have something like A=a, B=b and so on.
I need a general approach on this because the binary string always changes.
So it could be <<"aaa","bbb","ccc">>...
I tried to generate a list
erlang:binary_to_list(<<"a","b","c">>)
but I get string as a result.
"abc"
Thank you.

You did use the right method.
binary_to_list(Binary) -> [char()]
Returns a list of integers which correspond to the bytes of Binary.
There is no string type in Erlang: http://www.erlang.org/doc/reference_manual/data_types.html#id63119. The console just displays the lists in string representation as a courtesy, if all elements are in printable ASCII range.
You should read Erlang's "Bit Syntax Expressions" documentation to understand how to work on binaries.
Do not convert the whole binary into a list if you don't need it in list representation!
To extract the first three bytes you could use
<<A, B, C, Rest/binary>> = <<"aaa","bbb","ccc">>.
If you want to iterate over the binary data, you can use binary comprehension.
<< <<(F(X))>> || <<X>> <= <<"aaa","bbb","ccc">> >>.
Pattern matching is possible, too:
test(<<A, Tail/binary>>, Accu) -> test(Tail, Accu+A);
test(_, Accu) -> Accu.
882 = test(<<"aaa","bbb","ccc">>, 0).
Even for reading one UTF-8 character at once. So to convert a binary UTF-8 string into Erlang's "list of codepoints" format, you could use:
test(<<A/utf8, Tail/binary>>, Accu) -> test(Tail, [A|Accu]);
test(_, Accu) -> lists:reverse(Accu).
[97,97,97,600,99,99,99] = test(<<"aaa", 16#0258/utf8, "ccc">>, "").
(Note that `<<"aaa","bbb","ccc">> = <<"aaabbbccc">>. Don't actually use the last code snipped but the linked method.)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Split Erlang UTF8 binary by characters - string

Related

Binary Formatting Variables in TCL

Is there a way to check if a string is alphanumeric in erlang

How can I convert a character code to a string character in Lua?

Go's LeftStr, RightStr, SubStr

Erlang howto make a list from this binary <<"a,b,c">>

Categories

Resources