Strings in Erlang - what libraries and techniques should I be examining? - string

I am working on a project that will require internationalisation support down the track. I want to get started on the right foot with UTF support, and I was wondering what the best practice for handling UTF in Erlang is?
From my current research it seems there are a couple of issues with Erlang's built in string handling for some use cases (JSON parsing being a good example).
I have been looking at Starling and read (somewhere) recently that it is possibly going to be rolled into the standard Erlang release as the UTF 'standard'. Is this true? Are there other libraries or approaches I should be looking at?
From the comments:
EEP (Erlang Enhancement Proposal) 10 details Representing Unicode characters in Erlang

This page:
http://erlang.org/doc/highlights.html
...lists hightlights of release 5.7/OTP R13A. Note this passage:
1.2 Unicode support
Support for Unicode is implemented as
described in EEP10. Formatting and
reading of unicode data both from
terminals and files is supported by
the io and io_lib modules. Files can
be opened in modes with automatic
translation to and from different
unicode formats. The module 'unicode'
contains functions for conversion
between external and internal unicode
formats and the re module has support
for unicode data. There is also
language syntax for specifying string
and character data beyond the
ISO-latin-1 range.
I don't like to make pronouncements on what best practices would be, but I often find it helpful to have a minimal, complete example to start to generalize from. Here's one of getting utf into an erlang application and sending it out again to a different context. Assuming you had a MySql database with a row field in a table containing utf8 characters, here's one way to get it out and pipe it to a web browser as json:
hg clone http://bitbucket.org/justin/webmachine/ webmachine-read-only
cd webmachine-read-only
make
./scripts/new_webmachine.erl mywebdemo /tmp
svn checkout http://erlang-mysql-driver.googlecode.com/svn/trunk/ erlang-mysql-driver-read-only
cd erlang-mysql-driver-read-only/src
cp * /tmp/mywebdemo/src
svn checkout http://mochiweb.googlecode.com/svn/trunk/ mochiweb-read-only
cp mochiweb-read-only/src/mochijson2.erl /tmp/mywebdemo/src
cd /tmp/mywebdemo
Edit src/mywebdemo_resource.erl so it looks like this:
-module(mywebdemo_resource).
-export([init/1, to_html/2]).
-include_lib("webmachine/include/webmachine.hrl").
init([]) -> {ok, undefined}.
to_html(ReqData, State) ->
mysql:start_link(pool_id, "database.host.com", 3306, "db_user", "db_password", "db_name", fun(A, B, C, D) -> ouch end, utf8), %% add your connection string info
{data, Res} = mysql:fetch(pool_id, "select * from table where IdWhatever = 13"),
[[_, Utf8Str, _]] = mysql:get_result_rows(Res), %% pattern will need to be altered to match your table structure
{mochijson2:encode({struct, [{Utf8Str, 100}]}), ReqData, State}.
Build everything and start the url dispatcher:
make
./start.sh
Then execute the following in a web page (or something more convenient, like MozRepl):
var req = new XMLHttpRequest;
req.open('GET', "http://localhost:8000", false);
req.send(null);
eval("(" + req.responseText + ")");

As the previous poster mentioned the latest release of erlang supports utf natively. If you can't use the latest though then one thing I do usually is to use binaries for string data. It keeps erlang from mangling the bytes in a list. It has the side effect of making lists of strings easier to handle as well.

Related

rsync vs fs.readStream - How to handle special characters

I'm using two approaches to backing up a database file, rsync and a server-based API approach.
I'm getting slightly different results because of some particular certain high-numbered unicode characters, so the two backups are just a little bit different.
The characters in question are, in one case ⸭ (a unicode 2E2D), 猄 (unicode 7304), 璣 (74a3) which makes the voyage just fine through rsync, but which all become �� (two unicode FFFD characters) using the server / API approach.
Interestingly not all higher-numbered unicode characters get transformed to FFFDs. A 䋲 (42F2), a 0698, and thousands of others that are not converted and make it through just fine.
In fact there are only about 7 characters in the entire file that get transformed in transit.
I'm trying to get this to the point where there no difference whatsoever.
Basically there's an occasional discrepancy handling high-numbered unicode.
In both cases the backup files created are utf8 with char(10) line feeds.
Here's the basic difference between the two backup approaches:
RSYNC APPROACH
rsync -avuP path/to/server/ActiveDb.sql path/to/Backup.sql
SERVER APPROACH
ThisStream=Fs.createReadStream("/path/to/ActiveDb.sql");
ThisStream.on("open", ()=>{
ThisStream.pipe(Rr.Res);
});
On the backup machine
spawn=require("child_process").spawn("curl", ["-d", "apicall=bkup", "https://dbserver.server"]);
spawn.stdout.on("data", thisChunk=>{
Fs.appendFileSync("path/to/Backup.sql", thisChunk);
});```

Adding json data to json column is adding escape characters

I am using postgres database, I have a table called names which has a column named 'info' which is of type json. I am adding
{ "info": "security" , description : "Sednit update: Analysis of Zebrocy: The Sednit group \u2013 also known as APT28, Fancy Bear, Sofacy or STRONTIUM \u2013 is a group of attackers operating since 2004, if not earlier, and whose main objective is to steal confidential information from specific targets.\n\nToward the end of 2015, we started seeing a new component deployed by the group; a downloader for the main Sednit backdoor, Xagent. Kaspersky mentioned this component for the first time in 2017 in their APT trend report and recently wrote an article where they quickly described it under the name Zebrocy.\n\nThis new component is a family of malware, comprising downloaders and backdoors written in Delphi and AutoIt. These components play the same role in the Sednit ecosystem as Seduploader; that of first-stage malware."}
Here I am using node js, with sequelize as orm. When I save it in table. I see "\\n" for "\n" and "\\u" for \u. Can anyone help me to avoid escaping characters while saving to table.
I see \n for \n and \u for \u.
In your json description is type of string , so it will convert the new line/enter to \n that the default behavior , or else you will not get the new line / enter when you try to fetch the data again.
And \u is for unicode , you might be saving some smily or special character so that will be converted to such strings.
So there is no bug , this is how it works.

How to check the available free space in Haxe?

I don't find a way to check the free space available in a device using Haxe, Openfl, Lime or another library.
I would like to avoid download data that will exceed the size recommended for an app in each device.
What do you do to check that?
Try creating a file of that size! Then either delete it or reopen and write (not append) over its contents.
I don't know whether all platforms Haxe supports will work fine with this trick, but this algorithm is reported to work in many places and languages (I personally tested it in Ruby and saw the same suggestion for C++/.NET). To check whether X bytes of disk space are available:
open a new file for writing
seek X-1 bytes from the beginning
write a byte of data (whatever you want, 0, 42...)
close the file (probably unrelated to the task at hand, but don't forget to do that anyway)
If there's insufficient disk space, you'll likely get an exception at some point in this algorithm. You'll have to find out what errors to expect and process them properly.
Using ihx I've found this is working and requires nothing but Haxe Standard Library:
haxe interactive shell v0.3.4
type "help" for help
>> import sys.io.*;
>> var f = File.write('loca', true)
sys.io.FileOutput : { __f => #abstract }
>> f.seek(39999, FileSeek.SeekBegin)
Void : null
>> f.writeByte(0)
Void : null
>> f.close()
Void : null
After these manipulations, I had a file named loca of exactly 40000 bytes in my working directory.
By the way, be careful when doing things like these in ihx since it re-runs the entire session with the last entered line appended each time.
Ongoing experimentation:
However, when there's insufficient disk space, it may not fail with errors. In this case you'll have to check the real size with sys.FileSystem.stat(path).size. And don't forget to delete the file if there's not enough space.

Perl program structure for parsing

I've got question about program architecture.
Say you've got 100 different log files with different formats and you need to parse and put that info into an SQL database.
My view of it is like:
use general config file like:
program1->name1("apache",/var/log/apache.log) (modulename,path to logfile1)
program2->name2("exim",/var/log/exim.log) (modulename,path to logfile2)
....
sqldb->configuration
use something like a module (1 file per program) type1.module (regexp, logstructure(somevariables), sql(tables and functions))
fork or thread processes (don't know what is better on Linux now) for different programs.
So question is, is my view of this correct? I should use one module per program (web/MTA/iptablat)
or there is some better way? I think some regexps would be the same, like date/time/ip/url. What to do with that? Or what have I missed?
example: mta exim4 mainlog
2011-04-28 13:16:24 1QFOGm-0005nQ-Ig
<= exim#mydomain.org.ua** H=localhost
(exim.mydomain.org.ua)
[127.0.0.1]:51127 I=[127.0.0.1]:465
P=esmtpsa
X=TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32
CV=no A=plain_server:spam S=763
id=1303985784.4db93e788cb5c#mydomain.org.ua T="test" from
<exim#exim.mydomain.org.ua> for
test#domain.ua
everything that is bold is already parsed and will be putted into sqldb.incoming table. now im having structure in perl to hold every parsed variable like $exim->{timstamp} or $exim->{host}->{ip}
my program will do something like tail -f /file and parse it line by line
Flexability: let say i want to add supprot to apache server (just timestamp userip and file downloaded). all i need to know what logfile to parse, what regexp shoud be and what sql structure should be. So im planning to have this like a module. just fork or thread main process with parameters(logfile,filetype). Maybe further i would add some options what not to parse (maybe some log level is low and you just dont see mutch there)
I would do it like this:
Create a config file that is formatted like this: appname:logpath:logformatname
Create a collection of Perl class that inherit from a base parser class.
Write a script which loads the config file and then loops over its contents, passing each iteration to its appropriate handler object.
If you want an example of steps 1 and 2, we have one on our project. See MT::FileMgr and MT::FileMgr::* here.
The log-monitoring tool wots could do a lot of the heavy lifting for you here. It runs as a daemon, watching as many log files as you could want, running any combination of perl regexes over them and executing something when matches are found.
I would be inclined to modify wots itself (which its licence freely allows) to support a database write method - have a look at its existing handle_* methods.
Most of the hard work has already been done for you, and you can tackle the interesting bits.
I think File::Tail is a nice fit.
You can make an array of File::Tail objects and poll them with select like this:
while (1) {
($nfound,$timeleft,#pending)=
File::Tail::select(undef,undef,undef,$timeout,#files);
unless ($nfound) {
# timeout - do something else here, if you need to
} else {
foreach (#pending) {
# here you can handle log messages depending on filename
print $_->{"input"}." (".localtime(time).") ".$_->read;
}
(from perl File::Tail doc)

TCL (thermal control language) [printer protocol] references

I'm working on supporting of the TCL (thermal control protocol, stupid name, its a printer protocol of futurelogic) but i cannot find resources about this protocol, how it is, how it works, nothing, on theirs site i only found this mention http://www.futurelogic-inc.com/trademarks.aspx
any one had worked with it? does any one knows where can i find the data sheet?
The protocol is documented on their website http://www.futurelogic-inc.com/support/downloads/
If you are targetting the PSA66ST model it supports a number of protocols TCL, which is quite nice for delivering templated tickets and, line printing using the Epson ESC/P protocol.
This is all explained in the protocol document.
Oops, these links are incorrect and only correspond to marketing brochures. You will need to contact Futurelogic for the protocol documents. Probably also need to sign an NDA. Anyway, the information may guide you some more.
From what I can gather, it seems the FutureLogic thermal printers do not support general printing, but only printing using predefined templates stored in the printer's firmware. The basic command structure is a caret ^ followed by a one or two character command code, with arguments delimited using a pipe |, and the command ended with another caret ^. I've been able to reverse-engineer a few commands:
^S^ - Printer status
^Se^ - Extended printer status
^C|x|^ - Clear. Known arguments:
a - all
j - jam
^P|x|y0|...|yn|^ - Print fields y0 through yn using template x.
Data areas are defined in the firmware using a similar command format, command ^D|x|y0|...|yn|^, and templates are defined from data areas using command ^T|z|x0|...|xn|^.

Resources