rsync vs fs.readStream - How to handle special characters - node.js

I'm using two approaches to backing up a database file, rsync and a server-based API approach.
I'm getting slightly different results because of some particular certain high-numbered unicode characters, so the two backups are just a little bit different.
The characters in question are, in one case ⸭ (a unicode 2E2D), 猄 (unicode 7304), 璣 (74a3) which makes the voyage just fine through rsync, but which all become �� (two unicode FFFD characters) using the server / API approach.
Interestingly not all higher-numbered unicode characters get transformed to FFFDs. A 䋲 (42F2), a 0698, and thousands of others that are not converted and make it through just fine.
In fact there are only about 7 characters in the entire file that get transformed in transit.
I'm trying to get this to the point where there no difference whatsoever.
Basically there's an occasional discrepancy handling high-numbered unicode.
In both cases the backup files created are utf8 with char(10) line feeds.
Here's the basic difference between the two backup approaches:
RSYNC APPROACH
rsync -avuP path/to/server/ActiveDb.sql path/to/Backup.sql
SERVER APPROACH
ThisStream=Fs.createReadStream("/path/to/ActiveDb.sql");
ThisStream.on("open", ()=>{
ThisStream.pipe(Rr.Res);
});
On the backup machine
spawn=require("child_process").spawn("curl", ["-d", "apicall=bkup", "https://dbserver.server"]);
spawn.stdout.on("data", thisChunk=>{
Fs.appendFileSync("path/to/Backup.sql", thisChunk);
});```

Related

Linux using command file -i return wrong value charset=unknow-8bit for a windows-1252 encoded file

Using nodejs and iconv-lite to create a http response file in xml with charset windows-1252, the file -i command cannot identify it as windows-1252.
Server side:
r.header('Content-Disposition', 'attachment; filename=teste.xml');
r.header('Content-Type', 'text/xml; charset=iso8859-1');
r.write(ICONVLITE.encode(`<?xml version="1.0" encoding="windows-1252"?><x>€Àáção</x>`, "win1252")); //euro symbol and portuguese accentuated vogals
r.end();
The browser donwloads the file and then i check it in Ubuntu 20.04 LTS:
file -i teste.xml
/tmp/teste.xml: text/xml; charset=unknown-8bit
When i use gedit to open it, the accentuated vogal appear fine but the euro symbol it does not (all characters from 128 to 159 get messed up).
I checked in a windows 10 vm and in there all goes well. Both in Windows and Linux web browsers, it also shows all fine.
So, is it a problem in file command? How to check the right charsert of a file in Linux?
Thank you
EDIT
The result file can be get here
2nd EDIT
I found one error! The code line:
r.header('Content-Type', 'text/xml; charset=iso8859-1');
must be:
r.header('Content-Type', 'text/xml; charset=Windows-1252');
It's important to understand what a character encoding is and isn't.
A text file is actually just a stream of bits; or, since we've mostly agreed that there are 8 bits in a byte, a stream of bytes. A character encoding is a lookup table (and sometimes a more complicated algorithm) for deciding what characters to show to a human for that stream of bytes.
For instance, the character "€" encoded in Windows-1252 is the string of bits 10000000. That same string of bits will mean other things in other encodings - most encodings assign some meaning to all 256 possible bytes.
If a piece of software knows that the file is supposed to be read as Windows-1252, it can look up a mapping for that encoding and show you a "€". This is how browsers are displaying the right thing: you've told them in the Content-Type header to use the Windows-1252 lookup table.
Once you save the file to disk, that "Windows-1252" label form the Content-Type header isn't stored anywhere. So any program looking at that file can see that it contains the string of bits 10000000 but it doesn't know what mapping table to look that up in. Nothing you do in the HTTP headers is going to change that - none of those are going to affect how it's saved on disk.
In this particular case the "file" command could look at the "encoding" marker inside the XML document, and find the "windows-1252" there. My guess is that it simply doesn't have that functionality. So instead it uses its general logic for guessing an encoding: it's probably something ASCII-compatible, because it starts with the bytes that spell <?xml in ASCII; but it's not ASCII itself, because it has bytes outside the range 00000000 to 01111111; anything beyond that is hard to guess, so output "unknown-8bit".

How Do I resolve "Illuminate\Queue\InvalidPayloadException: Unable to JSON encode payload. Error code: 5"

Trying out the queue system for a better user upload experience with Laravel-Excel.
.env was been changed from 'sync' to 'database' and migrations run. All the necessary use statements are in place yet the error above persists.
The exact error happens here:
Illuminate\Queue\Queue.php:97
$payload = json_encode($this->createPayloadArray($job, $queue, $data));
if (JSON_ERROR_NONE !== json_last_error()) {
throw new InvalidPayloadException(
If I drop ShouldQueue, the file imports perfectly in-session (large file so long wait period for user.)
I've read many stackoverflow, github etc comments on this but I don't have the technical skills to deep-dive to fix my particular situation (most of them speak of UTF-8 but I don't if that's an issue here; I changed the excel save format to UTF-8 but it didn't fix it.)
Ps. Whilst running the migration, I got the error:
SQLSTATE[42000]: Syntax error or access violation: 1071 Specified key was too long; max key length is 767 bytes (SQL: alter table `jobs` add index `jobs_queue_index`(`queue`))
I bypassed by dropping the 'add index'; so my jobs table is not indexed on queue but I don't feel this is the cause.
One thing you can do when looking into json_encode() errors is use the json_last_error_msg() function, which will give you a bit more of a readable error message.
In your case you're getting a '5' back, which is the JSON_ERROR_UTF8 error code. The error message back for this is a slightly more informative one:
'Malformed UTF-8 characters, possibly incorrectly encoded'
So we know it's encountering non-UTF-8 characters, even though you're saving the file specifically with UTF-8 encoding. At first glance you might think you need to convert the encoding yourself in code (like this answer), but in this case, I don't think that'll help. For Laravel-Excel, this seems to be a limitation of trying to queue-read .xls files - from the Laravel-Excel docs:
You currently cannot queue xls imports. PhpSpreadsheet's Xls reader contains some non-utf8 characters, which makes it impossible to queue.
In this case you might be stuck with a slow, non-queueable option, or need to convert your spreadsheet into a queueable format e.g. .csv.
The key length error on running the migration is unrelated. It has been around for a while and is a side-effect of using an older version of MySQL/MariaDB. Check out this answer and the Laravel documentation around index lengths - you need to add this to your AppServiceProvider::boot() method:
Schema::defaultStringLength(191);

The XML parser detected error code 302

I am using the XML-INTO op-code to parse a web service request. Every now and then I get errors in the logs
(RNX0351 - "The XML parser detected error code 302").
The help for a 302 is
302 The parser does not support the requested CCSID value or
the first character of the XML document was not '<'
To the best of my knowledge, the first character is "<" and the request is generated from a previous web service call so I would be very suprised if the CCSID has changed.
The error is repeatable, for the specific query so it is almost certainly data related, I am just unsure how I would go about identifying the offending item.
Any thoughts on how to determine the issue, or better yet, how to overcome it?
cheers
CCSID is an AS400/iSeries/Power System attribute, and it applies to the whole IFS.It's like a declaration of what inside the file is, or in other words what its internal encoding "should be".
It's supposed that data content encoding in the file and the file one (the envelope) match, and the box uses this attribute to show and handle corresponding characters.
It sounds like you receive data under one encoding, but CCSID file doesn't match.
Try changing CCSID on your file (only the envelope). E.G.: 37 (american), 500 (latin-1), 819 (utf-8), 850 (dos), 1252 (win) and display file after.You can check first using ls -Sla yourfile in QSH or QP2TERM, or EDTF as well. CHGATTR allows you to change CCSID, as well as setccsid in QSH (again).
This way helped me to find related issues. Remember that although data may be visible in the four hundred, they may not be visible through a share folder in Win. It means that CCSID file, an content encoding don't match.
Hope it helps.
Hi I've seen this error with XML data uploaded to AS400/iSeries/IBM i with FTP and the CCSID 819 (ISO 8859-1 ASCII) and it has some binary garbage in first few positions of file. Changed encoding to CCSID 1208 (UTF-8 with IBM PUA) using FTP "quote type c 1208" and the problem cleared and XML-INTO was successful.
So, suggestion about XML parser error 302 received when using XML-INTO is to look at the file (wrklnk ...) and if first character is not "<" but instead some binary garbage then try CCSID 1208 for utf-8.
Statements in this answer about what 819 is and what ccsid represents utf-8 do not agree with previous answer but are correct, according to IBM documentation:
https://www-01.ibm.com/software/globalization/ccsid/ccsid819.html
https://www-01.ibm.com/software/globalization/ccsid/ccsid1208.html
I'm working on this problem a couple hours,
for me the solution was use option ccsid=UCS2 when you use data structure or variable to store xml.
something like that :
XML-INTO customer %XML( xmlSource : 'ccsid=UCS2');
I have the program running on ccsid = 870, every conversion to ccsid on the xmlSource field don't work,
The strange thing that when I use the file with ccsid = 850, every thing work fine
I mention that becouse this is the first page when you looking about this problem.
Maybe this help someone.

Apache userdir + suEXEC + fcgid doesn't recognise dot separated useraccounts

I've setup Apache with suEXEC, fcgid and userdir to enhance overall website security.
Everything works expect for useraccounts with a "." between their accountnames. Before using suEXEC and fcgid, this used to work although that practice has been discouraged many years ago.
For example: mydomain.com/~mytest/ works
mydomain.com/~my.test/ doesn't work
The error message that I get is "Bad Request Your browser sent a request that this server could not understand."
Is there a quick workaround to this or I'm I doomed at recreating all the accounts without any accountname separation?
Historically usernames were up to 8 characters long, started with a letter, and contained only lower case letters, underscore, and numbers. Some systems still make this assumption, and that is probably what is catching you out here.

Strings in Erlang - what libraries and techniques should I be examining?

I am working on a project that will require internationalisation support down the track. I want to get started on the right foot with UTF support, and I was wondering what the best practice for handling UTF in Erlang is?
From my current research it seems there are a couple of issues with Erlang's built in string handling for some use cases (JSON parsing being a good example).
I have been looking at Starling and read (somewhere) recently that it is possibly going to be rolled into the standard Erlang release as the UTF 'standard'. Is this true? Are there other libraries or approaches I should be looking at?
From the comments:
EEP (Erlang Enhancement Proposal) 10 details Representing Unicode characters in Erlang
This page:
http://erlang.org/doc/highlights.html
...lists hightlights of release 5.7/OTP R13A. Note this passage:
1.2 Unicode support
Support for Unicode is implemented as
described in EEP10. Formatting and
reading of unicode data both from
terminals and files is supported by
the io and io_lib modules. Files can
be opened in modes with automatic
translation to and from different
unicode formats. The module 'unicode'
contains functions for conversion
between external and internal unicode
formats and the re module has support
for unicode data. There is also
language syntax for specifying string
and character data beyond the
ISO-latin-1 range.
I don't like to make pronouncements on what best practices would be, but I often find it helpful to have a minimal, complete example to start to generalize from. Here's one of getting utf into an erlang application and sending it out again to a different context. Assuming you had a MySql database with a row field in a table containing utf8 characters, here's one way to get it out and pipe it to a web browser as json:
hg clone http://bitbucket.org/justin/webmachine/ webmachine-read-only
cd webmachine-read-only
make
./scripts/new_webmachine.erl mywebdemo /tmp
svn checkout http://erlang-mysql-driver.googlecode.com/svn/trunk/ erlang-mysql-driver-read-only
cd erlang-mysql-driver-read-only/src
cp * /tmp/mywebdemo/src
svn checkout http://mochiweb.googlecode.com/svn/trunk/ mochiweb-read-only
cp mochiweb-read-only/src/mochijson2.erl /tmp/mywebdemo/src
cd /tmp/mywebdemo
Edit src/mywebdemo_resource.erl so it looks like this:
-module(mywebdemo_resource).
-export([init/1, to_html/2]).
-include_lib("webmachine/include/webmachine.hrl").
init([]) -> {ok, undefined}.
to_html(ReqData, State) ->
mysql:start_link(pool_id, "database.host.com", 3306, "db_user", "db_password", "db_name", fun(A, B, C, D) -> ouch end, utf8), %% add your connection string info
{data, Res} = mysql:fetch(pool_id, "select * from table where IdWhatever = 13"),
[[_, Utf8Str, _]] = mysql:get_result_rows(Res), %% pattern will need to be altered to match your table structure
{mochijson2:encode({struct, [{Utf8Str, 100}]}), ReqData, State}.
Build everything and start the url dispatcher:
make
./start.sh
Then execute the following in a web page (or something more convenient, like MozRepl):
var req = new XMLHttpRequest;
req.open('GET', "http://localhost:8000", false);
req.send(null);
eval("(" + req.responseText + ")");
As the previous poster mentioned the latest release of erlang supports utf natively. If you can't use the latest though then one thing I do usually is to use binaries for string data. It keeps erlang from mangling the bytes in a list. It has the side effect of making lists of strings easier to handle as well.

Resources