Exporting CSV with Diacritics causes strange characters to appear in output

Exporting CSV with Diacritics causes strange characters to appear in output - linux

I export a csv in Scala/Spray and it works nice on my Windows machine but fails on Linux machine.
The response from both OS are identical:
Access-Control-Allow-Credentials:true
Access-Control-Allow-Headers:X-Requested-With, Cache-Control, Pragma, Origin, Authorization, Content-Type, Auth-Token
Access-Control-Allow-Methods:GET, POST, DELETE, OPTIONS, PUT
Access-Control-Allow-Origin:*
Access-Control-Expose-Headers:Auth-Token
Content-Disposition:attachment; filename=Enter report title.csv
Content-Length:229
Content-Type:text/csv; charset=ISO-8859-1
Date:Fri, 07 Feb 2014 22:17:40 GMT
Server:spray-can/1.2.0
I am wondering why the OS can make a difference?
When exporting from linux after jar is deployed, diacritics are replaced with strange chars.
For instance this Café macchiato
is fine when exporting from Windows but it appears like CafÃ© macchiato
when exporting from Linux.

To help Excel recognize character encoding you can add a BOM to the beginning of the file. For example:
def prepareBomOutputStream(outputFile: String) = {
val os = new FileOutputStream(outputFile)
os.write(239)
os.write(187)
os.write(191)
os
}
You can also check if in both cases you get exact same encoding and not a subset of the encoding. For instance on Windows you might get ISO-8859-15 instead. You can most likely set encoding explicitly in your CSV export code/library. To check encoding on Linux you can use:
$ file -ib /tmp/test.csv
text/plain; charset=utf-8
or even something like hexdump.

Please never ever use Excel for text-oriented files. It screws up stuff. Use an editor like vim or Notepad++ where you can inspect the bytes and actually see whether your stuff is correct or not.

Related

WinHttpRequest Object returns "â??" instead of "✓" (U+2713) [duplicate]

This question already has an answer here:
WinHttpRequest gzip response parsing
(1 answer)
Closed 1 year ago.
I am using Excel 2007 (12.0.4518.1014)
I have been using the WinHttpRequest Object to perform API GET Requests on a web service that hosts data for me.
Everything else works properly and it grabs the JSON formatted data from the web service and puts it into a string with the .ResponseText property.
The issue I am having is that inside that string, all of the Unicode characters are turned into gibberish like â?? instead of ✓ (U+2713). Which means that when I do MyRange.Value = .ResponseText the cell value becomes â??.
If I set the GET request to ask for Xml format, I get �?? instead of ✓
I have confirmed by repeating the GET request in Chrome, the web service is outputting the correct Unicode symbol, and Chrome was able to show me ✓. So this is an issue with VBA or WinHttp.
Excel by itself is able to generate Unicode symbols and VBA is able to do it as well ChrW(10003).
How can I preserve Unicode symbols during a GET request? Is it possible with WinHttp or do I need to change methods?
Edit:
Here are the headers during a standard response:
{
"access-control-allow-headers": "Content-Type",
"access-control-allow-methods": "GET, POST, PUT, DELETE, OPTIONS",
"access-control-allow-origin": "*",
"cache-control": "private",
"content-encoding": "gzip",
"content-security-policy": "frame-ancestors 'self', default-src * 'unsafe-inline' 'unsafe-eval' data: blob:;",
"content-type": "application/json",
"date": "Wed, 23 Jun 2021 18:08:53 GMT",
"expect-ct": "max-age=0;",
"referrer-policy": "strict-origin-when-cross-origin",
"strict-transport-security": "max-age=31536000; includeSubDomains; preload",
"vary": "Accept-Encoding",
"x-content-type-options": "nosniff",
"x-frame-options": "SAMEORIGIN",
"x-stackifyid": "V2|80002f92-0000-3100-b63f-84710c7967bb|C61313|CD10436"
}
Update: RESOLVED!
I have solved my problem with advice from #GSerg and wonderful insight from #JoelCoeHoorn. I will write how it was solved here since my question was closed.
WinHTTPRequest was swapped out for an XMLHTTP Object. This Object can be used in VBA with similar commands to WinHTTPRequest, as shown in the link about halfway down the page. But the XMLHTTP object was able to return Unicode characters with no issues.
To use it in VBA, you can create it with the line:
Dim http As Object
Set http = CreateObject("Microsoft.XMLHTTP")
And then you're ready to go with .open and .setRequestHeader and .Send similar to the WinHttpRequest object.

I know of five ways this can happen relative to an HTTP transaction:
The response has a header that includes what specific encoding is used. If the chosen encoding doesn't have the ability to display all the character points used in the text, this is what you get.
This is also what you get if text of the response is set directly and not mapped to the encoding specified in the header, so the encoding says the text should be different than what it is.
For historical reasons, there are some encodings that are system dependent, where the upper region of the encoding depends on the locally installed language packs/settings. So you can see this effect if the header chooses a system-specific encoding and the text is set on a system where this upper region is interpreted differently than the client, even though both ends used the same encoding.
The fourth way this can happen is with UTF-8 when a byte order mark is used incorrectly, ignored, or interpreted as text.
Finally, (and this is the most likely of these options to fit your situation) this can happen when an encoding is used in one place that is not supported in the other. VBA pre-dates the widespread adoption of unicode, and does not have good unicode support. Especially older versions of VBA, like you might encounter in, say, the long unsupported Excel 2007.
These problems all tend to manifest only for unicode characters and leave simple latin characters alone, because many encodings all handle the simple latin characters in the exact same way.

Linux using command file -i return wrong value charset=unknow-8bit for a windows-1252 encoded file

Using nodejs and iconv-lite to create a http response file in xml with charset windows-1252, the file -i command cannot identify it as windows-1252.
Server side:
r.header('Content-Disposition', 'attachment; filename=teste.xml');
r.header('Content-Type', 'text/xml; charset=iso8859-1');
r.write(ICONVLITE.encode(`<?xml version="1.0" encoding="windows-1252"?><x>€Àáção</x>`, "win1252")); //euro symbol and portuguese accentuated vogals
r.end();
The browser donwloads the file and then i check it in Ubuntu 20.04 LTS:
file -i teste.xml
/tmp/teste.xml: text/xml; charset=unknown-8bit
When i use gedit to open it, the accentuated vogal appear fine but the euro symbol it does not (all characters from 128 to 159 get messed up).
I checked in a windows 10 vm and in there all goes well. Both in Windows and Linux web browsers, it also shows all fine.
So, is it a problem in file command? How to check the right charsert of a file in Linux?
Thank you
EDIT
The result file can be get here
2nd EDIT
I found one error! The code line:
r.header('Content-Type', 'text/xml; charset=iso8859-1');
must be:
r.header('Content-Type', 'text/xml; charset=Windows-1252');

It's important to understand what a character encoding is and isn't.
A text file is actually just a stream of bits; or, since we've mostly agreed that there are 8 bits in a byte, a stream of bytes. A character encoding is a lookup table (and sometimes a more complicated algorithm) for deciding what characters to show to a human for that stream of bytes.
For instance, the character "€" encoded in Windows-1252 is the string of bits 10000000. That same string of bits will mean other things in other encodings - most encodings assign some meaning to all 256 possible bytes.
If a piece of software knows that the file is supposed to be read as Windows-1252, it can look up a mapping for that encoding and show you a "€". This is how browsers are displaying the right thing: you've told them in the Content-Type header to use the Windows-1252 lookup table.
Once you save the file to disk, that "Windows-1252" label form the Content-Type header isn't stored anywhere. So any program looking at that file can see that it contains the string of bits 10000000 but it doesn't know what mapping table to look that up in. Nothing you do in the HTTP headers is going to change that - none of those are going to affect how it's saved on disk.
In this particular case the "file" command could look at the "encoding" marker inside the XML document, and find the "windows-1252" there. My guess is that it simply doesn't have that functionality. So instead it uses its general logic for guessing an encoding: it's probably something ASCII-compatible, because it starts with the bytes that spell <?xml in ASCII; but it's not ASCII itself, because it has bytes outside the range 00000000 to 01111111; anything beyond that is hard to guess, so output "unknown-8bit".

rsync vs fs.readStream - How to handle special characters

I'm using two approaches to backing up a database file, rsync and a server-based API approach.
I'm getting slightly different results because of some particular certain high-numbered unicode characters, so the two backups are just a little bit different.
The characters in question are, in one case ⸭ (a unicode 2E2D), 猄 (unicode 7304), 璣 (74a3) which makes the voyage just fine through rsync, but which all become �� (two unicode FFFD characters) using the server / API approach.
Interestingly not all higher-numbered unicode characters get transformed to FFFDs. A 䋲 (42F2), a 0698, and thousands of others that are not converted and make it through just fine.
In fact there are only about 7 characters in the entire file that get transformed in transit.
I'm trying to get this to the point where there no difference whatsoever.
Basically there's an occasional discrepancy handling high-numbered unicode.
In both cases the backup files created are utf8 with char(10) line feeds.
Here's the basic difference between the two backup approaches:
RSYNC APPROACH
rsync -avuP path/to/server/ActiveDb.sql path/to/Backup.sql
SERVER APPROACH
ThisStream=Fs.createReadStream("/path/to/ActiveDb.sql");
ThisStream.on("open", ()=>{
ThisStream.pipe(Rr.Res);
});
On the backup machine
spawn=require("child_process").spawn("curl", ["-d", "apicall=bkup", "https://dbserver.server"]);
spawn.stdout.on("data", thisChunk=>{
Fs.appendFileSync("path/to/Backup.sql", thisChunk);
});```

how do I make a tiled map base64

Hello everyone I am making a game in Java using the Slick library. this is my first time doing it and I need some help. for the map I am using tiled. and whenever I try to call the tmx file I get this error :
Thu May 29 16:17:57 EDT 2014 ERROR:Unsupport tiled map type: base64,zlib (only gzip base64 supported)
So my best guess would be that it needs to be gzip base64. That makes no sense to me, can anyone help me out?

You need either to choose that option when you're creating a map in tiled, or you can click, on the menu bar, Map >> Map Properties >> and select the compression you want.

The XML parser detected error code 302

I am using the XML-INTO op-code to parse a web service request. Every now and then I get errors in the logs
(RNX0351 - "The XML parser detected error code 302").
The help for a 302 is
302 The parser does not support the requested CCSID value or
the first character of the XML document was not '<'
To the best of my knowledge, the first character is "<" and the request is generated from a previous web service call so I would be very suprised if the CCSID has changed.
The error is repeatable, for the specific query so it is almost certainly data related, I am just unsure how I would go about identifying the offending item.
Any thoughts on how to determine the issue, or better yet, how to overcome it?
cheers

CCSID is an AS400/iSeries/Power System attribute, and it applies to the whole IFS.It's like a declaration of what inside the file is, or in other words what its internal encoding "should be".
It's supposed that data content encoding in the file and the file one (the envelope) match, and the box uses this attribute to show and handle corresponding characters.
It sounds like you receive data under one encoding, but CCSID file doesn't match.
Try changing CCSID on your file (only the envelope). E.G.: 37 (american), 500 (latin-1), 819 (utf-8), 850 (dos), 1252 (win) and display file after.You can check first using ls -Sla yourfile in QSH or QP2TERM, or EDTF as well. CHGATTR allows you to change CCSID, as well as setccsid in QSH (again).
This way helped me to find related issues. Remember that although data may be visible in the four hundred, they may not be visible through a share folder in Win. It means that CCSID file, an content encoding don't match.
Hope it helps.

Hi I've seen this error with XML data uploaded to AS400/iSeries/IBM i with FTP and the CCSID 819 (ISO 8859-1 ASCII) and it has some binary garbage in first few positions of file. Changed encoding to CCSID 1208 (UTF-8 with IBM PUA) using FTP "quote type c 1208" and the problem cleared and XML-INTO was successful.
So, suggestion about XML parser error 302 received when using XML-INTO is to look at the file (wrklnk ...) and if first character is not "<" but instead some binary garbage then try CCSID 1208 for utf-8.
Statements in this answer about what 819 is and what ccsid represents utf-8 do not agree with previous answer but are correct, according to IBM documentation:
https://www-01.ibm.com/software/globalization/ccsid/ccsid819.html
https://www-01.ibm.com/software/globalization/ccsid/ccsid1208.html

I'm working on this problem a couple hours,
for me the solution was use option ccsid=UCS2 when you use data structure or variable to store xml.
something like that :
XML-INTO customer %XML( xmlSource : 'ccsid=UCS2');
I have the program running on ccsid = 870, every conversion to ccsid on the xmlSource field don't work,
The strange thing that when I use the file with ccsid = 850, every thing work fine
I mention that becouse this is the first page when you looking about this problem.
Maybe this help someone.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string