I have the following code, which I use to export Unicode chars (Herbrew) to CSV, meant to be open by Excel/Google Sheets:
String csv = const ListToCsvConverter().convert(rows);
List<int> bytes = List.from(utf8.encode(csv));
// bytes.insert(0, unicodeBomCharacterRune );
return bytes;
If I export it to gmail, and open it in google sheets it works fine, but when I open it it with Excel it opens it as gibberish.
I'm well familiar with this issue which I solved in other ENV (php, js) by addind BOM char at the beginning of the file,
but when I do so in flutter (ie. uncomment the line bytes.insert(0, unicodeBomCharacterRune );) I get gibberish both in sheets and in excel.
Any idea how to over come this issue?
Here is how I solved it:
String csv = const ListToCsvConverter().convert(rows);
List<int> bytes = List.from(utf8.encode(csv));
bytes.insert(0, 0xBF );
bytes.insert(0, 0xBB );
bytes.insert(0, 0xEF );
return bytes;
The UTF-8 BOM is a sequence of bytes at the start of a text stream (0xEF, 0xBB, 0xBF) that allows the reader to more reliably guess a file as being encoded in UTF-8.
NOTE: I insert the bytes in reverse order since I add them to the top
Related
The problem came up when getting the result of a web service returning json with Greek characters in it. Actually it is the city of Mykonos. The challenge is whatever encoding or conversion I'm using it is always displayed as:ΜΎΚΟxCE?ΟΣ . But it should show: ΜΎΚΟΝΟΣ
With Powershell I was able to verify, that the web service is returning the correct characters.
I narrowed the problem down when the byte array gets converted to a String in Groovy. Below is code that reproduces the issue I have. myUTF8String holds the byte array I get from URLConnection.content.text. The UTF8 byte sequence to look at is 0xce, 0x9d. After converting this to a string and back to a byte array the byte sequence for that character is 0xce, 0x3f. The result of below code will show the difference at position 9 of the original byte array and the one from the converted string. For the below test I'm using Groovy Console 4.0.6.
Any hints on this one?
import java.nio.charset.StandardCharsets;
def myUTF8String = "ce9cce8ece9ace9fce9dce9fcea3"
def bytes = myUTF8String.decodeHex();
content = new String(bytes).getBytes()
for ( i = 0; i < content.length; i++ ) {
if ( bytes[i] != content[i] ) {
println "Different... at pos " + i
hex = Long.toUnsignedString( bytes[i], 16).toUpperCase()
print hex.substring(hex.length()-2,hex.length()) + " != "
hex = Long.toUnsignedString( content[i], 16).toUpperCase()
println hex.substring(hex.length()-2,hex.length())
}
}
Thanks a lot
Andreas
you have to specify charset name when building String from bytes otherwise default java charset will be used - and it's not necessary urf-8.
Charset.defaultCharset() - Returns the default charset of this Java virtual machine.
The same problem with String.getBytes() - use charset parameter to get correct byte sequence.
Just change the following line in your code and issue will disappear:
content = new String(bytes, "UTF-8").getBytes("UTF-8")
as an option you can set default charset for the whole JVM instance with the following command line parameter:
java -Dfile.encoding=UTF-8 <your application>
but be careful because it will affect whole JVM instance!
https://docs.oracle.com/en/java/javase/19/intl/supported-encodings.html#GUID-DC83E43D-52F6-41D9-8F16-318F3F39D54F
I am working with a system that syncs files between two vendors. The tooling is written in Javascript and does a transformation on file names before sending it to the destination. I am trying to fix a bug in it that is failing to properly compare file names between the origin and destination.
The script uses the file name to check if it's on destination
For example:
The following file name contains a special character that has different encoding between source and destination.
source: Chinchón.jpg // hex code: ó
destination : Chinchón.jpg // hex code: 0xf3
The function that does the transformation is:
export const normalizeText = (text:string) => text
.normalize('NFC')
.replace(/\p{Diacritic}/gu, "")
.replace(/\u{2019}/gu, "'")
.replace(/\u{ff1a}/gu, ":")
.trim()
and the comparison is happening just like the following:
const array1 = ['Chinchón.jpg'];
console.log(array1.includes('Chinchón.jpg')); // false
Do I reverse the transformation before comparing? what's the best way to do that?
If i got your question right:
// prepare dictionary
const rawDictionary = ['Chinchón.jpg']
const dictionary = rawDictionary.map(x => normalizeText(x))
...
const rawComparant = 'Chinchón.jpg'
const comparant = normalizeText(rawComparant)
console.log(rawSources.includes(comparant))
I am trying to store variable string expressions from a file which contains special characters, like ø, æ , and å. Here is my code:
import h5py as h5
file = h5.File('deleteme.hdf5','a')
dt = h5.special_dtype(vlen=str)
dset = file.create_dataset("text",(1,),dtype=dt)
dset.attrs[str(1)] = "some text with ø, æ, å"
However the text is not stored properly. The data stored contains text:
"some text with \37777777703\37777777670, \37777777703\37777777646,\37777777703\37777777645"
How can I store the special characters properly? I have tried to follow the guide provided in the documentation here: Strings in HDF5 - Variable-length UTF-8
Edit:
The output was from h5dump. The answer below verified that the characters are properly stored as utf-8.
With:
import numpy as np
import h5py as h5
file = h5.File('deleteme.hdf5','w')
dt = h5.special_dtype(vlen=str)
dset = file.create_dataset("text",(3,),dtype=dt)
dset[:] = 'ø æ å'.split()
dset.attrs["1"] = "some text with ø, æ, å"
file.close()
file = h5.File('deleteme.hdf5','r')
print(file['text'][:])
print(file['text'].attrs["1"])
file.close()
I see:
$ python3 stack44661467.py
['ø' 'æ' 'å']
some text with ø, æ, å
That is h5py does see/interpret the strings as unicode - writing and reading.
With the dump utility:
$ h5dump deleteme.hdf5
HDF5 "deleteme.hdf5" {
GROUP "/" {
DATASET "text" {
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
DATASPACE SIMPLE { ( 3 ) / ( 3 ) }
DATA {
(0): "\37777777703\37777777670", "\37777777703\37777777646",
(2): "\37777777703\37777777645"
}
ATTRIBUTE "1" {
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
DATA {
(0): "some text with \37777777703\37777777670, \37777777703\37777777646, \37777777703\37777777645"
}
}
}
}
}
Note that in both case the datatype is marked UTF8
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
That's what the docs say:
http://docs.h5py.org/en/latest/strings.html#variable-length-utf-8
They can store any character a Python unicode string can store, with the exception of NULLs. In the file these are created as variable-length strings with character set H5T_CSET_UTF8.
Let h5py (or other reader) worry about interpreting \37777777703\37777777670 as the proper unicode character.
You should try storing your data in UTF-8 format by doing the following:
To encode in utf-8 format (before storingwith h5py) do:
u"æ".encode("utf-8")
which returns:
'\xc3\xa6'
Then to decode you could use the string decode like this:
'\xc3\xa6'.decode("utf-8")
which would return:
æ
Hope it helps!
EDIT
When you open files and you want them to be in utf-8, you can use the encoding parameter on the read file method:
f = open(fname, encoding="utf-8")
This should help properly encoding the original file.
Source: python-notes
I have a string in swift:
let flag = "Cattì ò"
I am trying to convert the UTF8 symbols.
I have tried using
stringByRemovingPercentEncoding
but noting changes. How can I convert the symbols properly ?
Welcome to the encoding guessing game! Look like somewhere along the pathway, your string didn't get the correct code page. Here's one way to guess it:
let flag = "Cattì ò"
let encodings = [NSASCIIStringEncoding,
NSNEXTSTEPStringEncoding,
NSJapaneseEUCStringEncoding,
NSUTF8StringEncoding,
NSISOLatin1StringEncoding,
NSSymbolStringEncoding,
NSNonLossyASCIIStringEncoding,
NSShiftJISStringEncoding,
NSISOLatin2StringEncoding,
NSUnicodeStringEncoding,
NSWindowsCP1251StringEncoding,
NSWindowsCP1252StringEncoding,
NSWindowsCP1253StringEncoding,
NSWindowsCP1254StringEncoding,
NSWindowsCP1250StringEncoding,
NSISO2022JPStringEncoding,
NSMacOSRomanStringEncoding,
NSUTF16StringEncoding,
NSUTF16BigEndianStringEncoding,
NSUTF16LittleEndianStringEncoding,
NSUTF32StringEncoding,
NSUTF32BigEndianStringEncoding,
NSUTF32LittleEndianStringEncoding]
for encoding in encodings {
if let bytes = flag.cStringUsingEncoding(encoding),
flag_utf8 = String(CString: bytes, encoding: NSUTF8StringEncoding) {
print("\(encoding): \(flag_utf8)")
}
}
The array contains all the encodings that Cocoa supports.
From the results, it seems like your string was encoded in NSISOLatin1StringEncoding (a.k.a ISO-8859-1), the default encoding for HTML 4.01. This gives Cattì ò in UTF-8, not exactly match your desired result but is the closest among all code pages.
Other good candidates are NSWindowsCP1252StringEncoding and NSWindowsCP1254StringEncoding so I'd suggest you check with other strings.
I tried to decode the audio using ffmpeg with the following code:
NSMutableData *finalData = [NSMutableData data];
......
while(av_read_frame(pFormatCtx, &packet) >= 0){
if(packet.stream_index == videoStream)
{
int consumed = avcodec_decode_audio4(pCodecCtx, pFrame, &got_frame_ptr, &packet);
if(got_frame_ptr)
{
[finalData appendBytes:(pFrame->data)[0] length:(pFrame->linesize)[0]];
}
}
av_free_packet(&packet);
}
......
[finalData writeToFile:path atomically:YES];
Bu the saved file can't be played, even I changed the file extension to wav. When I look into it in HexEdit (a Hex editor), I found there are many zero bytes. For example the content of the file before offset 0x970 are all zero. Is there any error in my code? Any help will be appreciated.
Actually the decode result is good. The zero bytes in the file is normal, because the decode result is PCM data. I tried to import the data into Adobe Audition, it can be played. FYI.