Storing string datasets in hdf5 with unicode - python-3.x

I am trying to store variable string expressions from a file which contains special characters, like ø, æ , and å. Here is my code:
import h5py as h5
file = h5.File('deleteme.hdf5','a')
dt = h5.special_dtype(vlen=str)
dset = file.create_dataset("text",(1,),dtype=dt)
dset.attrs[str(1)] = "some text with ø, æ, å"
However the text is not stored properly. The data stored contains text:
"some text with \37777777703\37777777670, \37777777703\37777777646,\37777777703\37777777645"
How can I store the special characters properly? I have tried to follow the guide provided in the documentation here: Strings in HDF5 - Variable-length UTF-8
Edit:
The output was from h5dump. The answer below verified that the characters are properly stored as utf-8.

With:
import numpy as np
import h5py as h5
file = h5.File('deleteme.hdf5','w')
dt = h5.special_dtype(vlen=str)
dset = file.create_dataset("text",(3,),dtype=dt)
dset[:] = 'ø æ å'.split()
dset.attrs["1"] = "some text with ø, æ, å"
file.close()
file = h5.File('deleteme.hdf5','r')
print(file['text'][:])
print(file['text'].attrs["1"])
file.close()
I see:
$ python3 stack44661467.py
['ø' 'æ' 'å']
some text with ø, æ, å
That is h5py does see/interpret the strings as unicode - writing and reading.
With the dump utility:
$ h5dump deleteme.hdf5
HDF5 "deleteme.hdf5" {
GROUP "/" {
DATASET "text" {
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
DATASPACE SIMPLE { ( 3 ) / ( 3 ) }
DATA {
(0): "\37777777703\37777777670", "\37777777703\37777777646",
(2): "\37777777703\37777777645"
}
ATTRIBUTE "1" {
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
DATA {
(0): "some text with \37777777703\37777777670, \37777777703\37777777646, \37777777703\37777777645"
}
}
}
}
}
Note that in both case the datatype is marked UTF8
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
That's what the docs say:
http://docs.h5py.org/en/latest/strings.html#variable-length-utf-8
They can store any character a Python unicode string can store, with the exception of NULLs. In the file these are created as variable-length strings with character set H5T_CSET_UTF8.
Let h5py (or other reader) worry about interpreting \37777777703\37777777670 as the proper unicode character.

You should try storing your data in UTF-8 format by doing the following:
To encode in utf-8 format (before storingwith h5py) do:
u"æ".encode("utf-8")
which returns:
'\xc3\xa6'
Then to decode you could use the string decode like this:
'\xc3\xa6'.decode("utf-8")
which would return:
æ
Hope it helps!
EDIT
When you open files and you want them to be in utf-8, you can use the encoding parameter on the read file method:
f = open(fname, encoding="utf-8")
This should help properly encoding the original file.
Source: python-notes

Related

Groovy - String created from UTF8 bytes has wrong characters

The problem came up when getting the result of a web service returning json with Greek characters in it. Actually it is the city of Mykonos. The challenge is whatever encoding or conversion I'm using it is always displayed as:ΜΎΚΟxCE?ΟΣ . But it should show: ΜΎΚΟΝΟΣ
With Powershell I was able to verify, that the web service is returning the correct characters.
I narrowed the problem down when the byte array gets converted to a String in Groovy. Below is code that reproduces the issue I have. myUTF8String holds the byte array I get from URLConnection.content.text. The UTF8 byte sequence to look at is 0xce, 0x9d. After converting this to a string and back to a byte array the byte sequence for that character is 0xce, 0x3f. The result of below code will show the difference at position 9 of the original byte array and the one from the converted string. For the below test I'm using Groovy Console 4.0.6.
Any hints on this one?
import java.nio.charset.StandardCharsets;
def myUTF8String = "ce9cce8ece9ace9fce9dce9fcea3"
def bytes = myUTF8String.decodeHex();
content = new String(bytes).getBytes()
for ( i = 0; i < content.length; i++ ) {
if ( bytes[i] != content[i] ) {
println "Different... at pos " + i
hex = Long.toUnsignedString( bytes[i], 16).toUpperCase()
print hex.substring(hex.length()-2,hex.length()) + " != "
hex = Long.toUnsignedString( content[i], 16).toUpperCase()
println hex.substring(hex.length()-2,hex.length())
}
}
Thanks a lot
Andreas
you have to specify charset name when building String from bytes otherwise default java charset will be used - and it's not necessary urf-8.
Charset.defaultCharset() - Returns the default charset of this Java virtual machine.
The same problem with String.getBytes() - use charset parameter to get correct byte sequence.
Just change the following line in your code and issue will disappear:
content = new String(bytes, "UTF-8").getBytes("UTF-8")
as an option you can set default charset for the whole JVM instance with the following command line parameter:
java -Dfile.encoding=UTF-8 <your application>
but be careful because it will affect whole JVM instance!
https://docs.oracle.com/en/java/javase/19/intl/supported-encodings.html#GUID-DC83E43D-52F6-41D9-8F16-318F3F39D54F

Python - Reading YAML file with escape characters and escape them

I have a yaml file with Latex-strings in its entries, in particular with many un-escaped escape signs \. The file could look like that
content:
- "explanation" : "\text{Explanation 1} "
"formula" : "\exp({{a}}^2) = {{d}}^2 - {{b}}^2"
- "explanation" : "\text{Explanation 2}"
"formula" : "{{b}}^2 = {{d}}^2 - \exp({{a}}^2) "
The desired output form (in python) looks like that:
config = {
"content" : [
{"explanation" : "\\text{Now} ",
"formula" : "\\exp({{a}}^2) = {{d}}^2 - {{b}}^2"},
{"explanation" : "\\text{With}",
"formula" : "{{a}}^2 = {{d}}^2 + 3 ++ {{b}}^2"}
]
}
where the \ have been escaped, but not the "{" and "}" as you would have when using re.escape(string).
path = "config.yml"
with open(path, "r",encoding = 'latin1') as stream:
config1 = yaml.safe_load(stream)
with open(path, "r",encoding = 'utf-8') as stream:
config2 = yaml.safe_load(stream)
# Codecs
import codecs
with codecs.open(path, "r",encoding='unicode_escape') as stream:
config3 = yaml.safe_load(stream)
with codecs.open(path, "r",encoding='latin1') as stream:
config4 = yaml.safe_load(stream)
with codecs.open(path, 'r', encoding='utf-8') as stream:
config5 = yaml.safe_load(stream)
#
with open(path, "r", encoding = 'utf-8') as stream:
stream = stream.read()
config6 = yaml.safe_load(stream)
with open(path, "r", encoding = 'utf-8') as stream:
config7 = yaml.load(stream,Loader = Loader)
None of these solutions seems to work, e.g. the "unicode-escape" option still reads in
\x1bxp({{a}}^2) instead of \exp({{a}}^2).
What can I do? (The dictionary entries are later given to a Latex-Parser but I can't escape all the \ signs by hand.).
\n, \e and \t are all special characters when double-quoted in YAML, and if you're going treat them literally you're basically asking the YAML parser to blindly treat double-quoted text as plain text, which means that you're going to have to write your own non-conforming YAML parser.
Instead of writing a parser from the ground up, however, an easier approach would be to customize an existing YAML parser by monkey-patching the method that scans double-quoted texts and making it the same as the method that scans plain texts. In case of PyYAML, that can be done with a simple override:
yaml.scanner.Scanner.fetch_double = yaml.scanner.Scanner.fetch_plain
If you want to avoid affecting other parts of the code that may parse YAML normally, you can use unittest.mock.patch as a context manager to patch the fetch_double method temporarily just for the loader call:
import yaml
from unittest.mock import patch
with patch('yaml.scanner.Scanner.fetch_double', yaml.scanner.Scanner.fetch_plain):
with open('config.yml') as stream:
config = yaml.safe_load(stream)
With your sample input, config would become:
{
'content': [
{'"explanation"': '"\\text{Explanation 1} "',
'"formula"': '"\\exp({{a}}^2) = {{d}}^2 - {{b}}^2"'},
{'"explanation"': '"\\text{Explanation 2}"',
'"formula"': '"{{b}}^2 = {{d}}^2 - \\exp({{a}}^2) "'}
]
}
Demo: https://replit.com/#blhsing/WaryDirectWorkplaces
Note that this approach comes with the obvious consequence that you lose all the capabilities of double quotes within the same call. If the configuration file has other double-quoted texts that need proper escaping, this will not parse them correctly. But if the configuration file has only the kind of input you posted in your question, it will help parse it in the way you prefer without having to modify the code that generates such an (improper) YAML file (since presumably you're asking this question because you don't have the authorization to modify the code that generates the YAML file).

Finding and replacing special chars in a file

I'm trying to find and replace some special chars in a file encoded in ISO-8859-1, then write the result to a new file encoded in UTF-8:
package inv
class MigrationScript {
static main(args) {
new MigrationScript().doStuff();
}
void doStuff() {
def dumpfile = "path to input file";
def newfileP = "path to output file"
def file = new File(dumpfile)
def newfile = new File(newfileP)
def x = [
"þ":"ş",
"ý":"ı",
"Þ":"Ş",
"ð":"ğ",
"Ý":"İ",
"Ð":"Ğ"
]
def r = file.newReader("ISO-8859-1")
def w = newfile.newWriter("UTF-8")
r.eachLine{
line ->
x.each {
key, value ->
if(line.find(key)) println "found a special char!"
line = line.replaceAll(key, value);
}
w << line + System.lineSeparator();
}
w.close()
}
}
My input file content is:
"þ": "ý": "Þ":" "ð":" "Ý":" "Ð":"
Problem is my code never finds the specified characters. The groovy script file itself is encoded in UTF-8. I'm guessing that may be the cause of the problem, but then I can't encode it in ISO-8859-1 because then I can't write "Ş" "Ğ" etc in it.
I took your code sample, run it with an input file encoded with charset ISO-8859-1 and it worked as expected. Can you double check if your input file is actually encoded with ISO-8859-1? Here is what I did:
I took file content from your question and saved it (using SublimeText) to a file /tmp/test.txt using Save -> Save with Encoding -> Western (ISO 8859-1)
I checked file encoding with following Linux command:
file -i /tmp/test.txt
/tmp/test.txt: text/plain; charset=iso-8859-1
I set up dumpfile variable with /tmp/test.txt file and newfile variable to /tmp/test_2.txt
I run your code and I saw in the console:
found a special char!
found a special char!
found a special char!
found a special char!
found a special char!
found a special char!
I checked encoding of the Groovy file in IntelliJ IDEA - it was UTF-8
I checked encoding of the output file:
file -i /tmp/test_2.txt
/tmp/test_2.txt: text/plain; charset=utf-8
I checked the content of the output file:
cat /tmp/test_2.txt
"ş": "ı": "Ş":" "ğ":" "İ":" "Ğ":"
I don't think it matters, but I have used the most recent Groovy 2.4.13
I'm guessing that your input file is not encoded properly. Do double check what is the encoding of the file - when I save the same content but with UTF-8 encoding, your program does not work as expected and I don't see any found a special char! entry in the console. When I display contents of ISO-8859-1 file I see something like that:
cat /tmp/test.txt
"�": "�": "�":" "�":" "�":" "�":"%
If I save the same content with UTF-8, I see the readable content of the file:
cat /tmp/test.txt
"þ": "ý": "Þ":" "ð":" "Ý":" "Ð":"%
Hope it helps in finding source of the problem.

Decode UTF8 symbols

I have a string in swift:
let flag = "Cattì ò"
I am trying to convert the UTF8 symbols.
I have tried using
stringByRemovingPercentEncoding
but noting changes. How can I convert the symbols properly ?
Welcome to the encoding guessing game! Look like somewhere along the pathway, your string didn't get the correct code page. Here's one way to guess it:
let flag = "Cattì ò"
let encodings = [NSASCIIStringEncoding,
NSNEXTSTEPStringEncoding,
NSJapaneseEUCStringEncoding,
NSUTF8StringEncoding,
NSISOLatin1StringEncoding,
NSSymbolStringEncoding,
NSNonLossyASCIIStringEncoding,
NSShiftJISStringEncoding,
NSISOLatin2StringEncoding,
NSUnicodeStringEncoding,
NSWindowsCP1251StringEncoding,
NSWindowsCP1252StringEncoding,
NSWindowsCP1253StringEncoding,
NSWindowsCP1254StringEncoding,
NSWindowsCP1250StringEncoding,
NSISO2022JPStringEncoding,
NSMacOSRomanStringEncoding,
NSUTF16StringEncoding,
NSUTF16BigEndianStringEncoding,
NSUTF16LittleEndianStringEncoding,
NSUTF32StringEncoding,
NSUTF32BigEndianStringEncoding,
NSUTF32LittleEndianStringEncoding]
for encoding in encodings {
if let bytes = flag.cStringUsingEncoding(encoding),
flag_utf8 = String(CString: bytes, encoding: NSUTF8StringEncoding) {
print("\(encoding): \(flag_utf8)")
}
}
The array contains all the encodings that Cocoa supports.
From the results, it seems like your string was encoded in NSISOLatin1StringEncoding (a.k.a ISO-8859-1), the default encoding for HTML 4.01. This gives Cattì ò in UTF-8, not exactly match your desired result but is the closest among all code pages.
Other good candidates are NSWindowsCP1252StringEncoding and NSWindowsCP1254StringEncoding so I'd suggest you check with other strings.

Converting "=?UTF 8?.." (RFC 2047) to a regular string in golang

I'm using an API and it's returning something like this for other language text:
=?UTF 8?B?2KfZhNiu2LfZiNin2Kog2KfZhNiq2Yog2KrYrNmF2Lkg2KjZitmG?= =?UTF 8?B?INit2YHYuCDYp9mE2YLYsdin2ZPZhiDYp9mE2YPYsdmK2YUg2YjZgQ==?= =?UTF 8?B?2YfZhdmHINmF2YXYpyDYp9mU2YXZhNin2Ycg2KfZhNi52YTYp9mF?= =?UTF 8?B?2Kkg2LnYqNivINin2YTZhNmHINin2YTYutiv2YrYp9mGLnBkZg==?=
Is this a common format? How would I go about converting this to a regular string in golang?
Golang usually handles multiple languages well, but I'm not sure about how to go about converting.
Since Go 1.5 you can use mime.WordDecoder.DecodeHeader:
package main
import (
"fmt"
"mime"
)
func main() {
dec := new(mime.WordDecoder)
header, err := dec.DecodeHeader("=?UTF-8?B?2KfZhNiu2LfZiNin2Kog2KfZhNiq2Yog2KrYrNmF2Lkg2KjZitmG?= =?UTF-8?B?INit2YHYuCDYp9mE2YLYsdin2ZPZhiDYp9mE2YPYsdmK2YUg2YjZgQ==?= =?UTF-8?B?2YfZhdmHINmF2YXYpyDYp9mU2YXZhNin2Ycg2KfZhNi52YTYp9mF?= =?UTF-8?B?2Kkg2LnYqNivINin2YTZhNmHINin2YTYutiv2YrYp9mGLnBkZg==?=")
if err != nil {
panic(err)
}
fmt.Println(header)
// Output: لخطوات التي تجمع بين حفظ القرآن الكريم وفهمه مما أملاه العلامة عبد الله الغديان.pdf
}
If you are using an older version of Go, you can use my replacement library: https://github.com/alexcesaro/quotedprintable
Aparrently your API is returning data encoded in RFC 2047 format. Basically, this defines the following:
encoded-word = "=?" charset "?" encoding "?" encoded-text "?="
Which means your charset is UTF-8 (very handy, since this is Go's native character set), and your encoding is Base64. The text you have to decode is the one between the "B?" and the "?=". So all you have to do is take that text and call:
base64.StdEncoding.DecodeString(text)
to get the original UTF-8 string.
There is a decodeRFC2047Word() function in the net/mail package of the Go stdlib, supporting encodings B and Q and charsets UTF-8, US-ASCII and ISO-8859-1. Unfortunately it's not exported, but you're free to take as much inspiration from it as you need ;)
BTW: I just noticed the charset in your example strings is UTF 8, which is a bit odd, since the official name of the encoding is UTF-8.

Resources