Finding and replacing special chars in a file - groovy

I'm trying to find and replace some special chars in a file encoded in ISO-8859-1, then write the result to a new file encoded in UTF-8:
package inv
class MigrationScript {
static main(args) {
new MigrationScript().doStuff();
}
void doStuff() {
def dumpfile = "path to input file";
def newfileP = "path to output file"
def file = new File(dumpfile)
def newfile = new File(newfileP)
def x = [
"þ":"ş",
"ý":"ı",
"Þ":"Ş",
"ð":"ğ",
"Ý":"İ",
"Ð":"Ğ"
]
def r = file.newReader("ISO-8859-1")
def w = newfile.newWriter("UTF-8")
r.eachLine{
line ->
x.each {
key, value ->
if(line.find(key)) println "found a special char!"
line = line.replaceAll(key, value);
}
w << line + System.lineSeparator();
}
w.close()
}
}
My input file content is:
"þ": "ý": "Þ":" "ð":" "Ý":" "Ð":"
Problem is my code never finds the specified characters. The groovy script file itself is encoded in UTF-8. I'm guessing that may be the cause of the problem, but then I can't encode it in ISO-8859-1 because then I can't write "Ş" "Ğ" etc in it.

I took your code sample, run it with an input file encoded with charset ISO-8859-1 and it worked as expected. Can you double check if your input file is actually encoded with ISO-8859-1? Here is what I did:
I took file content from your question and saved it (using SublimeText) to a file /tmp/test.txt using Save -> Save with Encoding -> Western (ISO 8859-1)
I checked file encoding with following Linux command:
file -i /tmp/test.txt
/tmp/test.txt: text/plain; charset=iso-8859-1
I set up dumpfile variable with /tmp/test.txt file and newfile variable to /tmp/test_2.txt
I run your code and I saw in the console:
found a special char!
found a special char!
found a special char!
found a special char!
found a special char!
found a special char!
I checked encoding of the Groovy file in IntelliJ IDEA - it was UTF-8
I checked encoding of the output file:
file -i /tmp/test_2.txt
/tmp/test_2.txt: text/plain; charset=utf-8
I checked the content of the output file:
cat /tmp/test_2.txt
"ş": "ı": "Ş":" "ğ":" "İ":" "Ğ":"
I don't think it matters, but I have used the most recent Groovy 2.4.13
I'm guessing that your input file is not encoded properly. Do double check what is the encoding of the file - when I save the same content but with UTF-8 encoding, your program does not work as expected and I don't see any found a special char! entry in the console. When I display contents of ISO-8859-1 file I see something like that:
cat /tmp/test.txt
"�": "�": "�":" "�":" "�":" "�":"%
If I save the same content with UTF-8, I see the readable content of the file:
cat /tmp/test.txt
"þ": "ý": "Þ":" "ð":" "Ý":" "Ð":"%
Hope it helps in finding source of the problem.

Related

Groovy - String created from UTF8 bytes has wrong characters

The problem came up when getting the result of a web service returning json with Greek characters in it. Actually it is the city of Mykonos. The challenge is whatever encoding or conversion I'm using it is always displayed as:ΜΎΚΟxCE?ΟΣ . But it should show: ΜΎΚΟΝΟΣ
With Powershell I was able to verify, that the web service is returning the correct characters.
I narrowed the problem down when the byte array gets converted to a String in Groovy. Below is code that reproduces the issue I have. myUTF8String holds the byte array I get from URLConnection.content.text. The UTF8 byte sequence to look at is 0xce, 0x9d. After converting this to a string and back to a byte array the byte sequence for that character is 0xce, 0x3f. The result of below code will show the difference at position 9 of the original byte array and the one from the converted string. For the below test I'm using Groovy Console 4.0.6.
Any hints on this one?
import java.nio.charset.StandardCharsets;
def myUTF8String = "ce9cce8ece9ace9fce9dce9fcea3"
def bytes = myUTF8String.decodeHex();
content = new String(bytes).getBytes()
for ( i = 0; i < content.length; i++ ) {
if ( bytes[i] != content[i] ) {
println "Different... at pos " + i
hex = Long.toUnsignedString( bytes[i], 16).toUpperCase()
print hex.substring(hex.length()-2,hex.length()) + " != "
hex = Long.toUnsignedString( content[i], 16).toUpperCase()
println hex.substring(hex.length()-2,hex.length())
}
}
Thanks a lot
Andreas
you have to specify charset name when building String from bytes otherwise default java charset will be used - and it's not necessary urf-8.
Charset.defaultCharset() - Returns the default charset of this Java virtual machine.
The same problem with String.getBytes() - use charset parameter to get correct byte sequence.
Just change the following line in your code and issue will disappear:
content = new String(bytes, "UTF-8").getBytes("UTF-8")
as an option you can set default charset for the whole JVM instance with the following command line parameter:
java -Dfile.encoding=UTF-8 <your application>
but be careful because it will affect whole JVM instance!
https://docs.oracle.com/en/java/javase/19/intl/supported-encodings.html#GUID-DC83E43D-52F6-41D9-8F16-318F3F39D54F

Python - Reading YAML file with escape characters and escape them

I have a yaml file with Latex-strings in its entries, in particular with many un-escaped escape signs \. The file could look like that
content:
- "explanation" : "\text{Explanation 1} "
"formula" : "\exp({{a}}^2) = {{d}}^2 - {{b}}^2"
- "explanation" : "\text{Explanation 2}"
"formula" : "{{b}}^2 = {{d}}^2 - \exp({{a}}^2) "
The desired output form (in python) looks like that:
config = {
"content" : [
{"explanation" : "\\text{Now} ",
"formula" : "\\exp({{a}}^2) = {{d}}^2 - {{b}}^2"},
{"explanation" : "\\text{With}",
"formula" : "{{a}}^2 = {{d}}^2 + 3 ++ {{b}}^2"}
]
}
where the \ have been escaped, but not the "{" and "}" as you would have when using re.escape(string).
path = "config.yml"
with open(path, "r",encoding = 'latin1') as stream:
config1 = yaml.safe_load(stream)
with open(path, "r",encoding = 'utf-8') as stream:
config2 = yaml.safe_load(stream)
# Codecs
import codecs
with codecs.open(path, "r",encoding='unicode_escape') as stream:
config3 = yaml.safe_load(stream)
with codecs.open(path, "r",encoding='latin1') as stream:
config4 = yaml.safe_load(stream)
with codecs.open(path, 'r', encoding='utf-8') as stream:
config5 = yaml.safe_load(stream)
#
with open(path, "r", encoding = 'utf-8') as stream:
stream = stream.read()
config6 = yaml.safe_load(stream)
with open(path, "r", encoding = 'utf-8') as stream:
config7 = yaml.load(stream,Loader = Loader)
None of these solutions seems to work, e.g. the "unicode-escape" option still reads in
\x1bxp({{a}}^2) instead of \exp({{a}}^2).
What can I do? (The dictionary entries are later given to a Latex-Parser but I can't escape all the \ signs by hand.).
\n, \e and \t are all special characters when double-quoted in YAML, and if you're going treat them literally you're basically asking the YAML parser to blindly treat double-quoted text as plain text, which means that you're going to have to write your own non-conforming YAML parser.
Instead of writing a parser from the ground up, however, an easier approach would be to customize an existing YAML parser by monkey-patching the method that scans double-quoted texts and making it the same as the method that scans plain texts. In case of PyYAML, that can be done with a simple override:
yaml.scanner.Scanner.fetch_double = yaml.scanner.Scanner.fetch_plain
If you want to avoid affecting other parts of the code that may parse YAML normally, you can use unittest.mock.patch as a context manager to patch the fetch_double method temporarily just for the loader call:
import yaml
from unittest.mock import patch
with patch('yaml.scanner.Scanner.fetch_double', yaml.scanner.Scanner.fetch_plain):
with open('config.yml') as stream:
config = yaml.safe_load(stream)
With your sample input, config would become:
{
'content': [
{'"explanation"': '"\\text{Explanation 1} "',
'"formula"': '"\\exp({{a}}^2) = {{d}}^2 - {{b}}^2"'},
{'"explanation"': '"\\text{Explanation 2}"',
'"formula"': '"{{b}}^2 = {{d}}^2 - \\exp({{a}}^2) "'}
]
}
Demo: https://replit.com/#blhsing/WaryDirectWorkplaces
Note that this approach comes with the obvious consequence that you lose all the capabilities of double quotes within the same call. If the configuration file has other double-quoted texts that need proper escaping, this will not parse them correctly. But if the configuration file has only the kind of input you posted in your question, it will help parse it in the way you prefer without having to modify the code that generates such an (improper) YAML file (since presumably you're asking this question because you don't have the authorization to modify the code that generates the YAML file).

Tkinter - converting a result

I am converting base64 to a string(complete) and I want to be able to edit the converted string entry and have this show up as the amended base64 i.e. input the #commented string below, click enter and then set "alg" to None in the second box and then have this amendment appear in the top / bottom box.
Note: I added the third box to compare whether the b64 results - it would be easier to change the results in the top box to reflect any amendments
import base64
import binascii
from tkinter import *
#eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9
root = Tk()
root.title("Converter")
root.geometry('290x200+250+60')
def b64decode():
s = strentry.get()
try:
return(res1.set(base64.b64decode(s).decode()))
except:
return(res.set("Error"))
def string_to_b64():
data = res1.get()
encodedBytes = base64.b64encode(data.encode("utf-8"))
encodedStr = str(encodedBytes, "utf-8")
return(res3.set(encodedStr))
strentry= StringVar()
res = StringVar()
res1 = StringVar()
res3 = StringVar()
Enter_Value_Entry = Entry(root, text="", textvariable = strentry)
Enter_Value_Entry.pack()
Enter_Value_Entry2 = Entry(root, textvariable = res1)
Enter_Value_Entry2.pack()
Enter_Value_Entry2 = Entry(root, textvariable = res3)
Enter_Value_Entry2.pack()
Error = Label(root,textvariable=res)
Error.pack()
root.bind("<Return>", lambda event: b64decode()==string_to_b64())
root.mainloop()
It is not advisable to try to edit Base64 unless you handle every bite by 8 & 6's
You say you want to just target a chunk so here it is laid out what the needful change must consider.
text as txt {"alg" :"HS25 6","ty p":"JW T"}
text as b64 eyJhbGci OiJIUzI1 NiIsInR5 cCI6IkpX VCJ9
change bytes of {"alg" to {"nul" no problem only first block is changed
text as txt {"nul" :"HS25 6","ty p":"JW T"}
text as b64 eyJudWwi OiJIUzI1 NiIsInR5 cCI6IkpX VCJ9
change bytes of {"alg" = {None BIG PROBLEM everything needs changing as 1bit was altered
text as txt {None: "HS256 ","typ ":"JWT "}
text as b64 e05vbmU6 IkhTMjU2 IiwidHlw IjoiSldU In0=
you MUST only change a block of 6x8 chars between boundary divisions for a new block of 8x6 chars between the same boundaries.
To make changes its best done in the related application thus for images use decode base64 to binary then use image editor and re encode back to base64. If you dont have the native fast base64 app as used on mac or linux then for poor window mans drag and drop (or commandline add filename) base64.cmd you can adapt:-
IF /I %~x1==.b64 goto Decode
rem todo IF /I %~x1==.hex goto HexDecode
FOR %%X in (.jpg .pdf .png) do IF /I %~x1==%%X goto Encode
:Help
echo Format unexpected, accepts .b64 .jpg .pdf OR .png
pause&exit /b
:Encode an accepted file to Base64 and trim off both -----BEGIN / END CERTIFICATE-----
Rem NOTE:- WILL overwrite, also previous .ext will be changed to lower case and .b64 appended
certutil -encode "%~pdn1%~x1" %temp%\tmp.b64 && findstr /v /c:-- %temp%\tmp.b64 > "%~pdn1%~x1.b64" && del %temp%\tmp.b64
Rem EXIT since we must not / cannot over-write the source
pause&exit /b
:Decode a .Base64 file to binary removing the appended .b64 NOTE:- will NOT allow overwite, thus error.
certutil -decode "%~pdn1.b64" "%~pdn1"
pause&exit /b

Writing bit string '00000000' to binary file outputs incorrect file format

I am writing to a binary file 8 bits (big-endian) at a time. The problem that I am having is whenever a sequence of '00000000' is being written to the file, the format of the .bin file is not what I expected
Code snippet:
file = open("example.bin", "wb")
for sample in block:
file.write((int(sample,2).to_bytes(1, 'big')))
Input with correct results:
block = ["11000000", "00100010", "01000000", "10000000"]
example.bin: À"#€
Input with incorrect results:
Input: block = ["11000000", "00100010", "00001000", "00000000"]
example.bin: c022 0800

Storing string datasets in hdf5 with unicode

I am trying to store variable string expressions from a file which contains special characters, like ø, æ , and å. Here is my code:
import h5py as h5
file = h5.File('deleteme.hdf5','a')
dt = h5.special_dtype(vlen=str)
dset = file.create_dataset("text",(1,),dtype=dt)
dset.attrs[str(1)] = "some text with ø, æ, å"
However the text is not stored properly. The data stored contains text:
"some text with \37777777703\37777777670, \37777777703\37777777646,\37777777703\37777777645"
How can I store the special characters properly? I have tried to follow the guide provided in the documentation here: Strings in HDF5 - Variable-length UTF-8
Edit:
The output was from h5dump. The answer below verified that the characters are properly stored as utf-8.
With:
import numpy as np
import h5py as h5
file = h5.File('deleteme.hdf5','w')
dt = h5.special_dtype(vlen=str)
dset = file.create_dataset("text",(3,),dtype=dt)
dset[:] = 'ø æ å'.split()
dset.attrs["1"] = "some text with ø, æ, å"
file.close()
file = h5.File('deleteme.hdf5','r')
print(file['text'][:])
print(file['text'].attrs["1"])
file.close()
I see:
$ python3 stack44661467.py
['ø' 'æ' 'å']
some text with ø, æ, å
That is h5py does see/interpret the strings as unicode - writing and reading.
With the dump utility:
$ h5dump deleteme.hdf5
HDF5 "deleteme.hdf5" {
GROUP "/" {
DATASET "text" {
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
DATASPACE SIMPLE { ( 3 ) / ( 3 ) }
DATA {
(0): "\37777777703\37777777670", "\37777777703\37777777646",
(2): "\37777777703\37777777645"
}
ATTRIBUTE "1" {
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
DATA {
(0): "some text with \37777777703\37777777670, \37777777703\37777777646, \37777777703\37777777645"
}
}
}
}
}
Note that in both case the datatype is marked UTF8
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
That's what the docs say:
http://docs.h5py.org/en/latest/strings.html#variable-length-utf-8
They can store any character a Python unicode string can store, with the exception of NULLs. In the file these are created as variable-length strings with character set H5T_CSET_UTF8.
Let h5py (or other reader) worry about interpreting \37777777703\37777777670 as the proper unicode character.
You should try storing your data in UTF-8 format by doing the following:
To encode in utf-8 format (before storingwith h5py) do:
u"æ".encode("utf-8")
which returns:
'\xc3\xa6'
Then to decode you could use the string decode like this:
'\xc3\xa6'.decode("utf-8")
which would return:
æ
Hope it helps!
EDIT
When you open files and you want them to be in utf-8, you can use the encoding parameter on the read file method:
f = open(fname, encoding="utf-8")
This should help properly encoding the original file.
Source: python-notes

Resources