Python - Reading YAML file with escape characters and escape them - python-3.x

I have a yaml file with Latex-strings in its entries, in particular with many un-escaped escape signs \. The file could look like that
content:
- "explanation" : "\text{Explanation 1} "
"formula" : "\exp({{a}}^2) = {{d}}^2 - {{b}}^2"
- "explanation" : "\text{Explanation 2}"
"formula" : "{{b}}^2 = {{d}}^2 - \exp({{a}}^2) "
The desired output form (in python) looks like that:
config = {
"content" : [
{"explanation" : "\\text{Now} ",
"formula" : "\\exp({{a}}^2) = {{d}}^2 - {{b}}^2"},
{"explanation" : "\\text{With}",
"formula" : "{{a}}^2 = {{d}}^2 + 3 ++ {{b}}^2"}
]
}
where the \ have been escaped, but not the "{" and "}" as you would have when using re.escape(string).
path = "config.yml"
with open(path, "r",encoding = 'latin1') as stream:
config1 = yaml.safe_load(stream)
with open(path, "r",encoding = 'utf-8') as stream:
config2 = yaml.safe_load(stream)
# Codecs
import codecs
with codecs.open(path, "r",encoding='unicode_escape') as stream:
config3 = yaml.safe_load(stream)
with codecs.open(path, "r",encoding='latin1') as stream:
config4 = yaml.safe_load(stream)
with codecs.open(path, 'r', encoding='utf-8') as stream:
config5 = yaml.safe_load(stream)
#
with open(path, "r", encoding = 'utf-8') as stream:
stream = stream.read()
config6 = yaml.safe_load(stream)
with open(path, "r", encoding = 'utf-8') as stream:
config7 = yaml.load(stream,Loader = Loader)
None of these solutions seems to work, e.g. the "unicode-escape" option still reads in
\x1bxp({{a}}^2) instead of \exp({{a}}^2).
What can I do? (The dictionary entries are later given to a Latex-Parser but I can't escape all the \ signs by hand.).

\n, \e and \t are all special characters when double-quoted in YAML, and if you're going treat them literally you're basically asking the YAML parser to blindly treat double-quoted text as plain text, which means that you're going to have to write your own non-conforming YAML parser.
Instead of writing a parser from the ground up, however, an easier approach would be to customize an existing YAML parser by monkey-patching the method that scans double-quoted texts and making it the same as the method that scans plain texts. In case of PyYAML, that can be done with a simple override:
yaml.scanner.Scanner.fetch_double = yaml.scanner.Scanner.fetch_plain
If you want to avoid affecting other parts of the code that may parse YAML normally, you can use unittest.mock.patch as a context manager to patch the fetch_double method temporarily just for the loader call:
import yaml
from unittest.mock import patch
with patch('yaml.scanner.Scanner.fetch_double', yaml.scanner.Scanner.fetch_plain):
with open('config.yml') as stream:
config = yaml.safe_load(stream)
With your sample input, config would become:
{
'content': [
{'"explanation"': '"\\text{Explanation 1} "',
'"formula"': '"\\exp({{a}}^2) = {{d}}^2 - {{b}}^2"'},
{'"explanation"': '"\\text{Explanation 2}"',
'"formula"': '"{{b}}^2 = {{d}}^2 - \\exp({{a}}^2) "'}
]
}
Demo: https://replit.com/#blhsing/WaryDirectWorkplaces
Note that this approach comes with the obvious consequence that you lose all the capabilities of double quotes within the same call. If the configuration file has other double-quoted texts that need proper escaping, this will not parse them correctly. But if the configuration file has only the kind of input you posted in your question, it will help parse it in the way you prefer without having to modify the code that generates such an (improper) YAML file (since presumably you're asking this question because you don't have the authorization to modify the code that generates the YAML file).

Related

manim - Reading in the LaTex string from a txt/tex file

using the community version of manim
I would like to create an example like this one
class MovingFrameBox(Scene):
def construct(self):
text=MathTex(
"\\frac{d}{dx}f(x)g(x)=","f(x)\\frac{d}{dx}g(x)","+",
"g(x)\\frac{d}{dx}f(x)"
)
self.play(Write(text))
framebox1 = SurroundingRectangle(text[1], buff = .1)
framebox2 = SurroundingRectangle(text[3], buff = .1)
self.play(
Create(framebox1),
)
self.wait()
self.play(
ReplacementTransform(framebox1,framebox2),
)
self.wait()
but by reading in the Latex string in the MathTex from any .tex file.
There is an error when reading in the .tex file and turning it into a MathTex object, in particular because all the original LaTex command such as \int only have one escape sign which needs to be escaped, e.g. \ -> \\.
Do you have any experiences in this?

Striping a string from newline markers

I store configuration data (paths to specific files) inside a file named app.cfg that looks like this :
path/to/config.json
path/to/default/folder
and I query those item with the following Python code:
with open("app.cfg","r",newline='') as config:
data = config.readlines()
PathToConfig = data[0]
DefaultPath = data[1]
config.close()
But when I use PathToConfig in my script, the path stored in this variable cannot be used because there is \n at the end of the string.
I tried to fix this issue by using this PathToConfig = data[0].rstrip() but there still is \n at the end of the string.
How can I strip this string from the newline marker ?
You should be able to solve it with .rstrip to strip "\n":
create app.cfg:
with open("app.cfg","w",newline='') as config:
config.writelines("""path/to/config.json
path/to/default/folder""")
app.cfg looks like this:
read contents from file:
with open("app.cfg","r",newline='') as config:
data = config.readlines()
PathToConfig = data[0].rstrip("\n")
DefaultPath = data[1]
output:

jmeter - how to creat new csv from original csv with groovy

I have a csv file (UTF-8 with BOM) like this
NAME,F1,F2,
test1,field1,field2
test2,field1,field2
test3,field1,field2
test4,field1,field2
test5,field1,field2
test6,field1,field2
I would like to discard the first three lines and create new csv (UTF-8 with BOM)
NAME,F1,F2,
test4,field1,field2
test5,field1,field2
test6,field1,field2
I get some idea from the page and code this in JSR223 PreProcessor
def originalCsvFile = new File('g:/Workspace/1.csv')
def newCsvFile = new File('g:/Workspace/2.csv')
originalCsvFile.readLines().take(5).each {line ->
newCsvFile.withWriter('UTF-8') { writer ->
writer.writeLine line
}
}
The above code does not work.
It is better to put the new csv path to the variable, I want to get the variable in jmeter CSV Data Set Config
Do you realize that:
take(5) function returns 5 first lines of the list
newCsvFile.withWriter function overwrites the file with the new data each time it's being called
So I believe you're looking for copying and pasting something like this:
originalCsvFile.readLines().eachWithIndex { line, index ->
if (index == 0 || index > 3) {
newCsvFile.withWriterAppend('UTF-8') { writer ->
writer.writeLine line
}
}
}
More information on Groovy scripting in JMeter: Apache Groovy - Why and How You Should Use It
Not as elegant, perhaps, but this is how I would do it:
List li = originalCsvFile.readLines()
newCsvFile.append(li[0] + "\n", 'UTF-8') //headers
li[4..-1].each { newCsvFile.append(it + "\n", 'UTF-8') }

Finding and replacing special chars in a file

I'm trying to find and replace some special chars in a file encoded in ISO-8859-1, then write the result to a new file encoded in UTF-8:
package inv
class MigrationScript {
static main(args) {
new MigrationScript().doStuff();
}
void doStuff() {
def dumpfile = "path to input file";
def newfileP = "path to output file"
def file = new File(dumpfile)
def newfile = new File(newfileP)
def x = [
"þ":"ş",
"ý":"ı",
"Þ":"Ş",
"ð":"ğ",
"Ý":"İ",
"Ð":"Ğ"
]
def r = file.newReader("ISO-8859-1")
def w = newfile.newWriter("UTF-8")
r.eachLine{
line ->
x.each {
key, value ->
if(line.find(key)) println "found a special char!"
line = line.replaceAll(key, value);
}
w << line + System.lineSeparator();
}
w.close()
}
}
My input file content is:
"þ": "ý": "Þ":" "ð":" "Ý":" "Ð":"
Problem is my code never finds the specified characters. The groovy script file itself is encoded in UTF-8. I'm guessing that may be the cause of the problem, but then I can't encode it in ISO-8859-1 because then I can't write "Ş" "Ğ" etc in it.
I took your code sample, run it with an input file encoded with charset ISO-8859-1 and it worked as expected. Can you double check if your input file is actually encoded with ISO-8859-1? Here is what I did:
I took file content from your question and saved it (using SublimeText) to a file /tmp/test.txt using Save -> Save with Encoding -> Western (ISO 8859-1)
I checked file encoding with following Linux command:
file -i /tmp/test.txt
/tmp/test.txt: text/plain; charset=iso-8859-1
I set up dumpfile variable with /tmp/test.txt file and newfile variable to /tmp/test_2.txt
I run your code and I saw in the console:
found a special char!
found a special char!
found a special char!
found a special char!
found a special char!
found a special char!
I checked encoding of the Groovy file in IntelliJ IDEA - it was UTF-8
I checked encoding of the output file:
file -i /tmp/test_2.txt
/tmp/test_2.txt: text/plain; charset=utf-8
I checked the content of the output file:
cat /tmp/test_2.txt
"ş": "ı": "Ş":" "ğ":" "İ":" "Ğ":"
I don't think it matters, but I have used the most recent Groovy 2.4.13
I'm guessing that your input file is not encoded properly. Do double check what is the encoding of the file - when I save the same content but with UTF-8 encoding, your program does not work as expected and I don't see any found a special char! entry in the console. When I display contents of ISO-8859-1 file I see something like that:
cat /tmp/test.txt
"�": "�": "�":" "�":" "�":" "�":"%
If I save the same content with UTF-8, I see the readable content of the file:
cat /tmp/test.txt
"þ": "ý": "Þ":" "ð":" "Ý":" "Ð":"%
Hope it helps in finding source of the problem.

Storing string datasets in hdf5 with unicode

I am trying to store variable string expressions from a file which contains special characters, like ø, æ , and å. Here is my code:
import h5py as h5
file = h5.File('deleteme.hdf5','a')
dt = h5.special_dtype(vlen=str)
dset = file.create_dataset("text",(1,),dtype=dt)
dset.attrs[str(1)] = "some text with ø, æ, å"
However the text is not stored properly. The data stored contains text:
"some text with \37777777703\37777777670, \37777777703\37777777646,\37777777703\37777777645"
How can I store the special characters properly? I have tried to follow the guide provided in the documentation here: Strings in HDF5 - Variable-length UTF-8
Edit:
The output was from h5dump. The answer below verified that the characters are properly stored as utf-8.
With:
import numpy as np
import h5py as h5
file = h5.File('deleteme.hdf5','w')
dt = h5.special_dtype(vlen=str)
dset = file.create_dataset("text",(3,),dtype=dt)
dset[:] = 'ø æ å'.split()
dset.attrs["1"] = "some text with ø, æ, å"
file.close()
file = h5.File('deleteme.hdf5','r')
print(file['text'][:])
print(file['text'].attrs["1"])
file.close()
I see:
$ python3 stack44661467.py
['ø' 'æ' 'å']
some text with ø, æ, å
That is h5py does see/interpret the strings as unicode - writing and reading.
With the dump utility:
$ h5dump deleteme.hdf5
HDF5 "deleteme.hdf5" {
GROUP "/" {
DATASET "text" {
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
DATASPACE SIMPLE { ( 3 ) / ( 3 ) }
DATA {
(0): "\37777777703\37777777670", "\37777777703\37777777646",
(2): "\37777777703\37777777645"
}
ATTRIBUTE "1" {
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
DATA {
(0): "some text with \37777777703\37777777670, \37777777703\37777777646, \37777777703\37777777645"
}
}
}
}
}
Note that in both case the datatype is marked UTF8
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
That's what the docs say:
http://docs.h5py.org/en/latest/strings.html#variable-length-utf-8
They can store any character a Python unicode string can store, with the exception of NULLs. In the file these are created as variable-length strings with character set H5T_CSET_UTF8.
Let h5py (or other reader) worry about interpreting \37777777703\37777777670 as the proper unicode character.
You should try storing your data in UTF-8 format by doing the following:
To encode in utf-8 format (before storingwith h5py) do:
u"æ".encode("utf-8")
which returns:
'\xc3\xa6'
Then to decode you could use the string decode like this:
'\xc3\xa6'.decode("utf-8")
which would return:
æ
Hope it helps!
EDIT
When you open files and you want them to be in utf-8, you can use the encoding parameter on the read file method:
f = open(fname, encoding="utf-8")
This should help properly encoding the original file.
Source: python-notes

Resources