Jinja2 NativeTemplate rendering strings with double quotes issue - python-3.x

I'm reading a path value from Postgres DB (column type String).
For example:
path | "G:\Shared drives\2 test\2021\08.2021\test.xlsx"
The problem is that some of the nested directories in the path starts with integer (as the above), and Python automatically treats those as Hex characters.
\2 is converted to \x02.
\08 (treated as \0) converted to \x008
\2021 (treated as \20) converted to \x821
print(repr('"G:\Shared drives\2 test\2021\08.2021\test.xlsx"'))
> '"G:\\Shared drives\x02 test\x821\x008.2021\test.xlsx"'
How can I stop Python from interpreting these hex values, and treat it as raw string?
Expected result:
'"G:\\Shared drives\\2 test\\2021\\08.2021\test.xlsx"'
Edit:
It seems that the value is correct on the DB side, and also when I read it.
The path is getting corrupted once I render it with Jinja2 NativeTemplate (specifically native template).
import jinja2
env = jinja2.nativetypes.NativeEnvironment()
path = '"G:\\Shared drives\\2 test\\2021\\08.2021\\test.xlsx"'
print(path)
> "G:\Shared drives\2 test\2021\08.2021\test.xlsx"
t = env.from_string('{{ path }}')
result = t.render(path=path)
print(result) # result is broken
> G:\Shared drives test18.2021 est.xlsx
print(repr(result))
> 'G:\\Shared drives\x02 test\x821\x008.2021\test.xlsx'
If I remove the double quotes from the path string, the render output is valid:
path = 'G:\\Shared drives\\2 test\\2021\\08.2021\\test.xlsx'
result = t.render(path=path)
print(result) # works without double quotes
> G:\Shared drives\2 test\2021\08.2021\test.xlsx
Update:
The issue source is the native_concat method in jinja's NativeEnvironment.
The method returns the literal_eval of the input.
try:
return literal_eval(out)
Reproduce:
path = '"G:\\Shared drives\\2 test\\2021\\08.2021\\test.xlsx"'
print(literal_eval(path))
> G:\Shared drives test18.2021 est.xlsx
print(repr(literal_eval(path)))
> 'G:\\Shared drives\x02 test\x821\x008.2021\test.xlsx'

\ starts an escape sequence in a Python string literals. To prevent this, you have roughly four options:
Properly escape your string. That is, double up every path separator backslash: '"G:\\Shared drives\\2 test\\2021\\08.2021\test.xlsx"'
Use raw strings to prevent escape sequences from being interpreted: r'G:\Shared drives\2 test\2021\08.2021\test.xls'
Use forward slashes instead of backslashes, which are valid as Windows paths: 'G:/Shared drives/2 test/2021/08.2021/test.xls'
Don’t use string literals in your Python code; instead, read the string from some other source (e.g. user input or a file).

s = "C:\Program Files\norton\appx"
print(s)
s = r"C:\Program Files\norton\appx"
print(s)

I've raised the issue to Jinja, and got the following answer:
https://github.com/pallets/jinja/issues/1518#issuecomment-950326763
"This is due to the way escapes work when parsing Python strings. You need to double up the \\"
Indeed, doubling the backslashes did the trick:
path = '"G:\\\\Shared drives\\\\2 test\\\\2021\\\\08.2021\\\\test.xlsx"'
print(literal_eval(path))
# output:
> G:\Shared drives\2 test\2021\08.2021\test.xlsx
And if I want to keep the double quotes:
path = '"\\"G:\\\\Shared drives\\\\2 test\\\\2021\\\\08.2021\\\\test.xlsx\\""'
print(literal_eval(path))
# output
> "G:\Shared drives\2 test\2021\08.2021\test.xlsx"

Related

Regex With Lookahead For Fixed Length String

strings = [
r"C:\Photos\Selfies\1|",
r"C:\HDPhotos\Landscapes\2|",
r"C:\Filters\Pics\12345678|",
r"C:\Filters\Pics2\00000000|",
r"C:\Filters\Pics2\00000000|XAV7"
]
for string in strings:
matchptrn = re.match(r"(?P<file_path>.*)(?!\d{8})", string)
if matchptrn:
print("FILE PATH = "+matchptrn.group('file_path'))
I am trying to get this regular expression with a lookahead to work the way I though it would. Examples of Look Aheads on most websites seem to be pretty basic string matches i.e. not matching 'bar' if it is preceded by a 'foo' as an example of a negative look behind.
My goal is to capture in the group file_path the actual file path only if the string does NOT have an 8 character length number in it just before the pipe symbol | and match anything after the pipe symbol in another group (something I haven't implemented here).
So in the above example it should match only the first two strings
C:\Photos\Selfies\1
C:\HDPhotos\Landscapes\2
In case of the last string
C:\Filters\Pics2\00000000|XAV7
I'd like to match C:\Filters\Pics2\00000000 in <file_path> and match XAV7in another group named .
(This is something I can figure out on my own if I get some help with the negative look ahead)
Currently <file_path> matches everything, which makes sense since it is non-greedy (.*)
I want it to only capture if the last part of the string before the pipe symbol is NOT an 8 length character.
OUTPUT OF CODE SNIPPET PASTED BELOW
FILE PATH = C:\Photos\Selfies\1|
FILE PATH = C:\HDPhotos\Landscapes\2|
FILE PATH = C:\Filters\Pics\12345678|
FILE PATH = C:\Filters\Pics2\00000000|
FILE PATH = C:\Filters\Pics2\00000000|XAV7
Making this modification of \\
matchptrn = re.match(r"(?P<file_path>.*)\\(?!\d{8})", string)
if matchptrn:
print("FILE PATH = "+matchptrn.group('file_path'))
makes things worse as the output is
FILE PATH = C:\Photos\Selfies
FILE PATH = C:\HDPhotos\Landscapes
FILE PATH = C:\Filters
FILE PATH = C:\Filters
FILE PATH = C:\Filters
Can someone please explain this as well ?
You can use
^(?!.*\\\d{8}\|$)(?P<file_path>.*)\|(?P<suffix>.*)
See the regex demo.
Details
^ - start of a string
(?!.*\\\d{8}\|$) - fail the match if the string contains \ followed with eight digits and then | at the end of string
(?P<file_path>.*) - Group "file_path": any zero or more chars other than line break chars as many as possible
\| - a pipe
(?P<suffix>.*) - Group "sfuffix": the rest of the string, any zero or more chars other than line break chars, as many as possible.
See the Python demo:
import re
strings = [
r"C:\Photos\Selfies\1|",
r"C:\HDPhotos\Landscapes\2|",
r"C:\Filters\Pics\12345678|",
r"C:\Filters\Pics2\00000000|",
r"C:\Filters\Pics2\00000000|XAV7"
]
for string in strings:
matchptrn = re.match(r"(?!.*\\\d{8}\|$)(?P<file_path>.*)\|(?P<suffix>.*)", string)
if matchptrn:
print("FILE PATH = {}, SUFFIX = {}".format(*matchptrn.groups()))
Output:
FILE PATH = C:\Photos\Selfies\1, SUFFIX =
FILE PATH = C:\HDPhotos\Landscapes\2, SUFFIX =
FILE PATH = C:\Filters\Pics2\00000000, SUFFIX = XAV7

Absolute folder path in python

I cannot get this code to work for me:
import os
# Define folder to search
searchFolder = "C:\Users\rohrl\OneDrive\Python\PictureCompare\MixedPictures"
os.chdir(searchFolder)
print(os.curdir)
I keep on getting a Unicode error on line 4. What am I doing wrong? I'm on a Windows PC.
The "\" character in Python is a string escape, and it introduces shortcuts for certain string characters. For example the string "\n" doesn't contain the characters \ and n. It contains a newline character. Windows paths always cause this trouble in Python. When Python sees "\U", it's looking for some unicode escape that doesn't exist.
You can use raw strings in Python by prepending the string with r.
searchFolder = r"C:\Users\rohrl\OneDrive\Python\PictureCompare\MixedPictures"
Or you can get in the habit of using double \\. Python reads \\ as a single \.
searchFolder = "C:\\Users\\rohrl\\OneDrive\\Python\\PictureCompare\\MixedPictures"
You need to escape the backslash - or use slashes.
Also I suggest You look at the pathlib Library (it does not help in this short example, but pathlib makes it more pythonic to work with file system objects) :
import os
import pathlib
# variant 1 - raw string
str_search_folder = r"C:\Users\rohrl\OneDrive\Python\PictureCompare\MixedPictures"
# variant 2 - escaping the backslash
str_search_folder = "C:\\Users\\rohrl\\OneDrive\\Python\\PictureCompare\\MixedPictures"
# variant 3 - my prefered, use slashes
str_search_folder = "C:/Users/rohrl/OneDrive/Python/PictureCompare/MixedPictures"
path_search_dir = pathlib.Path(str_search_folder)
os.chdir(path_search_dir)
# variant 1
print(os.curdir)
# variant 2
print(path_search_dir.cwd())

python replace \\ with \ in stringpath automatically

how can I replace "\" in path string with "\\" python, u know \ is for escape character and r'\' and r"\" also don't work, neither in str.replace() or in re.sub()
If your objective is to get the correct path you can use the raw string:
r"C:\Users"
# will return
Out[2]: 'C:\\Users'
# in the console
#however if you print it, it will print this:
print(r"C:\Users")
C:\Users
if you want to combine parts of the path dynamically i recommend the os library (standard library)
use it like this:
import os
path = os.path.join(r"first_part_of_path", r"other_part_of_path", "filename.xlsx")
from python's documentation: "The solution is to use Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with 'r'. So r"\n" is a two-character string containing '' and 'n', while "\n" is a one-character string containing a newline. Usually patterns will be expressed in Python code using this raw string notation."
https://docs.python.org/3/library/re.html
below maybe what you are looking for:
x=r'this, is a \test'
re.subn('\\','\\',x)
from the standard library, you could use os.path.normpath
Example:
import os
myDir = r"path\to\dir"
normalized = os.path.normpath(myDir)
Which enables the following :
>>> normalized
'path\\to\\dir'
>>> print(normalized)
path\to\dir
>>> str(normalized)
'path\\to\\dir'
>>> repr(normalized)
"'path\\\\to\\\\dir'"
I just realized our path for i.e.
path_str="E:\neural network\Pytorch"
can be changed to
path_str=path_str.encode('unicode-escape').decode().replace('\\\\', '\\')
and this would also do it automatically without need to manipulating the string manually to
path_str=r"E:\neural network\Pytorch"

How does ruamel.yaml determine the encoding of escaped byte sequences in a string?

I am having trouble figuring out where to modify or configure ruamel.yaml's loader to get it to parse some old YAML with the correct encoding. The essence of the problem is that an escaped byte sequence in the document seems to be interpreted as latin1, and I have no earthly clue where it is doing that, after some source diving here. Here is a code sample that demonstrates the behavior (this in particular was run in Python 3.6):
from ruamel.yaml import YAML
yaml = YAML()
yaml.load('a:\n b: "\\xE2\\x80\\x99"\n') # Note that this is a str (that is, unicode) with escapes for the byte escapes in the YAML document
# ordereddict([('a', ordereddict([('b', 'â\x80\x99')]))])
Here are the same bytes decoded manually, just to show what it should parse to:
>>> b"\xE2\x80\x99".decode('utf8')
'’'
Note that I don't really have any control over the source document, so modifying it to produce the correct output with ruamel.yaml is out of the question.
ruamel.yaml doesn't interpret individual strings, it interprets the
stream it gets hanled, i.e. the argument to .load(). If that
argument is a byte-stream or a file like object then its encoding is
determined based on the BOM, defaulting to UTF-8. But again: that is
at the stream level, not at individual scalar content after
interpreting escapes. Since you hand .load() Unicode (as this is
Python 3) that "stream" needs no further decoding. (Although
irrelevant for this question: it is done in the reader.py:Reader methods stream and
determine_encoding)
The hex escapes (of the form \xAB), will just put a specific hex
value in the type the loader uses to construct the scalar, that is
value for key 'b', and that is a normal Python 3 str i.e. Unicode in
one of its internal representations. That you get the â in your
output is because of how your Python is configured to decode it str
tyes.
So you won't "find" the place where ruamel.yaml decodes that
byte-sequence, because that is already assumed to be Unicode.
So the thing to do is that you double decode your double quoted
scalars (you only have to address those as plain, single quoted,
literal/folded scalars cannot have the hex escapes). There are various
points at which you can try to do that, but I think
constructor.py:RoundTripConsturtor.construct_scalar and
scalarstring.py:DoubleQuotedScalarString are the best candidates. The former of those might take some digging to find, but the latter is actually the type you'll get if you inspect
that string after loading when you add the option to preserve quotes:
yaml = ruamel.yaml.YAML()
yaml.preserve_quotes = True
data = yaml.load('a:\n b: "\\xE2\\x80\\x99"\n')
print(type(data['a']['b']))
which prints:
<class 'ruamel.yaml.scalarstring.DoubleQuotedScalarString'>
knowing that you can inspect that rather simple wrapper class:
class DoubleQuotedScalarString(ScalarString):
__slots__ = ()
style = '"'
def __new__(cls, value, anchor=None):
# type: (Text, Any) -> Any
return ScalarString.__new__(cls, value, anchor=anchor)
"update" the only method there (__new__) to do your double
encoding (you might have to put in additional checks to not double encode all
double quoted scalars0:
import sys
import codecs
import ruamel.yaml
def my_new(cls, value, anchor=None):
# type information only needed if using mypy
# value is of type 'str', decode to bytes "without conversion", then encode
value = value.encode('latin_1').decode('utf-8')
return ruamel.yaml.scalarstring.ScalarString.__new__(cls, value, anchor=anchor)
ruamel.yaml.scalarstring.DoubleQuotedScalarString.__new__ = my_new
yaml = ruamel.yaml.YAML()
yaml.preserve_quotes = True
data = yaml.load('a:\n b: "\\xE2\\x80\\x99"\n')
print(data)
which gives:
ordereddict([('a', ordereddict([('b', '’')]))])

I wanted to rename '\' with '\\' in python example:C:\Users\Performance_stats.xlsx with C:\\Users\\Performance_stats.xlsx

I try to rename '\' with '\\'by using below code but it is showing the error as
SyntaxError: unexpected character after line continuation character
String = "C\users\stat.csv"
String.replace('\','\\')
SyntaxError: unexpected character after line continuation character
Can someone help how to get the output as
"C\\users\\stat.csv" with rename function.
You should make use of the pathlib in python.
from pathlib import Path, PureWindowsPath
# I've explicitly declared my path as being in Windows format, so I can use forward slashes in it.
filename = PureWindowsPath("source_data\\text_files\\raw_data.txt")
# Convert path to the right format for the current operating system
correct_path = Path(filename)
print(correct_path)
# prints "source_data/text_files/raw_data.txt" on Mac and Linux
# prints "source_data\text_files\raw_data.txt" on Windows
To read more, you can refer to this article.
You need to escape each\
String = "C\\users\\stat.csv"
x=String.replace('\\','\\\\ ')
print (x==String)
print (x)
Output as follows:
C\\ users\\ stat.csv

Resources