The problem begins when I stumble upon unicode characters. For example, árbol. Right now I handle this by asking if the character at position i, that is, string (i:i) is less than 127. That means that it belongs to the ASCII table, with this I know for sure thatstring (i:i) is a complete single character. In the other case (>= 127) and for my example 'árbol', string (1,2) is the complete character.
I think the way I'm handling the strings solves the problem for my practical purposes (handling files in spanish, polish and russian), but in the case of handling chinese letter where characters may take up to 4 bytes then I would have problems.
Is there a way in fortran to single out unicode characters inside a string?
gfortran does not currently support non-ASCII characters in UTF-8 encoded files, see here. You can find the corresponding bug report here.
As a work-around, you can specify the unicode char in Hex-notation: char(int(z'00E1'), ucs4), or '\u00E1'. The latter requires the compile option -fbackslash to enable the evaluation of the backslash.
program character_kind
use iso_fortran_env
implicit none
integer, parameter :: ucs4 = selected_char_kind ('ISO_10646')
character(kind=ucs4, len=20) :: string
! string = ucs4_'árbol' ! This does not work
! string = char(int(z'00E1'), ucs4) // ucs4_'rbol' ! This works
string = ucs4_'\u00E1rbol' ! This is also working
open (output_unit, encoding='UTF-8')
print *, string(1:1)
print *, string
end program character_kind
ifort seems not to support ISO_10646 at all, selected_char_kind ('ISO_10646') returns -1. With ifort 15.0.0 I get the same message as described here.
Related
I'm working on a simple localization function for my scripts and, although it's starting to work quite well so far, I don't know how to avoid scape/special characters to be shown in UI as part of the text after feeding the widgets with the strings returned by f:read().
For example, if in a certain Strings.ES.txt's line I have: Ignorar \"Etiquetas de capa\", I'd expect backslashes didn't end showing up just like when I feed the widget with a normal string between doble quotes like: "Ignorar \"Etiquetas de capa\"", or at least have a way to avoid it. I've been trial-and-erroring with tostring() and load() functions and different (surely nonsense 🙄) concatenations like: load(tostring("[[" .. f:read()" .. ]]")) and such without any success, so here I'm again...
Do someone know if there is a way to get scape characters in a string returned by f:read() still behave as special as when they are found in a regular one?
I don't know how to avoid [e]scape/special characters to be shown in UI as part of the text
What you want is to "unescape" or "unquote" a string to interpret escape sequences as if it were parsed as a quoted string by Lua.
[...] with the strings returned by f:read() [...]
The fact that this string was obtained using f:read() can be ignored; all that matters is that it is a string literal without quotes using quoted string escapes.
I've been trial-and-erroring with tostring() and load() functions and different [...] concatenations like: load(tostring("[[" .. f:read()" .. ]]")) and such without any success [...]
This is almost how to do it, except you chose the wrong string literal type: "Long" strings using pairs square brackets ([ and ]) do not interpret escape sequences at all; they are intended for including long, raw, possibly multiline strings in Lua programs and often come in handy when you need to represent literal strings with backslashes (e.g. regular expressions - not to be confused with Lua patterns, which use % for escapes, and lack the basic alternation operator of regular expressions).
If you instead use single or double quotes to wrap the string, it will work fine:
local function unescape_string(escaped)
return assert(load(('return "%s"'):format(escaped)))()
end
this will produce a tiny Lua program (a "chunk") for each string, which just consists of return "<contents>". Recall that Lua chunks are just functions. Thus you can simply call the function to obtain the value of the string it returns. That way, Lua will interpret the escape sequences for us. The same approach is often used to use Lua for reading data serialized as Lua code.
Note also the use of assert for error handling: load returns nil, err if there is a syntax error. To deal with this gracefully, we can wrap the call to load in assert: assert returns its first argument (the chunk returned by load) if it is truthy; otherwise, if it is falsy (e.g. nil in this case), assert errors, using its second argument as an error message. If you omit the assert and your input causes a syntax error, you will instead get a cryptic "attempt to call a nil value" error.
You probably want to do additional validation, especially if these escaped strings are user-provided - otherwise a malicious string like str"; os.execute("...") can trivially invoke a remote code execution (RCE) vulnerability, allowing it to both execute Lua e.g. to block (while 1 do end), slow down or hijack your application, as well as shell commands using os.execute. To guard against this, searching for an unescaped closing quote should be sufficient (syntax errors e.g. through invalid escapes will still be possible, but RCE should not be possible excepting Lua interpreter bugs):
local function unescape_string(escaped)
-- match start & end of sequences of zero or more backslashes followed by a double quote
for from, to in escaped:gmatch'()\\*()"' do
-- number of preceding backslashes must be odd for the double quote to be escaped
assert((to - from) % 2 ~= 0, "unescaped double quote")
end
return assert(load(('return "%s"'):format(escaped)))()
end
Alternatively, a more robust (but also more complex) and presumably more efficient way of unescaping this would be to manually implement escape sequences through string.gsub; that way you get full control, which is more suitable for user-provided input:
-- Single-character backslash escapes of Lua 5.1 according to the reference manual: https://www.lua.org/manual/5.1/manual.html#2.1
local escapes = {a = '\a', b = '\b', f = '\b', n = '\n', r = '\r', t = '\t', v = '\v', ['\\'] = '\\', ["'"] = "'", ['"'] = '"'}
local function unescape_string(escaped)
return escaped:gsub("\\(.)", escapes)
end
you may implement escapes here as you see fit; for example, this misses decimal escapes, which could easily be implemented as escaped:gsub("\\(%d%d?%d?)", string.char) (this uses coercion of strings to numbers in string.char and a replacement function as second argument to string.gsub).
This function can finally be used straightforwardly as unescape_string(f:read()).
This problem might be very simple but I find it a bit confusing & that is why I need help.
With relevance to this question I posted that got solved, I got a new issue that I just noticed.
Source code:
from PyQt5 import QtCore,QtWidgets
app=QtWidgets.QApplication([])
def scroll():
#QtCore.QRegularExpression(r'\b'+'cat'+'\b')
item = listWidget.findItems(r'\bcat\b', QtCore.Qt.MatchRegularExpression)
for d in item:
print(d.text())
window = QtWidgets.QDialog()
window.setLayout(QtWidgets.QVBoxLayout())
listWidget = QtWidgets.QListWidget()
window.layout().addWidget(listWidget)
cats = ["love my cat","catirization","cat in the clouds","catść"]
for i,cat in enumerate(cats):
QtWidgets.QListWidgetItem(f"{i} {cat}", listWidget)
btn = QtWidgets.QPushButton('Scroll')
btn.clicked.connect(scroll)
window.layout().addWidget(btn)
window.show()
app.exec_()
Output GUI:
Now as you can see I am just trying to print out the text data based on the regex r"\bcat\b" when I press the "Scroll" button and it works fine!
Output:
0 love my cat
2 cat in the clouds
3 catść
However... as you can see on the #3, it should not be printed out cause it obviously does not match with the mentioned regular expression which is r"\bcat\b". However it does & I am thinking it has something to do with that special foreign character ść that makes it a match & prints it out (which it shouldn't right?).
I'm expecting an output like:
0 love my cat
2 cat in the clouds
Researches I have tried
I found this question and it says something about this \p{L} & based on the answer it means:
If all you want to match is letters (including "international"
letters) you can use \p{L}.
To be honest I'm not so sure how to apply that with PyQT5 also still I've made some tries & and I tried changing the regex to like this r'\b'+r'\p{cat}'+r'\b'. However I got this error.
QString::contains: invalid QRegularExpression object
QString::contains: invalid QRegularExpression object
QString::contains: invalid QRegularExpression object
QString::contains: invalid QRegularExpression object
Obviously the error says it's not a valid regex. Can someone educate me on how to solve this issue? Thank you!
In general, when you need to make your shorthand character classes and word boundaries Unicode-aware, you need to pass the QRegularExpression.UseUnicodePropertiesOption option to the regex compiler. See the QRegularExpression.UseUnicodePropertiesOption reference:
The meaning of the \w, \d, etc., character classes, as well as the meaning of their counterparts (\W, \D, etc.), is changed from matching ASCII characters only to matching any character with the corresponding Unicode property. For instance, \d is changed to match any character with the Unicode Nd (decimal digit) property; \w to match any character with either the Unicode L (letter) or N (digit) property, plus underscore, and so on. This option corresponds to the /u modifier in Perl regular expressions.
In Python, you could declare it as
rx = QtCore.QRegularExpression(r'\bcat\b', QtCore.QRegularExpression.UseUnicodePropertiesOption)
However, since the QListWidget.findItems does not support a QRegularExpression as argument and only allows the regex as a string object, you can only use the (*UCP) PCRE
verb as an alternative:
r'(*UCP)\bcat\b'
Make sure you define it at the regex beginning.
I have a list of tuples, produced by some function, which looks like:
[{"a","ą"},
{"ą","a"},
{"a","o"},
{"o","e"}]
But when I print it, I see in terminal:
[{"a",[261]},
{[261],"a"},
{"a","o"},
{"o","e"}]
I usually print it with this command:
io:format("~p~n", [functionThatGeneratesListOfTuples()]),
So far I found that you need to use ~ts when printing Unicode strings, so I tried this:
Pairs = functionThatGeneratesListOfTuples(),
PairsStr = io_lib:format("~p", [Pairs]),
io:format("~ts~n", [PairsStr]),
Is there any possibility to achieve that Unicode Strings would be represented appropriately?
The heuristics for detecting lists-of-integers as strings only recognize Latin-1 characters by default, so [65,66,67] is printed as "ABC" but [665,666,667] is printed as "[665,666,667]" even if you use ~tp. You have to start Erlang as erl +pc unicode to make it accept printable unicode code points above 255. In that mode, [665,666,667] is printed as "ʙʚʛ" with ~tp (but not with ~p).
See http://erlang.org/doc/man/io.html#printable_range-0 for more info, and also this recent improvement of the documentation, which will be included in OTP 21: https://github.com/erlang/otp/pull/1737/files
Python 3.6
I converted a string from utf8 to this:
b'\xe6\x88\x91\xe6\xb2\xa1\xe6\x9c\x89\xe7\x94\xb5#xn--ssdcsrs-2e1xt16k.com.au'
I now want that chunk of ascii back into string form, so there is no longer the little b for bytes at the beginning.
BUT I don't want it converted back to UTF8, I want that same sequence of characters that you ses above in my Python string.
How can I do so? All I can find are ways of converting bytes to string along with encoding or decoding.
The (wrong) answer is quite simple:
chr(asciiCode)
In your special case:
myString = ""
for char in b'\xe6\x88\x91\xe6\xb2\xa1\xe6\x9c\x89\xe7\x94\xb5#xn--ssdcsrs-2e1xt16k.com.au':
myString+=chr(char)
print(myString)
gives:
æ没æçµ#xn--ssdcsrs-2e1xt16k.com.au
Maybe you are also interested in the right answer? It will probably not please you, because it says you have ALWAYS to deal with encoding/decoding ... because myString is now both UTF-8 and ASCII at the same time (exactly as it already was before you have "converted" it to ASCII).
Notice that how myString shows up when you print it will depend on the implicit encoding/decoding used by print.
In other words ...
there is NO WAY to avoid encoding/decoding
but there is a way of doing it a not explicit way.
I suppose that reading my answer provided HERE: Converting UTF-8 (in literal) to Umlaute will help you much in understanding the whole encoding/decoding thing.
What you have there is not ASCII, as it contains for instance the byte \xe6, which is higher than 127. It's still UTF8.
The representation of the string (with the 'b' at the start, then a ', then a '\', ...), that is ASCII. You get it with repr(yourstring). But the contents of the string that you're printing is UTF8.
But I don't think you need to turn that back into an UTF8 string, but it may depend on the rest of your code.
the following code
if(!cfile.Open(fileName, CFile::modeCreate | CFile::modeReadWrite)){
return;
}
ggg.Format(_T("0 \r\n"));
cfile.Write(ggg, ggg.GetLength());
ggg.Format(_T("SECTION \r\n"));
cfile.Write(ggg, ggg.GetLength());
produces the following:
0 SECTI
clearly this is wrong: (a) \r\n is ignored, and (b) the word SECTION is cut off.
Can someone please tell me what I am doing wrong?
The same code without _T() in VC6 produces the correct results.
Thank you
a.
Apparently, you are building a Unicode build; CString (presumably that's what ggg is) holds a sequence of wchar_t characters, each two bytes large. ggg.GetLength() is the length of the string in characters.
However, CFile::Write takes the length in bytes, not in characters. You are passing half the number of bytes actually taken by the string, so only half the number of characters gets written.
Have you considered changing lines like:
cfile.Write(ggg, ggg.GetLength());
to`
cfile.Write(ggg, ggg.GetLength() * sizeof(TCHAR))
Write needs the number of bytes (not characters). Since Unicode is 2 bytes wide you need to account for that. sizeof(TCHAR) should be the number of bytes each character takes on a given platform. If it is built as Ansi it would be 1 and Unicode would have 2. Multiply that by the string length and the number of bytes should be correct.
Information on TCHAR can be found on MSDN documentation here. In particular it is defined as:
The _TCHAR data type is defined conditionally in Tchar.h. If the symbol _UNICODE is defined for your build, _TCHAR is defined as wchar_t; otherwise, for single-byte and MBCS builds, it is defined as char. (wchar_t, the basic Unicode wide-character data type, is the 16-bit counterpart to an 8-bit signed char.)
TCHAR and _TCHAR in your usage should be synonymous. However I believe these days Microsoft recommends including <tchar.h> and using _TCHAR. What I can't tell you is if _TCHAR existed on VC 6.0.
If using the method above - if you build using Unicode your output files will be in Unicode. If you build for Ansi it will be output as 8bit ASCII.
Want CFile.write to output Ascii no matter what? Read on...
If you want all text written to the file as 8bit ASCII you are going to have to use one of the macros for conversion. In particular CT2A. More on the macros can be found in this MSDN article. Each macro can be broken up by name, however CT2A says convert the Generic character string (equivalent to W when _UNICODE is defined, equivalent to A otherwise) to Ascii per the chart at the link. So no matter whether using Unicode or Ascii it would output Ascii. Your code would look something like:
ggg.Format(_T("0 \r\n"));
cfile.Write(CT2A(ggg), ggg.GetLength());
ggg.Format(_T("SECTION \r\n"));
cfile.Write(CT2A(ggg), ggg.GetLength());
Since the macro converts everything to Ascii CString's GetLength() will suffice.