Regex deprecation warning confusion [duplicate] - python-3.x

The following fragment of code comes from my github repository found here.
It opens a binary file, and extracts the text within <header> tags. These are the crucial lines:
gbxfile = open(filename,'rb')
gbx_data = gbxfile.read()
gbx_header = b'(<header)((?s).*)(</header>)'
header_intermediate = re.findall(gbx_header, gbx_data)
The script works BUT it receives the following Deprecation Warning:
DeprecationWarning: Flags not at the start of the expression b'(<header)((?s).*)(</' (truncated)
header_intermediate = re.findall(gbx_header, gbx_data)
What is the correct use of the regular expression in gbx_header, so that this warning is not displayed?

You can check the Python bug tacker Issue 39394, the warning was introduced in Python 3.6.
The point is that the Python re now does not allow using inline modifiers not at the start of string. In Python 2.x, you can use your pattern without any problem and warnings as (?s) is silently applied to the whole regular expression under the hood. Since it is not always an expected behavior, the Python developers decided to produce a warning.
Note you can use inline modifier groups in Python re now, see restrict 1 word as case sensitive and other as case insensitive in python regex | (pipe).
So, the solutions are
Putting (?s) (or any other inline modifier) at the start of the pattern: (?s)(<header)(.*)(</header>)
Using the re option, re.S / re.DOTALL instead of (?s), re.I / re.IGNORECASE instead of (?i), etc.
Using workarounds (instead of ., use [\w\W]/[\d\D]/[\s\S] if you do not want to use (?s) or re.S/re.DOTALL).

Related

Python regular expressions with Foreign characters in python PyQT5

This problem might be very simple but I find it a bit confusing & that is why I need help.
With relevance to this question I posted that got solved, I got a new issue that I just noticed.
Source code:
from PyQt5 import QtCore,QtWidgets
app=QtWidgets.QApplication([])
def scroll():
#QtCore.QRegularExpression(r'\b'+'cat'+'\b')
item = listWidget.findItems(r'\bcat\b', QtCore.Qt.MatchRegularExpression)
for d in item:
print(d.text())
window = QtWidgets.QDialog()
window.setLayout(QtWidgets.QVBoxLayout())
listWidget = QtWidgets.QListWidget()
window.layout().addWidget(listWidget)
cats = ["love my cat","catirization","cat in the clouds","catść"]
for i,cat in enumerate(cats):
QtWidgets.QListWidgetItem(f"{i} {cat}", listWidget)
btn = QtWidgets.QPushButton('Scroll')
btn.clicked.connect(scroll)
window.layout().addWidget(btn)
window.show()
app.exec_()
Output GUI:
Now as you can see I am just trying to print out the text data based on the regex r"\bcat\b" when I press the "Scroll" button and it works fine!
Output:
0 love my cat
2 cat in the clouds
3 catść
However... as you can see on the #3, it should not be printed out cause it obviously does not match with the mentioned regular expression which is r"\bcat\b". However it does & I am thinking it has something to do with that special foreign character ść that makes it a match & prints it out (which it shouldn't right?).
I'm expecting an output like:
0 love my cat
2 cat in the clouds
Researches I have tried
I found this question and it says something about this \p{L} & based on the answer it means:
If all you want to match is letters (including "international"
letters) you can use \p{L}.
To be honest I'm not so sure how to apply that with PyQT5 also still I've made some tries & and I tried changing the regex to like this r'\b'+r'\p{cat}'+r'\b'. However I got this error.
QString::contains: invalid QRegularExpression object
QString::contains: invalid QRegularExpression object
QString::contains: invalid QRegularExpression object
QString::contains: invalid QRegularExpression object
Obviously the error says it's not a valid regex. Can someone educate me on how to solve this issue? Thank you!
In general, when you need to make your shorthand character classes and word boundaries Unicode-aware, you need to pass the QRegularExpression.UseUnicodePropertiesOption option to the regex compiler. See the QRegularExpression.UseUnicodePropertiesOption reference:
The meaning of the \w, \d, etc., character classes, as well as the meaning of their counterparts (\W, \D, etc.), is changed from matching ASCII characters only to matching any character with the corresponding Unicode property. For instance, \d is changed to match any character with the Unicode Nd (decimal digit) property; \w to match any character with either the Unicode L (letter) or N (digit) property, plus underscore, and so on. This option corresponds to the /u modifier in Perl regular expressions.
In Python, you could declare it as
rx = QtCore.QRegularExpression(r'\bcat\b', QtCore.QRegularExpression.UseUnicodePropertiesOption)
However, since the QListWidget.findItems does not support a QRegularExpression as argument and only allows the regex as a string object, you can only use the (*UCP) PCRE
verb as an alternative:
r'(*UCP)\bcat\b'
Make sure you define it at the regex beginning.

How to change a string into a variable

I want to write out some data into a file. I saved the filename as a variable. I wan to use % mode to substitude the variable to the text, but it gives an error:
IndentationError: unindent does not match any outer indentation level
writeafile = open('N:\myfile\%s.txt' , "a") % (variable)
Assuming we are talking about Python here, you should move variable next to the
'N:\\myfile\\%s.txt' string for correct syntax, like so:
writeafile = open("N:\\myfile\\%s.txt" % variable, "a")
However, using this style of formatting is not recommended by Pydocs:
The formatting operations described here exhibit a variety of quirks that lead to a number of common errors (such as failing to display tuples and dictionaries correctly). Using the newer formatted string literals, the str.format() interface, or template strings may help avoid these errors. Each of these alternatives provides their own trade-offs and benefits of simplicity, flexibility, and/or extensibility.
Source
So, I'd suggest using f-strings, which have been available in Python since 3.6. The double \\ is intentional here, otherwise Python will treat it as an escape character and you'll get undesired results.
writeafile = open(f"N:\\myfile\\{variable}.txt", "a")
Alternatively, you could also use str.format():
writeafile = open("N:\\myfile\\{name}.txt".format(name=variable), "a")

Non capturing branch reset regex in NodeJS

https://regex101.com/r/UXnhTy/1
var date = /(?|(Sat)ur|(Sun))day/;
console.log(date.exec("Sunday"));
This fails with:
SyntaxError: Invalid regular expression: /(?|(Sat)ur|(Sun))day/: Invalid group
Is there a version of NodeJS that supports this? Or some library out there that
I tested this with nodejs v8.12.0
Not really. An advanced alternative regex library for JavaScript is XRegExp, but it doesn't have the feature you're after - not even as an addon.
A simpler regex feature that is supported by XRegExp is named capture groups, so you can write:
var days = XRegExp('(?:(?<d>Sat)ur|(?<d>Sun))day', 'gi');
You can't use numbers as group names, but named groups should fit what your needs - they allow backreferences (using \k<d>), replacement (${d}), capturing (match.d), and all features of a regular numbered group.
Named captured groups is supported natively by ES2018: ES2018 Regular Expression Updates.
According to node.green, named capture groups are supported by Node.js ≥10.3.0, or by ≥8.6.0 with the --harmony flag.

Invalid Syntax in Python 3.5.2 using Codeacademy functions

I'm trying to code in Python 3. So far I've used codeacademy to copy and paste the functions I've wanted. Unless what codeacademy uses is python 2 (which it's not, I've checked). So I'm curious why it highlights len and says invalid syntax.
print ('Have you thought today?')
original = raw_input('Yes or No:')
If len(original) > 2:
print ('When?')
You put If capitalized. Python is case sensitive so you must use the correct keyword which is lowercase if.
Also raw_input() was renamed to input() in Python 3.
python3 doesn't have raw_input().
Change
original = raw_input('Yes or No:')
to:
original = input('Yes or No:')
And, also use if in place of If.

Determine the firefox version available for update using Python

I am looking for snippet which will check which version is available to download for updates.
I use python 3.x. So it would be nice if anyone has a hint how i can check the version available on the server. The OUtput should generate a variable in which the version number of firefox is stored. for example 22.0
I am using linux as the operating system of my choice.
to be clear:
I don't want to know whhich version is already installed on my system. i want to know which version can be updated.
So far i got the following code:
def firefox_version_remote():
firefox_version_fresh = os.popen("curl -s -l ftp.mozilla.org/pub/mozilla.org/firefox/releases/latest/linux-i686/de/").read()
# short name for firefox version num fresh
fvnf = " "
for i in firefox_version_fresh:
if i.isalpha() != True:
fvnf = fvnf + i
return fvnf.strip()
this returns -22.0..2 where it should return 22.0
Have you considered using a regular expression to match the numbers you're trying to extract. That would be a lot easier. Something like this:
matches = re.findall(r'\d+(?:\.\d+)+', firefox_version_fresh)
if matches:
fvnf = matches[0]
That's assuming the version is of the form x.y potentially followed by more sub versions (e.g. x.y.z).
\d+ is one or more digits
(?: )+ is one or more of everything in the parentheses. The ?: tells the compiler that it's a non-capturing group - i.e. you're not interesting in extracting the data inside the parentheses as a separate group.
\.\d+ matches a dot followed by one or more digits
So the whole expression can be described as one or more digits followed by one or more occurences of a dot and one or more digits.

Resources