I'm trying to replace some messy data with regex in a data frame, the column has qualification values, but they are messy. For example, I have 'plastic','plastique','Plasticpackage',or 'Karton','carton','Carton'... in the column 'packaging', but they all mean the same thing, that is 'plastic' or 'Carton', things like that. Therefore Im trying to replace all these values with .replace and Regex. My code looks like this:
dict1={r'[cK]arton':'Carton',r'\W*((?i)plasti(?-i))\W*':'Plastique',r'[cC]onserve':'Conserve'}
df['packaging'].replace(dict1,inplace=True,regex=True)
However, when i execute it gives me the error:missing : at position 18
I have checked, line 1 to line 18 have 17 missing values not only at line 18, so why i have this error? Should I tell python to ignore na values? But the replace() function does not seem to have the ignore na option.
Thank you very much in advance
You cannot use inline modifiers in Python re at a non-initial position in a regex. Besides, it does not support (?-i) notation (to disable the effect of the preceding (?i)).
Instead, you can use an inline modifier group, (?i:...).
So, you need to fix the regex definitions likes this:
dict1={
r'[cK]arton':'Carton',
r'\W*(?i:plasti)\W*':'Plastique',
r'[cC]onserve':'Conserve'
}
Or, r'\W*(?i:plasti)\W*' can also be written as r'(?i)\W*plasti\W*'.
Related
I am trying to formnulate a regex to get the ids from the below two strings examples:
/drugs/2/drug-19904-5106/magnesium-oxide-tablet/details
/drugs/2/drug-19906/magnesium-moxide-tablet/details
In the first case, I should get 19904-5106 and in the second case 19906.
So far I tried several, the closes I could get is [drugs/2/drug]-.*\d but would return g-19904-5106 and g-19907.
Please any help to get ride of the "g-"?
Thank you in advance.
When writing a regex expression, consider the patterns you see so that you can align it correctly. For example, if you know that your desired IDs always appear in something resembling ABCD-1234-5678 where 1234-5678 is the ID you want, then you can use that. If you also know that your IDs are always digits, then you can refine the search even more
For your example, using a regex string like
.+?-(\d+(?:-\d+)*)
should do the trick. In a python script that would look something like the following:
match = re.search(r'.+?-(\d+(?:-\d+)*)', my_string)
if match:
my_id = match.group(1)
The pattern may vary depending on the depth and complexity of your examples, but that works for both of the ones you provided
This is the closest I could find: \d+|.\d+-.\d+
This problem might be very simple but I find it a bit confusing & that is why I need help.
With relevance to this question I posted that got solved, I got a new issue that I just noticed.
Source code:
from PyQt5 import QtCore,QtWidgets
app=QtWidgets.QApplication([])
def scroll():
#QtCore.QRegularExpression(r'\b'+'cat'+'\b')
item = listWidget.findItems(r'\bcat\b', QtCore.Qt.MatchRegularExpression)
for d in item:
print(d.text())
window = QtWidgets.QDialog()
window.setLayout(QtWidgets.QVBoxLayout())
listWidget = QtWidgets.QListWidget()
window.layout().addWidget(listWidget)
cats = ["love my cat","catirization","cat in the clouds","catść"]
for i,cat in enumerate(cats):
QtWidgets.QListWidgetItem(f"{i} {cat}", listWidget)
btn = QtWidgets.QPushButton('Scroll')
btn.clicked.connect(scroll)
window.layout().addWidget(btn)
window.show()
app.exec_()
Output GUI:
Now as you can see I am just trying to print out the text data based on the regex r"\bcat\b" when I press the "Scroll" button and it works fine!
Output:
0 love my cat
2 cat in the clouds
3 catść
However... as you can see on the #3, it should not be printed out cause it obviously does not match with the mentioned regular expression which is r"\bcat\b". However it does & I am thinking it has something to do with that special foreign character ść that makes it a match & prints it out (which it shouldn't right?).
I'm expecting an output like:
0 love my cat
2 cat in the clouds
Researches I have tried
I found this question and it says something about this \p{L} & based on the answer it means:
If all you want to match is letters (including "international"
letters) you can use \p{L}.
To be honest I'm not so sure how to apply that with PyQT5 also still I've made some tries & and I tried changing the regex to like this r'\b'+r'\p{cat}'+r'\b'. However I got this error.
QString::contains: invalid QRegularExpression object
QString::contains: invalid QRegularExpression object
QString::contains: invalid QRegularExpression object
QString::contains: invalid QRegularExpression object
Obviously the error says it's not a valid regex. Can someone educate me on how to solve this issue? Thank you!
In general, when you need to make your shorthand character classes and word boundaries Unicode-aware, you need to pass the QRegularExpression.UseUnicodePropertiesOption option to the regex compiler. See the QRegularExpression.UseUnicodePropertiesOption reference:
The meaning of the \w, \d, etc., character classes, as well as the meaning of their counterparts (\W, \D, etc.), is changed from matching ASCII characters only to matching any character with the corresponding Unicode property. For instance, \d is changed to match any character with the Unicode Nd (decimal digit) property; \w to match any character with either the Unicode L (letter) or N (digit) property, plus underscore, and so on. This option corresponds to the /u modifier in Perl regular expressions.
In Python, you could declare it as
rx = QtCore.QRegularExpression(r'\bcat\b', QtCore.QRegularExpression.UseUnicodePropertiesOption)
However, since the QListWidget.findItems does not support a QRegularExpression as argument and only allows the regex as a string object, you can only use the (*UCP) PCRE
verb as an alternative:
r'(*UCP)\bcat\b'
Make sure you define it at the regex beginning.
I am trying to build an IF formula but I am getting an error that I have too many arguments. Any idea how to fix this?
=IF(BZ190=$C$163,$C$163, IF(BZ190=$C$163*$C$165,$C$163*$C$166, IF(BZ190=$C$163*$C$166,$C$163*$C$167, IF(BZ190=$C$163*$C$167,Z190=$C$163*$C$168, IF(BZ190=$C$163*$C$168,$C$163*$C$169, IF(BZ190=$C$163*$C$169,$C$163*$C$170, IF(BZ190=$C$163*$C$170,$C$163*$C$171, IF(BZ190=$C$163*$C$171,$C$163*$C$172, IF(BZ190=$C$163*$C$172,$C$163*$C$173, IF(BZ190=$C$163*$C$173,$C$163*$C$174, IF(BZ190=$C$163*$C$174,$C$163*$C$175, IF(BZ190=$C$163*$C$175,$C$163), IF(AND((SUM(BZ190:BZ$190)-CA$168)<0,BZ190=""),$C$163*$C$165,"")))))))))))
The last but one if-clause looks like this:
IF(BZ190=$C$163*$C$175,$C$163),
^
|
| bracket is obsolete
There should not be a bracket at the end, it should just be:
IF(BZ190=$C$163*$C$175,$C$163,
But I have another point here: imagine that, within half a year or within a year, you need to modify something. How will you find out what all those things mean? Therefore I'd advise you to use names, something like:
$C$163 equals "interest_rate"
$C$165 equals "student_income"
...
Like this, your formula will become something like:
IF(BZ190=interest_rate,interest_rate,
IF(BZ190=interest_rate * student_income, ...
This will be much clearer to read and to maintain. And, oh, before I forget: writing the formula in multiline (one if-clause per line) also increases readability and maintainability.
I want to use printing command bellow in many places of my script. But I need to keep replacing "Survived" with some other string.
print(df.Survived.value_counts())
Can I automate the process by formating variable the same way as string? So if I want to replace "Survived" with "different" can I use something like:
var = 'different'
text = 'df.{}.value_counts()'.format(var)
print(text)
unfortunately this prints out "df.different.value_counts()" as as a string, while I need to print the value of df.different.value_counts()
I'm pretty sure alot of IDEs, have this option that is called refactoring, and it allows you to change a similar line of code/string on every line of code to what you need it to be.
I'm aware of VSCode's way of refactoring, is by selecting a part of the code and right click to select the option called change all occurances. This will replace the exact code on every line if it exists.
But if you want to do what you proposed, then eval('df.{}.value_counts()'.format(var)) is an option, but this is very unsecured and dangerous, so a more safer approach would be importing the ast module and using it's literal_eval function which is safer. ast.literal_eval('df.{}.value_counts()'.format(var)).
if ast.literal_eval() doesn't work then try this final solution that works.
def cat():
return 1
text = locals()['df.{}.value_counts'.format(var)]()
Found the way: print(df[var].value_counts())
I'm writing a script to scrape from another website with Python, and I am facing this question that I have yet to figure out a method to resolve it.
So say I have set to replace this particular string with something else.
word_replace_1 = 'dv'
namelist = soup.title.string.replace(word_replace_1,'11dv')
The script works fine, when the titles are dv234,dv123 etc.
The output will be 11dv234, 11dv123.
However if the titles are, dv234, mixed with dvab123, even though I did not set dvab to be replaced with anything, the script is going to replace it to 11dvab123. What should I do here?
Also, if the title is a combination of alphabits,numbers and Korean characters, say DAV123ㄱㄴㄷ,
how exactly should I make it to only spitting out DAV123, and adding - in between alphabits and numbers?
Python - making a function that would add "-" between letters
This gives me the idea to add - in between all characters, but is there a method to add - between character and number?
the only way atm I can think of is creating a table of replacing them, for example something like this
word_replace_3 = 'a1'
word_replace_4 = 'a2'
.......
and then print them out as
namelist3 = soup.title.string.replace(word_replace_3,'a-1').replace(word_replace_4,'a-2')
This is just slow and not efficient. What would be the best method to resolve this?
Thanks.