How can I remove all characters inside angle brackets python? - python-3.x

How can I remove all characters inside angle brackets including the brackets in a string? How can I also remove all the text between ("\r\n") and ("."+"any 3 characters") Is this possible? I am currently using the solution by #xkcdjerry
e.g
body = """Dear Students roads etc. you place a tree take a snapshot, then when you place a\r\nbuilding, take a snapshot. Place at least 5-6 objects and then have 5-6\r\nsnapshots. Please keep these snapshots with you as everyone will be asked\r\nto share them during the class.\r\n\r\nI am attaching one PowerPoint containing instructions and one video of\r\nexplanation for your reference.\r\n\r\nKind regards,\r\nTeacher Name\r\n zoom_0.mp4\r\n<https://drive.google.com/file/d/1UX-klOfVhbefvbhZvIWijaBdQuLgh_-Uru4_1QTkth/view?usp=drive_web>"""
d = re.compile("\r\n.+?\\....")
body = d.sub('', body)
a = re.compile("<.*?>")
body = a.sub('', body)
print(body)```
For some reason the output is fine except that it has:
```gle.com/file/d/1UX-klOfVhbefvbhZvIWijaBdQuLgh_-Uru4_1QTkth/view?usp=drive_web>
randomly attached to the end How can I fix it.

Answer
Your problem can be solved by a regex:
Put this into the shell:
import re
a=re.compile("<.*?>")
a.sub('',"Keep this part of the string< Remove this part>Keep This part as well")
Output:
'Keep this part of the stringKeep This part as well'
Second question:
import re
re.compile("\r\n.*?\\..{3}")
a.sub('',"Hello\r\nFilename.png")
Output:
'Hello'
Breakdown
Regex is a robust way of finding, replacing, and mutating small strings inside bigger ones, for further reading,consult https://docs.python.org/3/library/re.html. Meanwhile, here are the breakdowns of the regex information used in this answer:
. means any char.
*? means as many of the before as needed but as little as possible(non-greedy match)
So .*? means any number of characters but as little as possible.
Note: The reason there is a \\. in the second regex is that a . in the match needs to be escaped by a \, which in its turn needs to be escaped as \\
The methods:
re.compile(patten:str) compiles a regex for farther use.
regex.sub(repl:str,string:str) replaces every match of regex in string with repl.
Hope it helps.

Related

How can I transform RegEx categories into plain RegEx?

This question is based on this question. During coding, I got some new things popping up and because the initial question is properly answered, I want to describe my issues in this question.
My goal is to have a RegEx which filters out everything, instead of some special requirements:
Alphanumeric allowed
non-Lating e.g. Chinese or Japanese allowed
.,-?!"'=$|<>[]{} allowed
Works with NodeJS 8.9.4
During implementation of the answer from the main question, I've found out, that this only works with newer Node versions (because of the supported ES version). Sadly, our project runs on 8.9.4 which can't be changed in any way. So upgrading is not an option.
I've started searching around and found this page: https://github.com/slevithan/xregexp/blob/master/tools/output/categories.js
With the help of another question, I've tried to build something together which matches my requirements. I came out with:
/[^\(?:[A-Za-z\xAA\xB5\xBA\xC0-\xD6\xD8-\xF6\xF8-\u02C1\u02C6-\u02D1\u02E0-\u02E4\u02EC\u02EE\u0370-\u0374\u0376\u0377\u037A-\u037D\u037F\u0386\u0388-\u038A\u038C\u038E-\u03A1\u03A3-\u03F5\u03F7-\u0481\u048A-\u052F\u0531-\u0556\u0559\u0560-\u0588\u05D0-\u05EA\u05EF-\u05F2\u0620-\u064A\u066E\u066F\u0671-\u06D3\u06D5\u06E5\u06E6\u06EE\u06EF\u06FA-\u06FC\u06FF\u0710\u0712-\u072F\u074D-\u07A5\u07B1\u07CA-\u07EA\u07F4\u07F5\u07FA\u0800-\u0815\u081A\u0824\u0828\u0840-\u0858\u0860-\u086A\u0870-\u0887\u0889-\u088E\u08A0-\u08C9\u0904-\u0939\u093D\u0950\u0958-\u0961\u0971-\u0980\u0985-\u098C\u098F\u0990\u0993-\u09A8\u09AA-\u09B0\u09B2\u09B6-\u09B9\u09BD\u09CE\u09DC\u09DD\u09DF-\u09E1\u09F0\u09F1\u09FC\u0A05-\u0A0A\u0A0F\u0A10\u0A13-\u0A28\u0A2A-\u0A30\u0A32\u0A33\u0A35\u0A36\u0A38\u0A39\u0A59-\u0A5C\u0A5E\u0A72-\u0A74\u0A85-\u0A8D\u0A8F-\u0A91\u0A93-\u0AA8\u0AAA-\u0AB0\u0AB2\u0AB3\u0AB5-\u0AB9\u0ABD\u0AD0\u0AE0\u0AE1\u0AF9\u0B05-\u0B0C\u0B0F\u0B10\u0B13-\u0B28\u0B2A-\u0B30\u0B32\u0B33\u0B35-\u0B39\u0B3D\u0B5C\u0B5D\u0B5F-\u0B61\u0B71\u0B83\u0B85-\u0B8A\u0B8E-\u0B90\u0B92-\u0B95\u0B99\u0B9A\u0B9C\u0B9E\u0B9F\u0BA3\u0BA4\u0BA8-\u0BAA\u0BAE-\u0BB9\u0BD0\u0C05-\u0C0C\u0C0E-\u0C10\u0C12-\u0C28\u0C2A-\u0C39\u0C3D\u0C58-\u0C5A\u0C5D\u0C60\u0C61\u0C80\u0C85-\u0C8C\u0C8E-\u0C90\u0C92-\u0CA8\u0CAA-\u0CB3\u0CB5-\u0CB9\u0CBD\u0CDD\u0CDE\u0CE0\u0CE1\u0CF1\u0CF2\u0D04-\u0D0C\u0D0E-\u0D10\u0D12-\u0D3A\u0D3D\u0D4E\u0D54-\u0D56\u0D5F-\u0D61\u0D7A-\u0D7F\u0D85-\u0D96\u0D9A-\u0DB1\u0DB3-\u0DBB\u0DBD\u0DC0-\u0DC6\u0E01-\u0E30\u0E32\u0E33\u0E40-\u0E46\u0E81\u0E82\u0E84\u0E86-\u0E8A\u0E8C-\u0EA3\u0EA5\u0EA7-\u0EB0\u0EB2\u0EB3\u0EBD\u0EC0-\u0EC4\u0EC6\u0EDC-\u0EDF\u0F00\u0F40-\u0F47\u0F49-\u0F6C\u0F88-\u0F8C\u1000-\u102A\u103F\u1050-\u1055\u105A-\u105D\u1061\u1065\u1066\u106E-\u1070\u1075-\u1081\u108E\u10A0-\u10C5\u10C7\u10CD\u10D0-\u10FA\u10FC-\u1248\u124A-\u124D\u1250-\u1256\u1258\u125A-\u125D\u1260-\u1288\u128A-\u128D\u1290-\u12B0\u12B2-\u12B5\u12B8-\u12BE\u12C0\u12C2-\u12C5\u12C8-\u12D6\u12D8-\u1310\u1312-\u1315\u1318-\u135A\u1380-\u138F\u13A0-\u13F5\u13F8-\u13FD\u1401-\u166C\u166F-\u167F\u1681-\u169A\u16A0-\u16EA\u16F1-\u16F8\u1700-\u1711\u171F-\u1731\u1740-\u1751\u1760-\u176C\u176E-\u1770\u1780-\u17B3\u17D7\u17DC\u1820-\u1878\u1880-\u1884\u1887-\u18A8\u18AA\u18B0-\u18F5\u1900-\u191E\u1950-\u196D\u1970-\u1974\u1980-\u19AB\u19B0-\u19C9\u1A00-\u1A16\u1A20-\u1A54\u1AA7\u1B05-\u1B33\u1B45-\u1B4C\u1B83-\u1BA0\u1BAE\u1BAF\u1BBA-\u1BE5\u1C00-\u1C23\u1C4D-\u1C4F\u1C5A-\u1C7D\u1C80-\u1C88]+/g
My current example string is:
Test=😕查看
°°^ Marting 10202029 Offline!"§$%&/()!"§$%&/()After this we want to keep the allowed special chars: .,-?!"'=$|<>[]{}
Somehow, the answer from the first question works better as the parsed categories from me. So there must be something wrong, which I'm unable to find.
At the end, I want to put everything inside a var.replace() command, to replace everything bad with a single whitespace.
For testing, I'm using: https://regexr.com/
You can either use regexpu to transpile the regex into an ES6-compliant regex, or you may go to the Unicode Utilities: UnicodeSet page and get the code point ranges manually.
In your case, paste [^\p{L}\p{N}] into the Input field, check Abbreviate and Escape, then click Show Set. Add the .,?!"'=$|<>[\]{}- at the end of the character class. Then, double the backslashes (also, escape the ' or ", your string literal delimiter char, I escaped ' below) and put inside the pattern_from_uu variable definition in this JavaScript code and then, all you need to define the regex is const reg = new RegExp(pattern, "gu") or const reg = new RegExp(pattern, "u"):
const pattern_from_uu = '[^0-9A-Za-z\\u00AA\\u00B2\\u00B3\\u00B5\\u00B9\\u00BA\\u00BC-\\u00BE\\u00C0-\\u00D6\\u00D8-\\u00F6\\u00F8-\\u02C1\\u02C6-\\u02D1\\u02E0-\\u02E4\\u02EC\\u02EE\\u0370-\\u0374\\u0376\\u0377\\u037A-\\u037D\\u037F\\u0386\\u0388-\\u038A\\u038C\\u038E-\\u03A1\\u03A3-\\u03F5\\u03F7-\\u0481\\u048A-\\u052F\\u0531-\\u0556\\u0559\\u0560-\\u0588\\u05D0-\\u05EA\\u05EF-\\u05F2\\u0620-\\u064A\\u0660-\\u0669\\u066E\\u066F\\u0671-\\u06D3\\u06D5\\u06E5\\u06E6\\u06EE-\\u06FC\\u06FF\\u0710\\u0712-\\u072F\\u074D-\\u07A5\\u07B1\\u07C0-\\u07EA\\u07F4\\u07F5\\u07FA\\u0800-\\u0815\\u081A\\u0824\\u0828\\u0840-\\u0858\\u0860-\\u086A\\u0870-\\u0887\\u0889-\\u088E\\u08A0-\\u08C9\\u0904-\\u0939\\u093D\\u0950\\u0958-\\u0961\\u0966-\\u096F\\u0971-\\u0980\\u0985-\\u098C\\u098F\\u0990\\u0993-\\u09A8\\u09AA-\\u09B0\\u09B2\\u09B6-\\u09B9\\u09BD\\u09CE\\u09DC\\u09DD\\u09DF-\\u09E1\\u09E6-\\u09F1\\u09F4-\\u09F9\\u09FC\\u0A05-\\u0A0A\\u0A0F\\u0A10\\u0A13-\\u0A28\\u0A2A-\\u0A30\\u0A32\\u0A33\\u0A35\\u0A36\\u0A38\\u0A39\\u0A59-\\u0A5C\\u0A5E\\u0A66-\\u0A6F\\u0A72-\\u0A74\\u0A85-\\u0A8D\\u0A8F-\\u0A91\\u0A93-\\u0AA8\\u0AAA-\\u0AB0\\u0AB2\\u0AB3\\u0AB5-\\u0AB9\\u0ABD\\u0AD0\\u0AE0\\u0AE1\\u0AE6-\\u0AEF\\u0AF9\\u0B05-\\u0B0C\\u0B0F\\u0B10\\u0B13-\\u0B28\\u0B2A-\\u0B30\\u0B32\\u0B33\\u0B35-\\u0B39\\u0B3D\\u0B5C\\u0B5D\\u0B5F-\\u0B61\\u0B66-\\u0B6F\\u0B71-\\u0B77\\u0B83\\u0B85-\\u0B8A\\u0B8E-\\u0B90\\u0B92-\\u0B95\\u0B99\\u0B9A\\u0B9C\\u0B9E\\u0B9F\\u0BA3\\u0BA4\\u0BA8-\\u0BAA\\u0BAE-\\u0BB9\\u0BD0\\u0BE6-\\u0BF2\\u0C05-\\u0C0C\\u0C0E-\\u0C10\\u0C12-\\u0C28\\u0C2A-\\u0C39\\u0C3D\\u0C58-\\u0C5A\\u0C5D\\u0C60\\u0C61\\u0C66-\\u0C6F\\u0C78-\\u0C7E\\u0C80\\u0C85-\\u0C8C\\u0C8E-\\u0C90\\u0C92-\\u0CA8\\u0CAA-\\u0CB3\\u0CB5-\\u0CB9\\u0CBD\\u0CDD\\u0CDE\\u0CE0\\u0CE1\\u0CE6-\\u0CEF\\u0CF1\\u0CF2\\u0D04-\\u0D0C\\u0D0E-\\u0D10\\u0D12-\\u0D3A\\u0D3D\\u0D4E\\u0D54-\\u0D56\\u0D58-\\u0D61\\u0D66-\\u0D78\\u0D7A-\\u0D7F\\u0D85-\\u0D96\\u0D9A-\\u0DB1\\u0DB3-\\u0DBB\\u0DBD\\u0DC0-\\u0DC6\\u0DE6-\\u0DEF\\u0E01-\\u0E30\\u0E32\\u0E33\\u0E40-\\u0E46\\u0E50-\\u0E59\\u0E81\\u0E82\\u0E84\\u0E86-\\u0E8A\\u0E8C-\\u0EA3\\u0EA5\\u0EA7-\\u0EB0\\u0EB2\\u0EB3\\u0EBD\\u0EC0-\\u0EC4\\u0EC6\\u0ED0-\\u0ED9\\u0EDC-\\u0EDF\\u0F00\\u0F20-\\u0F33\\u0F40-\\u0F47\\u0F49-\\u0F6C\\u0F88-\\u0F8C\\u1000-\\u102A\\u103F-\\u1049\\u1050-\\u1055\\u105A-\\u105D\\u1061\\u1065\\u1066\\u106E-\\u1070\\u1075-\\u1081\\u108E\\u1090-\\u1099\\u10A0-\\u10C5\\u10C7\\u10CD\\u10D0-\\u10FA\\u10FC-\\u1248\\u124A-\\u124D\\u1250-\\u1256\\u1258\\u125A-\\u125D\\u1260-\\u1288\\u128A-\\u128D\\u1290-\\u12B0\\u12B2-\\u12B5\\u12B8-\\u12BE\\u12C0\\u12C2-\\u12C5\\u12C8-\\u12D6\\u12D8-\\u1310\\u1312-\\u1315\\u1318-\\u135A\\u1369-\\u137C\\u1380-\\u138F\\u13A0-\\u13F5\\u13F8-\\u13FD\\u1401-\\u166C\\u166F-\\u167F\\u1681-\\u169A\\u16A0-\\u16EA\\u16EE-\\u16F8\\u1700-\\u1711\\u171F-\\u1731\\u1740-\\u1751\\u1760-\\u176C\\u176E-\\u1770\\u1780-\\u17B3\\u17D7\\u17DC\\u17E0-\\u17E9\\u17F0-\\u17F9\\u1810-\\u1819\\u1820-\\u1878\\u1880-\\u1884\\u1887-\\u18A8\\u18AA\\u18B0-\\u18F5\\u1900-\\u191E\\u1946-\\u196D\\u1970-\\u1974\\u1980-\\u19AB\\u19B0-\\u19C9\\u19D0-\\u19DA\\u1A00-\\u1A16\\u1A20-\\u1A54\\u1A80-\\u1A89\\u1A90-\\u1A99\\u1AA7\\u1B05-\\u1B33\\u1B45-\\u1B4C\\u1B50-\\u1B59\\u1B83-\\u1BA0\\u1BAE-\\u1BE5\\u1C00-\\u1C23\\u1C40-\\u1C49\\u1C4D-\\u1C7D\\u1C80-\\u1C88\\u1C90-\\u1CBA\\u1CBD-\\u1CBF\\u1CE9-\\u1CEC\\u1CEE-\\u1CF3\\u1CF5\\u1CF6\\u1CFA\\u1D00-\\u1DBF\\u1E00-\\u1F15\\u1F18-\\u1F1D\\u1F20-\\u1F45\\u1F48-\\u1F4D\\u1F50-\\u1F57\\u1F59\\u1F5B\\u1F5D\\u1F5F-\\u1F7D\\u1F80-\\u1FB4\\u1FB6-\\u1FBC\\u1FBE\\u1FC2-\\u1FC4\\u1FC6-\\u1FCC\\u1FD0-\\u1FD3\\u1FD6-\\u1FDB\\u1FE0-\\u1FEC\\u1FF2-\\u1FF4\\u1FF6-\\u1FFC\\u2070\\u2071\\u2074-\\u2079\\u207F-\\u2089\\u2090-\\u209C\\u2102\\u2107\\u210A-\\u2113\\u2115\\u2119-\\u211D\\u2124\\u2126\\u2128\\u212A-\\u212D\\u212F-\\u2139\\u213C-\\u213F\\u2145-\\u2149\\u214E\\u2150-\\u2189\\u2460-\\u249B\\u24EA-\\u24FF\\u2776-\\u2793\\u2C00-\\u2CE4\\u2CEB-\\u2CEE\\u2CF2\\u2CF3\\u2CFD\\u2D00-\\u2D25\\u2D27\\u2D2D\\u2D30-\\u2D67\\u2D6F\\u2D80-\\u2D96\\u2DA0-\\u2DA6\\u2DA8-\\u2DAE\\u2DB0-\\u2DB6\\u2DB8-\\u2DBE\\u2DC0-\\u2DC6\\u2DC8-\\u2DCE\\u2DD0-\\u2DD6\\u2DD8-\\u2DDE\\u2E2F\\u3005-\\u3007\\u3021-\\u3029\\u3031-\\u3035\\u3038-\\u303C\\u3041-\\u3096\\u309D-\\u309F\\u30A1-\\u30FA\\u30FC-\\u30FF\\u3105-\\u312F\\u3131-\\u318E\\u3192-\\u3195\\u31A0-\\u31BF\\u31F0-\\u31FF\\u3220-\\u3229\\u3248-\\u324F\\u3251-\\u325F\\u3280-\\u3289\\u32B1-\\u32BF\\u3400-\\u4DBF\\u4E00-\\uA48C\\uA4D0-\\uA4FD\\uA500-\\uA60C\\uA610-\\uA62B\\uA640-\\uA66E\\uA67F-\\uA69D\\uA6A0-\\uA6EF\\uA717-\\uA71F\\uA722-\\uA788\\uA78B-\\uA7CA\\uA7D0\\uA7D1\\uA7D3\\uA7D5-\\uA7D9\\uA7F2-\\uA801\\uA803-\\uA805\\uA807-\\uA80A\\uA80C-\\uA822\\uA830-\\uA835\\uA840-\\uA873\\uA882-\\uA8B3\\uA8D0-\\uA8D9\\uA8F2-\\uA8F7\\uA8FB\\uA8FD\\uA8FE\\uA900-\\uA925\\uA930-\\uA946\\uA960-\\uA97C\\uA984-\\uA9B2\\uA9CF-\\uA9D9\\uA9E0-\\uA9E4\\uA9E6-\\uA9FE\\uAA00-\\uAA28\\uAA40-\\uAA42\\uAA44-\\uAA4B\\uAA50-\\uAA59\\uAA60-\\uAA76\\uAA7A\\uAA7E-\\uAAAF\\uAAB1\\uAAB5\\uAAB6\\uAAB9-\\uAABD\\uAAC0\\uAAC2\\uAADB-\\uAADD\\uAAE0-\\uAAEA\\uAAF2-\\uAAF4\\uAB01-\\uAB06\\uAB09-\\uAB0E\\uAB11-\\uAB16\\uAB20-\\uAB26\\uAB28-\\uAB2E\\uAB30-\\uAB5A\\uAB5C-\\uAB69\\uAB70-\\uABE2\\uABF0-\\uABF9\\uAC00-\\uD7A3\\uD7B0-\\uD7C6\\uD7CB-\\uD7FB\\uF900-\\uFA6D\\uFA70-\\uFAD9\\uFB00-\\uFB06\\uFB13-\\uFB17\\uFB1D\\uFB1F-\\uFB28\\uFB2A-\\uFB36\\uFB38-\\uFB3C\\uFB3E\\uFB40\\uFB41\\uFB43\\uFB44\\uFB46-\\uFBB1\\uFBD3-\\uFD3D\\uFD50-\\uFD8F\\uFD92-\\uFDC7\\uFDF0-\\uFDFB\\uFE70-\\uFE74\\uFE76-\\uFEFC\\uFF10-\\uFF19\\uFF21-\\uFF3A\\uFF41-\\uFF5A\\uFF66-\\uFFBE\\uFFC2-\\uFFC7\\uFFCA-\\uFFCF\\uFFD2-\\uFFD7\\uFFDA-\\uFFDC\\U00010000-\\U0001000B\\U0001000D-\\U00010026\\U00010028-\\U0001003A\\U0001003C\\U0001003D\\U0001003F-\\U0001004D\\U00010050-\\U0001005D\\U00010080-\\U000100FA\\U00010107-\\U00010133\\U00010140-\\U00010178\\U0001018A\\U0001018B\\U00010280-\\U0001029C\\U000102A0-\\U000102D0\\U000102E1-\\U000102FB\\U00010300-\\U00010323\\U0001032D-\\U0001034A\\U00010350-\\U00010375\\U00010380-\\U0001039D\\U000103A0-\\U000103C3\\U000103C8-\\U000103CF\\U000103D1-\\U000103D5\\U00010400-\\U0001049D\\U000104A0-\\U000104A9\\U000104B0-\\U000104D3\\U000104D8-\\U000104FB\\U00010500-\\U00010527\\U00010530-\\U00010563\\U00010570-\\U0001057A\\U0001057C-\\U0001058A\\U0001058C-\\U00010592\\U00010594\\U00010595\\U00010597-\\U000105A1\\U000105A3-\\U000105B1\\U000105B3-\\U000105B9\\U000105BB\\U000105BC\\U00010600-\\U00010736\\U00010740-\\U00010755\\U00010760-\\U00010767\\U00010780-\\U00010785\\U00010787-\\U000107B0\\U000107B2-\\U000107BA\\U00010800-\\U00010805\\U00010808\\U0001080A-\\U00010835\\U00010837\\U00010838\\U0001083C\\U0001083F-\\U00010855\\U00010858-\\U00010876\\U00010879-\\U0001089E\\U000108A7-\\U000108AF\\U000108E0-\\U000108F2\\U000108F4\\U000108F5\\U000108FB-\\U0001091B\\U00010920-\\U00010939\\U00010980-\\U000109B7\\U000109BC-\\U000109CF\\U000109D2-\\U00010A00\\U00010A10-\\U00010A13\\U00010A15-\\U00010A17\\U00010A19-\\U00010A35\\U00010A40-\\U00010A48\\U00010A60-\\U00010A7E\\U00010A80-\\U00010A9F\\U00010AC0-\\U00010AC7\\U00010AC9-\\U00010AE4\\U00010AEB-\\U00010AEF\\U00010B00-\\U00010B35\\U00010B40-\\U00010B55\\U00010B58-\\U00010B72\\U00010B78-\\U00010B91\\U00010BA9-\\U00010BAF\\U00010C00-\\U00010C48\\U00010C80-\\U00010CB2\\U00010CC0-\\U00010CF2\\U00010CFA-\\U00010D23\\U00010D30-\\U00010D39\\U00010E60-\\U00010E7E\\U00010E80-\\U00010EA9\\U00010EB0\\U00010EB1\\U00010F00-\\U00010F27\\U00010F30-\\U00010F45\\U00010F51-\\U00010F54\\U00010F70-\\U00010F81\\U00010FB0-\\U00010FCB\\U00010FE0-\\U00010FF6\\U00011003-\\U00011037\\U00011052-\\U0001106F\\U00011071\\U00011072\\U00011075\\U00011083-\\U000110AF\\U000110D0-\\U000110E8\\U000110F0-\\U000110F9\\U00011103-\\U00011126\\U00011136-\\U0001113F\\U00011144\\U00011147\\U00011150-\\U00011172\\U00011176\\U00011183-\\U000111B2\\U000111C1-\\U000111C4\\U000111D0-\\U000111DA\\U000111DC\\U000111E1-\\U000111F4\\U00011200-\\U00011211\\U00011213-\\U0001122B\\U00011280-\\U00011286\\U00011288\\U0001128A-\\U0001128D\\U0001128F-\\U0001129D\\U0001129F-\\U000112A8\\U000112B0-\\U000112DE\\U000112F0-\\U000112F9\\U00011305-\\U0001130C\\U0001130F\\U00011310\\U00011313-\\U00011328\\U0001132A-\\U00011330\\U00011332\\U00011333\\U00011335-\\U00011339\\U0001133D\\U00011350\\U0001135D-\\U00011361\\U00011400-\\U00011434\\U00011447-\\U0001144A\\U00011450-\\U00011459\\U0001145F-\\U00011461\\U00011480-\\U000114AF\\U000114C4\\U000114C5\\U000114C7\\U000114D0-\\U000114D9\\U00011580-\\U000115AE\\U000115D8-\\U000115DB\\U00011600-\\U0001162F\\U00011644\\U00011650-\\U00011659\\U00011680-\\U000116AA\\U000116B8\\U000116C0-\\U000116C9\\U00011700-\\U0001171A\\U00011730-\\U0001173B\\U00011740-\\U00011746\\U00011800-\\U0001182B\\U000118A0-\\U000118F2\\U000118FF-\\U00011906\\U00011909\\U0001190C-\\U00011913\\U00011915\\U00011916\\U00011918-\\U0001192F\\U0001193F\\U00011941\\U00011950-\\U00011959\\U000119A0-\\U000119A7\\U000119AA-\\U000119D0\\U000119E1\\U000119E3\\U00011A00\\U00011A0B-\\U00011A32\\U00011A3A\\U00011A50\\U00011A5C-\\U00011A89\\U00011A9D\\U00011AB0-\\U00011AF8\\U00011C00-\\U00011C08\\U00011C0A-\\U00011C2E\\U00011C40\\U00011C50-\\U00011C6C\\U00011C72-\\U00011C8F\\U00011D00-\\U00011D06\\U00011D08\\U00011D09\\U00011D0B-\\U00011D30\\U00011D46\\U00011D50-\\U00011D59\\U00011D60-\\U00011D65\\U00011D67\\U00011D68\\U00011D6A-\\U00011D89\\U00011D98\\U00011DA0-\\U00011DA9\\U00011EE0-\\U00011EF2\\U00011FB0\\U00011FC0-\\U00011FD4\\U00012000-\\U00012399\\U00012400-\\U0001246E\\U00012480-\\U00012543\\U00012F90-\\U00012FF0\\U00013000-\\U0001342E\\U00014400-\\U00014646\\U00016800-\\U00016A38\\U00016A40-\\U00016A5E\\U00016A60-\\U00016A69\\U00016A70-\\U00016ABE\\U00016AC0-\\U00016AC9\\U00016AD0-\\U00016AED\\U00016B00-\\U00016B2F\\U00016B40-\\U00016B43\\U00016B50-\\U00016B59\\U00016B5B-\\U00016B61\\U00016B63-\\U00016B77\\U00016B7D-\\U00016B8F\\U00016E40-\\U00016E96\\U00016F00-\\U00016F4A\\U00016F50\\U00016F93-\\U00016F9F\\U00016FE0\\U00016FE1\\U00016FE3\\U00017000-\\U000187F7\\U00018800-\\U00018CD5\\U00018D00-\\U00018D08\\U0001AFF0-\\U0001AFF3\\U0001AFF5-\\U0001AFFB\\U0001AFFD\\U0001AFFE\\U0001B000-\\U0001B122\\U0001B150-\\U0001B152\\U0001B164-\\U0001B167\\U0001B170-\\U0001B2FB\\U0001BC00-\\U0001BC6A\\U0001BC70-\\U0001BC7C\\U0001BC80-\\U0001BC88\\U0001BC90-\\U0001BC99\\U0001D2E0-\\U0001D2F3\\U0001D360-\\U0001D378\\U0001D400-\\U0001D454\\U0001D456-\\U0001D49C\\U0001D49E\\U0001D49F\\U0001D4A2\\U0001D4A5\\U0001D4A6\\U0001D4A9-\\U0001D4AC\\U0001D4AE-\\U0001D4B9\\U0001D4BB\\U0001D4BD-\\U0001D4C3\\U0001D4C5-\\U0001D505\\U0001D507-\\U0001D50A\\U0001D50D-\\U0001D514\\U0001D516-\\U0001D51C\\U0001D51E-\\U0001D539\\U0001D53B-\\U0001D53E\\U0001D540-\\U0001D544\\U0001D546\\U0001D54A-\\U0001D550\\U0001D552-\\U0001D6A5\\U0001D6A8-\\U0001D6C0\\U0001D6C2-\\U0001D6DA\\U0001D6DC-\\U0001D6FA\\U0001D6FC-\\U0001D714\\U0001D716-\\U0001D734\\U0001D736-\\U0001D74E\\U0001D750-\\U0001D76E\\U0001D770-\\U0001D788\\U0001D78A-\\U0001D7A8\\U0001D7AA-\\U0001D7C2\\U0001D7C4-\\U0001D7CB\\U0001D7CE-\\U0001D7FF\\U0001DF00-\\U0001DF1E\\U0001E100-\\U0001E12C\\U0001E137-\\U0001E13D\\U0001E140-\\U0001E149\\U0001E14E\\U0001E290-\\U0001E2AD\\U0001E2C0-\\U0001E2EB\\U0001E2F0-\\U0001E2F9\\U0001E7E0-\\U0001E7E6\\U0001E7E8-\\U0001E7EB\\U0001E7ED\\U0001E7EE\\U0001E7F0-\\U0001E7FE\\U0001E800-\\U0001E8C4\\U0001E8C7-\\U0001E8CF\\U0001E900-\\U0001E943\\U0001E94B\\U0001E950-\\U0001E959\\U0001EC71-\\U0001ECAB\\U0001ECAD-\\U0001ECAF\\U0001ECB1-\\U0001ECB4\\U0001ED01-\\U0001ED2D\\U0001ED2F-\\U0001ED3D\\U0001EE00-\\U0001EE03\\U0001EE05-\\U0001EE1F\\U0001EE21\\U0001EE22\\U0001EE24\\U0001EE27\\U0001EE29-\\U0001EE32\\U0001EE34-\\U0001EE37\\U0001EE39\\U0001EE3B\\U0001EE42\\U0001EE47\\U0001EE49\\U0001EE4B\\U0001EE4D-\\U0001EE4F\\U0001EE51\\U0001EE52\\U0001EE54\\U0001EE57\\U0001EE59\\U0001EE5B\\U0001EE5D\\U0001EE5F\\U0001EE61\\U0001EE62\\U0001EE64\\U0001EE67-\\U0001EE6A\\U0001EE6C-\\U0001EE72\\U0001EE74-\\U0001EE77\\U0001EE79-\\U0001EE7C\\U0001EE7E\\U0001EE80-\\U0001EE89\\U0001EE8B-\\U0001EE9B\\U0001EEA1-\\U0001EEA3\\U0001EEA5-\\U0001EEA9\\U0001EEAB-\\U0001EEBB\\U0001F100-\\U0001F10C\\U0001FBF0-\\U0001FBF9\\U00020000-\\U0002A6DF\\U0002A700-\\U0002B738\\U0002B740-\\U0002B81D\\U0002B820-\\U0002CEA1\\U0002CEB0-\\U0002EBE0\\U0002F800-\\U0002FA1D\\U00030000-\\U0003134A.,?!"\'=$|<>[\\]{}-]';
let pattern = pattern_from_uu.replace(/\\U000([a-f\d]+)/gi, "\\u{$1}");
console.log("Your regex is:\n/" + pattern + "/gu");
const texts = ["Test=😕查看","°°^ Marting 10202029 Offline!\"§$%&/()!\"§$%&/()After this we want to keep the allowed special chars: .,-?!\"'=$|<>[]{}"];
const reg = new RegExp(pattern, "gu")
for (const text of texts) {
console.log(text.replace(reg, ""));
}
See the generated regex demo.

Extracting string from Regex in Pandas for large dataset

We have a csv file which contains log entries in each row.
We need to extract the thread names from each log entry into a separate column.
What would be the fastest way to implement the same ?
The approach below (string functions) also seems to take alot of time for large datasets.
We have csv files with minimum of 100K entries in each csv file.
This is the piece of codes which extracts the path
df['thread'] = df.message.str.extract(pat = '(\[(\w+.)+?\]|$)')[0]
The below is a sample log entry, we are picking out:
[c.a.j.sprint_planning_resources.listener.RunAsyncEvent]
from the regex above.
2020-12-01 05:07:36,485-0500 ForkJoinPool.commonPool-worker-30 WARN Ives_Chen 245x27568399x23 oxk7fv 10.97.200.99,127.0.0.1 /browse/MDT-206838 [c.a.j.sprint_planning_resources.listener.RunAsyncEvent] Event processed: com.atlassian.jira.event.issue.IssueEvent#5c8703d0[issue=ABC-61381,comment=<null>,worklog=<null>,changelog=[GenericEntity:ChangeGroup][issue,1443521][author,JIRAUSER39166][created,2020-12-01 05:07:36.377][id,15932782],eventTypeId=2,sendMail=true,params={eventsource=action, baseurl=https://min.com},subtasksUpdated=true,spanningOperation=Optional.empty]
Does anyone know a better/faster method to implement the same ?
The \[(\w+.)+?\] is a very inefficient pattern that may cause catastrophic backtracking due to the nested quantifiers with an unescaped . that matches any char, and thus also matches what \w does.
You can use
df['thread'] = df['message'].str.extract(r'\[(\w+(?:\.\w+)*)]', expand=False).fillna("")
See this regex demo. Note there is no need adding $ as an alternative since .fillna("") will replace the NA with an empty string.
The regex matches
\[ - a [ char
(\w+(?:\.\w+)*) - Capturing group 1: one or more word chars followed with zero or more sequences of a . and one or more word chars
] - a ] char.
Your regex takes a whopping 8,572 steps to complete, see https://regex101.com/r/5c3vi7/1
You can use this regex to significantly cut down the regex processing to 4 steps:
\[[^\]]+\]
Do notice the absence of the /g modifier
https://regex101.com/r/6522P8/1

Python3 strip() get unexpect result

It's a weird problem
to_be_stripped="D:\\Users\\UserKnown\\PycharmProjects\\ProjectKnown\\PT\\collections\\120"
And two strings below:
s1="D:\\Users\\UserKnown\\PycharmProjects\\ProjectKnown\\PT\\collections\\120\\[Content_Types].xml"
s2="D:\\Users\\UserKnown\\PycharmProjects\\ProjectKnown\\PT\\collections\\120\\_rels\.rels"
When I use the command below:
s1.strip(to_be_stripped)
s2.strip(to_be_stripped)
I get these outputs:
'[Content_Types].x'
'_rels\\.'
If I use lstrip(), they will be:
'[Content_Types].xml'
'_rels\\.rels'
Which is the right outputs.
However, if we replace all Project Known with zeus_pipeline:
to_be_stripped="D:\\Users\\UserKnown\\PycharmProjects\\zeus_pipeline\\PT\\collections\\120"
And:
s2="D:\\Users\\UserKnown\\PycharmProjects\\zeus_pipeline\\PT\\collections\\120\\_rels\.rels"
s2.lstrip(to_be_stripped)will be '.rels'
If I use / instead of \\, nothing goes wrong. I am wondering why this problem happens.
strip isn't meant to remove full strings exactly. Rather, you give it a string, and every character in that string is removed from the start and of the string to be stripped.
In your case, the variable to_be_stripped contains the characters m and l, so those are stripped from the end of s1. However, it doesn't contain the character x, so the stripping stops there and no characters beyond that are removed.
Check out this question. The accepted answer is probably more extensive than you need - I like another user's suggestion of using replace instead of strip. This would look like:
s1.replace(to_be_stripped, "")

BluePrism count number of times a character exists in a string

I have a calculation which will remove a blank space and replace with a full stop. This is correct for 90% of my cases. However, sometimes two blanks will appear in my value. For the second space I want to delete it. Is this possible?
I think it may be possible using a code stage, but I am not sure what the code would be.
My current calculation is Replace([Item Data.Name], " ", ".")
Example data John B Smith I want the result to be John.BSmith
For anything that'd like to do with the strings, there is a really powerful tool called Regular Expressions (regex). I encourage you to play with it, because it's a really powerful tool in the hands of RPA developer.
To replace the second space in any string with a "." you can use the following action.
Object: Utility - Strings
Action: Regex - Find and Replace
Input:
Regex Pattern: "(?<= .*) "
Text: "John B Smith"
Replacement: "."
The above action is not a standard Blueprism one, so it has to be added to your VBO. The action looks as follows:
The VB.net code for that action is as follows:
Dim R as New Regex(Regex_Pattern, RegexOptions.SingleLine)
Dim M as Match = R.Match(Text)
replacement_result = R.Replace(Text,Regex_Pattern,replacement_string)
There might be a need for some additional assemblies, so please see below a printscreen of references and namespaces used in my object:
I resolved this issue by using the Utility - Strings object and the split text action. I split my name by space. This outputted a collection which I was then able to loop through and add a full stop after the fist instance but then trim the other instances.
Please see screenshot
I think the simplest solution would be
Replace(Replace(Text," "," ")" ","."))
if you know that it will give one or two spaces
First replace the two white spaces to single and then again single white space to dot(.)

Lua Pattern for extracting/replacing value in / /

I have a string like hello /world today/
I need to replace /world today/ with /MY NEW STRING/
Reading the manual I have found
newString = string.match("hello /world today/","%b//")
which I can use with gsub to replace, but I wondered is there also an elegant way to return just the text between the /, I know I could just trim it, but I wondered if there was a pattern.
Try something like one of the following:
slashed_text = string.match("hello /world today/", "/([^/]*)/")
slashed_text = string.match("hello /world today/", "/(.-)/")
slashed_text = string.match("hello /world today/", "/(.*)/")
This works because string.match returns any captures from the pattern, or the entire matched text if there are no captures. The key then is to make sure that the pattern has the right amount of greediness, remembering that Lua patterns are not a complete regular expression language.
The first two should match the same texts. In the first, I've expressly required that the pattern match as many non-slashes as possible. The second (thanks lhf) matches the shortest span of any characters at all followed by a slash. The third is greedier, it matches the longest span of characters that can still be followed by a slash.
The %b// in the original question doesn't have any advantages over /.-/ since the the two delimiters are the same character.
Edit: Added a pattern suggested by lhf, and more explanations.

Resources