Hello so i try to scrape off the author image url from the given video link using the urllib3 module, but due to different lengths of the url it causes to join other properties like the width and height
https://yt3.ggpht.com/ytc/AKedOLS-Bwwebj7zfYDDo43sYPxD8LN7q4Lq4EvqfyoDbw=s400-c-k-c0x00ffffff-no-rj","width":400,"height"
instead of this author image link which i want :
https://yt3.ggpht.com/ytc/AKedOLS-Bwwebj7zfYDDo43sYPxD8LN7q4Lq4EvqfyoDbw=s400-c-k-c0x00ffffff-no-rj
the code that i worked
import re
import urllib.request
def get_author(uri):
html = urllib.request.urlopen(uri)
author_image = re.findall(r'yt3.ggpht.com/(\S{99})', html.read().decode())
return f"https://yt3.ggpht.com/{author_image[1]}"
sorry for my bad english, thanks in advance =)
If you are not sure about the length of the match, do not hardcode the amount of chars to be matched. {99} is not going to work with arbitrary strings.
Besides, you want to match the string in a mark-up text and you need to be sure you only match until the delimiting char. If it is a " char, then match until that character.
Also, dots in regex are special and you need to escape them to match literal dots.
Besides, findall is used to match all occurrences, you can use re.search to get the first one to free up some resources.
So, a fix could look like
def get_author(uri):
html = urllib.request.urlopen(uri)
author_image = re.search(r'yt3\.ggpht\.com/[^"]+', html.read().decode())
if author_image:
return f"https://{author_image.group()}"
return None # you will need to handle this condition in your later code
Here, author_image is the regex match data object, and if it matches, you need to prepend the match value (author_image.group()) with https:// and return the value, else, you need to return some default value to check later in the code (here, None).
Related
This is the part of the code I have copied to see the output,
def check(string,sub_str):
if(string.find(sub_str)==-1):
print('no')
else:
print('yes)
# driver code for testing the above function
string='geeks for geeks'
sub_str='geeks'
I specifically wanted to understand how this expression works :
if(string.find(sub_str)==-1): . Also this code is for finding substrings in a given strings can some one tell if this is the optimal way, I know it is tutorial code but I have an easier way to find the substrings. Just wanted to know if that would make passing test cases easier hence the above code. Anyways thanks y'all for your answers.
The method find() returns the index of the string you are looking for. The string in front of find() is the one in which you are looking for the second string.
SentenceThatIsCompletelySearched.find(ForThisPartHere)
If the string you are looking for is present it will return the index (a number on which position of the sentence the string has been found).
If the string is not inside the sentence then find() will return -1 (a number).
So in your case you are checking if sub_str is inside string and if it is not present (return of -1) you will print "no". If it is you will print "yes".
How can I remove all characters inside angle brackets including the brackets in a string? How can I also remove all the text between ("\r\n") and ("."+"any 3 characters") Is this possible? I am currently using the solution by #xkcdjerry
e.g
body = """Dear Students roads etc. you place a tree take a snapshot, then when you place a\r\nbuilding, take a snapshot. Place at least 5-6 objects and then have 5-6\r\nsnapshots. Please keep these snapshots with you as everyone will be asked\r\nto share them during the class.\r\n\r\nI am attaching one PowerPoint containing instructions and one video of\r\nexplanation for your reference.\r\n\r\nKind regards,\r\nTeacher Name\r\n zoom_0.mp4\r\n<https://drive.google.com/file/d/1UX-klOfVhbefvbhZvIWijaBdQuLgh_-Uru4_1QTkth/view?usp=drive_web>"""
d = re.compile("\r\n.+?\\....")
body = d.sub('', body)
a = re.compile("<.*?>")
body = a.sub('', body)
print(body)```
For some reason the output is fine except that it has:
```gle.com/file/d/1UX-klOfVhbefvbhZvIWijaBdQuLgh_-Uru4_1QTkth/view?usp=drive_web>
randomly attached to the end How can I fix it.
Answer
Your problem can be solved by a regex:
Put this into the shell:
import re
a=re.compile("<.*?>")
a.sub('',"Keep this part of the string< Remove this part>Keep This part as well")
Output:
'Keep this part of the stringKeep This part as well'
Second question:
import re
re.compile("\r\n.*?\\..{3}")
a.sub('',"Hello\r\nFilename.png")
Output:
'Hello'
Breakdown
Regex is a robust way of finding, replacing, and mutating small strings inside bigger ones, for further reading,consult https://docs.python.org/3/library/re.html. Meanwhile, here are the breakdowns of the regex information used in this answer:
. means any char.
*? means as many of the before as needed but as little as possible(non-greedy match)
So .*? means any number of characters but as little as possible.
Note: The reason there is a \\. in the second regex is that a . in the match needs to be escaped by a \, which in its turn needs to be escaped as \\
The methods:
re.compile(patten:str) compiles a regex for farther use.
regex.sub(repl:str,string:str) replaces every match of regex in string with repl.
Hope it helps.
This problem might be very simple but I find it a bit confusing & that is why I need help.
With relevance to this question I posted that got solved, I got a new issue that I just noticed.
Source code:
from PyQt5 import QtCore,QtWidgets
app=QtWidgets.QApplication([])
def scroll():
#QtCore.QRegularExpression(r'\b'+'cat'+'\b')
item = listWidget.findItems(r'\bcat\b', QtCore.Qt.MatchRegularExpression)
for d in item:
print(d.text())
window = QtWidgets.QDialog()
window.setLayout(QtWidgets.QVBoxLayout())
listWidget = QtWidgets.QListWidget()
window.layout().addWidget(listWidget)
cats = ["love my cat","catirization","cat in the clouds","catść"]
for i,cat in enumerate(cats):
QtWidgets.QListWidgetItem(f"{i} {cat}", listWidget)
btn = QtWidgets.QPushButton('Scroll')
btn.clicked.connect(scroll)
window.layout().addWidget(btn)
window.show()
app.exec_()
Output GUI:
Now as you can see I am just trying to print out the text data based on the regex r"\bcat\b" when I press the "Scroll" button and it works fine!
Output:
0 love my cat
2 cat in the clouds
3 catść
However... as you can see on the #3, it should not be printed out cause it obviously does not match with the mentioned regular expression which is r"\bcat\b". However it does & I am thinking it has something to do with that special foreign character ść that makes it a match & prints it out (which it shouldn't right?).
I'm expecting an output like:
0 love my cat
2 cat in the clouds
Researches I have tried
I found this question and it says something about this \p{L} & based on the answer it means:
If all you want to match is letters (including "international"
letters) you can use \p{L}.
To be honest I'm not so sure how to apply that with PyQT5 also still I've made some tries & and I tried changing the regex to like this r'\b'+r'\p{cat}'+r'\b'. However I got this error.
QString::contains: invalid QRegularExpression object
QString::contains: invalid QRegularExpression object
QString::contains: invalid QRegularExpression object
QString::contains: invalid QRegularExpression object
Obviously the error says it's not a valid regex. Can someone educate me on how to solve this issue? Thank you!
In general, when you need to make your shorthand character classes and word boundaries Unicode-aware, you need to pass the QRegularExpression.UseUnicodePropertiesOption option to the regex compiler. See the QRegularExpression.UseUnicodePropertiesOption reference:
The meaning of the \w, \d, etc., character classes, as well as the meaning of their counterparts (\W, \D, etc.), is changed from matching ASCII characters only to matching any character with the corresponding Unicode property. For instance, \d is changed to match any character with the Unicode Nd (decimal digit) property; \w to match any character with either the Unicode L (letter) or N (digit) property, plus underscore, and so on. This option corresponds to the /u modifier in Perl regular expressions.
In Python, you could declare it as
rx = QtCore.QRegularExpression(r'\bcat\b', QtCore.QRegularExpression.UseUnicodePropertiesOption)
However, since the QListWidget.findItems does not support a QRegularExpression as argument and only allows the regex as a string object, you can only use the (*UCP) PCRE
verb as an alternative:
r'(*UCP)\bcat\b'
Make sure you define it at the regex beginning.
I have an example text string text_var = 'ndTail7-40512-1' and I want to split the first time I see a number followed by a - BUT I want to keep the number. Currently, I have print(re.split('\d*(?=-)',text_var,1)) and my output is ['ndTail', '-40512-1']. But I want to keep that number which is the trigger so it should look like ['ndTail', '7-40512-1']. Any help?
We can try using re.findall here:
text_var = 'ndTail7-40512-1'
matches = re.findall(r'(.*?)(\d-.*$)', text_var)
print(matches[0])
This prints:
('ndTail', '7-40512-1')
Sometimes it can be easier to use re.findall rather than re.split.
The regex pattern used here says to:
(.*?) match AND capture all content up to, but including
(\d-.*$) the first digit which is followed by a hyphen;
match and capture this content all the way to the end of the input
Note that we are using re.findall which typically has the potential to return multiple matches. However, in this case, our pattern matches to the end of the input, so we are left with just a single tuple containing the two desired capture groups.
Hey guys so I tried looking at previous questions but they dont answer it like my teacher wants it to be answered. Basically i need to get a string from a user input and see if it has:
at least one of [!,#,#,$,%,^,&,*,(,)] (non-letter and nonnumeric
character)
o Create a list for these special characters
I have no idea how to make a def to do this. Please help!
You should probably look into Regular expressions. Regular expressions allow you to do many string operations in a concise way. Specifically, you'll want to use re.findall() in order to find all special characters in your string and return them. You can check if the returned list has length 0 to check if there were any special characters at all.
With regards to building the regular expression to find special characters itself... I'm sure you can figure that part out ;)
Please try the below
import re
inputstring = raw_input("Enter String: ")
print inputstring
print "Valid" if re.match("^[a-zA-Z0-9_]*$", inputstring) else "Invalid"