converting html+hex email address to readable string Python 3 - python-3.x

I've been trying to find an online converter, or Python3 function, for conversion of email addresses in the html+hex format, such as: %69%6efo ---> info
%69 : i
%6e : n
&#64 : #
(source: http://www.asciitable.com/)
...and so on..
All the following sites are not converting both hex and html codes combined in the "word":
https://www.motobit.com/util/charset-codepage-conversion.asp
https://www.binaryhexconverter.com/ascii-text-to-binary-converter
https://www.dcode.fr/ascii-code
http://www.unit-conversion.info/texttools/ascii/
https://mothereff.in/binary-ascii
I'd appreciate any recommendations.
Txs.

Try html.unescape() or HTMLParser#unescape, depending on which version of Python you are using: https://stackoverflow.com/a/2087433/2675670
Since this is a mix of hex values and regular characters, I think we have to come up with a custom solution:
word = "%69%6efo"
while word.find("%") >= 0:
index = word.find("%")
ascii_value = word[index+1:index+3]
hex_value = int(ascii_value, 16)
letter = chr(hex_value)
word = word.replace(word[index:index+3], letter)
print(word)
Maybe there's a more streamlined "Pythonic" way of doing this, but it works for the test input.

Related

Get number from string in Python

I have a string, I have to get digits only from that string.
url = "www.mylocalurl.com/edit/1987"
Now from that string, I need to get 1987 only.
I have been trying this approach,
id = [int(i) for i in url.split() if i.isdigit()]
But I am getting [] list only.
You can use regex and get the digit alone in the list.
import re
url = "www.mylocalurl.com/edit/1987"
digit = re.findall(r'\d+', url)
output:
['1987']
Replace all non-digits with blank (effectively "deleting" them):
import re
num = re.sub('\D', '', url)
See live demo.
You aren't getting anything because by default the .split() method splits a sentence up where there are spaces. Since you are trying to split a hyperlink that has no spaces, it is not splitting anything up. What you can do is called a capture using regex. For example:
import re
url = "www.mylocalurl.com/edit/1987"
regex = r'(\d+)'
numbers = re.search(regex, url)
captured = numbers.groups()[0]
If you do not what what regular expressions are, the code is basically saying. Using the regex string defined as r'(\d+)' which basically means capture any digits, search through the url. Then in the captured we have the first captured group which is 1987.
If you don't want to use this, then you can use your .split() method but this time provide a split using / as the separator. For example `url.split('/').

How to extract text between specific letters from a string in Python(3.9)?

how may I be able to take from a string in python a value that is in a given text but is inside it, it's between 2 letters that I want it to copy from inside.
e.g.
"Kahoot : ID:1234567 Name:RandomUSERNAME"
I want it to receive the 1234567 and the RandomUSERNAME in 2 different variables.
a way I found to catch is to get it between the "ID:"COPYINPUT until the SPACE., "Name:"COPYINPUT until the end of the text.
How do I code this?
if I hadn't explained correctly tell me, I don't know how to ask/format this question! Sorry for any inconvenience!.
If the text always follows the same format you could just split the string. Alternatively, you could use regular expressions using the re library.
Using split:
string = "Kahoot : ID:1234567 Name:RandomUSERNAME"
string = string.split(" ")
id = string[2][3:]
name = string[3][5:]
print(id)
print(name)
Using re:
import re
string = "Kahoot : ID:1234567 Name:RandomUSERNAME"
id = re.search(r'(?<=ID:).*?(?=\s)', string).group(0)
name = re.search(r'(?<=Name:).*', string).group(0)
print(id)
print(name)

How to use f'string bytes'string together? [duplicate]

I'm looking for a formatted byte string literal. Specifically, something equivalent to
name = "Hello"
bytes(f"Some format string {name}")
Possibly something like fb"Some format string {name}".
Does such a thing exist?
No. The idea is explicitly dismissed in the PEP:
For the same reason that we don't support bytes.format(), you may
not combine 'f' with 'b' string literals. The primary problem
is that an object's __format__() method may return Unicode data
that is not compatible with a bytes string.
Binary f-strings would first require a solution for
bytes.format(). This idea has been proposed in the past, most
recently in PEP 461. The discussions of such a feature usually
suggest either
adding a method such as __bformat__() so an object can control how it is converted to bytes, or
having bytes.format() not be as general purpose or extensible as str.format().
Both of these remain as options in the future, if such functionality
is desired.
In 3.6+ you can do:
>>> a = 123
>>> f'{a}'.encode()
b'123'
You were actually super close in your suggestion; if you add an encoding kwarg to your bytes() call, then you get the desired behavior:
>>> name = "Hello"
>>> bytes(f"Some format string {name}", encoding="utf-8")
b'Some format string Hello'
Caveat: This works in 3.8 for me, but note at the bottom of the Bytes Object headline in the docs seem to suggest that this should work with any method of string formatting in all of 3.x (using str.format() for versions <3.6 since that's when f-strings were added, but the OP specifically asks about 3.6+).
From python 3.6.2 this percent formatting for bytes works for some use cases:
print(b"Some stuff %a. Some other stuff" % my_byte_or_unicode_string)
But as AXO commented:
This is not the same. %a (or %r) will give the representation of the string, not the string iteself. For example b'%a' % b'bytes' will give b"b'bytes'", not b'bytes'.
Which may or may not matter depending on if you need to just present the formatted byte_or_unicode_string in a UI or if you potentially need to do further manipulation.
As noted here, you can format this way:
>>> name = b"Hello"
>>> b"Some format string %b World" % name
b'Some format string Hello World'
You can see more details in PEP 461
Note that in your example you could simply do something like:
>>> name = b"Hello"
>>> b"Some format string " + name
b'Some format string Hello'
This was one of the bigger changes made from python 2 to python3. They handle unicode and strings differently.
This s how you'd convert to bytes.
string = "some string format"
string.encode()
print(string)
This is how you'd decode to string.
string.decode()
I had a better appreciation for the difference between Python 2 versus 3 change to unicode through this coursera lecture by Charles Severence. You can watch the entire 17 minute video or fast forward to somewhere around 10:30 if you want to get to the differences between python 2 and 3 and how they handle characters and specifically unicode.
I understand your actual question is how you could format a string that has both strings and bytes.
inBytes = b"testing"
inString = 'Hello'
type(inString) #This will yield <class 'str'>
type(inBytes) #this will yield <class 'bytes'>
Here you could see that I have a string a variable and a bytes variable.
This is how you would combine a byte and string into one string.
formattedString=(inString + ' ' + inBytes.encode())

Why is pyperclip not copying result of phone numbers to clipboard

I'm a beginner learning python with Automate The Boring Stuff by Al Sweigart.
I'm currently on the part where he created a program using Regular expression on how to extract emails and phone numbers from documents and have them pasted to another document.
Below is the script:
#! python3
import re
import pyperclip
# Create a regex for phone numbers
phoneRegex = re.compile(r'''
# 08108989212
(\d{11}) # Full phone number
''', re.VERBOSE)
#Create a regex for email a`enter code here`ddressess
emailRegex = re.compile(r'''
# some.+_thing#(\d{2,5}))?.com
[a-zA-Z0-9_.+] + # name part
# # #symbol
[a-zA-Z0-9_.+] + # domain name part
''', re.VERBOSE)
#Get the text off the clipboard
text = pyperclip.paste()
# TODO: Extract the email/phone from this text
extractedPhone = phoneRegex.findall(text)
extractedEmail = emailRegex.findall(text)
allPhoneNumbers = []
for allPhoneNumber in extractedPhone:
allPhoneNumbers.append(allPhoneNumber[0])
print(extractedPhone)
print(extractedEmail)
# Copy the extracted email/phone to the clipboard
results = '\n'.join(allPhoneNumbers) + '\n' + '\n'.join(extractedEmail)
pyperclip.copy(results)
The script is expected to extract, print both phone numbers and email addresses to the terminal which it does. It is also expected to copy the extracted phone number and email addresses to the clipboard automatically, so they can be pasted to another text editor or word document.
Now the problem is, it copies only the email address but converts the phone numbers to 0 when pasted.
What am i not getting right?
Please pardon the errors in my English.
for library: phonenumbers (pypi, source)
Python version of Google's common library for parsing, formatting,
storing and validating international phone numbers.
I think you will need to use this to format those phone numbers.
To be more specific, you'll need to install the package using:
pip install phonenumbers
The main object that the library deals with is a PhoneNumber object. You can create this from a string representing a phone number using the parse function, but you also need to specify the country that the phone number is being dialled from (unless the number is in E.164 format, which is globally unique).
import phonenumbers
x = phonenumbers.parse("+442083661177", None)
print(x)
Country Code: 44 National Number: 2083661177 Leading Zero: False
type(x)
<class 'phonenumbers.phonenumber.PhoneNumber'>
y = phonenumbers.parse("020 8366 1177", "GB")
print(y)
Country Code: 44 National Number: 2083661177 Leading Zero: False
x == y
True
z = phonenumbers.parse("00 1 650 253 2222", "GB") # as dialled from GB, not a GB number
print(z)
Country Code: 1 National Number: 6502532222 Leading Zero(s): False
More information can be found here: https://pypi.org/project/phonenumbers/
The problem is you don't need this part of your code
allPhoneNumbers = []
for allPhoneNumber in extractedPhone:
allPhoneNumbers.append(allPhoneNumber[0])
all it does is to create list with first char (obviously always 0) from all extracted phone numbers.
Then change the result as follows:
results = '\n'.join(extractedPhone) + '\n' + '\n'.join(extractedEmail)

Unable to remove string from text I am extracting from html

I am trying to extract the main article from a web page. I can accomplish the main text extraction using Python's readability module. However the text I get back often contains several &#13 strings (there is a ; at the end of this string but this editor won't allow the full string to be entered (strange!)). I have tried using the python replace function, I have also tried using regular expression's replace function, I have also tried using the unicode encode and decode functions. None of these approaches have worked. For the replace and Regular Expression approaches I just get back my original text with the &#13 strings still present and with the unicode encode decode approach I get back the error message:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 2099: ordinal not in range(128)
Here is the code I am using that takes the initial URL and using readability extracts the main article. I have left in all my commented out code that corresponds to the different approaches I have tried to remove the 
 string. It appears as though &#13 is interpreted to be u'\xa9'.
from readability.readability import Document
def find_main_article_text_2():
#url = 'http://finance.yahoo.com/news/questcor-pharmaceuticals-closes-transaction-acquire-130000695.html'
url = "http://us.rd.yahoo.com/finance/industry/news/latestnews/*http://us.rd.yahoo.com/finance/external/cbsm/SIG=11iiumket/*http://www.marketwatch.com/News/Story/Story.aspx?guid=4D9D3170-CE63-4570-B95B-9B16ABD0391C&siteid=yhoof2"
html = urllib.urlopen(url).read()
readable_article = Document(html).summary()
readable_title = Document(html).short_title()
#readable_article.replace("u'\xa9'"," ")
#print re.sub("
",'',readable_article)
#unicodedata.normalize('NFKD', readable_article).encode('ascii','ignore')
print readable_article
#print readable_article.decode('latin9').encode('utf8'),
print "There are " ,readable_article.count("
"),"
's"
#print readable_article.encode( sys.stdout.encoding , '' )
#sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
#sents = sent_tokenizer.tokenize(readable_article)
#new_sents = []
#for sent in sents:
# unicode_sent = sent.decode('utf-8')
# s1 = unicode_sent.encode('ascii', 'ignore')
#s2 = s1.replace("\n","")
# new_sents.append(s1)
#print new_sents
# u'\xa9'
I have a URL that I have been testing the code with included inside the def. If anybody has any ideas on how to remove this &#13 I would appreciate the help. Thanks, George

Resources