Why is pyperclip not copying result of phone numbers to clipboard - python-3.x

I'm a beginner learning python with Automate The Boring Stuff by Al Sweigart.
I'm currently on the part where he created a program using Regular expression on how to extract emails and phone numbers from documents and have them pasted to another document.
Below is the script:
#! python3
import re
import pyperclip
# Create a regex for phone numbers
phoneRegex = re.compile(r'''
# 08108989212
(\d{11}) # Full phone number
''', re.VERBOSE)
#Create a regex for email a`enter code here`ddressess
emailRegex = re.compile(r'''
# some.+_thing#(\d{2,5}))?.com
[a-zA-Z0-9_.+] + # name part
# # #symbol
[a-zA-Z0-9_.+] + # domain name part
''', re.VERBOSE)
#Get the text off the clipboard
text = pyperclip.paste()
# TODO: Extract the email/phone from this text
extractedPhone = phoneRegex.findall(text)
extractedEmail = emailRegex.findall(text)
allPhoneNumbers = []
for allPhoneNumber in extractedPhone:
allPhoneNumbers.append(allPhoneNumber[0])
print(extractedPhone)
print(extractedEmail)
# Copy the extracted email/phone to the clipboard
results = '\n'.join(allPhoneNumbers) + '\n' + '\n'.join(extractedEmail)
pyperclip.copy(results)
The script is expected to extract, print both phone numbers and email addresses to the terminal which it does. It is also expected to copy the extracted phone number and email addresses to the clipboard automatically, so they can be pasted to another text editor or word document.
Now the problem is, it copies only the email address but converts the phone numbers to 0 when pasted.
What am i not getting right?
Please pardon the errors in my English.

for library: phonenumbers (pypi, source)
Python version of Google's common library for parsing, formatting,
storing and validating international phone numbers.
I think you will need to use this to format those phone numbers.
To be more specific, you'll need to install the package using:
pip install phonenumbers
The main object that the library deals with is a PhoneNumber object. You can create this from a string representing a phone number using the parse function, but you also need to specify the country that the phone number is being dialled from (unless the number is in E.164 format, which is globally unique).
import phonenumbers
x = phonenumbers.parse("+442083661177", None)
print(x)
Country Code: 44 National Number: 2083661177 Leading Zero: False
type(x)
<class 'phonenumbers.phonenumber.PhoneNumber'>
y = phonenumbers.parse("020 8366 1177", "GB")
print(y)
Country Code: 44 National Number: 2083661177 Leading Zero: False
x == y
True
z = phonenumbers.parse("00 1 650 253 2222", "GB") # as dialled from GB, not a GB number
print(z)
Country Code: 1 National Number: 6502532222 Leading Zero(s): False
More information can be found here: https://pypi.org/project/phonenumbers/

The problem is you don't need this part of your code
allPhoneNumbers = []
for allPhoneNumber in extractedPhone:
allPhoneNumbers.append(allPhoneNumber[0])
all it does is to create list with first char (obviously always 0) from all extracted phone numbers.
Then change the result as follows:
results = '\n'.join(extractedPhone) + '\n' + '\n'.join(extractedEmail)

Related

Extracting Unstructured Addresses and email ids as variables from scraped text - Python

I am a novice in python, so please pardon me if it seems to be a simple problem. The Below code successfully scrapes a webpage. Is there a way to extract Addresses, email ids & contact numbers from this text and put it in a dataframe. I have searched two ways to do so :-
REGEX - But it may not work as i have many websites to scrape and the
addresses may not be always structured in a regular pattern.
Pyap - It caters only to US & Canadian Addresses.
Is there a way apart from the above two to fetch the required details :-
import requests
from bs4 import BeautifulSoup
link = input("ENTER WEBPAGE") # for example, i am using this webpage as of now "[https://glg.it/contact-us/][1] "
response = requests.get(url)
details = response.text
scraped_details = BeautifulSoup(details, "html.parser")
pretty1 = scraped_details.prettify()
print(pretty1)
Thanks for any help !!
Regex can be used by modifying expression which matches most of the address format>br>
import re
txt = ...
regexp = "[0-9]{1,3} .+, .+, [A-Z]{2} [0-9]{5}"
address = re.findall(regexp, txt)
# address = ['44 West 22nd Street, New York, NY 12345']
Explanation:
[0-9]{1,3}: 1 to 3 digits, the address number
(space): a space between the number and the street name
.+: street name, any character for any number of occurrences
,: a comma and a space before the city
.+: city, any character for any number of occurrences
,: a comma and a space before the state
[A-Z]{2}: exactly 2 uppercase chars from A to Z
[0-9]{5}: 5 digits
re.findall(expr, string) will return an array with all the occurrences found.

converting html+hex email address to readable string Python 3

I've been trying to find an online converter, or Python3 function, for conversion of email addresses in the html+hex format, such as: %69%6efo ---> info
%69 : i
%6e : n
&#64 : #
(source: http://www.asciitable.com/)
...and so on..
All the following sites are not converting both hex and html codes combined in the "word":
https://www.motobit.com/util/charset-codepage-conversion.asp
https://www.binaryhexconverter.com/ascii-text-to-binary-converter
https://www.dcode.fr/ascii-code
http://www.unit-conversion.info/texttools/ascii/
https://mothereff.in/binary-ascii
I'd appreciate any recommendations.
Txs.
Try html.unescape() or HTMLParser#unescape, depending on which version of Python you are using: https://stackoverflow.com/a/2087433/2675670
Since this is a mix of hex values and regular characters, I think we have to come up with a custom solution:
word = "%69%6efo"
while word.find("%") >= 0:
index = word.find("%")
ascii_value = word[index+1:index+3]
hex_value = int(ascii_value, 16)
letter = chr(hex_value)
word = word.replace(word[index:index+3], letter)
print(word)
Maybe there's a more streamlined "Pythonic" way of doing this, but it works for the test input.

Trying to make a phone book using python

I am trying to make a phone book using these instructions
Write a program that creates 2 lists: one of names and one of phone numbers. Give these variables appropriate names (for example names and numbers). Using a for loop, have the user enter 3 names and 3 numbers of people for the phone book. Next: display the entries from the phone book, name and then number. Use a for loop. Next, ask the user to enter a name. Store their input in a variable. Use a search to see if the name is entered in the name list. If the name is in the name list, print the number. If not have the program respond, “Name not found.
Your output should look like:
Name Number
sally 11
bob 22
carl 33  
Number you are looking for is: 11
All I want to know is how do you make a simple list out of user inputed data. so I can do this question.
Pseudocode is
#LOOP THREE TIMES
# names = GET INPUT name
# numbers = GET INPUT number
#END LOOP
#LOOP THREE TIMES
# PRINT (name) in names, (number) in numbers
#END LOOP
# searchName = GET INPUT "Enter a name for Search"
#IF searchName IN names THEN
# PRINT matching number
# LOOP names
# IF searchName == name THEN
# foundIndex = name(index)
# searchPhoneNumber = phoneNumber[foundIndex]
# END IF
# END LOOP
# PRINT searchPhoneNumber
#ELSE
# PRINT "Name Not Found"
#END IF
use this:
names = []
phone_numbers = []
num = 3
for i in range(num):
name = input("Name: ")
phone_number = input("Phone Number: ") # for convert to int => int(input("Phone Number: "))
names.append(name)
phone_numbers.append(phone_number)
print("\nName\t\t\tPhone Number\n")
for i in range(num):
print("{}\t\t\t{}".format(names[i], phone_numbers[i]))
search_term = input("\nEnter search term: ")
print("Search result:")
if search_term in names:
index = names.index(search_term)
phone_number = phone_numbers[index]
print("Name: {}, Phone Number: {}".format(search_term, phone_number))
else:
print("Name Not Found")
To add a name or number to the appropriate list, use the append function i.e.
numberlist.append(number_that_was_input)
or
namelist.append(name_that_was_input)
and as #cricket007 so eloquently states, we do like to see that you at least try to do things for yourself.
To receive input from the user, use the input() function.
Example:
name = input('type in name')
print(name)
#Outputs the name you typed.
To add that value into a list use the append.
Example:
my_list = [] #Initialize list first.
my_list.append(name) # this will add the contents of variable name to your list.
# my_list now looks like this: ["user817205"]
Since you have to do this 3 times, it's smart to use a for loop to do that,
you can iterate 3 times through a block of code using the following:
for _ in range(3):
#type the code you want to repeat 3 times here!
PS.: Remember you only need to initialize your list once, so keep the my_list = [] out of the for loop.

Replacing a substring AFTER a character in a python pandas dataframe

I'm new to pandas and am having a lot of trouble with this and haven't found a solution, despite my searches. Hoping one of you can help me.
I have a pandas dataframe that has a column of emails that I'm trying to clean up. Some examples are:
>>> email['EMAIL']
0 testing#...com
1 NaN
2 I.am.ME#GAMIL.COM
3 FIRST.LAST.NAME#MAIL.CMO
4 EMAIL+REMOVE#TESTING.COM
Name: EMAIL, dtype: object
There are a number of things I'm trying to do here:
1) replace misspelled endings (e.g. CMO) with correct spellings (e.g. COM)
2) replace misspelled domain names with correct spellings
3) replace multiple periods with just 1 period AFTER the '#' symbol.
4) remove all periods before the '#' sign if they have a gmail account
5) remove all characters after the "+" symbol up to the '#' symbol
So, from the example above I would have returned:
>>> email['EMAIL']
0 testing#.com
1 NaN
2 IamME#GMAIL.COM
3 FIRST.LAST.NAME#MAIL.COM
4 EMAIL#TESTING.COM
Name: EMAIL, dtype: object
I've worked on a number of different codes and keep running into errors. Here's one of my best guesses so far, for removing multiple periods after the '#' symbol.
def remove_periods(email):
email_split = email['EMAIL'].str.split('#')
ending = email_split.str.get(-1)
ending = ending.str.replace('\.{2,}', '.')
emailupdate = email_split.str[:-1]
emailupdate.append(ending)
email_split.str.get()
return '#'.join(emailupdate)
email['EMAIL'].apply(remove_periods)
I could print the multiple other versions too, but they all returns errors too.
Thanks a lot for the help!
import numpy as np
import pandas as pd
pd.options.display.width = 1000
email = pd.DataFrame({'EMAIL':[
'testing#...com', np.nan, 'I.am.ME#GAMIL.COM', 'FIRST.LAST.NAME#MAIL.CMO',
'EMAIL+REMOVE#TESTING.COM', 'gamil#bar...com', 'noperiods#localhost']})
email[['NAME', '#', 'ADDR']] = email['EMAIL'].str.rpartition('#')
# 1) replace misspelled endings (e.g. COM) with correct spellings
email['ADDR'] = email['ADDR'].str.replace(r'(?i)CMO$', 'COM')
# 2) replace misspelled domain names with correct spellings
email['ADDR'] = email['ADDR'].str.replace(r'(?i)GAMIL', 'GMAIL')
# 3) replace multiple periods with just 1 period AFTER the '#' symbol.
email['ADDR'] = email['ADDR'].str.replace(r'[.]{2,}', '.')
# 4) remove all periods before the '#' sign if they have a gmail account
mask = email['ADDR'].str.contains(r'(?i)^GMAIL[.]COM$') == True
email.loc[mask, 'NAME'] = email.loc[mask, 'NAME'].str.replace(r'[.]', '')
# 5) remove all characters after the "+" symbol up to the '#' symbol
email['NAME'] = email['NAME'].str.replace(r'[+].*', '')
# put it back together. You could reassign to email['EMAIL'] if you wish.
email['NEW_EMAIL'] = email['NAME'] + email['#'] + email['ADDR']
# clean up intermediate columns
# del email[['NAME', '#', 'ADDR']]
print(email)
yields
EMAIL NAME # ADDR NEW_EMAIL
0 testing#...com testing # .com testing#.com
1 NaN NaN None None NaN
2 I.am.ME#GAMIL.COM IamME # GMAIL.COM IamME#GMAIL.COM
3 FIRST.LAST.NAME#MAIL.CMO FIRST.LAST.NAME # MAIL.COM FIRST.LAST.NAME#MAIL.COM
4 EMAIL+REMOVE#TESTING.COM EMAIL # TESTING.COM EMAIL#TESTING.COM
5 gamil#bar...com gamil # bar.com gamil#bar.com
6 noperiods#localhost noperiods # localhost noperiods#localhost
The NAME column holds everything before the last #
The ADDR column holds everything after the last #.
I left the NAME, ADDR columns visible (and did not overwrite the original EMAIL column)
so it would be easier to understand the intermediate steps.

Generating word boundaries from string with no spaces

I'm starting the process of developing an algorithm to determine the gender of an individual based on their email address. I can have emails such as the following:
johnsonsam#example.com
samjohnson#example.com
sjohnson#example.com
john#example.com
My plan is to try to do an index search against the most common first and last names based on the US census. This is meant to apply to the US demographic. However, I think it would be much more efficient if I could first decompose the above e-mail addresses into the following:
<wb>johnson</wb><wb>sam</wb>#example.com
<wb>sam</wb><wb>johnson</wb>#example.com
<wb>s</wb><wb>johnson</wb>#example.com
<wb>john</wb>#example.com
Are there any algorithms (preferably in Python) that you know of that can do this annotation? Any other suggestions towards solving this are welcome.
The problem you've described is called "word segmentation." The wordsegment package will do this for you. It uses the Google Web Trillion Word Corpus, and works well even on names.
To install it:
pip install wordsegment
Here's an example program:
import sys
import wordsegment
def main():
for line in sys.stdin:
print '%s -> %s' % (line.strip(), wordsegment.segment(line))
if __name__ == '__main__':
main()
And here's the output on some examples (assuming you've already separated out the part before the "#" in the email address):
johnsonsam -> ['johnson', 'sam']
samjohnson -> ['sam', 'johnson']
sjohnson -> ['s', 'johnson']
john -> ['john']
johnson_sam -> ['johnson', 'sam']
You could try using lists of names from census data and see if that gives you even better performance. For more information about how you might implement the algorithm yourself with a custom list of words, see the "Word Segmentation" section of this chapter by Norvig: Natural Language Corpus Data.
Here's a basic start, you need to consider also separators (such as dots, underscores, etc), middle names, and initials.
import re
def is_name_list(cands, refs):
for c in cands:
if (len(c) > 1) and (not c in refs):
return False
return True
emails = [
'johnsonsam#example.com',
'samjohnson#example.com',
'sjohnson#example.com',
'john#example.com'
]
names = ['john', 'sam', 'johnson']
for e in emails:
print '\n' + e
at_ind = e.index('#')
user = e[0:at_ind]
for n in names:
finals = []
parts = filter(None, user.split(n))
if is_name_list(parts, names):
all_parts = re.split('(' + n + ')', user)
all_parts.append(e[at_ind:])
strs = ["<wb>" + s + "</wb>" for s in all_parts if s != '']
if len(strs) > 0:
final = ''.join(strs)
if not final in finals:
finals.append(final)
print finals

Resources