Extracting Unstructured Addresses and email ids as variables from scraped text - Python - python-3.x

I am a novice in python, so please pardon me if it seems to be a simple problem. The Below code successfully scrapes a webpage. Is there a way to extract Addresses, email ids & contact numbers from this text and put it in a dataframe. I have searched two ways to do so :-
REGEX - But it may not work as i have many websites to scrape and the
addresses may not be always structured in a regular pattern.
Pyap - It caters only to US & Canadian Addresses.
Is there a way apart from the above two to fetch the required details :-
import requests
from bs4 import BeautifulSoup
link = input("ENTER WEBPAGE") # for example, i am using this webpage as of now "[https://glg.it/contact-us/][1] "
response = requests.get(url)
details = response.text
scraped_details = BeautifulSoup(details, "html.parser")
pretty1 = scraped_details.prettify()
print(pretty1)
Thanks for any help !!

Regex can be used by modifying expression which matches most of the address format>br>
import re
txt = ...
regexp = "[0-9]{1,3} .+, .+, [A-Z]{2} [0-9]{5}"
address = re.findall(regexp, txt)
# address = ['44 West 22nd Street, New York, NY 12345']
Explanation:
[0-9]{1,3}: 1 to 3 digits, the address number
(space): a space between the number and the street name
.+: street name, any character for any number of occurrences
,: a comma and a space before the city
.+: city, any character for any number of occurrences
,: a comma and a space before the state
[A-Z]{2}: exactly 2 uppercase chars from A to Z
[0-9]{5}: 5 digits
re.findall(expr, string) will return an array with all the occurrences found.

Related

Get number from string in Python

I have a string, I have to get digits only from that string.
url = "www.mylocalurl.com/edit/1987"
Now from that string, I need to get 1987 only.
I have been trying this approach,
id = [int(i) for i in url.split() if i.isdigit()]
But I am getting [] list only.
You can use regex and get the digit alone in the list.
import re
url = "www.mylocalurl.com/edit/1987"
digit = re.findall(r'\d+', url)
output:
['1987']
Replace all non-digits with blank (effectively "deleting" them):
import re
num = re.sub('\D', '', url)
See live demo.
You aren't getting anything because by default the .split() method splits a sentence up where there are spaces. Since you are trying to split a hyperlink that has no spaces, it is not splitting anything up. What you can do is called a capture using regex. For example:
import re
url = "www.mylocalurl.com/edit/1987"
regex = r'(\d+)'
numbers = re.search(regex, url)
captured = numbers.groups()[0]
If you do not what what regular expressions are, the code is basically saying. Using the regex string defined as r'(\d+)' which basically means capture any digits, search through the url. Then in the captured we have the first captured group which is 1987.
If you don't want to use this, then you can use your .split() method but this time provide a split using / as the separator. For example `url.split('/').

find all website addresses in the input text (Python)

I need to find all website addresses in the input text and print all addresses in the order they appear in the text, each on a new line. "https: //" "http: //" "www."
I used split in the string, but I can't return that start with this 'www'.
Can someone explain to me how can I solve this?
Sample Input 1:
WWW.GOOGLE.COM uses 100-percent renewable energy sources and www.ecosia.com plants a tree for every 45 searches!
Sample Output 1:
WWW.GOOGLE.COM
www.ecosia.com
text = input()
text = text.lower()
words = text.split(" ")
for word in words:
A better way is to use Regex.
You can learn more good regex pattern from this
import re
url_regex = r"(?i)(https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9]+\.[^\s]{2,}|www\.[a-zA-Z0-9]+\.[^\s]{2,})"
raw_string = "WWW.GOOGLE.COM uses 100-percent renewable energy sources and www.ecosia.com plants a tree for every 45 searches!"
urls = re.findall(url_regex, raw_string)
what i would do is to catch the "www" couse' we know every url beggins with that , and end with an spacebar, so put everything in and array and then print it, but python has a lot of string functions in its library but i don't know many of that.
str = " www.GOOGLE.COM uses 100-percent renewable energy sources and www.ecosia.com plants a tree for every 45 searches! "
str.lower()
tmp = ""
all_url = []
k=0
for i in range(len(str)-3):
if(str[i]+str[i+1]+str[i+2] == "www"):
k=i+4
while(str[k] != " "):
tmp=tmp+str[k]
k+=1
all_url.append(tmp)
tmp = ""
i=k
for url in all_url:
print("www." + url )

Why is pyperclip not copying result of phone numbers to clipboard

I'm a beginner learning python with Automate The Boring Stuff by Al Sweigart.
I'm currently on the part where he created a program using Regular expression on how to extract emails and phone numbers from documents and have them pasted to another document.
Below is the script:
#! python3
import re
import pyperclip
# Create a regex for phone numbers
phoneRegex = re.compile(r'''
# 08108989212
(\d{11}) # Full phone number
''', re.VERBOSE)
#Create a regex for email a`enter code here`ddressess
emailRegex = re.compile(r'''
# some.+_thing#(\d{2,5}))?.com
[a-zA-Z0-9_.+] + # name part
# # #symbol
[a-zA-Z0-9_.+] + # domain name part
''', re.VERBOSE)
#Get the text off the clipboard
text = pyperclip.paste()
# TODO: Extract the email/phone from this text
extractedPhone = phoneRegex.findall(text)
extractedEmail = emailRegex.findall(text)
allPhoneNumbers = []
for allPhoneNumber in extractedPhone:
allPhoneNumbers.append(allPhoneNumber[0])
print(extractedPhone)
print(extractedEmail)
# Copy the extracted email/phone to the clipboard
results = '\n'.join(allPhoneNumbers) + '\n' + '\n'.join(extractedEmail)
pyperclip.copy(results)
The script is expected to extract, print both phone numbers and email addresses to the terminal which it does. It is also expected to copy the extracted phone number and email addresses to the clipboard automatically, so they can be pasted to another text editor or word document.
Now the problem is, it copies only the email address but converts the phone numbers to 0 when pasted.
What am i not getting right?
Please pardon the errors in my English.
for library: phonenumbers (pypi, source)
Python version of Google's common library for parsing, formatting,
storing and validating international phone numbers.
I think you will need to use this to format those phone numbers.
To be more specific, you'll need to install the package using:
pip install phonenumbers
The main object that the library deals with is a PhoneNumber object. You can create this from a string representing a phone number using the parse function, but you also need to specify the country that the phone number is being dialled from (unless the number is in E.164 format, which is globally unique).
import phonenumbers
x = phonenumbers.parse("+442083661177", None)
print(x)
Country Code: 44 National Number: 2083661177 Leading Zero: False
type(x)
<class 'phonenumbers.phonenumber.PhoneNumber'>
y = phonenumbers.parse("020 8366 1177", "GB")
print(y)
Country Code: 44 National Number: 2083661177 Leading Zero: False
x == y
True
z = phonenumbers.parse("00 1 650 253 2222", "GB") # as dialled from GB, not a GB number
print(z)
Country Code: 1 National Number: 6502532222 Leading Zero(s): False
More information can be found here: https://pypi.org/project/phonenumbers/
The problem is you don't need this part of your code
allPhoneNumbers = []
for allPhoneNumber in extractedPhone:
allPhoneNumbers.append(allPhoneNumber[0])
all it does is to create list with first char (obviously always 0) from all extracted phone numbers.
Then change the result as follows:
results = '\n'.join(extractedPhone) + '\n' + '\n'.join(extractedEmail)

How to strip whitespace from element in list

I read over a file, scraped all the artist names from within the file and put it all in a list. Im trying to pull out one artist only from the list, and then removing the space and shuffling all the letters (word scrabble).
artist_names = []
rand_artist = artist_names[random.randrange(len(artist_names))] #Picks random artist from list]
print(rand_artist)
howevever when i print rand_artist out, sometimes i get an artist with lets say 2 or 3 words such as "A Northern Chorus" or "The Beatles". i would like to remove the whitespace between the words and then shuffle the words.
First replace whitespaces with empty strings. Then turn the string to a list of characters. Since i guess you want them to be lowercase, I included that as well.
import random
s = "A Northern Chorus".replace(' ','').lower()
l=list(s)
random.shuffle(l)
print(l)
Also, you can use random.choice(artist_names) instead of randrange().

How to extract information from a text file that is located on a web page in python

I am a total beginner and I'm trying to do the following. I need to open a text file from a web page which contains a small list like that below.
name lastname M 0909
name lastname C 0909
name lastname F 0909
name lastname M 0909
name lastname M 0909
What I need to do is to count how many big M letters and how many big different letters there is(here is 3 M,F and C)and print it out. Then I need to create a new text file and transfer (only) all the names into it and save it on my hard drive. So far I only figured out how to open the list from web page.
import urllib.request
url = 'http://mypage.com/python/textfile.txt'
with urllib.request.urlopen(url) as myfile:
for i in myfile:
i = i.decode("ISO-8859-1")
print(i,end=" ")
But that is all I know. I tried using count() but it counts only one line at the time, it counts how many big M letters are in one line(1) but it does not add them together for the whole text(3). Any help would be appreciated, thank you.
I don't know exactly what you are doing, but try this:
import urllib.request
url = 'http://mypage.com/python/textfile.txt'
with urllib.request.urlopen(url) as myfile:
number_of_M = 0
set_of_big_letters = set()
for i in myfile:
i = i.decode("ISO-8859-1")
name, lastname, big_letter, _ = i.split(' ') # if they are seperated by space
set_of_big_letters.add(big_letter)
if big_letter == 'M':
number_of_M += 1
print(number_of_M)
print(len(set_of_big_letters))

Resources