find all website addresses in the input text (Python) - python-3.x

I need to find all website addresses in the input text and print all addresses in the order they appear in the text, each on a new line. "https: //" "http: //" "www."
I used split in the string, but I can't return that start with this 'www'.
Can someone explain to me how can I solve this?
Sample Input 1:
WWW.GOOGLE.COM uses 100-percent renewable energy sources and www.ecosia.com plants a tree for every 45 searches!
Sample Output 1:
WWW.GOOGLE.COM
www.ecosia.com
text = input()
text = text.lower()
words = text.split(" ")
for word in words:

A better way is to use Regex.
You can learn more good regex pattern from this
import re
url_regex = r"(?i)(https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9]+\.[^\s]{2,}|www\.[a-zA-Z0-9]+\.[^\s]{2,})"
raw_string = "WWW.GOOGLE.COM uses 100-percent renewable energy sources and www.ecosia.com plants a tree for every 45 searches!"
urls = re.findall(url_regex, raw_string)

what i would do is to catch the "www" couse' we know every url beggins with that , and end with an spacebar, so put everything in and array and then print it, but python has a lot of string functions in its library but i don't know many of that.
str = " www.GOOGLE.COM uses 100-percent renewable energy sources and www.ecosia.com plants a tree for every 45 searches! "
str.lower()
tmp = ""
all_url = []
k=0
for i in range(len(str)-3):
if(str[i]+str[i+1]+str[i+2] == "www"):
k=i+4
while(str[k] != " "):
tmp=tmp+str[k]
k+=1
all_url.append(tmp)
tmp = ""
i=k
for url in all_url:
print("www." + url )

Related

Extracting Unstructured Addresses and email ids as variables from scraped text - Python

I am a novice in python, so please pardon me if it seems to be a simple problem. The Below code successfully scrapes a webpage. Is there a way to extract Addresses, email ids & contact numbers from this text and put it in a dataframe. I have searched two ways to do so :-
REGEX - But it may not work as i have many websites to scrape and the
addresses may not be always structured in a regular pattern.
Pyap - It caters only to US & Canadian Addresses.
Is there a way apart from the above two to fetch the required details :-
import requests
from bs4 import BeautifulSoup
link = input("ENTER WEBPAGE") # for example, i am using this webpage as of now "[https://glg.it/contact-us/][1] "
response = requests.get(url)
details = response.text
scraped_details = BeautifulSoup(details, "html.parser")
pretty1 = scraped_details.prettify()
print(pretty1)
Thanks for any help !!
Regex can be used by modifying expression which matches most of the address format>br>
import re
txt = ...
regexp = "[0-9]{1,3} .+, .+, [A-Z]{2} [0-9]{5}"
address = re.findall(regexp, txt)
# address = ['44 West 22nd Street, New York, NY 12345']
Explanation:
[0-9]{1,3}: 1 to 3 digits, the address number
(space): a space between the number and the street name
.+: street name, any character for any number of occurrences
,: a comma and a space before the city
.+: city, any character for any number of occurrences
,: a comma and a space before the state
[A-Z]{2}: exactly 2 uppercase chars from A to Z
[0-9]{5}: 5 digits
re.findall(expr, string) will return an array with all the occurrences found.

How do I perform a regular expression on multiple .txt files in a folder (Python)?

I'm trying to open up 32 .txt files, extract some text from them (using RegEx) and then save them as individual files again(later on in the project I'm hoping to collate them together). I've tested the RegEx on a single file and it seems to work:
import os
import re
os.chdir(r'C:\Users\garet\OneDrive - University of Exeter\Masters\Year Two\Dissertation planning\Manual scrape\Finished years proper')
with open('1988.txt') as txtfile:
text= txtfile.read()
#print(len(text)) #sentences in text
start = r'Body\n\n\n'
docs = re.findall(start, text)
print('Found the start of %s documents.' % len(docs))
end = r'Load-Date:'
print('Found the end of %s documents.' % len(docs))
docs = re.findall(end, text)
regex = start+r'(.+?)'+end
articles = re.findall(regex, text, re.S)
print('You have now parsed the 154 articles so only the body of content remains. All metadata has been removed.')
print('Here is an example of a parsed article:', articles[0])
Now I want to perform the exact same thing on all my .txt files in that folder, but I can't figure out how to. I've been playing around with For loops but with little success. Currently I have this:
import os
import re
finished_years_proper= os.listdir(r'C:\Users\garet\OneDrive - University of Exeter\Masters\Year Two\Dissertation\Manual scrape\Finished years proper')
os.chdir(r'C:\Users\garet\OneDrive - University of Exeter\Masters\Year Two\Dissertation\Manual scrape\Finished years proper')
print('There are %s .txt files in this folder.' % len(finished_years_proper))
if i.endswith(".txt"):
with open(finished_years_proper + i, 'r') as all_years:
for line in all_years:
start = r'Body\n\n\n'
docs = re.findall(start, all_years)
end = r'Load-Date:'
docs = re.findall(end, all_years)
regex = start+r'(.+?)'+end
articles = re.findall(regex, all_years, re.S)
However, I'm returning a type error:
File "C:\Users\garet\OneDrive - University of Exeter\Masters\Year Two\Dissertation\Method\Python\untitled1.py", line 15, in <module>
with open(finished_years_proper + i, 'r') as all_years:
TypeError: can only concatenate list (not "str") to list
I'm unsure how to proceed... I've seen on other forums that I should convert something into a string, but I'm not sure what to convert or even if this is the right way to proceed. Any help with this would be really appreciated!
After taking Benedictanjw's into my codes I've ended up with this:
Hi, this is what I ended up with:
all_years= []
for fyp in finished_years_proper: #fyp is each text file in folder
with open(fyp, 'r') as year:
for line in year: #line is each element in each text file in folder
start = r'Body\n\n\n'
docs = re.findall(start, line)
end = r'Load-Date:'
docs = re.findall(end, line)
regex = start+r'(.+?)'+end
articles = re.findall(regex, line, re.S)
all_years.append(articles) #append strings to reflect RegEx
parsed_documents= all_years.append(articles)
print(parsed_documents) #returns None. Apparently this is okay.
Does the 'None' mean that the parsing of each file is successful (as in it emulates the result I had when I tested the RegEx on a single file)? And if so, how can I visualise my output without returning None. Many thanks in advance!!
The problem shows because finished_years_proper is a list and in your line:
with open(finished_years_proper + i, 'r') as all_years:
you are trying to concatenate i with that list. I presume you had accidentally defined i elsewhere as a string. I guess you probably want to do something like:
all_years = []
for fyp in finished_years_proper:
with open(fyp, 'r') as year:
for line in year:
... # your regex search on year
all_years.append(xxx)

Why is my code working but infinitely printing?

I am downloading two excel spreadsheets, one is a list of TLD's and their definitions the other is a list of the top one million websites. Part of my function is to find the top position of a particular TLD based off a user input. The code 'works' however it is printing the rank/key infinitely and I don't know why, any help is appreciated.
def parseCSV(tlds, top):
tldDict = {
" ":" "
}
topDict = {
" ":" "
}
#opening CSV file using string from parameter
file = open(r"C:\Users\bubba\OneDrive\Documents\Foundation Programming\tlds.csv", 'r')
file2 = open(r"C:\Users\bubba\OneDrive\Documents\Foundation Programming\top 1m.csv", 'r')
#iterating through file and splitting each line by comma
#this is put into a dictionary
for i in file.readlines():
x=i.split(",")
tldDict[x[0]]=x[1]
file.close()
for i in file2.readlines():
x=i.split(",")
topDict[x[0]]=x[1]
file2.close()
return [tldDict,topDict]
def getTopTLD(tld, tldDict, topDict):
matched_keys = []
for key, pair in topDict.items():
if tld in pair:
matched_keys.append(key) # Simple append statement
print('The position of this tld is:', min(matched_keys))
return matched_keys
tldDict, topDict = parseCSV("tlds","top-1m")
while True:
tld = input("Enter tld ")
if tld == "exit":
break
getTopTLD(tld, tldDict, topDict)
console example:
Enter tld com
The position of this tld is: 1
The position of this tld is: 1
The position of this tld is: 1
The position of this tld is: 1
The position of this tld is: 1
I thought it might be the while loop but if I remove it the result is the same.

python3/email: parsing a list of email addresses with embedded commas?

I know how to use email.utils.parseaddr() to parse an email address. However, I want to parse a list of multiple email addresses, such as the address portion of this header:
Cc: "abc" <foo#bar.com>, "www, xxyyzz" <something#else.com>
In general, I know I can split on a regex like \s*,\s* to get the individual addresses, but in my example, the name portion of one of the addresses contains a comma, and this regex therefore will split the header incorrectly.
I know how to manually write state-machine-based code to properly split that address into pieces, and I also know how to code a complicated regex that would match each email address. I'm not asking for help in writing such code. Rather, I'm wondering if there are any existing python modules which I can use to properly split this email address list, so I don't have to "re-invent the wheel".
Thank you in advance.
Borrowing the answer from this question How do you extract multiple email addresses from an RFC 2822 mail header in python?
msg = 'Cc: "abc" <foo#bar.com>, "www, xxyyzz" <something#else.com>'
import email.utils
print(email.utils.getaddresses([msg]))
produces:
[('abc', 'foo#bar.com'), ('www, xxyyzz', 'something#else.com')]
This is not elegant in the least and I'm sure someone will come along and improve upon this. However, this works for me and hopefully gives you an idea of how this can be done.
The split method is what you're looking for here I believe. In the simplest terms, you take your string and choose a character to split upon. This will separate the string into a list that you can iterate over assuming the split key selection is found. If it's not found then the string is a one element list.
emails = 'Cc: "abc" <foo#bar.com>, "www, xxyyzz" <something#else.com>'
emails
Out[37]:
'Cc: "abc" <foo#bar.com>, "www, xxyyzz" <something#else.com>'
In [38]:
emails = emails.split(' ')
new_emails = []
for e in emails:
if '#' in e:
new_email = e.replace('<', '')
new_email = new_email.replace('>', '')
new_email = new_email.replace(',', '')
new_emails.append(new_email)
print(new_emails)
['foo#bar.com', 'something#else.com']
If you want to use regex to do this, someone smarter than I will have to help.
I know I can do something like the following, but again, I'm hoping that there is already an existing package which could do this for me ...
#!/usr/bin/python3
import email.utils
def getaddrs(text):
def _yieldaddrs(text):
inquote = False
curaddr = ''
for x in text:
if x == '"':
inquote = not inquote
curaddr += x
elif x == ',':
if inquote:
curaddr += x
else:
yield(curaddr)
curaddr = ''
else:
curaddr += x
if curaddr:
yield(curaddr)
return [email.utils.parseaddr(x) for x in _yieldaddrs(text)]
addrstring = '"abc" <foo#bar.com>, "www, xxyyzz" <something#else.com>'
print('{}'.format(getaddrs(addrstring)))
# Prints this ...
# [('abc', 'foo#bar.com'), ('www, xxyyzz', 'something#else.com')]

Expected str instance, int found. How do I change an int to str to make this code work?

I'm trying to write code that analyses a sentence that contains multiple words and no punctuation. I need it to identify individual words in the sentence that is entered and store them in a list. My example sentence is 'ask not what your country can do for you ask what you can do for your country. I then need the original position of the word to be written to a text file. This is my current code with parts taken from other questions I've found but I just can't get it to work
myFile = open("cat2numbers.txt", "wt")
list = [] # An empty list
sentence = "" # Sentence is equal to the sentence that will be entered
print("Writing to the file: ", myFile) # Telling the user what file they will be writing to
sentence = input("Please enter a sentence without punctuation ") # Asking the user to enter a sentenc
sentence = sentence.lower() # Turns everything entered into lower case
words = sentence.split() # Splitting the sentence into single words
positions = [words.index(word) + 1 for word in words]
for i in range(1,9):
s = repr(i)
print("The positions are being written to the file")
d = ', '.join(positions)
myFile.write(positions) # write the places to myFile
myFile.write("\n")
myFile.close() # closes myFile
print("The positions are now in the file")
The error I've been getting is TypeError: sequence item 0: expected str instance, int found. Could someone please help me, it would be much appreciated
The error stems from .join due to the fact you're joining ints on strings.
So the simple fix would be using:
d = ", ".join(map(str, positions))
which maps the str function on all the elements of the positions list and turns them to strings before joining.
That won't solve all your problems, though. You have used a for loop for some reason, in which you .close the file after writing. In consequent iterations you'll get an error for attempting to write to a file that has been closed.
There's other things, list = [] is unnecessary and, using the name list should be avoided; the initialization of sentence is unnecessary too, you don't need to initialize like that. Additionally, if you want to ask for 8 sentences (the for loop), put your loop before doing your work.
All in all, try something like this:
with open("cat2numbers.txt", "wt") as f:
print("Writing to the file: ", myFile) # Telling the user what file they will be writing to
for i in range(9):
sentence = input("Please enter a sentence without punctuation ").lower() # Asking the user to enter a sentenc
words = sentence.split() # Splitting the sentence into single words
positions = [words.index(word) + 1 for word in words]
f.write(", ".join(map(str, positions))) # write the places to myFile
myFile.write("\n")
print("The positions are now in the file")
this uses the with statement which handles closing the file for you, behind the scenes.
As I see it, in the for loop, you try to write into file, than close it, and than WRITE TO THE CLOSED FILE again. Couldn't this be the problem?

Resources