Generating word boundaries from string with no spaces - nlp

I'm starting the process of developing an algorithm to determine the gender of an individual based on their email address. I can have emails such as the following:
johnsonsam#example.com
samjohnson#example.com
sjohnson#example.com
john#example.com
My plan is to try to do an index search against the most common first and last names based on the US census. This is meant to apply to the US demographic. However, I think it would be much more efficient if I could first decompose the above e-mail addresses into the following:
<wb>johnson</wb><wb>sam</wb>#example.com
<wb>sam</wb><wb>johnson</wb>#example.com
<wb>s</wb><wb>johnson</wb>#example.com
<wb>john</wb>#example.com
Are there any algorithms (preferably in Python) that you know of that can do this annotation? Any other suggestions towards solving this are welcome.

The problem you've described is called "word segmentation." The wordsegment package will do this for you. It uses the Google Web Trillion Word Corpus, and works well even on names.
To install it:
pip install wordsegment
Here's an example program:
import sys
import wordsegment
def main():
for line in sys.stdin:
print '%s -> %s' % (line.strip(), wordsegment.segment(line))
if __name__ == '__main__':
main()
And here's the output on some examples (assuming you've already separated out the part before the "#" in the email address):
johnsonsam -> ['johnson', 'sam']
samjohnson -> ['sam', 'johnson']
sjohnson -> ['s', 'johnson']
john -> ['john']
johnson_sam -> ['johnson', 'sam']
You could try using lists of names from census data and see if that gives you even better performance. For more information about how you might implement the algorithm yourself with a custom list of words, see the "Word Segmentation" section of this chapter by Norvig: Natural Language Corpus Data.

Here's a basic start, you need to consider also separators (such as dots, underscores, etc), middle names, and initials.
import re
def is_name_list(cands, refs):
for c in cands:
if (len(c) > 1) and (not c in refs):
return False
return True
emails = [
'johnsonsam#example.com',
'samjohnson#example.com',
'sjohnson#example.com',
'john#example.com'
]
names = ['john', 'sam', 'johnson']
for e in emails:
print '\n' + e
at_ind = e.index('#')
user = e[0:at_ind]
for n in names:
finals = []
parts = filter(None, user.split(n))
if is_name_list(parts, names):
all_parts = re.split('(' + n + ')', user)
all_parts.append(e[at_ind:])
strs = ["<wb>" + s + "</wb>" for s in all_parts if s != '']
if len(strs) > 0:
final = ''.join(strs)
if not final in finals:
finals.append(final)
print finals

Related

How to split strings from .txt file into a list, sorted from A-Z without duplicates?

For instance, the .txt file includes 2 lines, separated by commas:
John, George, Tom
Mark, James, Tom,
Output should be:
[George, James, John, Mark, Tom]
The following will create the list and store each item as a string.
def test(path):
filename = path
with open(filename) as f:
f = f.read()
f_list = f.split('\n')
for i in f_list:
if i == '':
f_list.remove(i)
res1 = []
for i in f_list:
res1.append(i.split(', '))
res2 = []
for i in res1:
res2 += i
res3 = [i.strip(',') for i in res2]
for i in res3:
if res3.count(i) != 1:
res3.remove(i)
res3.sort()
return res3
print(test('location/of/file.txt'))
Output:
['George', 'James', 'John', 'Mark', 'Tom']
Your file opening is fine, although the 'r' is redundant since that's the default. You claim it's not, but it is. Read the documentation.
You have not described what task is so I have no idea what's going on there. I will assume that it is correct.
Rather than populating a list and doing a membership test on every iteration - which is O(n^2) in time - can you think of a different data structure that guarantees uniqueness? Google will be your friend here. Once you discover this data structure, you will not have to perform membership checks at all. You seem to be struggling with this concept; the answer is a set.
The input data format is not rigorously defined. Separators may be commas or commas with trailing spaces, and may appear (or not) at the end of the line. Consider making an appropriate regular expression and using its splitting feature to split individual lines, though normal splitting and stripping may be easier to start.
In the following example code, I've:
ignored task since you've said that that's fine;
separated actual parsing of file content from parsing of in-memory content to demonstrate the function without a file;
used a set comprehension to store unique results of all split lines; and
used a generator to sorted that drops empty strings.
from io import StringIO
from typing import TextIO, List
def parse(f: TextIO) -> List[str]:
words = {
word.strip()
for line in f
for word in line.split(',')
}
return sorted(
word for word in words if word != ''
)
def parse_file(filename: str) -> List[str]:
with open(filename) as f:
return parse(f)
def test():
f = StringIO('John, George , Tom\nMark, James, Tom, ')
words = parse(f)
assert words == [
'George', 'James', 'John', 'Mark', 'Tom',
]
f = StringIO(' Han Solo, Boba Fet \n')
words = parse(f)
assert words == [
'Boba Fet', 'Han Solo',
]
if __name__ == '__main__':
test()
I came up with a very simple solution if anyone will need:
lines = x.read().split()
lines.sort()
new_list = []
[new_list.append(word) for word in lines if word not in new_list]
return new_list
with open("text.txt", "r") as fl:
list_ = set()
for line in fl.readlines():
line = line.strip("\n")
line = line.split(",")
[list_.add(_) for _ in line if _ != '']
print(list_)
I think that you missed a comma after Jim in the first line.
You can avoid the use of a loop by using split property :
content=file.read()
my_list=content.split(",")
to delete the occurence in your list you can transform it to set :
my_list=list(set(my_list))
then you can sort it using sorted
so the finale code :
with open("file.txt", "r") as file :
content=file.read()
my_list=content.replace("\n","").replace(" ", "").split(",")
result=sorted(list(set(my_list)))
you can add a key to your sort function

Searching a .txt file for multiple list terms and outputting counts/%/words

I'm trying to search through a given .txt file (source_filename) for a series of list terms and provide their count output, % of those terms in the .txt file, and exact words for each set of listed terms found in the text.
How should I set up the count/reporting feature?
#build GUI for text file selection
import PySimpleGUI as sg
window_rows = [[sg.Text('Please select a .txt file for analysis')],
[sg.InputText(), sg.FileBrowse()],
[sg.Submit(), sg.Cancel()]]
window = sg.Window('Cool Tool Name', window_rows)
event, values = window.Read()
window.Close()
source_filename = values[0]
#written communication term list
dwrit = ('write','written','writing', 'email', 'memo')
written = dwrit
#oral communication term list
doral = ('oral','spoken','talk','speech,')
oral = doral
#visual communication term list
dvis = ('visual','sight')
visual = dvis
#auditory communication term list
daud = ('hear', 'hearing', 'heard')
auditory = daud
#multimodal communication term list
dmm = ('multimodal','multi-modal','mixed media','audio and visual')
multimodal = dmm
#define all term lists
communication = (dwrit, doral, dvis, daud, dmm)
#search lists
from collections import Counter
with open(source_filename, encoding = 'ISO-8859-1') as f:
for line in f:
Counter.update(line.lower().split())
print(Counter(communication))
The problem is, I'm printing out all the terms in all lists right now, but I'm not actually searching the document just for those listed terms and ignoring all other terms...
The ideal output would look like:
Written: [number, %, words]
Oral: [number, %, words]
Visual: [number, %, words]
Auditory: [number, %, words]
Multimodal: [number, %, words]
The Counter is a dictionary which is keyed off the thing you are counting. So that is why you are seeing every single word, you didn't just lookup the words (as keys in the Counter) that correspond to the words you are interested in. Below is an example of how you might do one of the items, this is a pattern you can use to do the other lists.
Try this:
from collections import Counter
c = Counter()
#search lists
with open(source_filename, encoding = 'ISO-8859-1') as f:
for line in f:
c.update(line.lower().split())
written_words = len([x for x in written if x in c.keys()])
print(f'Written: [{written_words}, {written_words/len(c.keys())} %]')

Extract characters within certain symbols

I have extracted text from an HTML file, and have the whole thing in a string.
I am looking for a method to loop through the string, and extract only values that are within square brackets and put strings in a list.
I have looked in to several questions, among them this one: Extract character before and after "/"
But i am having a hard time modifying it. Can someone help?
Solved!
Thank you for all your inputs, I will definitely look more into regex. I managed to do what i wanted in a pretty manual way (may not be beautiful):
#remove all html code and append to string
for i in html_file:
html_string += str(html2text.html2text(i))
#set this boolean if current character is either [ or ]
add = False
#extract only values within [ or ], based on add = T/F
for i in html_string:
if i == '[':
add = True
if i == ']':
add = False
clean_string += str(i)
if add == True:
clean_string += str(i)
#split string into list without square brackets
clean_string_list = clean_string.split('][')
The HTML file I am looking to get as pure text (dataframe later on) instead of HTML, is my personal Facebook data that i have downloaded.
Try out this regex, given a string it will place all text inside [ ] into a list.
import re
print(re.findall(r'\[(\w+)\]','spam[eggs][hello]'))
>>> ['eggs', 'hello']
Also this is a great reference for building your own regex.
https://regex101.com
EDIT: If you have nested square brackets here is a function that will handle that case.
import re
test ='spam[eg[nested]gs][hello]'
def square_bracket_text(test_text,found):
"""Find text enclosed in square brackets within a string"""
matches = re.findall(r'\[(\w+)\]',test_text)
if matches:
found.extend(matches)
for word in found:
test_text = test_text.replace('[' + word + ']','')
square_bracket_text(test_text,found)
return found
match = []
print(square_bracket_text(test,match))
>>>['nested', 'hello', 'eggs']
hope it helps!
You can also use re.finditer() for this, see below example.
Let suppose, we have word characters inside brackets so regular expression will be \[\w+\].
If you wish, check it at https://rextester.com/XEMOU85362.
import re
s = "<h1>Hello [Programmer], you are [Excellent]</h1>"
g = re.finditer("\[\w+\]", s)
l = list() # or, l = []
for m in g:
text = m.group(0)
l.append(text[1: -1])
print(l) # ['Programmer', 'Excellent']

python3/email: parsing a list of email addresses with embedded commas?

I know how to use email.utils.parseaddr() to parse an email address. However, I want to parse a list of multiple email addresses, such as the address portion of this header:
Cc: "abc" <foo#bar.com>, "www, xxyyzz" <something#else.com>
In general, I know I can split on a regex like \s*,\s* to get the individual addresses, but in my example, the name portion of one of the addresses contains a comma, and this regex therefore will split the header incorrectly.
I know how to manually write state-machine-based code to properly split that address into pieces, and I also know how to code a complicated regex that would match each email address. I'm not asking for help in writing such code. Rather, I'm wondering if there are any existing python modules which I can use to properly split this email address list, so I don't have to "re-invent the wheel".
Thank you in advance.
Borrowing the answer from this question How do you extract multiple email addresses from an RFC 2822 mail header in python?
msg = 'Cc: "abc" <foo#bar.com>, "www, xxyyzz" <something#else.com>'
import email.utils
print(email.utils.getaddresses([msg]))
produces:
[('abc', 'foo#bar.com'), ('www, xxyyzz', 'something#else.com')]
This is not elegant in the least and I'm sure someone will come along and improve upon this. However, this works for me and hopefully gives you an idea of how this can be done.
The split method is what you're looking for here I believe. In the simplest terms, you take your string and choose a character to split upon. This will separate the string into a list that you can iterate over assuming the split key selection is found. If it's not found then the string is a one element list.
emails = 'Cc: "abc" <foo#bar.com>, "www, xxyyzz" <something#else.com>'
emails
Out[37]:
'Cc: "abc" <foo#bar.com>, "www, xxyyzz" <something#else.com>'
In [38]:
emails = emails.split(' ')
new_emails = []
for e in emails:
if '#' in e:
new_email = e.replace('<', '')
new_email = new_email.replace('>', '')
new_email = new_email.replace(',', '')
new_emails.append(new_email)
print(new_emails)
['foo#bar.com', 'something#else.com']
If you want to use regex to do this, someone smarter than I will have to help.
I know I can do something like the following, but again, I'm hoping that there is already an existing package which could do this for me ...
#!/usr/bin/python3
import email.utils
def getaddrs(text):
def _yieldaddrs(text):
inquote = False
curaddr = ''
for x in text:
if x == '"':
inquote = not inquote
curaddr += x
elif x == ',':
if inquote:
curaddr += x
else:
yield(curaddr)
curaddr = ''
else:
curaddr += x
if curaddr:
yield(curaddr)
return [email.utils.parseaddr(x) for x in _yieldaddrs(text)]
addrstring = '"abc" <foo#bar.com>, "www, xxyyzz" <something#else.com>'
print('{}'.format(getaddrs(addrstring)))
# Prints this ...
# [('abc', 'foo#bar.com'), ('www, xxyyzz', 'something#else.com')]

Delta words between two TXT files

I would like to count the delta words between two files.
file_1.txt has content One file with some text and words..
file_1.txt has content One file with some text and additional words to be found..
diff command on Unix systems gives the following infos. difflib can give a similar output.
$ diff file_1.txt file_2.txt
1c1
< One file with some text and words.
---
> One file with some text and additional words to be found.
Is there an easy way to found the words added or removed between two files, or at least between two lines as git diff --word-diff does.
First of all you need to read your files into strings with open() where 'file_1.txt' is path to your file and 'r' is for "reading mode".
Similar for the second file. And don't forget to close() your files when you're done!
Use split(' ') function to split strings you have just read into lists of words.
file_1 = open('file_1.txt', 'r')
text_1 = file_1.read().split(' ')
file_1.close()
file_2 = open('file_2.txt', 'r')
text_2 = file_2.read().split(' ')
file_2.close()
Next step you need to get difference between text_1 and text_2 list variables (objects).
There are many ways to do it.
1)
You can use Counter class from collections library.
Pass your lists to the class's constructor, then find the difference by subtraction in straight and reverse order, call elements() method to get elements and list() to transform it to the list type.
from collections import Counter
text_count_1 = Counter(text_1)
text_count_2 = Counter(text_2)
difference = list((text_count_1 - text_count_2).elements()) + list((text_count_2 - text_count_1).elements())
Here is the way to calculate the delta words.
from collections import Counter
text_count_1 = Counter(text_1)
text_count_2 = Counter(text_2)
delta = len(list((text_count_2 - text_count_1).elements())) \
- len(list((text_count_1 - text_count_2).elements()))
print(delta)
2)
Use Differ class from difflib library. Pass both lists to compare() method of Differ class and then iterate it with for.
from difflib import Differ
difference = []
for d in Differ().compare(text_1, text_2):
difference.append(d)
Then you can count the delta words like this.
from difflib import Differ
delta = 0
for d in Differ().compare(text_1, text_2):
status = d[0]
if status == "+":
delta += 1
elif status == "-":
delta -= 1
print(delta)
3)
You can write difference method by yourself. For example:
def get_diff (list_1, list_2):
d = []
for item in list_1:
if item not in list_2:
d.append(item)
return d
difference = get_diff(text_1, text_2) + get_diff(text_2, text_1)
I think that there are other ways to do this. But I will limit by three.
Since you get the difference list you can manage the output like whatever you wish.
..and here is yet another way to do this with dict()
#!/usr/bin/python
import sys
def loadfile(filename):
h=dict()
f=open(filename)
for line in f.readlines():
words=line.split(' ')
for word in words:
h[word.strip()]=1
return h
first=loadfile(sys.argv[1])
second=loadfile(sys.argv[2])
print "in both first and second"
for k in first.keys():
if k and k in second.keys():
print k

Resources