Python program that extracts most frequently found names from .csv file - python-3.x

I have created a program that generates 5000 random names, ssn, city, address, and email and stored them in fakeprofile.csv file. I am trying to extract the most common names from the file. I was able to get the program to work syntactically but fail to extract frequent names.
Here's the code:
import re
import statistics
file_open = open('fakeprofile.csv').read()
frequent_names = re.findall('[A-Z][a-z]*', file_open)
print(frequent_names)
Sample in the file:
Alicia Walters 419-52-4141 Yorkstad 66616 Schultz Extensions Suite 225
Reynoldsmouth, VA 72465 stevenserin#stein.biz
Nicole Duffy 212-38-9009 West Timothy 51077 Phillips Ports Apt. 314
Hubbardville, IN 06723 kaitlinthomas#bennett-carter.com
Stephanie Lewis 442-20-1279 Jacquelineshire 650 Gutierrez Forge Apt. 839
West Christianbury, TN 13654 ukelley#gmail.com
Michael Harris 108-81-3733 East Toddberg 14387 Douglas Mission Suite 038
Garciaview, WI 58624 kshields#yahoo.com
Aaron Moreno 171-30-7715 Port Taraburgh 56672 Wagner Path
Lake Christopher, VA 37884 lucasscott#nguyen.info
Alicia Zimmerman 286-88-9507 Barberstad 5365 Heath Extensions Apt. 731
South Randyburgh, NJ 79367 daniellewebb#yahoo.com
Brittney Mcmillan 334-44-0321 Lisahaven PSC 3856, Box 2428
APO AE 03215 kevin95#hotmail.com
Amanda Perkins 327-31-6610 Perryville 8750 Hurst Harbor Apt. 929
Sample output:
', 'Lake', 'Brianna', 'P', 'A', 'Michael', 'Smith', 'Harveymouth', 'Patricia', 'Tunnel', 'West', 'William', 'G', 'A', 'Charles', 'Perkins', 'Lake', 'Marie', 'Lisa', 'Overpass', 'Suite', 'Kennedymouth', 'C', 'A', 'Barbara', 'Perez', 'Billyshire', 'Joshua', 'Village', 'Cindymouth', 'W', 'I', 'Curtis', 'Simmons', 'North', 'Mitchellport', 'Gordon', 'Crest', 'Suite', 'Jacksonburgh', 'C', 'O', 'Cameron', 'Berg', 'South', 'Dean', 'Christina', 'Coves', 'Williamton', 'T', 'N', 'Maria', 'Williams', 'North', 'Judith', 'Carson', 'Overpass', 'Apt', 'West', 'Amandastad', 'N', 'M', 'Hannah', 'Dennis', 'Rodriguezmouth', 'P', 'S', 'C', 'Box', 'A', 'P', 'O', 'A', 'E', 'Laura', 'Richardson', 'Lake', 'Kayla', 'Johnson', 'Place', 'Suite', 'Port', 'Jennifermouth', 'N', 'H', 'John', 'Lawson', 'Hintonhaven', 'Thomas', 'Via', 'Mossport', 'N', 'J', 'Jennifer', 'Hill', 'East', 'Phillip', 'P', 'S', 'C', 'Box', 'A', 'P', 'O', 'A', 'E', 'Cody', 'Jackson', 'Lake', 'Jessicamouth', 'Snyder', 'Ways', 'Apt', 'New', 'Stacey', 'M', 'E', 'Ryan', 'Friedman', 'Shahburgh', 'Jerry', 'Pike', 'Suite', 'Toddfort', 'N', 'V', 'Kathleen', 'Fox', 'Ferrellmouth', 'P', 'S', 'C', 'Box', 'A', 'P', 'O', 'A', 'P', 'Michael', 'Thompson', 'Port', 'Jessica', 'Boone', 'Spurs', 'Suite', 'Port', 'Ashleyland', 'C', 'O', 'Christopher', 'Marsh', 'North', 'Catherine', 'Scott', 'Trail', 'Apt', 'Baileyburgh', 'F', 'L', 'Richard', 'Rangel', 'New', 'Anna', 'Ray', 'Drive', 'Apt', 'Nunezland', 'I', 'A', 'Connor', 'Stanton', 'Troyshire', 'Rodgers', 'Hill', 'West', 'Annmouth', 'N', 'H', 'James', 'Medina',
My issue here is being unable to extract most frequently found first names as well as avoiding those capital letters. Instead, I have extracted all names (including the unnecessary capital letters) and the one seen above is a small sample of all names extracted. I noticed that the first names are always on the odd rows in the output, and I am trying to capture the most frequent first names in those odd rows.
The fakeprofile.csv file was created by this program:
import csv
import faker
from faker import Faker
fake = Faker()
name = fake.name(); print(name)
ssn = fake.ssn(); print(ssn)
city = fake.city(); print(city)
address = fake.address(); print(address)
email = fake.email(); print(email)
profile = fake.simple_profile()
for i,j in profile.items():
print('{}: {}'.format(i,j))
print('Name: {}, SSN: {}, City: {}, Address: {}, Email: {}'.format(name,ssn,city,address,email))
with open('fakeprofile.csv', 'w') as f:
for i in range(0,5001):
print(f'{fake.name()} {fake.ssn()} {fake.city()} {fake.address()} {fake.email()}', file=f)

Does this achieve what you want?
import collections, re
# Read in all lines into a list
with open('fakeprofile.csv') as f:
lines = f.readlines()
# Throw out every other line
lines = [line for i, line in enumerate(lines) if i%2 == 0]
# Keep only first word of each line
names = [line.split()[0] for line in lines]
# Find most common names
n = 3
frequent_names = collections.Counter(names).most_common(n)
# Display most common names
for name, count in frequent_names:
print(name, count)
To do the counting it uses collections.Counter together with its most_common() method.

I think It would have been better if you use pandas library, for the CSV manipulation (collecting the desire information ), and then apply python collection like counter(df ['name'] ) into it, or else could you give us more information about the CSV file.
thank you

So the main problem you have is that you use a regexp that will capture every letter.
You are interested in the first world in the odd line.
you can do something on those lines:
# either use a dict to count or a list to transform as counter.
dico_count = {}
with open('fakeprofile.csv') as file_open: # use of context manager
line_number = 1
for line in file_open: #iterates all the lines
if line_number % 2 != 0 : # odd line
spt = line.strip().split()
dico_count[spt[0]] = dico_count.get(spt[0], 0) + 1
frequent_name_counter = [(k,v) for k,v in sorted(dico_count.items(), key=lambda x: x[1], reverse=True)]

Related

Remove redundant sublists within list in python

Hello everyone I have a list of lists values such as :
list_of_values=[['A','B'],['A','B','C'],['D','E'],['A','C'],['I','J','K','L','M'],['J','M']]
and I would like to keep within that list, only the lists where I have the highest amount of values.
For instance in sublist1 : ['A','B'] A and B are also present in the sublist2 ['A','B','C'], so I remove the sublist1.
The same for sublist4.
the sublist6 is also removed because J and M were present in a the longer sublist5.
at the end I should get:
list_of_no_redundant_values=[['A','B','C'],['D','E'],['I','J','K','L','M']]
other exemple =
list_of_values=[['A','B'],['A','B','C'],['B','E'],['A','C'],['I','J','K','L','M'],['J','M']]
expected output :
[['A','B','C'],['B','E'],['I','J','K','L','M']]
Does someone have an idea ?
mylist=[['A','B'],['A','C'],['A','B','C'],['D','E'],['I','J','K','L','M'],['J','M']]
def remove_subsets(lists):
outlists = lists[:]
for s1 in lists:
for s2 in lists:
if set(s1).issubset(set(s2)) and (s1 is not s2):
outlists.remove(s1)
break
return outlists
print(remove_subsets(mylist))
This should result in [['A', 'B', 'C'], ['D', 'E'], ['I', 'J', 'K', 'L', 'M']]

Generate custom alpha numeric sequence

I am trying to generate custom alpha numeric sequence.
The sequence would be like this :
AA0...AA9 AB0...AB9 AC0...AC9..and so on..
In short, there are 3 places to fill..
On the first place, the values can go from A to Z.
On the second place, the values can go from A to Z.
On the last place, the value can go from 0 to 9.
Code :
s= list('AA0')
for i in range(26):
for j in range(26):
for k in range(10):
if k<10:
print(s[0]+s[1]+str(k))
s[1]= chr(ord(s[1])+1)
s[0]= chr(ord(s[0])+1)
I was able to generate sequence till AZ9 and then I am getting below sequence..
it should be BA0...BZ9..
B[0
B[1
B[2
B[3
B[4
B[5
B[6
B[7
B[8
B[9
B\0
B\1
B\2
B\3
B\4
B\5
B\6
this is a way to do just that:
from itertools import product
from string import ascii_uppercase, digits
for a, b, d in product(ascii_uppercase, ascii_uppercase, digits):
print(f'{a}{b}{d}')
string.ascii_uppercase is just 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'; string.digits is '0123456789' and itertools.product then iterates over all combinations.
instead of digits you could use range(10) just as well.
You can use itertools.product:
>>> letters = [chr(x) for x in range(ord('A'), ord('Z')+1)]
>>> letters
['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']
>>> combinations = ["".join(map(str, x)) for x in itertools.product(letters, letters, range(10))]
>>> combinations
['AA0', 'AA1', 'AA2', 'AA3', 'AA4', 'AA5', 'AA6', 'AA7', 'AA8', 'AA9', 'AB0', 'AB1', 'AB2', 'AB3', 'AB4', 'AB5', 'AB6', 'AB7', 'AB8', 'AB9', 'AC0', 'AC1', 'AC2', 'AC3', 'AC4', 'AC5', 'AC6', 'AC7', 'AC8', 'AC9', 'AD0', 'AD1', 'AD2', 'AD3', 'AD4', 'AD5', 'AD6', 'AD7', 'AD8', 'AD9', 'AE0', 'AE1', 'AE2', 'AE3', 'AE4', 'AE5', 'AE6', 'AE7', 'AE8', 'AE9', 'AF0', 'AF1', 'AF2', 'AF3', 'AF4', 'AF5', 'AF6', 'AF7', 'AF8', 'AF9', 'AG0', 'AG1', 'AG2', 'AG3', 'AG4', 'AG5', 'AG6', 'AG7', 'AG8', 'AG9', 'AH0', 'AH1', 'AH2', 'AH3', 'AH4', 'AH5', 'AH6', 'AH7', 'AH8', 'AH9', 'AI0', 'AI1', 'AI2', 'AI3', 'AI4', 'AI5', 'AI6', 'AI7', 'AI8', 'AI9', 'AJ0', 'AJ1', 'AJ2', 'AJ3', 'AJ4', 'AJ5', 'AJ6', 'AJ7', 'AJ8', 'AJ9', 'AK0'...]

Why Gensim most similar in doc2vec gives the same vector as the output?

I am using the following code to get the ordered list of user posts.
model = doc2vec.Doc2Vec.load(doc2vec_model_name)
doc_vectors = model.docvecs.doctag_syn0
doc_tags = model.docvecs.offset2doctag
for w, sim in model.docvecs.most_similar(positive=[model.infer_vector('phone_comments')], topn=4000):
print(w, sim)
fw.write(w)
fw.write(" (")
fw.write(str(sim))
fw.write(")")
fw.write("\n")
fw.close()
However, I am also getting the vector "phone comments" (that I use to find nearest neighbours) in like 6th place of the list. Is there any mistake I do in the code? or is it a issue in Gensim (becuase the vector cannot be a neighbour of itself)?
EDIT
Doc2vec model training code
######Preprocessing
docs = []
analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')
for key, value in my_d.items():
value = re.sub("[^1-9a-zA-Z]"," ", value)
words = value.lower().split()
tags = key.replace(' ', '_')
docs.append(analyzedDocument(words, tags.split(' ')))
sentences = [] # Initialize an empty list of sentences
######Get n-grams
#Get list of lists of tokenised words. 1 sentence = 1 list
for item in docs:
sentences.append(item.words)
#identify bigrams and trigrams (trigram_sentences_project)
trigram_sentences_project = []
bigram = Phrases(sentences, min_count=5, delimiter=b' ')
trigram = Phrases(bigram[sentences], min_count=5, delimiter=b' ')
for sent in sentences:
bigrams_ = bigram[sent]
trigrams_ = trigram[bigram[sent]]
trigram_sentences_project.append(trigrams_)
paper_count = 0
for item in trigram_sentences_project:
docs[paper_count] = docs[paper_count]._replace(words=item)
paper_count = paper_count+1
# Train model
model = doc2vec.Doc2Vec(docs, size = 100, window = 300, min_count = 5, workers = 4, iter = 20)
#Save the trained model for later use to take the similarity values
model_name = user_defined_doc2vec_model_name
model.save(model_name)
The infer_vector() method expects a list-of-tokens, just like the words property of the text examples (TaggedDocument objects, usually) that were used to train the model.
You're supplying a simple string, 'phone_comments', which will look to infer_vector() like the list ['p', 'h', 'o', 'n', 'e', '_', 'c', 'o', 'm', 'm', 'e', 'n', 't', 's']. Thus your origin vector for the most_similar() is probably garbage.
Further, you're not getting back the input 'phone_comments', you're getting back the different string 'phone comments'. If that's a tag-name in the model, then that must have been a supplied tag during model training. Its superficial similarity to phone_comments may be meaningless - they're different strings.
(But it may also hints that your training had problems, too, and trained the text that should have been words=['phone', 'comments'] as words=['p', 'h', 'o', 'n', 'e', ' ', 'c', 'o', 'm', 'm', 'e', 'n', 't', 's'] instead.)

Is there any way to force ipython to interpret utf-8 symbols?

I'm using ipython notebook.
What I want to do is search a literal string for any spanish accented letters (ñ,á,é,í,ó,ú,Ñ,Á,É,Í,Ó,Ú) and change them to their closest representation in the english alphabet.
I decided to write down a simple function and give it a go:
def remove_accent(n):
listn = list(n)
for i in range(len(listn)):
if listn[i] == 'ó':
listn[i] =o
return listn
Seemed simple right simply compare if the accented character is there and change it to its closest representation so i went ahead and tested it getting the following output:
in []: remove_accent('whatever !## ó')
out[]: ['w',
'h',
'a',
't',
'e',
'v',
'e',
'r',
' ',
'!',
'#',
'#',
' ',
'\xc3',
'\xb3']
I've tried to change the default encoding from ASCII (I presume since i'm getting two positions for te accented character instead of one '\xc3','\xb3') to UTF-8 but this didnt work. what i would like to get is:
in []: remove_accent('whatever !## ó')
out[]: ['w',
'h',
'a',
't',
'e',
'v',
'e',
'r',
' ',
'!',
'#',
'#',
' ',
'o']
PD: this wouldn't be so bad if the accented character yielded just one position instead of two I would just require to change the if condition but I haven't find a way to do that either.
Your problem is that you are getting two characters for the 'ó' character instead of one. Therefore, try to change it to unicode first so that every character has the same length as follows:
def remove_accent(n):
n_unicode=unicode(n,"UTF-8")
listn = list(n_unicode)
for i in range(len(listn)):
if listn[i] == u'ó':
listn[i] = 'o'.encode('utf-8')
else:
listn[i]=listn[i].encode('utf-8')
return listn

Musical note string (C#-4, F-3, etc.) to MIDI note value, in Python

The code in my answer below converts musical notes in strings, such as C#-4 or F-3, to their corresponding MIDI note values.
I am posting this because I am tired of trying to dig it up online every time I need it. I'm sure I'm not the only one who can find a use for it. I just wrote this up — it is tested and correct. It's in Python, but I feel that it pretty close to universally understandable.
#Input is string in the form C#-4, Db-4, or F-3. If your implementation doesn't use the hyphen,
#just replace the line :
# letter = midstr.split('-')[0].upper()
#with:
# letter = midstr[:-1]
def MidiStringToInt(midstr):
Notes = [["C"],["C#","Db"],["D"],["D#","Eb"],["E"],["F"],["F#","Gb"],["G"],["G#","Ab"],["A"],["A#","Bb"],["B"]]
answer = 0
i = 0
#Note
letter = midstr.split('-')[0].upper()
for note in Notes:
for form in note:
if letter.upper() == form:
answer = i
break;
i += 1
#Octave
answer += (int(midstr[-1]))*12
return answer
NOTES_FLAT = ['C', 'Db', 'D', 'Eb', 'E', 'F', 'Gb', 'G', 'Ab', 'A', 'Bb', 'B']
NOTES_SHARP = ['C', 'C#', 'D', 'D#', 'E', 'F', 'F#', 'G', 'G#', 'A', 'A#', 'B']
def NoteToMidi(KeyOctave):
# KeyOctave is formatted like 'C#3'
key = KeyOctave[:-1] # eg C, Db
octave = KeyOctave[-1] # eg 3, 4
answer = -1
try:
if 'b' in key:
pos = NOTES_FLAT.index(key)
else:
pos = NOTES_SHARP.index(key)
except:
print('The key is not valid', key)
return answer
answer += pos + 12 * (int(octave) + 1) + 1
return answer

Resources