Extracting acronyms from each string in list index - python-3.x

I have a list of strings (the other posts only had single words or ints) that are imported from a file and I am having trouble using nested loops to separate each few words in an index into its own list and then taking the first letters of each to create acronyms.
I have tried picking apart each index and processing it through another loop to get the first letter of each word but the closest I got was pulling every first letter from each indexes from the original layer.
text = (infile.read()).splitlines()
acronym = []
separator = "."
for i in range(len(text)):
substring = [text[i]]
for j in range(len(substring)):
substring2 = [substring[j][:1])]
acronym.append(substring2)
print("The Acronym is: ", separator.join(acronym))
Happy Path: The list of multi-word strings will turn be translated into acronyms that are listed with linebreaks.
Example of what should output at the end: D.O.D. \n N.S.A. \n ect.
What's happened so far: Before I had gotten it to take the first letter of the first word of every index at the sentence level but I haven't figured out how to nest these loops to get to the single words of each index.
Useful knowledge: THE BEGINNING FORMAT AFTER SPLITLINES (Since people couldn't read this) is a list with indexes with syntax like this: ['Department of Defense', 'National Security Agency', ...]

What you have is kind of a mess. If you are going to be re-using code, it is often better to just make it into a function. Try this out.
def get_acronym(the_string):
words = the_string.split(" ")
return_string = ""
for word in words:
return_string += word[0]
return return_string
text = ['Department of Defense', 'National Security Agency']
for agency in text:
print("The acronym is: " + get_acronym(agency))

I figured out how to do it from a file. File format was like this:
['This is Foo', 'Coming from Bar', 'Bring Your Own Device', 'Department of Defense']
So if this also helps anyone, enjoy~
infile = open(iname, 'r')
text = (infile.read()).splitlines()
print("The Strings To Become Acronyms Are As Followed: \n", text, "\n")
acronyms = []
for string in text:
words = string.split()
letters = [word[0] for word in words]
acronyms.append(".".join(letters).upper())
print("The Acronyms For These Strings Are: \n",acronyms)
This code outputs like this:
The Strings To Become Acronyms Are As Followed:
['This is Foo', 'Coming from Bar', 'Bring Your Own Device', 'Department of Defense']
The Acronyms For These Strings Are:
['T.I.F', 'C.F.B', 'B.Y.O.D', 'D.O.D']

Related

How to search if every word in string starts with any of the word in list using python

I am trying to filter sentences from my pandas data-frame having 50 million records using keyword search. If any words in sentence starts with any of these keywords.
WordsToCheck=['hi','she', 'can']
text_string1="my name is handhit and cannary"
text_string2="she can play!"
If I do something like this:
if any(key in text_string1 for key in WordsToCheck):
print(text_string1)
I get False positive as handhit as hit in the last part of word.
How can I smartly avoid all such False positives from my result set?
Secondly, is there any faster way to do it in python? I am using apply function currently.
I am following this link so that my question is not a duplicate: How to check if a string contains an element from a list in Python
If the case is important you can do something like this:
def any_word_starts_with_one_of(sentence, keywords):
for kw in keywords:
match_words = [word for word in sentence.split(" ") if word.startswith(kw)]
if match_words:
return kw
return None
keywords = ["hi", "she", "can"]
sentences = ["Hi, this is the first sentence", "This is the second"]
for sentence in sentences:
if any_word_starts_with_one_of(sentence, keywords):
print(sentence)
If case is not important replace line 3 with something like this:
match_words = [word for word in sentence.split(" ") if word.lower().startswith(kw.lower())]

How do I search for a substring in a string then find the character before the substring in python

I am making a small project in python that lets you make notes then read them by using specific arguments. I attempted to make an if statement to check if the string has a comma in it, and if it does, than my python file should find the comma then find the character right below that comma and turn it into an integer so it can read out the notes the user created in a specific user-defined range.
If that didn't make sense then basically all I am saying is that I want to find out what line/bit of code is causing this to not work and return nothing even though notes.txt has content.
Here is what I have in my python file:
if "," not in no_cs: # no_cs is the string I am searching through
user_out = int(no_cs[6:len(no_cs) - 1])
notes = open("notes.txt", "r") # notes.txt is the file that stores all the notes the user makes
notes_lines = notes.read().split("\n") # this is suppose to split all the notes into a list
try:
print(notes_lines[user_out])
except IndexError:
print("That line does not exist.")
notes.close()
elif "," in no_cs:
user_out_1 = int(no_cs.find(',') - 1)
user_out_2 = int(no_cs.find(',') + 1)
notes = open("notes.txt", "r")
notes_lines = notes.read().split("\n")
print(notes_lines[user_out_1:user_out_2]) # this is SUPPOSE to list all notes in a specific range but doesn't
notes.close()
Now here is the notes.txt file:
note
note1
note2
note3
and lastly here is what I am getting in console when I attempt to run the program and type notes(0,2)
>>> notes(0,2)
jeffv : notes(0,2)
[]
A great way to do this is to use the python .partition() method. It works by splitting a string from the first occurrence and returns a tuple... The tuple consists of three parts 0: Before the separator 1: The separator itself 2: After the separator:
# The whole string we wish to search.. Let's use a
# Monty Python quote since we are using Python :)
whole_string = "We interrupt this program to annoy you and make things\
generally more irritating."
# Here is the first word we wish to split from the entire string
first_split = 'program'
# now we use partition to pick what comes after the first split word
substring_split = whole_string.partition(first_split)[2]
# now we use python to give us the first character after that first split word
first_character = str(substring_split)[0]
# since the above is a space, let's also show the second character so
# that it is less confusing :)
second_character = str(substring_split)[1]
# Output
print("Here is the whole string we wish to split: " + whole_string)
print("Here is the first split word we want to find: " + first_split)
print("Now here is the first word that occurred after our split word: " + substring_split)
print("The first character after the substring split is: " + first_character)
print("The second character after the substring split is: " + second_character)
output
Here is the whole string we wish to split: We interrupt this program to annoy you and make things generally more irritating.
Here is the first split word we want to find: program
Now here is the first word that occurred after our split word: to annoy you and make things generally more irritating.
The first character after the substring split is:
The second character after the substring split is: t

How to find the number of common words in a text file and delete them in python?

The question is to:
Firstly,find the number of all words in a text file
Secondly, delete the common words like, a, an , and, to, in, at, but,... (it is allowed to write a list of these words)
Thirdly, find the number of the remaining words (unique words)
Make a list of them
the file name should be used as the parameter of the function
I have done the first part of the question
import re
file = open('text.txt', 'r', encoding = 'latin-1')
word_list = file.read().split()
for x in word_list:
print(x)
res = len(word_list)
print ('The number of words in the text:' + str(res))
def uncommonWords (file):
uncommonwords = (list(file))
for i in uncommonwords:
i += 1
print (i)
The code shows till the number of the words and nothing appears after that.
you can do it like this
# list of common words you want to remove
stop_words = set(["is", "the", "to", "in"])
# set to collect unique words
words_in_file = set()
with open("words.txt") as text_file:
for line in text_file:
for word in line.split():
words_in_file.add(word)
# remove common words from word list
unique_words = words_in_file - stop_words
print(list(unique_words))
First, you may want to get rid of punctuation : as showed in this answer, you should do :
nonPunct = re.compile('.*[A-Za-z0-9].*')
filtered = [w for w in text if nonPunct.match(w)]
then, you could do
from collections import Counter
counts = Counter(filtered)
you can then access the list of unique words with list(counts.keys()) and then you can chose to ignore the words you don't want with
[word for word in list(counts.keys()) if word not in common_words]
Hope this answers your question.

How can I remove all POS tags except for 'VBD' and 'VBN' from my CSV file?

I want to remove words tagged with the specific part-of-speech tags VBD and VBN from my CSV file. But, I'm getting the error "IndexError: list index out of range" after entering the following code:
for word in POS_tag_text_clean:
if word[1] !='VBD' and word[1] !='VBN':
words.append(word[0])
My CSV file has 10 reviews of 10 people and the row name is Comment.
Here is my full code:
df_Comment = pd.read_csv("myfile.csv")
def clean(text):
stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()
tagged = nltk.pos_tag(text)
text = text.rstrip()
text = re.sub(r'[^a-zA-Z]', ' ', text)
stop_free = " ".join([i for i in text.lower().split() if((i not in stop) and (not i.isdigit()))])
punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
return normalized
text_clean = []
for text in df)Comment['Comment']:
text_clean.append(clean(text).split())
print(text_clean)
POS_tag_text_clean = [nltk.pos_tag(t) for t in text_clean]
print(POS_tag_text_clean)
words=[]
for word in POS_tag_text_clean:
if word[1] !='VBD' and word[1] !='VBN':
words.append(word[0])
How can I fix the error?
It is a bit hard to understand your problem without an example and the corresponding outputs, but it might be this:
Assuming that text is a string, text_clean will be a list of lists of strings, where every string represents a word. After the part-of-speech tagging, POS_tag_text_clean will therefore be a list of lists of tuples, each tuple containing a word and its tag.
If I'm right, then your last loop actually loops over items from your dataframe instead of words, as the name of the variable suggests. If an item has only one word (which is not so unlikely, since you filter a lot in clean()), your call to word[1] will fail with an error similar to the one you report.
Instead, try this code:
words = []
for item in POS_tag_text_clean:
words_in_item = []
for word in item:
if word[1] !='VBD' and word[1] !='VBN':
words_in_item .append(word[0])
words.append(words_in_item)

Expected str instance, int found. How do I change an int to str to make this code work?

I'm trying to write code that analyses a sentence that contains multiple words and no punctuation. I need it to identify individual words in the sentence that is entered and store them in a list. My example sentence is 'ask not what your country can do for you ask what you can do for your country. I then need the original position of the word to be written to a text file. This is my current code with parts taken from other questions I've found but I just can't get it to work
myFile = open("cat2numbers.txt", "wt")
list = [] # An empty list
sentence = "" # Sentence is equal to the sentence that will be entered
print("Writing to the file: ", myFile) # Telling the user what file they will be writing to
sentence = input("Please enter a sentence without punctuation ") # Asking the user to enter a sentenc
sentence = sentence.lower() # Turns everything entered into lower case
words = sentence.split() # Splitting the sentence into single words
positions = [words.index(word) + 1 for word in words]
for i in range(1,9):
s = repr(i)
print("The positions are being written to the file")
d = ', '.join(positions)
myFile.write(positions) # write the places to myFile
myFile.write("\n")
myFile.close() # closes myFile
print("The positions are now in the file")
The error I've been getting is TypeError: sequence item 0: expected str instance, int found. Could someone please help me, it would be much appreciated
The error stems from .join due to the fact you're joining ints on strings.
So the simple fix would be using:
d = ", ".join(map(str, positions))
which maps the str function on all the elements of the positions list and turns them to strings before joining.
That won't solve all your problems, though. You have used a for loop for some reason, in which you .close the file after writing. In consequent iterations you'll get an error for attempting to write to a file that has been closed.
There's other things, list = [] is unnecessary and, using the name list should be avoided; the initialization of sentence is unnecessary too, you don't need to initialize like that. Additionally, if you want to ask for 8 sentences (the for loop), put your loop before doing your work.
All in all, try something like this:
with open("cat2numbers.txt", "wt") as f:
print("Writing to the file: ", myFile) # Telling the user what file they will be writing to
for i in range(9):
sentence = input("Please enter a sentence without punctuation ").lower() # Asking the user to enter a sentenc
words = sentence.split() # Splitting the sentence into single words
positions = [words.index(word) + 1 for word in words]
f.write(", ".join(map(str, positions))) # write the places to myFile
myFile.write("\n")
print("The positions are now in the file")
this uses the with statement which handles closing the file for you, behind the scenes.
As I see it, in the for loop, you try to write into file, than close it, and than WRITE TO THE CLOSED FILE again. Couldn't this be the problem?

Resources