I am working on some file operations with python.
I have two text files. First, contains a lot of lines about bigram word embedding results such as apple_pie 0.3434 0.6767 0.2312. And another text file which contains a lot of lines with unigram word embedding results of apple_pie has apple 0.2334 0.3412 0.123 pie 0.976 0.75654 0.2312
I want to append apple_pie bigram word embedding results with apple and pie unigram so it result becomes something like:
apple_pie 0.3434 0.6767 0.2312 0.2334 0.3412 0.123 0.976 0.75654 0.2312 in one line. Does anybody know how to do this? Thanks...
bigram = open("bigram.txt",'r')
unigram = open("unigram.txt",'r')
combine =open("combine.txt",'w')
bigram_lines = bigram.readlines()
unigram_lines = unigram.readlines()
iteration = 0
while iteration < len(bigram_lines):
num_list_bigram = []
text_list_bigram = []
for item in bigram_lines[iteration].split(" "):
if "." in item:
num_list_bigram.append(item)
else:
text_list_bigram.append(item)
num_list_unigram = []
text_list_unigram = []
for item in unigram_lines[iteration].split(" "):
if "." in item:
num_list_unigram.append(item)
else:
text_list_unigram.append(item)
iteration+=1
com_list=text_list_bigram+num_list_bigram+num_list_unigram
for item in com_list:
combine.write(item+" ")
combine.write("\n")
bigram.close()
unigram.close()
combine.close()
Hopefully this will work for you
Related
I am trying to filter sentences from my pandas data-frame having 50 million records using keyword search. If any words in sentence starts with any of these keywords.
WordsToCheck=['hi','she', 'can']
text_string1="my name is handhit and cannary"
text_string2="she can play!"
If I do something like this:
if any(key in text_string1 for key in WordsToCheck):
print(text_string1)
I get False positive as handhit as hit in the last part of word.
How can I smartly avoid all such False positives from my result set?
Secondly, is there any faster way to do it in python? I am using apply function currently.
I am following this link so that my question is not a duplicate: How to check if a string contains an element from a list in Python
If the case is important you can do something like this:
def any_word_starts_with_one_of(sentence, keywords):
for kw in keywords:
match_words = [word for word in sentence.split(" ") if word.startswith(kw)]
if match_words:
return kw
return None
keywords = ["hi", "she", "can"]
sentences = ["Hi, this is the first sentence", "This is the second"]
for sentence in sentences:
if any_word_starts_with_one_of(sentence, keywords):
print(sentence)
If case is not important replace line 3 with something like this:
match_words = [word for word in sentence.split(" ") if word.lower().startswith(kw.lower())]
The question is to:
Firstly,find the number of all words in a text file
Secondly, delete the common words like, a, an , and, to, in, at, but,... (it is allowed to write a list of these words)
Thirdly, find the number of the remaining words (unique words)
Make a list of them
the file name should be used as the parameter of the function
I have done the first part of the question
import re
file = open('text.txt', 'r', encoding = 'latin-1')
word_list = file.read().split()
for x in word_list:
print(x)
res = len(word_list)
print ('The number of words in the text:' + str(res))
def uncommonWords (file):
uncommonwords = (list(file))
for i in uncommonwords:
i += 1
print (i)
The code shows till the number of the words and nothing appears after that.
you can do it like this
# list of common words you want to remove
stop_words = set(["is", "the", "to", "in"])
# set to collect unique words
words_in_file = set()
with open("words.txt") as text_file:
for line in text_file:
for word in line.split():
words_in_file.add(word)
# remove common words from word list
unique_words = words_in_file - stop_words
print(list(unique_words))
First, you may want to get rid of punctuation : as showed in this answer, you should do :
nonPunct = re.compile('.*[A-Za-z0-9].*')
filtered = [w for w in text if nonPunct.match(w)]
then, you could do
from collections import Counter
counts = Counter(filtered)
you can then access the list of unique words with list(counts.keys()) and then you can chose to ignore the words you don't want with
[word for word in list(counts.keys()) if word not in common_words]
Hope this answers your question.
I am currently trying to compare to text files, to see if they have any words in common in both files.
The text files are as
ENGLISH.TXT
circle
table
year
competition
FRENCH.TXT
bien
competition
merci
air
table
My current code is getting them to print, Ive removed all the unnessecary squirly brackets and so on, but I cant get them to print on different lines.
List = open("english.txt").readlines()
List2 = open("french.txt").readlines()
anb = set(List) & set(List2)
anb = str(anb)
anb = (str(anb)[1:-1])
anb = anb.replace("'","")
anb = anb.replace(",","")
anb = anb.replace('\\n',"")
print(anb)
The output is expected to separate both results onto new lines.
Currently Happening:
Competition Table
Expected:
Competition
Table
Thanks in advance!
- Xphoon
Hi I'd suggest you to try two things as a good practice:
1) Use "with" for opening files
with open('english.txt', 'r') as englishfile, open('french.txt', 'r') as frenchfile:
##your python operations for the file
2) Try to use the "f-String" opportunity if you're using Python 3:
print(f"Hello\nWorld!")
File read using "open()" vs "with open()"
This post explains very well why to use the "with" statement :)
And additionally to the f-strings if you want to print out variables do it like this:
print(f"{variable[index]}\n variable2[index2]}")
Should print out:
Hello and World! in seperate lines
Here is one solution including converting between sets and lists:
with open('english.txt', 'r') as englishfile, open('french.txt', 'r') as frenchfile:
english_words = englishfile.readlines()
english_words = [word.strip('\n') for word in english_words]
french_words = frenchfile.readlines()
french_words = [word.strip('\n') for word in french_words]
anb = set(english_words) & set(french_words)
anb_list = [item for item in anb]
for item in anb_list:
print(item)
Here is another solution by keeping the words in lists:
with open('english.txt', 'r') as englishfile, open('french.txt', 'r') as frenchfile:
english_words = englishfile.readlines()
english_words = [word.strip('\n') for word in english_words]
french_words = frenchfile.readlines()
french_words = [word.strip('\n') for word in french_words]
for english_word in english_words:
for french_word in french_words:
if english_word == french_word:
print(english_word)
I want to remove words tagged with the specific part-of-speech tags VBD and VBN from my CSV file. But, I'm getting the error "IndexError: list index out of range" after entering the following code:
for word in POS_tag_text_clean:
if word[1] !='VBD' and word[1] !='VBN':
words.append(word[0])
My CSV file has 10 reviews of 10 people and the row name is Comment.
Here is my full code:
df_Comment = pd.read_csv("myfile.csv")
def clean(text):
stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()
tagged = nltk.pos_tag(text)
text = text.rstrip()
text = re.sub(r'[^a-zA-Z]', ' ', text)
stop_free = " ".join([i for i in text.lower().split() if((i not in stop) and (not i.isdigit()))])
punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
return normalized
text_clean = []
for text in df)Comment['Comment']:
text_clean.append(clean(text).split())
print(text_clean)
POS_tag_text_clean = [nltk.pos_tag(t) for t in text_clean]
print(POS_tag_text_clean)
words=[]
for word in POS_tag_text_clean:
if word[1] !='VBD' and word[1] !='VBN':
words.append(word[0])
How can I fix the error?
It is a bit hard to understand your problem without an example and the corresponding outputs, but it might be this:
Assuming that text is a string, text_clean will be a list of lists of strings, where every string represents a word. After the part-of-speech tagging, POS_tag_text_clean will therefore be a list of lists of tuples, each tuple containing a word and its tag.
If I'm right, then your last loop actually loops over items from your dataframe instead of words, as the name of the variable suggests. If an item has only one word (which is not so unlikely, since you filter a lot in clean()), your call to word[1] will fail with an error similar to the one you report.
Instead, try this code:
words = []
for item in POS_tag_text_clean:
words_in_item = []
for word in item:
if word[1] !='VBD' and word[1] !='VBN':
words_in_item .append(word[0])
words.append(words_in_item)
I'm a newbie to Python as well as to this forum. Below is the question
The file is as mentioned in the image
File Format.
I'm able to split the text in text2 column and write to different rows with the below code
myfile=open('Output.csv,'w')
wr=csv.writer(myfile,lineterminator='\n')
df=pd.read_excel("Input.xlsx")
for txt in df['Text2']:
sentence.append(txt.split('.'))
for pharse in sentence:
for words in pharse:
wr.writerow([words])
I need a help on how to map the sentences, which are of variable length to the key.Also, how to achieve the specific format as mentioned in attached image file.
Also, the writerow function starts writing in the first row but how to specify to begin with column three.
Any help on this is much appreciated!!
Try this:
myfile = open('Output.csv','w')
wr = csv.writer(myfile, lineterminator='\n')
entries = {}
for k, txt1, txt2 in df.values:
sentences = [s.strip() for s in txt2.split('.') if len(s.strip()) > 0]
# sentences = [s.strip() + '.' for s in txt2.split('.') if len(s.strip()) > 0]
entries[k] = [txt1, sentences]
for k in entries.keys():
txt1, txt2 = entries[k]
wr.writerow([k, txt1, txt2[0]])
for s in txt2[1:]:
wr.writerow(['', '', s])
myfile.close()
Use alternative sentences = ... line (the line commented in the above code) if you want to have a dot at the end of each sentence in the csv file. From your example image it is not clear what needs to happen to the dot (sometimes it appears in the output and sometimes it does not).
Also, if so desired, the code can be further simplified by combining the two loops into one loop:
myfile = open('Output.csv','w')
wr = csv.writer(myfile,lineterminator='\n')
for k, txt1, txt2 in df.values:
sentences = [s.strip() for s in txt2.split('.') if len(s.strip()) > 0]
wr.writerow([k, txt1, sentences[0]])
for s in sentences[1:]:
wr.writerow([None,'',s])
myfile.close()