Find common words between two big unstructured text files - data-processing

I have two big unstructured text files which can NOT be fitted into the memory. I want to find the common words between them.
What would be the most efficient (time and space) way?
Thanks

I gave this two files:
pi_poem
Now I will a rhyme construct
By chosen words the young instruct
I do not like green eggs and ham
I do not like them Sam I am
pi_prose
The thing I like best about pi is the magic it does with circles.
Even young kids can have fun with the simple integer approximations.
The code is simple. The first loop reads the first file, line by line, sticking the words into a lexicon set. The second loop reads the second file; each word it finds in the first file's lexicon goes into a set of common words.
Does that do what you need? You'll need to adapt it for punctuation, and you'll probably want to remove the extra printing once you have it changed over.
lexicon = set()
with open("pi_poem", 'r') as text:
for line in text.readlines():
for word in line.split():
if not word in lexicon:
lexicon.add(word)
print lexicon
common = set()
with open("pi_prose", 'r') as text:
for line in text.readlines():
for word in line.split():
if word in lexicon:
common.add(word)
print common
Output:
set(['and', 'am', 'instruct', 'ham', 'chosen', 'young', 'construct', 'Now', 'By', 'do', 'them', 'I', 'eggs', 'rhyme', 'words', 'not', 'a', 'like', 'Sam', 'will', 'green', 'the'])
set(['I', 'the', 'like', 'young'])

Related

Write a text file and go to a new line when there is punctuation

I have a file, called input.txt, structured as follows:
Hi Mark, my name is Lucas! I was born in Paris in 1998.
I currently live in Berlin.
My goal is to put the text in lowercase, remove numbers and punctuation and replace them with \n (eliminating the excess ones), delete the stopwords and write it all in a new file, called output.txt.
So, if
stopwords = ['my', 'is', 'i', 'was', 'in'],
output.txt should be
hi mark
name lucas
born paris
currently live berlin
But if I use the following code
stopwords = ['my', 'is', 'i', 'was', 'in']
with open('input.txt', 'r', encoding='utf-8-sig') as file:
new_file = open('output.txt', 'w', encoding='utf-8-sig')
for line in file:
corpus = line.lower()
corpus = corpus.strip().replace('’', '\'')
corpus = re.compile('[0-9{}]'.format(re.escape(string.punctuation))).sub('\n', corpus).replace('\n ', '\n').replace(' \n', '\n')
corpus = re.sub(r'\n+', '\n', corpus).strip()
corpus = ' '.join(w for w in corpus.split() if w not in stopwords) # (1)
new_file.write(corpus)
new_file.write('\n')
new_file.close()
I get
hi mark name lucas born paris
currently live berlin
How can I fix it, perhaps by changing only line of code (1)?
Thanks for your help.
This code should do what you want:
import re
# STOPWORDS
stopwords = ["my", "is", "i", "was", "in"]
# The below comprehension will build a regex pattern for each
# word, which will require the word to have a space behind and
# in front for it to be a match. This prevents matching lone
# letters contained within other words.
stopwords = [f"(?<=\s){stopword}(?=\s)" for stopword in stopwords]
# GET INPUT FROM FILE
with open("input.txt", "r") as input_txt:
text = input_txt.read()
# FORMAT TEXT
text = re.sub("’", "'", text).lower()
text = re.sub(r"[^a-z\u00E0-\u00FF ]", "\n", text)
text = re.sub("|".join(stopwords), "", text)
text = re.sub("[ ]+", " ", text)
text = re.sub("\n[ ]+", "\n", text)
text = re.sub(r"\n+", "\n", text).strip()
# WRITE OUPUT TO FILE
with open("output.txt", "w") as output_txt:
output_txt.write(text)
Output
hi mark
name lucas
born paris
currently live berlin
As per your request this now should handle a case where sentences are separated by a line break but no punctuation or number. The issue before was that the .split() we were using to separate words would also remove line breaks. To get around this, I had to use quite a bit of regex, so this answer became a whole lot more complicated.
However, I hope it does what you want it to and let me know if it works or you need help understanding any of it.

Define a list of words and check whether any of those words exist in the body of text

can you please help me with the logic of the following questions. I would like to define a list of different words and check if the words exist in the text, if so I would like the word return, if the words are not part of the text, I would like a message to be returned.
The code I have is the following:
def search_word():
with open('testing.txt', 'r') as textFile:
word_list = ['account', 'earn', 'free', 'links', 'click', 'cash', 'extra', 'win', 'bonus', 'card']
for line in textFile.read():
for word in word_list:
if word in line:
print(word)
else:
print('The text does not include any predefined words.')
search_word()
The output I get is the else statement. I know the issue is with the "for line in textFile.read()" code, however I am trying to understand why the logic does not work in the above code.
I get the right result by change to the following code by moving the "fileText = textObjet.read()" before the for loop command.
def search_word():
with open('email.txt', 'r') as textObject:
fileText = textObject.read()
word_list = ['account', 'earn', 'free', 'links', 'click', 'cash', 'Extra', 'win', 'bonus', 'card']
for word in word_list:
if word in fileText:
print(word)
else:
print(word, '- the email does not include any predefined spam words.')
search_word()
I would appreciate your help with understanding the difference in the logic.
thanks.
Lois
Assume the testing.txt contain only one word account, read() return your text file as a string like 'account\n'.
for line in textFile.read(): is reading every character (including space and new line) in that string, something like ['a', 'c', 'c', 'o', 'u', 'n', 't', '\n'], then
for word in word_list: compare words in word_list to every character. 10 words to 'a', then 10 words to 'c' ... and at last 10 words to \n. Not a single word comparison is match so we will have 80 (10 words x 8 characters) else statement executed.
While using fileText = textObject.read() without a for loop, for word in word_list: just compare words in word_list to that string. 'account' to 'account\n', 'earn' to 'account\n' ... 'card' to 'account\n'. This time return only 10 results for 10 words you have.

Anonymize files,with a array of specified words, Python

So i'm writting an anonymizer and I'm having trouble with figuring out, how to replace a Name in a textfile. I have an array with names that should get anonymized, refered here as text here's my code, it should go into an other file and check if the words match, and if true, it should get replaced. As programming is still a foreign language to me, I would love to read a comprehensive answer
for words in fin_message:
if words == text :
new_list = words.replace(text, "xxx")
print(new_list)
else:
print(words)
Since text is a list, you can't directly compare it to "word", but you can test whether the word is in text:
...
if words in text:
print("xxx")
...
This will, however, print the words in the text file one by one. If instead, you want to print the text file as-is, except for the replacements, you could iterate over the lines of the file, and inside the lines over the banned names. Something like this:
banned_words = ["Peter", "Paul", "Mary"]
with open("my_file.txt") as f:
for line in f:
for forbidden in banned_words:
line.replace(forbidden, "xxx")
print(line)

How to split concatenated strings of this kind: "howdoIsplitthis?"

Suppose I have a string such as this:
"IgotthistextfromapdfIscraped.HowdoIsplitthis?"
And I want to produce:
"I got this text from a pdf I scraped. How do I split this?"
How can I do it?
It turns out that this task is called word segmentation, and there is a python library that can do that:
>>> from wordsegment import load, segment
>>> load()
>>> segment("IgotthistextfromapdfIscraped.HowdoIsplitthis?")
['i', 'got', 'this', 'text', 'from', 'a', 'pdf', 'i', 'scraped', 'how',
'do', 'i', 'split', 'this']
Short answer: no realistic chance.
Long answer:
The only hint where to split the string is finding valid words in the string. So you need a dictionary of the expected language, containing not only the root words, but also all flexions (is that the correct linguistic term?). And then you can try to find a sequence of these words that matches the characters of your string.

Python 3 - Replacing words in a sentence with its index value

I am writing a program that takes a sentence input from the user as a string (str1Input) and then writes that sentence to a file. After splitting the sentence into a list (str1), the program identifies the unique words and writes it to the same file. I then need to replace each word in the sentence (str1Input) with it's index value. by reading from the file containing the sentence
For instance, if i had "i like to code in python and code things" this would be (str1Input) i would then use "str1Input.split()" which changes it to {'i','like','to','code','in','python','and','code','things'}. After finding the unique values i would get: {'and', 'code', 'like', 'i', 'things', 'python', 'to', 'in'}.
I have a problem as I am unsure as to how i would read from the file to replace each word in the sentence with the index value of that word. here is the code so far:
str1Input = input("Please enter a sentence: ")
str1 = str1Input.split()
str1Write = open('str1.txt','w')
str1Write.write(str1Input)
str1Write.close()
print("The words in that sentence are: ",str1)
unique_words = set(str1)
print("Here are the unique words in that sentence: " ,unique_words)
If anyone could help me with this that would be greatly appreciated, thanks!

Resources