How to split concatenated strings of this kind: "howdoIsplitthis?" - string

Suppose I have a string such as this:
"IgotthistextfromapdfIscraped.HowdoIsplitthis?"
And I want to produce:
"I got this text from a pdf I scraped. How do I split this?"
How can I do it?

It turns out that this task is called word segmentation, and there is a python library that can do that:
>>> from wordsegment import load, segment
>>> load()
>>> segment("IgotthistextfromapdfIscraped.HowdoIsplitthis?")
['i', 'got', 'this', 'text', 'from', 'a', 'pdf', 'i', 'scraped', 'how',
'do', 'i', 'split', 'this']

Short answer: no realistic chance.
Long answer:
The only hint where to split the string is finding valid words in the string. So you need a dictionary of the expected language, containing not only the root words, but also all flexions (is that the correct linguistic term?). And then you can try to find a sequence of these words that matches the characters of your string.

Related

Define a list of words and check whether any of those words exist in the body of text

can you please help me with the logic of the following questions. I would like to define a list of different words and check if the words exist in the text, if so I would like the word return, if the words are not part of the text, I would like a message to be returned.
The code I have is the following:
def search_word():
with open('testing.txt', 'r') as textFile:
word_list = ['account', 'earn', 'free', 'links', 'click', 'cash', 'extra', 'win', 'bonus', 'card']
for line in textFile.read():
for word in word_list:
if word in line:
print(word)
else:
print('The text does not include any predefined words.')
search_word()
The output I get is the else statement. I know the issue is with the "for line in textFile.read()" code, however I am trying to understand why the logic does not work in the above code.
I get the right result by change to the following code by moving the "fileText = textObjet.read()" before the for loop command.
def search_word():
with open('email.txt', 'r') as textObject:
fileText = textObject.read()
word_list = ['account', 'earn', 'free', 'links', 'click', 'cash', 'Extra', 'win', 'bonus', 'card']
for word in word_list:
if word in fileText:
print(word)
else:
print(word, '- the email does not include any predefined spam words.')
search_word()
I would appreciate your help with understanding the difference in the logic.
thanks.
Lois
Assume the testing.txt contain only one word account, read() return your text file as a string like 'account\n'.
for line in textFile.read(): is reading every character (including space and new line) in that string, something like ['a', 'c', 'c', 'o', 'u', 'n', 't', '\n'], then
for word in word_list: compare words in word_list to every character. 10 words to 'a', then 10 words to 'c' ... and at last 10 words to \n. Not a single word comparison is match so we will have 80 (10 words x 8 characters) else statement executed.
While using fileText = textObject.read() without a for loop, for word in word_list: just compare words in word_list to that string. 'account' to 'account\n', 'earn' to 'account\n' ... 'card' to 'account\n'. This time return only 10 results for 10 words you have.

How to eliminate new line character from text file when opened in python?

I am trying to take words from stopwords.txt file and append them as string in python list.
stopwords.txt
a
about
above
after
again
against
all
am
an
and
any
are
aren't
as
at
be
because
been
before
being
My Code :
stopword = open("stopwords.txt", "r")
stopwords = []
for word in stopword:
stopwords.append(word)
List stopwords output:
['a\n',
'about\n',
'above\n',
'after\n',
'again\n',
'against\n',
'all\n',
'am\n',
'an\n',
'and\n',
'any\n',
'are\n',
"aren't\n",
'as\n',
'at\n',
'be\n',
'because\n',
'been\n',
'before\n',
'being\n']
Desired Output :
['a',
'about',
'above',
'after',
'again',
'against',
'all',
'am',
'an',
'and',
'any',
'are',
"aren't",
'as',
'at',
'be',
'because',
'been',
'before',
'being']
Is there any method to transpose stopword so that it eliminate '\n' character or any method at all to reach the desire output?
Instead of
stopwords.append(word)
do
stopwords.append(word.strip())
The string.strip() method strips whitespace of any kind (spaces, tabs, newlines, etc.) from the start and end of the string. You can give an argument to the function in order to strip a specific string or set of characters, or use lstrip() or rstrip() to only strip the front or back of the string, but for this case just strip() should suffice.
You can use the .strip() method. It removes all occurrences of the character passed as an argument from a string:
stopword = open("stopwords.txt", "r")
stopwords = []
for word in stopword:
stopwords.append(word.strip("\n"))

How to append a string into a list without its new line "\n" in Python 3?

I have a text file something like this (suppose A and B are persons and below text is a conversation between them):
A: Hello
B: Hello
A: How are you?
B: I am good. Thanks and you?
I added this conversation into a list that returns below result:
[['A', 'Hello\n'], ['A', 'How are you?\n'], ['B', 'Hello\n'], ['B', 'I am good. Thanks and you?\n']]
I use these commands in a loop:
new_sentence = line.split(': ', 1)[1]
attendees_and_sentences[index].append(person)
attendees_and_sentences[index].append(new_sentence)
print(attendees_and_sentences) # with this command I get the above result
print(attendees_and_sentences[0][1]) # if I run this one, then I don't get "\n" in the sentence.
The problem is those "\n" characters on my result screen. How can I get rid of them?
Thank you.
You can use Python's rstrip function.
For example:
>>> 'my string\n'.rstrip()
'my string'
And if you want to trim the trailing newlines while preserving other whitespace, you can specify the characters to remove, like so:
>>> 'my string \n'.rstrip()
'my string '

Strip Punctuation From String in Python

I`m working with documents, and I need to have the words isolated without punctuation. I know how to use string.split(" ") to make each word just the letters, but the punctuation baffles me.
this is an example using regex, and the result is
['this', 'is', 'a', 'string', 'with', 'punctuation']
s = " ,this ?is a string! with punctuation. "
import re
pattern = re.compile('\w+')
result = pattern.findall(s)
print(result)

Find common words between two big unstructured text files

I have two big unstructured text files which can NOT be fitted into the memory. I want to find the common words between them.
What would be the most efficient (time and space) way?
Thanks
I gave this two files:
pi_poem
Now I will a rhyme construct
By chosen words the young instruct
I do not like green eggs and ham
I do not like them Sam I am
pi_prose
The thing I like best about pi is the magic it does with circles.
Even young kids can have fun with the simple integer approximations.
The code is simple. The first loop reads the first file, line by line, sticking the words into a lexicon set. The second loop reads the second file; each word it finds in the first file's lexicon goes into a set of common words.
Does that do what you need? You'll need to adapt it for punctuation, and you'll probably want to remove the extra printing once you have it changed over.
lexicon = set()
with open("pi_poem", 'r') as text:
for line in text.readlines():
for word in line.split():
if not word in lexicon:
lexicon.add(word)
print lexicon
common = set()
with open("pi_prose", 'r') as text:
for line in text.readlines():
for word in line.split():
if word in lexicon:
common.add(word)
print common
Output:
set(['and', 'am', 'instruct', 'ham', 'chosen', 'young', 'construct', 'Now', 'By', 'do', 'them', 'I', 'eggs', 'rhyme', 'words', 'not', 'a', 'like', 'Sam', 'will', 'green', 'the'])
set(['I', 'the', 'like', 'young'])

Resources