Spliting a sentence in python using iteration - python-3.x

I have a challenge in my class that is to split a sentence into a list of separate words using iteration. I can't use any .split functions. Anybody had any ideas?

sentence = 'how now brown cow'
words = []
wordStartIndex = 0
for i in range(0,len(sentence)):
if sentence[i:i+1] == ' ':
if i > wordStartIndex:
words.append(sentence[wordStartIndex:i])
wordStartIndex = i + 1
if i > wordStartIndex:
words.append(sentence[wordStartIndex:len(sentence)])
for w in words:
print('word = ' + w)
Needs tweaking for leading spaces or multiple spaces or punctuation.

I never miss an opportunity to drag out itertools.groupby():
from itertools import groupby
sentence = 'How now brown cow?'
words = []
for isalpha, characters in groupby(sentence, str.isalpha):
if isalpha: # characters are letters
words.append(''.join(characters))
print(words)
OUTPUT
% python3 test.py
['How', 'now', 'brown', 'cow']
%
Now go back and define what you mean by 'word', e.g. what do you want to do about hyphens, apostrophes, etc.

Related

Move a character or word to a new line

Given a string how do i move part of the string in to a new line. without moving the rest of the line or characters
'This' and 'this' word should go in the next line
Output:
> and word should go in the next line
This this
This is just an example of the output i want assuming the words can be different by characters. To be more clear say i have some string elements in an array and i have to move every second and third word of the elements to a new line and printing the rest of the line as is. I've tried using \n and a for loop. But it also moves the rest of the string to a new line
['This and this', 'word should go', 'in the next']
Output:
> This word in
and this should go the next
So the 2nd and 3rd word of the elements are moved without affecting the rest of the line. Is it possible to do this without much complication? I'm aware of the format method but i don't know how to use it in this situation.
For your first example, in case you don't know the order of the target words in advance, I would use a dictionary to store the indices of the found words. Then you can sort those to put the found words in the second line in the same order as they appeared in the text:
targets = ['this', 'This']
source = 'This and this word should go in the next line.'
target_ixs = {source.find(target): target for target in targets}
line2 = ' '.join([target_ixs[i] for i in sorted(target_ixs)])
line1 = source
for target in targets:
line1 = line1.replace(target, '')
line1 = line1.replace(' ', ' ').lstrip()
result = line1 + '\n' + line2
print(result)
and word should go in the next line.
This this
Your second example is easier, because you already know which parts of the strings to put in the second line, so you just need to split each string into a list of words and select from those:
source = ['This and this', 'word should go', 'in the next']
source_lists = [s.split() for s in source]
line1 = ' '.join([source_list[0] for source_list in source_lists])
line2 = ' '.join([' '.join(source_list[1:]) for source_list in source_lists])
result = line1 + '\n' + line2
print(result)
This word in
and this should go the next
You can probably do quite a bit without much complication using the regular expression library and some python language features. That being said, it depends on how complex the rules are for determining what words go where. Typically, you want to start with a string and "tokenize" it into the constituent words. See the code example below:
import re
sentence = "This and this word should go in the next line"
all_words = re.split(r'\W+', sentence)
matched_words = " ".join(re.findall(r"this", sentence, re.IGNORECASE))
unmatched_words = " ".join([word for word in all_words if word not in matched_words])
print(f"{unmatched_words}\n{matched_words}")
> and word should go in the next line
This this
Final Thoughts:
I am by no means a regex ninja so, there may be even more clever things that can be done with just regex patterns and functions. Hopefully, this gives you some food for thought at least.
Got it:
data = ['This and this', 'word should go', 'in the next']
first_line = []
second_line = []
for item in data:
item = item.split(' ')
first_word = item[0]
item.remove(first_word)
others = " ".join(item)
first_line.append(first_word)
second_line.append(others)
print(" ".join(first_line) + "\n" + " ".join(second_line))
My Solution:
input_data = ['This and this', 'word should go ok', 'this next']
I've slightly altered your test string to better test the code.
# Example 1
# Print all words in input_data, moving any word matching the
# string "this" (match is case insensitive) to the next line.
print('Example 1')
lines = ([], [])
for words in input_data:
for word in words.split():
lines[word.lower() == 'this'].append(word)
result = ' '.join(lines[0]) + '\n' + ' '.join(lines[1])
print(result)
The code in example 1 sorts each word into the 2-element tuple, lines. The key part is the boolean expression that preforms the string comparison.
# Example 2
# Print all words in input_data, moving the second and third
# word in any string to the next line.
from itertools import count
print('\nExample 2')
lines = ([], [])
for words in input_data:
for q in zip(count(), words.split()):
lines[q[0] in (1, 2)].append(q[1])
result = ' '.join(lines[0]) + '\n' + ' '.join(lines[1])
print(result)
The next solution is basically the same as the first. I zip each word to an integer so you know the word's position when you get to the boolean expression which, again, sorts the words into their appropriate list in lines.
As you can see, this solution is fairly flexible and can be adjusted to fit a number of scenarios.
Good luck, and I hope this helped!

Variable length lookahead/behind regex when you don't know exactly what you're matching (python)

I have some transcriptions that unfortunately contain lots of occurrences of words separated by a period but no space (ie word.word).
Is there a way to use regex to separate these, but leave other words like decimals and abbreviations such as U.K. or U.S.A alone? I'm planning to tokenize the text, and so i want the word.word occurrences to be counted as separate words, but I don't want to mess up abbreviations/decimals/any other places where the period is part of the word. Since I would want to replace these specific word.word periods with a space but leave all others alone (or at least not replace them with a space because then it would break up the abbreviation), my first thought was something like this:
text = re.sub("(?<!\d){2,}\.(?!\d){2,}", " ", text)
look for periods that are surrounded by at least two or more not-digits, and then just replace the period with a space. But it seems that variable length lookbehind/lookahead isn't really a thing you can do. I've tested this out in some regex testers and it still matches the letter abbreviations above, although it does not match decimals.
Is there another way to write what I've thought about or another way to approach this? I've gotten somewhat mentally stuck in this solution and I can't find another way that will do close to what I'm looking to do - can it even be done?
Thank you!
Ok, so :D
i have written this code, which i have given it the string "i.would.like.to.visit.the.U.S.A.or.the.u.k.while.i.am.eating.a.banana.b" (the b is there for a purpose, to make sure it doesn't delete one letters for no reason), and the output was:
['i', 'would', 'like', 'to', 'visit', 'the', 'USA', 'or', 'the', 'uk', 'while', 'i', 'am', 'eating', 'a', 'banana', 'b'].
The code is:
text = "i.would.like.to.visit.the.U.S.A.or.the.u.k.while.i.am.eating.a.banana.b"
def split(string: str):
string = string.split(".")
length = len(string) - 1
obj = enumerate(string)
together = []
for index, word in obj:
sub = []
if index and len(word) == 1 and index < length:
idx = index
while len(string[idx]) == 1:
sub.append((string[idx], idx))
idx += 1
next(obj)
together.append(sub)
if together:
deleted = 0
for sub in together:
if len(sub) > 1:
string[sub[0][1] - deleted:sub[-1][1] + 1 - deleted] = ["".join(x[0] for x in sub)]
deleted += len(sub) - 1
return string
print(split(text))
You can edit the section "".join(x[0] for x in sub) to ".".join(x[0] for x in sub) in order to keep the dots, (U.S.A instead of USA)
If you are just trying to add space if both sides are two or more characters the following is what you are looking for.
text = re.sub(r"([^\d.]{2})\.([^\d.]{2})", r"\1. \2", text)
Example:
"This sentence ends.The following is an abbreviation A.B.C." becomes
"This sentence ends. The following is an abbreviation A.B.C."

How to remove less frequent words from pandas dataframe

How do i remove words that appears less than x time for example words appear less than 3 times in pandas dataframe. I use nltk as non english word removal, however the result is not good. I assume that word apear less than 3 times as non english words.
input_text=["this is th text one tctst","this is text two asdf","this text will be remove"]
def clean_non_english(text):
text=" ".join(w for w in nltk.wordpunct_tokenize(text)if w.lower() in words or not w.isalpha())
return text
Dataset['text']=Dataset['text'].apply(lambda x:clean_non_english(x))
Desired output
input_text=["this is text ","this is text ","this is text"]
so the word appear in the list less than 3 times will be removed
Try this
input_text=["this is th text one tctst","this is text two asdf","this text will be remove"]
all_ = [x for y in input_text for x in y.split(' ') ]
a, b = np.unique(all_, return_counts = True)
to_remove = a[b < 3]
output_text = [' '.join(np.array(y.split(' '))[~np.isin(y.split(' '), to_remove)])
for y in input_text]

Python Join String to Produce Combinations For All Words in String

If my string is this: 'this is a string', how can I produce all possible combinations by joining each word with its neighboring word?
What this output would look like:
this is a string
thisis a string
thisisa string
thisisastring
thisis astring
this isa string
this isastring
this is astring
What I have tried:
s = 'this is a string'.split()
for i, l in enumerate(s):
''.join(s[0:i])+' '.join(s[i:])
This produces:
'this is a string'
'thisis a string'
'thisisa string'
'thisisastring'
I realize I need to change the s[0:i] part because it's statically anchored at 0 but I don't know how to move to the next word is while still including this in the output.
A simpler (and 3x faster than the accepted answer) way to use itertools product:
s = 'this is a string'
s2 = s.replace('%', '%%').replace(' ', '%s')
for i in itertools.product((' ', ''), repeat=s.count(' ')):
print(s2 % i)
You can also use itertools.product():
import itertools
s = 'this is a string'
words = s.split()
for t in itertools.product(range(len('01')), repeat=len(words)-1):
print(''.join([words[i]+t[i]*' ' for i in range(len(t))])+words[-1])
Well, it took me a little longer than I expected... this is actually tricker than I thought :)
The main idea:
The number of spaces when you split the string is the length or the split array - 1. In our example there are 3 spaces:
'this is a string'
^ ^ ^
We'll take a binary representation of all the options to have/not have either one of the spaces, so in our case it'll be:
000
001
011
100
101
...
and for each option we'll generate the sentence respectively, where 111 represents all 3 spaces: 'this is a string' and 000 represents no-space at all: 'thisisastring'
def binaries(n):
res = []
for x in range(n ** 2 - 1):
tmp = bin(x)
res.append(tmp.replace('0b', '').zfill(n))
return res
def generate(arr, bins):
res = []
for bin in bins:
tmp = arr[0]
i = 1
for digit in list(bin):
if digit == '1':
tmp = tmp + " " + arr[i]
else:
tmp = tmp + arr[i]
i += 1
res.append(tmp)
return res
def combinations(string):
s = string.split(' ')
bins = binaries(len(s) - 1)
res = generate(s, bins)
return res
print combinations('this is a string')
# ['thisisastring', 'thisisa string', 'thisis astring', 'thisis a string', 'this isastring', 'this isa string', 'this is astring', 'this is a string']
UPDATE:
I now see that Amadan thought of the same idea - kudos for being quicker than me to think about! Great minds think alike ;)
The easiest is to do it recursively.
Terminating condition: Schrödinger join of a single element list is that word.
Recurring condition: say that L is the Schrödinger join of all the words but the first. Then the Schrödinger join of the list consists of all elements from L with the first word directly prepended, and all elements from L with the first word prepended with an intervening space.
(Assuming you are missing thisis astring by accident. If it is deliberately, I am sure I have no idea what the question is :P )
Another, non-recursive way you can do it is to enumerate all numbers from 0 to 2^(number of words - 1) - 1, then use the binary representation of each number as a selector whether or not a space needs to be present. So, for example, the abovementioned thisis astring corresponds to 0b010, for "nospace, space, nospace".

Find first of many elements in python - text.find(a , b , c)

I want to check if one of the words in "a" is within "text"
text = "testing if this works"
a = ['asd' , 'test']
print text.find(a)
how can I do this?
thanks
If you want to check whether any of the words in a is in text, use, well, any:
any(word in text for word in a)
If you want to know the number of words in a that occur in text, you can simply add them:
print('Number of words in a that match text: %s' %
sum(word in text for word in a))
If you want to only match full words (i.e. you don't want to match test the word testing), split the text into words, as in:
words = set(text.split())
any(word in words for word in a)
In [20]: wordset = set(text.split())
In [21]: any(w in wordset for w in a)
Out[21]: False
Regexes can be used to search for multiple match patterns in a single pass:
>>> import re
>>> a = ['asd' , 'test']
>>> regex = re.compile('|'.join(map(re.escape, sorted(a, key=len, reverse=True))))
>>> print bool(regex.search(text)) # determine whether there are any matches
True
>>> print regex.findall(text) # extract all matching text
['test']
>>> regex.search(text).start() # find the position of the first match
0

Resources