How to find sentence clauses that match word sequences? python - python-3.x

I have a large number of sentences from which I want to extract clauses/ segments that match certain word combinations. I have the following code that works, but it only works with one string of one word. I cannot find a way to extend it to work with multiple strings and strings of two words. I thought this was simple and asked by others before me, but could not find the answer. Can anybody help me?
This is my code:
import pandas as pd
df = pd.read_csv('text.csv')
identifiers = ('what')
sentence = df['A']
for i in sentence:
i = i.split()
if identifiers in i:
index = i.index(identifiers)
print(i[index:])
Give a sentence like this:
"Given that I want to become an entrepreneur, I am wondering what collage to attend."
and a list of two-word identifiers such as this:
identifiers = [('I am', 'I can' ..., 'I will')] # There could be dozens
how can I achieve a result like this?
I am wondering what collage to attend.
I tried: extending the code above, using isin() and something like if any([x in i for x in identifiers]) but no solution. Any suggestions?

It does not work for multiple-word phrases because you used split. Since it splits on spaces (by default), logically there won't be any single element left containing a space.
You can use in immediately to test if a certain string contains any other:
>>> sentence = "Given that I want to become an entrepreneur, I am wondering what collage to attend."
>>> identifiers = ['I am', 'I can', 'I will']
>>> for i in identifiers:
... if i in sentence:
... print (sentence[sentence.index(i):])
...
I am wondering what collage to attend.
Your attempt any([x in sentence for x in identifiers]), for these strings, shows
[True, False, False]
and while it gives some useful result, but still not the index, it would require another loop over this result to actually print the index. (And the any part is not necessary unless you specifically and only want to know if a sentence contains such a phrase.)
But the [x in sentence ..] list comprehension only yields a list of True and False, with which you cannot do anything, so it's a dead end.
But it suggests an alternative:
>>> [sentence.index(x) for x in identifiers if x in sentence]
[45]
which leads us to a list of results:
>>> [sentence[sentence.index(x):] for x in identifiers if x in sentence]
['I am wondering what collage to attend.']
If you add 'I want' to your list of identifiers, you still get a correct result, now consisting of two sentence fragments (both all the way up to the end):
['I am wondering what collage to attend.', 'I want to become an entrepreneur, I am wondering what collage to attend.']
(For fun and while I'm at it: if you want to clip off the excess at the first comma, add a regexp that matches everything except a comma:
>>> [re.match(r'^([^,]+)', sentence[sentence.index(x):]).groups(0)[0] for x in identifiers if x in sentence]
['I am wondering what collage to attend.', 'I want to become an entrepreneur']
Never mind the groups(0)[0] part at the end of that regex, it's just to coerce the SRE_Match object back into a regular string.)

Related

Python - how to recursively search a variable substring in texts that are elements of a list

let me explain better what I mean in the title.
Examples of strings where to search (i.e. strings of variable lengths
each one is an element of a list; very large in reality):
STRINGS = ['sftrkpilotndkpilotllptptpyrh', 'ffftapilotdfmmmbtyrtdll', 'gftttepncvjspwqbbqbthpilotou', 'htfrpilotrtubbbfelnxcdcz']
The substring to find, which I know is for sure:
contained in each element of STRINGS
is also contained in a SOURCE string
is of a certain fixed LENGTH (5 characters in this example).
SOURCE = ['gfrtewwxadasvpbepilotzxxndffc']
I am trying to write a Python3 program that finds this hidden word of 5 characters that is in SOURCE and at what position(s) it occurs in each element of STRINGS.
I am also trying to store the results in an array or a dictionary (I do not know what is more convenient at the moment).
Moreover, I need to perform other searches of the same type but with different LENGTH values, so this value should be provided by a variable in order to be of more general use.
I know that the first point has been already solved in previous posts, but
never (as far as I know) together with the second point, which is the part of the code I could not be able to deal with successfully (I do not post my code because I know it is just too far from being fixable).
Any help from this great community is highly appreciated.
-- Maurizio
You can iterate over the source string and for each sub-string use the re module to find the positions within each of the other strings. Then if at least one occurrence was found for each of the strings, yield the result:
import re
def find(source, strings, length):
for i in range(len(source) - length):
sub = source[i:i+length]
positions = {}
for s in strings:
# positions[s] = [m.start() for m in re.finditer(re.escape(sub), s)]
positions[s] = [i for i in range(len(s)) if s.startswith(sub, i)] # Using built-in functions.
if not positions[s]:
break
else:
yield sub, positions
And the generator can be used as illustrated in the following example:
import pprint
pprint.pprint(dict(find(
source='gfrtewwxadasvpbepilotzxxndffc',
strings=['sftrkpilotndkpilotllptptpyrh',
'ffftapilotdfmmmbtyrtdll',
'gftttepncvjspwqbbqbthpilotou',
'htfrpilotrtubbbfelnxcdcz'],
length=5
)))
which produces the following output:
{'pilot': {'ffftapilotdfmmmbtyrtdll': [5],
'gftttepncvjspwqbbqbthpilotou': [21],
'htfrpilotrtubbbfelnxcdcz': [4],
'sftrkpilotndkpilotllptptpyrh': [5, 13]}}

Doubts about string

So, I'm doing an exercise using python, and I tried to use the terminal to do step by step to understand what's happening but I didn't.
I want to understand mainly why the conditional return just the index 0.
Looking 'casino' in [Casinoville].lower() isn't the same thing?
Exercise:
Takes a list of documents (each document is a string) and a keyword.
Returns list of the index values into the original list for all documents containing the keyword.
Exercise solution
def word_search(documents, keyword):
indices = []
for i, doc in enumerate(documents):
tokens = doc.split()
normalized = [token.rstrip('.,').lower() for token in tokens]
if keyword.lower() in normalized:
indices.append(i)
return indices
My solution
def word_search(documents, keyword):
return [i for i, word in enumerate(doc_list) if keyword.lower() in word.rstrip('.,').lower()]
Run
>>> doc_list = ["The Learn Python Challenge Casino.", "They bought a car", "Casinoville"]
Expected output
>>> word_search(doc_list, 'casino')
>>> [0]
Actual output
>>> word_search(doc_list, 'casino')
>>> [0, 2]
Let's try to understand the difference.
The "result" function can be written with list-comprehension:
def word_search(documents, keyword):
return [i for i, word in enumerate(documents)
if keyword.lower() in
[token.rstrip('.,').lower() for token in word.split()]]
The problem happens with the string : "Casinoville" at index 2.
See the output:
print([token.rstrip('.,').lower() for token in doc_list[2].split()])
# ['casinoville']
And here is the matter: you try to ckeck if a word is in the list. The answer is True only if all the string matches (this is the expected output).
However, in your solution, you only check if a word contains a substring. In this case, the condition in is on the string itself and not the list.
See it:
# On the list :
print('casino' in [token.rstrip('.,').lower() for token in doc_list[2].split()])
# False
# On the string:
print('casino' in [token.rstrip('.,').lower() for token in doc_list[2].split()][0])
# True
As result, in the first case, "Casinoville" isn't included while it is in the second one.
Hope that helps !
The question is "Returns list of the index values into the original list for all documents containing the keyword".
you need to consider word only.
In "Casinoville" case, word "casino" is not in, since this case only have word "Casinoville".
When you use the in operator, the result depends on the type of object on the right hand side. When it's a list (or most other kinds of containers), you get an exact membership test. So 'casino' in ['casino'] is True, but 'casino' in ['casinoville'] is False because the strings are not equal.
When the right hand side of is is a string though, it does something different. Rather than looking for an exact match against a single character (which is what strings contain if you think of them as sequences), it does a substring match. So 'casino' in 'casinoville' is True, as would be casino in 'montecasino' or 'casino' in 'foocasinobar' (it's not just prefixes that are checked).
For your problem, you want exact matches to whole words only. The reference solution uses str.split to separate words (the with no argument it splits on any kind of whitespace). It then cleans up the words a bit (stripping off punctuation marks), then does an in match against the list of strings.
Your code never splits the strings you are passed. So when you do an in test, you're doing a substring match on the whole document, and you'll get false positives when you match part of a larger word.

Overlapping values of strings in Python

I am building a puzzle word game in Python. I have the correct puzzle word, and the guessed puzzle word. I want to build a third string which shows the correct letters in the guessed puzzle in the correct puzzle word, and _ at the position of the incorrect letters.
For example, say the correct word is APPLE and the guessed word is APTLE
then i want to have a third string: AP_L_
The guessed word and correct word are guaranteed to be 3 to 5 characters long, but the guessed word is not guaranteed to be the same length as the correct word
For example, correct word is TEA and the guessed word is TEAKO, then the third string should be TEA__ because the players guessed the last two letters incorrectly.
Another example, correct word is APPLE and guessed word is POP, the third string should be:
_ _ P_ _ (without space separation)
I can successfully get the matched indexes of the correct and guessed word; however, I am having problems building the third string. I just learned that strings in Python are immutable and that i cannot assign something like str1[index] = str2[index]
I have tried many things, including using lists, but i am not getting the correct answer. The attached code is my most recent attempt, would you please help me solve this?
Thank you
find the match between puzzle_word and guess
def matcher(str_a, str_b):
#find indexes where letters overlap
matched_indexes = [i for i, (a, b) in enumerate(zip(str_a, str_b)) if a == b]
result = []
for i in str_a:
result.append('_')
for value in matched_indexes:
result[value].replace('_', str_a[value])
print(result)
matcher("apple", "allke")
the output result right now is list of five "_"
cases:
correct word is APPLE and the guessed word is APTLE third
string: AP_L_
correct word is TEA and the guessed word is TEAKO,
third string should be TEA__
correct word is APPLE and guessed
word is POP, third string should be _ _ P_ _
You can use itertools.zip_longest here to always make sure you pad out to the longest word provided and then create a new string by joining the matching characters or otherwise a _. eg:
from itertools import zip_longest
correct_and_guess = [
('APPLE', 'APTLE'),
('TEA', 'TEAKO'),
('APPLE', 'POP')
]
for correct, guess in correct_and_guess:
# If characters in same positions match - show character otherwise `_`
new_word = ''.join(c if c == g else '_' for c, g in zip_longest(correct, guess, fillvalue='_'))
print(correct, guess, new_word)
Will print the following:
APPLE APTLE AP_LE
TEA TEAKO TEA__
APPLE POP __P__
Couple of things here.
str.replace() does not replace inline; as you noted strings are immutable, so you have to assign the result of replace:
result[value] = result[value].replace('_', str_a[value])
However, there's no point doing this since you can just assign to the list element:
result[value] = str_a[value]
And finally you can assign a list of the length of str_a without the for loop, which might be more readable:
result = ['_'] * len(str_a)

Python get character position matches between 2 strings

I'm looking to encode text using a custom alphabet, while I have a decoder for such a thing, I'm finding encoding more difficult.
Attempted string.find, string.index, itertools and several loop attempts. I would like to take the position, convert it to integers to add to a list. I know its something simple I'm overlooking, and all of these options will probably yield a way for me to get the desired results, I'm just hitting a roadblock for some reason.
alphabet = '''h8*jklmnbYw99iqplnou b'''
toencode = 'You win'
I would like the outcome to append to a list with the integer position of the match between the 2 string. I imagine the output to look similar to this:
[9,18,19,20,10,13,17]
Ok, I just tried a bit harder and got this working. For anyone who ever wants to reference this, I did the following:
newlist = []
for p in enumerate(flagtext):
for x in enumerate(alphabet):
if p[1] == x[1]:
newlist.append(x[0])
print newlist

how would i look for the shortest unique subsequence from a set of words in python?

If i have a set of similar words such as:
\bigoplus
\bigotimes
\bigskip
\bigsqcup
\biguplus
\bigvee
\bigwedge
...
\zebra
\zeta
i would like to find the shortest unique set of letters that would characterize each word uniquely
i.e.
\bigop:
\bigoplus
\bigot:
\bigotimes
\bigsk:
\bigskip
EDIT: notice the unique sequence identifier always starts from the begining of the word. I writting an app that gives snippet suggestions when typing. So in general users will start typing from the start of the word
and so on, the sequence needs only be as long as is enough to characterize a word uniquely.
EDIT: but needs to start from the begining of the word.
The characterization always begins from the beginning of the word.
My thoughts:
i was thinking of sorting the words, and grouping based on the fist alphabetical letter, then probably use a longest common subsequence algorithm to find the longest subsequence in common, take its length and use length+1 chars for that unique substring, but im stuck since the algorithms i know for longest subsequence will usually only take two parameters at a time, and i may have more than two words in each group starting with a particular alphabetical letter.
Im i solving an already solved probelem? google was no help.
I'm assuming you want to find the prefixes that uniquely identify the strings, because if you could pick any subsequence, then for example om would be enough to identify \bigotimes in your example.
You can make use of the fact that for a given word, the word with the longest common prefix will be adjacent to it in lexicographical order.
Since your dictionary seems to be sorted already, you can figure out the solution for every word by finding the longest prefix that disambiguates it from both its neighbors.
Example:
>>> lst = r"""
... \bigoplus
... \bigotimes
... \bigskip
... \bigsqcup
... \biguplus
... \bigvee
... \bigwedge
... """.split()
>>> lst.sort() # necessary if lst is not already sorted
>>> lst = [""] + lst + [""]
>>> def cp(x): return len(os.path.commonprefix(x))
...
>>> { lst[i]: 1 + max(cp(lst[i-1:i+1]), cp(lst[i:i+2])) for i in range(1,len(lst)-1) }
{'\\bigvee': 5,
'\\bigsqcup': 6,
'\\biguplus': 5,
'\\bigwedge': 5,
'\\bigotimes': 6,
'\\bigoplus': 6,
'\\bigskip': 6}
The numbers indicate how long the minimal uniquely identifying prefix of a word is.
Thought I'd dump this here since it was the most similar to a question I was about to ask:
Looking for a better solution (will report back when I find one) to iterating through a sequence of strings, trying to map the shortest unique string for/to each.
For example, in a sequence of:
['blue', 'black', 'bold']
# 'blu' --> 'blue'
# 'bla' --> 'black'
# 'bo' --> 'bold'
Looking to improve upon my first, feeble solution. Here's what I came up with:
# Note: Iterating through the keys in a dict, mapping shortest
# unique string to the original string.
shortest_unique_strings = {}
for k in mydict:
for ix in range(len(k)):
# When the list-comp only has one item.
# 'key[:ix+1]' == the current substring
if len([key for key in mydict if key.startswith(key[:ix+1])]) == 1:
shortest_unique_strings[key[:ix+1]] = k
break
Note: On improving efficiency: we should be able to remove those keys/strings that have already been found, so that successive searches don't have to repeat on those items.
Note: I specifically refrained from creating/using any functions outside of built-ins.

Resources