I have text.txt which contains words and numbers.
1. I want to use "for" and "if" for cells or matrices which contain string.
d = reshape(textread('text.txt','%s','delimiter','\t'),2,2)'
if h(1,1)=='apple'
k=1
end
This didn't work.
2. I want to do something like this.
for m=1:size(text)
for n=1:size(text2)
if text2(n,3) contains 'apple' (such as "My apple his, her")
if last word of text2(n,3) is text(m,2)
output(m,1)==text(m,2)
output(m,2)==text(m,1)
end
end
end
end
It is written in non-Matlab way but you know what I want to do.
This type of work might be easy if the text and text2 are matrices of numbers. But they are strings.
How can I perform the same thing I do for numerical matrices?
I think you're looking for the string manipulation utilities. Here are a few:
% Case-sensitive comparison
strcmp('apple', 'apple') == true
strcmp('apple', 'orange') == false
% Case-INsensitive comparison
strcmpi('apple', 'AppLe') == true
strcmpi('apple', 'orange') == false
% Compare first N characters, case sensitive
strncmp('apple', 'apples are delicious', 5) == true
% Compare first N characters, case INsensitive
strncmpi('apple', 'AppLes are delicious', 5) == true
% find strings in other strings
strfind('I have apples and oranges', 'apple')
ans =
8 % match found at 8th character
% find any kind of pattern in (a) string(s)
regexp(...)
Well I'm not going to even begin on regexp...Just read doc regexp :)
Related
The problem is that มาก technically is in มาก็. Because มาก็ is มาก + ็.
So when I do
"แชมพูมาก็เยอะ".replace("มาก", " X ")
I end up with
แชมพู X ็เยอะ
And what I want
แชมพู X เยอะ
What I really want is to force the last character ก็ to count as a single character, so that มาก no longer matches มาก็.
While I haven't found a proper solution, I was able to find a solution. I split each string into separate (combined) characters via regex. Then I compare those lists to each other.
# Check is list is inside other list
def is_slice_in_list(s,l):
len_s = len(s) #so we don't recompute length of s on every iteration
return any(s == l[i:len_s+i] for i in range(len(l) - len_s+1))
def is_word_in_string(w, s):
a = regex.findall(u'\X', w)
b = regex.findall(u'\X', s)
return is_slice_in_list(a, b)
assert is_word_in_string("มาก็", "พูมาก็เยอะ") == True
assert is_word_in_string("มาก", "พูมาก็เยอะ") == False
The regex will split like this:
พู ม า ก็ เ ย อ ะ
ม า ก
And as it compares ก็ to ก the function figures the words are not the same.
I will mark as answered but if there is a nice or "proper" solution I will chose that one.
Say the character which we want to check if it appears consecutively in a string s is the dot '.'.
For example, 'test.2.1' does not have consecutive dots, whereas 'test..2.2a...' has consecutive dots. In fact, for the second example, we should not even bother with checking the rest of the string after the first occurence of the consecutive dots.
I have come up with the following simple method:
def consecutive_dots(s):
count = 0
for c in data:
if c == '.':
count += 1
if count == 2:
return True
else:
count = 0
return False
I was wondering if the above is the most 'pythonic' and efficient way to achieve this goal.
You can just use in to check if two consecutive dots (that is, the string "..") appear in the string s
def consecutive_dots(s):
return '..' in s
I have some transcriptions that unfortunately contain lots of occurrences of words separated by a period but no space (ie word.word).
Is there a way to use regex to separate these, but leave other words like decimals and abbreviations such as U.K. or U.S.A alone? I'm planning to tokenize the text, and so i want the word.word occurrences to be counted as separate words, but I don't want to mess up abbreviations/decimals/any other places where the period is part of the word. Since I would want to replace these specific word.word periods with a space but leave all others alone (or at least not replace them with a space because then it would break up the abbreviation), my first thought was something like this:
text = re.sub("(?<!\d){2,}\.(?!\d){2,}", " ", text)
look for periods that are surrounded by at least two or more not-digits, and then just replace the period with a space. But it seems that variable length lookbehind/lookahead isn't really a thing you can do. I've tested this out in some regex testers and it still matches the letter abbreviations above, although it does not match decimals.
Is there another way to write what I've thought about or another way to approach this? I've gotten somewhat mentally stuck in this solution and I can't find another way that will do close to what I'm looking to do - can it even be done?
Thank you!
Ok, so :D
i have written this code, which i have given it the string "i.would.like.to.visit.the.U.S.A.or.the.u.k.while.i.am.eating.a.banana.b" (the b is there for a purpose, to make sure it doesn't delete one letters for no reason), and the output was:
['i', 'would', 'like', 'to', 'visit', 'the', 'USA', 'or', 'the', 'uk', 'while', 'i', 'am', 'eating', 'a', 'banana', 'b'].
The code is:
text = "i.would.like.to.visit.the.U.S.A.or.the.u.k.while.i.am.eating.a.banana.b"
def split(string: str):
string = string.split(".")
length = len(string) - 1
obj = enumerate(string)
together = []
for index, word in obj:
sub = []
if index and len(word) == 1 and index < length:
idx = index
while len(string[idx]) == 1:
sub.append((string[idx], idx))
idx += 1
next(obj)
together.append(sub)
if together:
deleted = 0
for sub in together:
if len(sub) > 1:
string[sub[0][1] - deleted:sub[-1][1] + 1 - deleted] = ["".join(x[0] for x in sub)]
deleted += len(sub) - 1
return string
print(split(text))
You can edit the section "".join(x[0] for x in sub) to ".".join(x[0] for x in sub) in order to keep the dots, (U.S.A instead of USA)
If you are just trying to add space if both sides are two or more characters the following is what you are looking for.
text = re.sub(r"([^\d.]{2})\.([^\d.]{2})", r"\1. \2", text)
Example:
"This sentence ends.The following is an abbreviation A.B.C." becomes
"This sentence ends. The following is an abbreviation A.B.C."
I have short strings (tweets) in which I must extract all instances of mentions from the text and return a list of these instances including repeats.
extract_mentions('.#AndreaTantaros-supersleuth! You are a true journalistic professional. Keep up the great work! #MakeAmericaGreatAgain')
[AndreaTantaros]
How do I make it so that I remove all text after the first instance of punctuation after '#'? (In this case it would be '-') Note, punctuation can be varied. Please no use of regex.
I have used the following:
tweet_list = tweet.split()
mention_list = []
for word in tweet_list:
if '#' in word:
x = word.index('#')
y = word[x+1:len(word)]
if y.isalnum() == False:
y = word[x+1:-1]
mention_list.append(y)
else:
mention_list.append(y)
return mention_list
This would only work for instances with one extra character
import string
def extract_mentions(s, delimeters = string.punctuation + string.whitespace):
mentions = []
begin = s.find('#')
while begin >= 0:
end = begin + 1
while end < len(s) and s[end] not in delimeters:
end += 1
mentions.append(s[begin+1:end])
begin = s.find('#', end)
return mentions
>>> print(extract_mentions('.#AndreaTantaros-supersleuth! You are a true journalistic professional. Keep up the great work! #MakeAmericaGreatAgain'))
['AndreaTantaros']
Use string.punctuation module to get all punctuation chars.
Remove the first characters while they are punctuation (else the answer would be empty string all the time). Then find the first punctuation char.
This uses 2 loops with opposite conditions and a set for better speed.
z =".#AndreaTantaros-supersleuth! You are a true journalistic professional. Keep up the great work! #MakeAmericaGreatAgain') [AndreaTantaros]"
import string
# skip leading punctuation: find position of first non-punctuation
spun=set(string.punctuation) # faster if searched from a set
start_pos = 0
while z[start_pos] in spun:
start_pos+=1
end_pos = start_pos
while z[end_pos] not in spun:
end_pos+=1
print(z[start_pos:end_pos])
Just use regexp to match and extract part of the text.
Essentially, I have two strings of equal length, let's say 'AGGTCT' and 'AGGCCT' for examples sake. I want to compare them position by position and get a readout of when they do not match. So here I would hope to get 1 out because there is only 1 position where they do not match at position 4. If anyone has ideas for the positional comparison code that would help me a lot to get started.
Thank you!!
Use the following syntax to get the number of dissimilar characters for strings of equal size:
sum( str1 ~= str2 )
If you want to be case insensitive, use:
sum( lower(str1) ~= lower(str2) )
The expression str1 ~= str2 performs char-by-char comparison of the two strings, yielding a logical vector of the same size as the strings, with true where they mismatch (using ~=) and false where they match. To get your result simply sum the number of true values (mismatches).
EDIT: if you want to count the number of matching chars you can:
Use "equal to" == operator (instead of "not-equal to" ~= operator):
sum( str1 == str2 )
Subtract the number of mismatch, from the total number:
numel(str1) - sum( str1 ~= str2 )
You can compare all the element of the string:
r = all(seq1 == seq2)
This will compare char by char and return true if all the element in the resulting array are true. If the strings can have different sizes you may want to compare the sizes first. An alternative is
r = any(seq1 ~= seq2)
Another solution is to use strcmp:
r = strcmp(seq1, seq2)
Just would like to point out that you are asking to calculate the hamming distance (as you ask for alternatives - the article contains links to some). This is already discussed here. In short the builtin command pdist can do it.