Finding unique letters in a string - python-3.x

I am trying to find out how to find the amount of unique letters in a string.
I know how to find the amount of unique characters in a string by using the code below.
But what if I want to find the amount of unique letters, not characters, excluding punctuation,in the string?
import string
s = 'AabC'
s = s.lower()
print(sum(1 for c in string.ascii_lowercase if s.count(c) == 1))

First, you can filter out all non-letter characters, then you can convert it into a set and check the length.
s = 'AabC123qwer!!>>??'
unique = set(filter(str.isalpha, s.lower()))
print(len(unique))
7

Related

How to remove the alphanumeric characters from a list and split them in the result?

'''def tokenize(s):
string = s.lower().split()
getVals = list([val for val in s if val.isalnum()])
result = "".join(getVals)
print (result)'''
tokenize('AKKK#eastern B!##est!')
Im trying for the output of ('akkkeastern', 'best')
but my output for the above code is - AKKKeasternBest
what are the changes I should be making
Using a list comprehension is a good way to filter elements out of a sequence like a string. In the example below, the list comprehension is used to build a list of characters (characters are also strings in Python) that are either alphanumeric or a space - we are keeping the space around to use later to split the list. After the filtered list is created, what's left to do is make a string out of it using join and last but not least use split to break it in two at the space.
Example:
string = 'AKKK#eastern B!##est!'
# Removes non-alpha chars, but preserves space
filtered = [
char.lower()
for char in string
if char.isalnum() or char == " "
]
# String-ifies filtered list, and splits on space
result = "".join(filtered).split()
print(result)
Output:
['akkkeastern', 'best']

Can we replace an integer to English letter in a document python

I have a document and it contain numbers in between is there a way I can replace all the numbers to the English equivalent ?
eg:
My age is 10. I am in my 7th grade.
expected-o/p :
My age is Ten and I am in my seventh grade.
Thanks in advance
You'll want to take a look at num2words.
You'll have to construct regexp to catch the numbers you want to replace and pass them to num2words. Based on example provided, you also might need the ordinal flag.
import re
from num2words import num2words
# this is just an example NOT ready to use code
text = "My age is 10. I am in my 7th grade."
to_replace = set(re.findall('\d+', text)) # find numbers to replace
longest = sorted(to_replace, key=len, reverse=True) # sort so longest are replaced first
for m in longest:
n = int(m) # convert from string to number
result = num2words(n) # generate text representation
text = re.sub(m, result, text) # substitute in the text
print(text)
edited to reflect that OP wants to catch all digits

Get digits at end of string in a pythonic way

I'm using python 3.x. I'm trying to get the (int) number at the end of a string with format
string_example_1 = l-45-98-567-567-12
string_example_2 = s-89-657
or in general, a single lowercase letter followed by a number of integers separated by '-'. What I need is to get the last number (12 and 657 in these cases). I have archived this with the function
def ending(the_string):
out = ''
while the_string[-1].isdigit():
out = the_string[-1] + out
the_string = the_string[:-1]
return out
but I'm sure there must be a more pythonic way to do this. In a previous instance I check manually that the string starts the way I like by doing something like
if st[0].isalpha() and st[1]=='-' and st[2].isdigit():
statement...
I would just split the string on -, take the last of the splits and convert it to an integer.
string_example_1 = "l-45-98-567-567-12"
string_example_2 = "s-89-657"
def last_number(s):
return int(s.split("-")[-1])
print(last_number(string_example_1))
# 12
print(last_number(string_example_2))
# 657
Without regular expressions, you could reverse the string, take elements from the string while they're still numbers, and then reverse the result. In Python:
from itertools import takewhile
def extract_final_digits(s):
return int(''.join(reversed(list(takewhile(lambda c: c.isdigit(), reversed(s))))))
But the simplest is to just split on a delimiter and take the final element in the split list.

Is there a way to substring, which is between two words in the string in Python?

My question is more or less similar to:
Is there a way to substring a string in Python?
but it's more specifically oriented.
How can I get a par of a string which is located between two known words in the initial string.
Example:
mySrting = "this is the initial string"
Substring = "initial"
knowing that "the" and "string" are the two known words in the string that can be used to get the substring.
Thank you!
You can start with simple string manipulation here. str.index is your best friend there, as it will tell you the position of a substring within a string; and you can also start searching somewhere later in the string:
>>> myString = "this is the initial string"
>>> myString.index('the')
8
>>> myString.index('string', 8)
20
Looking at the slice [8:20], we already get close to what we want:
>>> myString[8:20]
'the initial '
Of course, since we found the beginning position of 'the', we need to account for its length. And finally, we might want to strip whitespace:
>>> myString[8 + 3:20]
' initial '
>>> myString[8 + 3:20].strip()
'initial'
Combined, you would do this:
startIndex = myString.index('the')
substring = myString[startIndex + 3 : myString.index('string', startIndex)].strip()
If you want to look for matches multiple times, then you just need to repeat doing this while looking only at the rest of the string. Since str.index will only ever find the first match, you can use this to scan the string very efficiently:
searchString = 'this is the initial string but I added the relevant string pair a few more times into the search string.'
startWord = 'the'
endWord = 'string'
results = []
index = 0
while True:
try:
startIndex = searchString.index(startWord, index)
endIndex = searchString.index(endWord, startIndex)
results.append(searchString[startIndex + len(startWord):endIndex].strip())
# move the index to the end
index = endIndex + len(endWord)
except ValueError:
# str.index raises a ValueError if there is no match; in that
# case we know that we’re done looking at the string, so we can
# break out of the loop
break
print(results)
# ['initial', 'relevant', 'search']
You can also try something like this:
mystring = "this is the initial string"
mystring = mystring.strip().split(" ")
for i in range(1,len(mystring)-1):
if(mystring[i-1] == "the" and mystring[i+1] == "string"):
print(mystring[i])
I suggest using a combination of list, split and join methods.
This should help if you are looking for more than 1 word in the substring.
Turn the string into array:
words = list(string.split())
Get the index of your opening and closing markers then return the substring:
open = words.index('the')
close = words.index('string')
substring = ''.join(words[open+1:close])
You may want to improve a bit with the checking for the validity before proceeding.
If your problem gets more complex, i.e multiple occurrences of the pair values, I suggest using regular expression.
import re
substring = ''.join(re.findall(r'the (.+?) string', string))
The re should store substrings separately if you view them in list.
I am using the spaces between the description to rule out the spaces between words, you can modify to your needs as well.

How to calculate word co-occurence

I have a string of characters of length 50 say representing a sequence abbcda.... for alphabets taken from the set A={a,b,c,d}.
I want to calculate how many times b is followed by another b (n-grams) where n=2.
Similarly, how many times a particular character is repeated thrice n=3 consecutively, say in the input string abbbcbbb etc so here the number of times b occurs in a sequence of 3 letters is 2.
To find the number of non-overlapping 2-grams you can use
numel(regexp(str, 'b{2}'))
and for 3-grams
numel(regexp(str, 'b{3}'))
to count overlapping 2-grams use positive lookahead
numel(regexp(str, '(b)(?=b{1})'))
and for overlapping n-grams
numel(regexp(str, ['(b)(?=b{' num2str(n-1) '})']))
EDIT
In order to find number of occurrences of an arbitrary sequence use the first element in first parenthesis and the rest after equality sign, to find ba use
numel(regexp(str, '(b)(?=a)'))
to find bda use
numel(regexp(str, '(b)(?=da)'))
Building on the proposal by Magla:
str = 'abcdabbcdaabbbabbbb'; % for example
index_single = ismember(str, 'b');
index_digram = index_single(1:end-1)&index_single(2:end);
index_trigram = index_single(1:end-2)&index_single(2:end-1)&index_single(3:end);
You may try this piece of code that uses ismember (doc).
%generate string (50 char, 'a' to 'd')
str = char(floor(97 + (101-97).*rand(1,50)))
%digram case
index_digram = ismember(str, 'aa');
%trigram case
index_trigram = ismember(str, 'aaa');
EDIT
Probabilities can be computed with
proba = sum(index_digram)/length(index_digram);
this will find all n-grams and count them:
numberOfGrams = 5;
s = char(floor(rand(1,1000)*4)+double('a'));
ngrams = cell(1);
for n = 2:numberOfGrams
strLength = size(s,2)-n+1;
indices = repmat((1:strLength)',1,n)+repmat(1:n,strLength,1)-1;
grams = s(indices);
gramNumbers = (double(grams)-double('a'))*((ones(1,n)*n).^(0:n-1))';
[uniqueGrams, gramInd] = unique(gramNumbers);
count=hist(gramNumbers,uniqueGrams);
ngrams(n) = {struct('gram',grams(gramInd,:),'count',count)};
end
edit:
the result will be:
ngrams{n}.gram %a list of all n letter sequences in the string
ngrams{n}.count(x) %the number of times the sequence ngrams{n}.gram(x) appears

Resources