Change Letters in A String One at a Time (Pandas,Python3) - python-3.x

I have a list of words in Pandas (DF)
Words
Shirt
Blouse
Sweater
What I'm trying to do is swap out certain letters in those words with letters from my dictionary one letter at a time.
so for example:
mydict = {"e":"q,w",
"a":"z"}
would create a new list that first replaces all the "e" in a list one at a time, and then iterates through again replacing all the "a" one at a time:
Words
Shirt
Blouse
Sweater
Blousq
Blousw
Swqater
Swwater
Sweatqr
Sweatwr
Swezter
I've been looking around at solutions here: Mass string replace in python?
and have tried the following code but it changes all instances "e" instead of doing so one at a time -- any help?:
mydict = {"e":"q,w"}
s = DF
for k, v in mydict.items():
for j in v:
s['Words'] = s["Words"].str.replace(k, j)
DF["Words"] = s
this doesn't seem to work either:
s = DF.replace({"Words": {"e": "q","w"}})

This answer is very similar to Brian's answer, but a little bit sanitized and the output has no duplicates:
words = ["Words", "Shirt", "Blouse", "Sweater"]
md = {"e": "q,w", "a": "z"}
md = {k: v.split(',') for k, v in md.items()}
newwords = []
for word in words:
newwords.append(word)
for c in md:
occ = word.count(c)
pos = 0
for _ in range(occ):
pos = word.find(c, pos)
for r in md[c]:
tmp = word[:pos] + r + word[pos+1:]
newwords.append(tmp)
pos += 1
Content of newwords:
['Words', 'Shirt', 'Blouse', 'Blousq', 'Blousw', 'Sweater', 'Swqater', 'Swwater', 'Sweatqr', 'Sweatwr', 'Swezter']
Prettyprint:
Words
Shirt
Blouse
Blousq
Blousw
Sweater
Swqater
Swwater
Sweatqr
Sweatwr
Swezter
Any errors are a result of the current time. ;)
Update (explanation)
tl;dr
The main idea is to find the occurences of the character in the word one after another. For each occurence we are then replacing it with the replacing-char (again one after another). The replaced word get's added to the output-list.
I will try to explain everything step by step:
words = ["Words", "Shirt", "Blouse", "Sweater"]
md = {"e": "q,w", "a": "z"}
Well. Your basic input. :)
md = {k: v.split(',') for k, v in md.items()}
A simpler way to deal with replacing-dictionary. md now looks like {"e": ["q", "w"], "a": ["z"]}. Now we don't have to handle "q,w" and "z" differently but the step for replacing is just the same and ignores the fact, that "a" only got one replace-char.
newwords = []
The new list to store the output in.
for word in words:
newwords.append(word)
We have to do those actions for each word (I assume, the reason is clear). We also append the world directly to our just created output-list (newwords).
for c in md:
c as short for character. So for each character we want to replace (all keys of md), we do the following stuff.
occ = word.count(c)
occ for occurrences (yeah. count would fit as well :P). word.count(c) returns the number of occurences of the character/string c in word. So "Sweater".count("o") => 0 and "Sweater".count("e") => 2.
We use this here to know, how often we have to take a look at word to get all those occurences of c.
pos = 0
Our startposition to look for c in word. Comes into use in the next loop.
for _ in range(occ):
For each occurence. As a continual number has no value for us here, we "discard" it by naming it _. At this point where c is in word. Yet.
pos = word.find(c, pos)
Oh. Look. We found c. :) word.find(c, pos) returns the index of the first occurence of c in word, starting at pos. At the beginning, this means from the start of the string => the first occurence of c. But with this call we already update pos. This plus the last line (pos += 1) moves our search-window for the next round to start just behind the previous occurence of c.
for r in md[c]:
Now you see, why we updated mc previously: we can easily iterate over it now (a md[c].split(',') on the old md would do the job as well). So we are doing the replacement now for each of the replacement-characters.
tmp = word[:pos] + r + word[pos+1:]
The actual replacement. We store it in tmp (for debug-reasons). word[:pos] gives us word up to the (current) occurence of c (exclusive c). r is the replacement. word[pos+1:] adds the remaining word (again without c).
newwords.append(tmp)
Our so created new word tmp now goes into our output-list (newwords).
pos += 1
The already mentioned adjustment of pos to "jump over c".
Additional question from OP: Is there an easy way to dictate how many letters in the string I want to replace [(meaning e.g. multiple at a time)]?
Surely. But I have currently only a vague idea on how to achieve this. I am going to look at it, when I got my sleep. ;)
words = ["Words", "Shirt", "Blouse", "Sweater", "multipleeee"]
md = {"e": "q,w", "a": "z"}
md = {k: v.split(',') for k, v in md.items()}
num = 2 # this is the number of replaces at a time.
newwords = []
for word in words:
newwords.append(word)
for char in md:
for r in md[char]:
pos = multiples = 0
current_word = word
while current_word.find(char, pos) != -1:
pos = current_word.find(char, pos)
current_word = current_word[:pos] + r + current_word[pos+1:]
pos += 1
multiples += 1
if multiples == num:
newwords.append(current_word)
multiples = 0
current_word = word
Content of newwords:
['Words', 'Shirt', 'Blouse', 'Sweater', 'Swqatqr', 'Swwatwr', 'multipleeee', 'multiplqqee', 'multipleeqq', 'multiplwwee', 'multipleeww']
Prettyprint:
Words
Shirt
Blouse
Sweater
Swqatqr
Swwatwr
multipleeee
multiplqqee
multipleeqq
multiplwwee
multipleeww
I added multipleeee to demonstrate, how the replacement works: For num = 2 it means the first two occurences are replaced, after them, the next two. So there is no intersection of the replaced parts. If you would want to have something like ['multiplqqee', 'multipleqqe', 'multipleeqq'], you would have to store the position of the "first" occurence of char. You can then restore pos to that position in the if multiples == num:-block.
If you got further questions, feel free to ask. :)

Because you need to replace letters one at a time, this doesn't sound like a good problem to solve with pandas, since pandas is about doing everything at once (vectorized operations). I would dump out your DataFrame into a plain old list and use list operations:
words = DF.to_dict()["Words"].values()
for find, replace in reversed(sorted(mydict.items())):
for word in words:
occurences = word.count(find)
if not occurences:
print word
continue
start_index = 0
for i in range(occurences):
for replace_char in replace.split(","):
modified_word = list(word)
index = modified_word.index(find, start_index)
modified_word[index] = replace_char
modified_word = "".join(modified_word)
print modified_word
start_index = index + 1
Which gives:
Words
Shirt
Blousq
Blousw
Swqater
Swwater
Sweatqr
Sweatwr
Words
Shirt
Blouse
Swezter
Instead of printing the words, you can append them to a list and re-create a DataFrame if that's what you want to end up with.

If you are looping, you need to update s at each cycle of the loop. You also need to loop over v.
mydict = {"e":"q,w"}
s=deduped
for k, v in mydict.items():
for j in v:
s = s.replace(k, j)
Then reassign it to your dataframe:
df["Words"] = s
If you can write this as a function that takes in a 1d array (list, numpy array etc...), you can use df.apply to apply it to any column, using df.apply().

Related

Recursive function how to manage output

I'm working on a project for creating some word list. I have a word and some rules, for example, this char % is for digit, while this one ^ for special character, for example January%%^ should create things like:
January00!
January01!
January02!
January03!
January04!
January05!
January06!
etc.
For now I'm trying to do it with only digit and create a recursive function, because people can add as many digits and special characters as they want
January^%%%^% (for example)
This is the first function I have created:
month = "January"
nbDigit = "%%%"
def addNumber(month : list, position: int):
for i in range(position, len(month)):
for j in range(0,10):
month[position] = j
if(position == len(month)-1):
print (''.join(str(v) for v in month))
if position < len(month):
if month[position+1] == "%":
addNumber(month, position+1)
The problem is for each % that I have there is another output (three %, three times as output January000-January999/January000-January999/January000-January999).
When I tried to add the new function special character it's even worse, because I can't manage the output since every word can't end with a special character or digit. (AddSpecialChar is also a recursive function).
I believe what you are looking for is the following:
month = 'January'
nbDigit = "%%"
def addNumbers(root: str, mask: str)-> list:
# create a list of words using root followed By digits
rslt = []
mxNmb = 0
for i in range(len(mask)):
mxNmb += 9 * 10**i
mxNmb += 1
for i in range(mxNmb):
word = f"{root}{((str(i).rjust(len(mask), '0')))}"
rslt.append(word)
return rslt
this will produce:
['January00',
'January01',
'January02',
'January03',
'January04',
'January05',
'January06',
'January07',
'January08',
'January09',
'January10',
'January11',
'January12',
'January13',
'January14',
'January15',
'January16',
'January17',
'January18',
'January19',
'January20',
'January21',
'January22',
'January23',
'January24',
'January25',
'January26',
'January27',
'January28',
'January29',
'January30',
'January31',
'January32',
'January33',
'January34',
'January35',
'January36',
'January37',
'January38',
'January39',
'January40',
'January41',
'January42',
'January43',
'January44',
'January45',
'January46',
'January47',
'January48',
'January49',
'January50',
'January51',
'January52',
'January53',
'January54',
'January55',
'January56',
'January57',
'January58',
'January59',
'January60',
'January61',
'January62',
'January63',
'January64',
'January65',
'January66',
'January67',
'January68',
'January69',
'January70',
'January71',
'January72',
'January73',
'January74',
'January75',
'January76',
'January77',
'January78',
'January79',
'January80',
'January81',
'January82',
'January83',
'January84',
'January85',
'January86',
'January87',
'January88',
'January89',
'January90',
'January91',
'January92',
'January93',
'January94',
'January95',
'January96',
'January97',
'January98',
'January99']
Adding another position to the nbDigit variable will produce the numeric sequence from 000 to 999

How to count strings in specified field within each line of one or more csv files

Writing a Python program (ver. 3) to count strings in a specified field within each line of one or more csv files.
Where the csv file contains:
Field1, Field2, Field3, Field4
A, B, C, D
A, E, F, G
Z, E, C, D
Z, W, C, Q
the script is executed, for example:
$ ./script.py 1,2,3,4 file.csv
And the result is:
A 10
C 7
D 2
E 2
Z 2
B 1
Q 1
F 1
G 1
W 1
ERROR
the script is executed, for example:
$ ./script.py 1,2,3,4 file.csv file.csv file.csv
Where the error occurs:
for rowitem in reader:
for pos in field:
pos = rowitem[pos] ##<---LINE generating error--->##
if pos not in fieldcnt:
fieldcnt[pos] = 1
else:
fieldcnt[pos] += 1
TypeError: list indices must be integers or slices, not str
Thank you!
Judging from the output, I'd say that the fields in the csv file does not influence the count of the string. If the string uniqueness is case-insensitive please remember to use yourstring.lower() to return the string so that different case matches are actually counted as one. Also do keep in mind that if your text is large the number of unique strings you might find could be very large as well, so some sort of sorting must be in place to make sense of it! (Or else it might be a long list of random counts with a large portion of it being just 1s)
Now, to get a count of unique strings using the collections module is an easy way to go.
file = open('yourfile.txt', encoding="utf8")
a= file.read()
#if you have some words you'd like to exclude
stopwords = set(line.strip() for line in open('stopwords.txt'))
stopwords = stopwords.union(set(['<media','omitted>','it\'s','two','said']))
# make an empty key-value dict to contain matched words and their counts
wordcount = {}
for word in a.lower().split(): #use the delimiter you want (a comma I think?)
# replace punctuation so they arent counted as part of a word
word = word.replace(".","")
word = word.replace(",","")
word = word.replace("\"","")
word = word.replace("!","")
if word not in stopwords:
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
That should do it. The wordcount dict should contain the word and it's frequency. After that just sort it using collections and print it out.
word_counter = collections.Counter(wordcount)
for word, count in word_counter.most_common(20):
print(word, ": ", count)
I hope this solves your problem. Lemme know if you face problems.

Combining words in a dictionary to match a single word

I'm working on a problem where I need to check how many words in a dictionary can be combined to match a single word.
For example:
Given the string "hellogoodsir", and the dictionary: {hello, good, sir, go, od, e, l}, the goal is to find all the possible combinations to form the string.
In this case, the result would be hello + good + sir, and hello + go + od + sir, resulting in 3 + 4 = 7 words used, or 1 + 1 = 2 combinations.
What I've come up with is simply to put all the words starting with the first character ("h" in this instance) in one hashmap (startH), and the rest in another hashmap (endH). I then go through every single word in the startH hashmap, and check if "hellogoodsir" contains the new word (start + end), where end is every word in the endH hashmap. If it does, I check if it equals the word to match, and then increments the counter with the value of the number for each word used. If it contains it, but doesn't equal it, I call the same method (recursion) using the new word (i.e. start + end), and proceed to try to append any word in the end hashmap to the new word to get a match.
This is obviously very slow for large number of words (and a long string to match). Is there a more efficient way to solve this problem?
As far as I know, this is an O(n^2) algorithm, but I'm sure this can be done faster.
Let's start with your solution. It is no linear nor quadric time, it's actually exponential time. A counter example that shows that is:
word = "aaa...a"
dictionary = {"a", "aa", "aaa", ..., "aa...a"}
Since your solution is going through each possible matching, and there is exponential number of such in this example - the solution is exponential time.
However, that can be done more efficiently (quadric time worst case), with Dynamic Programming, by following the recursive formula:
D[0] = 1 #
D[i] = sum { D[j] | word.Substring(i,j) is in the dictionary | 0 <= j < i }
Calculating each D[i] (given the previous ones are already known) is done in O(i)
This sums to total O(n^2) time, with O(n) extra space.
Quick note: By iterating the dictionary instead of all (i,j) pairs for each D[i], you can achieve O(k) time for each D[i], which ends up as O(n*k), where k is the dictionary size. This can be optimized for some cases by traversing only potentially valid strings - but for the same counter example as above, it will result in O(n*k).
Example:
dictionary = {hello, good, sir, go, od, e, l}
string = "hellogoodsir"
D[0] = 1
D[1] = 0 (no substring h)
D[2] = 0 (no substring he, d[1] = 0 for e)
...
D[5] = 1 (hello is the only valid string in dictionary)
D[6] = 0 (no dictionary string ending with g)
D[7] = D[5], because string.substring(5,7)="go" is in dictionary
D[8] = 0, no substring ending with "oo"
D[9] = 2: D[7] for "od", and D[5] for "good"
D[10] = D[11] = 0 (no strings in dictionary ending with "si" or "s")
D[12] = D[7] = 2 for substring "sir"
My suggestion would be to use a prefix tree. The nodes beneath the root would be h, g, s, o, e, and l. You will need nodes for terminating characters as well, to differentiate between go and good.
To find all matches, use a Breadth-first-search approach. The state you will want to keep track of is a composition of: the current index in the search-string, the current node in the tree, and the list of words used so far.
The initial state should be 0, root, []
While the list of states is not empty, dequeue the next state, and see if the index matches any of the keys of the children of the node. If so, modify a copy of the state and enqueue it. Also, if any of the children are the terminating character, do the same, adding the word to the list in the state.
I'm not sure on the O(n) time on this algorithm, but it should be much faster.

Return number of alphabetical substrings within input string

I'm trying to generate code to return the number of substrings within an input that are in sequential alphabetical order.
i.e. Input: 'abccbaabccba'
Output: 2
alphabet = 'abcdefghijklmnopqrstuvwxyz'
def cake(x):
for i in range(len(x)):
for j in range (len(x)+1):
s = x[i:j+1]
l = 0
if s in alphabet:
l += 1
return l
print (cake('abccbaabccba'))
So far my code will only return 1. Based on tests I've done on it, it seems it just returns a 1 if there are letters in the input. Does anyone see where I'm going wrong?
You are getting the output 1 every time because your code resets the count to l = 0 on every pass through the loop.
If you fix this, you will get the answer 96, because you are including a lot of redundant checks on empty strings ('' in alphabet returns True).
If you fix that, you will get 17, because your test string contains substrings of length 1 and 2, as well as 3+, that are also substrings of the alphabet. So, your code needs to take into account the minimum substring length you would like to consider—which I assume is 3:
alphabet = 'abcdefghijklmnopqrstuvwxyz'
def cake(x, minLength=3):
l = 0
for i in range(len(x)):
for j in range(i+minLength, len(x)): # carefully specify both the start and end values of the loop that determines where your substring will end
s = x[i:j]
if s in alphabet:
print(repr(s))
l += 1
return l
print (cake('abccbaabccba'))

Matlab. Find the indices of a cell array of strings with characters all contained in a given string (without repetition)

I have one string and a cell array of strings.
str = 'actaz';
dic = {'aaccttzz', 'ac', 'zt', 'ctu', 'bdu', 'zac', 'zaz', 'aac'};
I want to obtain:
idx = [2, 3, 6, 8];
I have written a very long code that:
finds the elements with length not greater than length(str);
removes the elements with characters not included in str;
finally, for each remaining element, checks the characters one by one
Essentially, it's an almost brute force code and runs very slowly. I wonder if there is a simple way to do it fast.
NB: I have just edited the question to make clear that characters can be repeated n times if they appear n times in str. Thanks Shai for pointing it out.
You can sort the strings and then match them using regular expression. For your example the pattern will be ^a{0,2}c{0,1}t{0,1}z{0,1}$:
u = unique(str);
t = ['^' sprintf('%c{0,%d}', [u; histc(str,u)]) '$'];
s = cellfun(#sort, dic, 'uni', 0);
idx = find(~cellfun('isempty', regexp(s, t)));
I came up with this :
>> g=#(x,y) sum(x==y) <= sum(str==y);
>> h=#(t)sum(arrayfun(#(x)g(t,x),t))==length(t);
>> f=cellfun(#(x)h(x),dic);
>> find(f)
ans =
2 3 6
g & h: check if number of count of each letter in search string <= number of count in str.
f : finally use g and h for each element in dic

Resources