How to reconstruct strings in "edit_distance_problem"? - string

Suppose you have given dp table for string X = "AGGGCT" and string Y = "AGGCA"
m = length of X + 1
n = length of Y + 1
0 1 2 3 4 5
1 0 1 2 3 4
2 1 0 1 2 3
dp[m][n] = 3 2 1 0 1 2
4 3 2 1 1 2
5 4 3 2 1 2
6 5 4 3 2 2
and you want to reconstruct three strings as follows
string row1 = "AGGGCT" ;
string row2 = "||| | " ;
string row3 = "AGG-CA" ;
How to recontruct strings row1, row2 and row3, if possible post code in C/C++/Java.

I think this page can be a good starting point:
http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Java
You have to make a few modifications, but the core idea should be to store in the "min" which case was choosed for a given (i,j), and before the return you can walk through the matrix backwards starting by distance[str1.length()][str2.length()] step-by-step. If in the steps the distances are the same you show a |, if they differ but stepping diagonal then it was a change step, otherwise if vertical/horizontal a remove/add.
You can store this "backwards" information in a string and later display it in a reverse order.

Related

Cumulative count using grouping, sorting, and condition

i want Cumulative count of zero only in column c grouped by column a and sorted by b if other number the count reset to 1
this a sample
df = pd.DataFrame({'a':[1,1,1,1,2,2,2,2],
'b':[1,2,3,4,1,2,3,4],
'c':[10,0,0,5,1,0,1,0]}
)
i try next code that work but if zero appear more than one time shift function didn't depend on new value and need to run more than one time depend on count of zero series
df.loc[df.c == 0 ,'n'] = df.n.shift(1)+1
i try next code it done with small data frame but when try with large data take a long time and didn't finsh
for ind in df.index:
if df.loc[ind,'c'] == 0 :
df.loc[ind,'new'] = df.loc[ind-1,'new']+1
else :
df.loc[ind,'new'] = 1
pd.DataFrame({'a':[1,1,1,1,2,2,2,2],
'b':[1,2,3,4,1,2,3,4],
'c':[10,0,0,5,1,0,1,0]}
The desired result
a b c n
0 1 1 10 1
1 1 2 0 2
2 1 3 0 3
3 1 4 5 1
4 2 1 1 1
5 2 2 0 2
6 2 3 1 1
7 2 4 0 2
Try use cumsum to create a group variable and then use groupby.cumcount to create the new column:
df.sort_values(['a', 'b'], inplace=True)
df['n'] = df['c'].groupby([df.a, df['c'].ne(0).cumsum()]).cumcount() + 1
df
a b c n
0 1 1 10 1
1 1 2 0 2
2 1 3 0 3
3 1 4 5 1
4 2 1 1 1
5 2 2 0 2
6 2 3 1 1
7 2 4 0 2

How do I convert alphanumeric to digits while keeping actual digits in the string intact

I want to convert a column that has alphanumeric values to digits
0 newyork2510 2
1 boston76w2 1
2 chicago785dw 1
3 san891dwn39210114 1
4 f2391rpg 1
so that
0 newyork2510 2
should look like
0 14523251518112510 2
similarly, rest of the whole second column.
I can only do the following which only helps me converting alphabets to digits
for character in input:
number = ord(character) - 96
output.append(number)
Can you help me?
Try replace with regex=True:
# map the char to string integers
char_map = {chr(i): str(j) for j,i in enumerate(range(ord('a'), ord('z')+1), 1)}
# apply the mapping to the column
df['col1'] = df['col1'].replace(char_map, regex=True)
Output:
col0 col1 col2
0 0 14523251518112510 2
1 1 2151920151476232 1
2 2 38931715785423 1
3 3 191148914231439210114 1
4 4 6239118167 1

Function coverage () in R

I want to understand what the function coverage does to an IRange. for example the codes below:
ir <- IRanges (1:3, width = 3)
ir
IRanges object with 3 ranges and 0 metadata columns:
start end width
[1] 1 3 3
[2] 2 4 3
[3] 3 5 3
coverage (ir)
integer-Rle of length 5 with 5 runs
Lengths: 1 1 1 1 1
Values : 1 2 3 2 1
why the values repeats itself like 123 then 21
I figured it out.
The right answer is that we count the ranges covering each number starting from 1 till the last number in the last range.
for example
ir <- IRanges (4:6, width = 3)
first, we draw a plot for that IRange staring from 1 which is not included in any range and ending with 8 which is the boundry of the last range
second, we count the ranges of the Ir that covers each of these number from 0 to 8
count = c (0,0,0,1,2,3,2,1)
Rle (count)
numeric-Rle of length 8 with 6 runs
Lengths: 3 1 1 1 1 1
Values : 0 1 2 3 2 1

Pandas Flag Rows with Complementary Zeros

Given the following data frame:
import pandas as pd
df=pd.DataFrame({'A':[0,4,4,4],
'B':[0,4,4,0],
'C':[0,4,4,4],
'D':[4,0,0,4],
'E':[4,0,0,0],
'Name':['a','a','b','c']})
df
A B C D E Name
0 0 0 0 4 4 a
1 4 4 4 0 0 a
2 4 4 4 0 0 b
3 4 0 4 4 0 c
I'd like to add a new field called "Match_Flag" which labels unique combinations of rows if they have complementary zero patterns (as with rows 0, 1, and 2) AND have the same name (just for rows 0 and 1). It uses the name of the rows that match.
The desired result is as follows:
A B C D E Name Match_Flag
0 0 0 0 4 4 a a
1 4 4 4 0 0 a a
2 4 4 4 0 0 b NaN
3 4 0 4 4 0 c NaN
Caveat:
The patterns may vary, but should still be complementary.
Thanks in advance!
UPDATE
Sorry for the confusion.
Here is some clarification:
The reason why rows 0 and 1 are "complementary" is that they have opposite patterns of zeros in their columns; 0,0,0,4,4 vs, 4,4,4,0,0.
The number 4 is arbitrary; it could just as easily be 0,0,0,4,2 and 65,770,23,0,0. So if 2 such rows are indeed complementary and they have the same name, I'd like for them to be flagged with that same name under the "Match_Flag" column.
You can identify a compliment if it's dot product is zero and it's element wise sum is nowhere zero.
def complements(df):
v = df.drop('Name', axis=1).values
n = v.shape[0]
row, col = np.triu_indices(n, 1)
# ensure two rows are complete
# their sum contains no zeros
c = ((v[row] + v[col]) != 0).all(1)
complete = set(row[c]).union(col[c])
# ensure two rows do not overlap
# their product is zero everywhere
o = (v[row] * v[col] == 0).all(1)
non_overlap = set(row[o]).union(col[o])
# we are a compliment iff we do
# not overlap and we are complete
complement = list(non_overlap.intersection(complete))
# return slice
return df.Name.iloc[complement]
Then groupby('Name') and apply our function
df['Match_Flag'] = df.groupby('Name', group_keys=False).apply(complements)

How can I implement a grouping algorithm in J?

I'm trying to implement A006751 in J. It's pretty easy to do in Haskell, something like:
concat . map (\g -> concat [show $ length g, [g !! 0]]) . group . show
(Obviously that's not complete, but it's the basic heart of it. I spent about 10 seconds on that, so treat it accordingly.) I can implement any of this fairly easily in J, but the part that eludes me is a good, idiomatic J algorithm that corresponds to Haskell's group function. I can write a clumsy one, but it doesn't feel like good J.
Can anyone implement Haskell's group in good J?
Groups are usually done with the /. adverb.
1 1 2 1 </. 'abcd'
┌───┬─┐
│abd│c│
└───┴─┘
As you can see, it's not sequential. Just make your key sequential like so (essentially determining if an item is different from the next, and do a running sum of the resulting 0's and 1's):
neq =. 13 : '0, (}. y) ~: (}: y)'
seqkey =. 13 : '+/\neq y'
(seqkey 1 1 2 1) </. 'abcd'
┌──┬─┬─┐
│ab│c│d│
└──┴─┴─┘
What I need then is a function which counts the items (#), and tells me what they are ({. to just pick the first). I got some inspiration from nubcount:
diffseqcount =. 13 : ',(seqkey y) (#,{.)/. y'
diffseqcount 2
1 2
diffseqcount 1 2
1 1 1 2
diffseqcount 1 1 1 2
3 1 1 2
If you want the nth result, just use power:
diffseqcount(^:10) 2 NB. 10th result
1 3 2 1 1 3 2 1 3 2 2 1 1 3 3 1 1 2 1 3 2 1 2 3 2 2 2 1 1 2
I agree that /. ( Key ) is the best general method for applying verbs to groups in J. An alternative in this case, where we need to group consecutive numbers that are the same, is dyadic ;. (Cut):
1 1 0 0 1 0 1 <(;.1) 3 1 1 1 2 2 3
┌─┬─────┬───┬─┐
│3│1 1 1│2 2│3│
└─┴─────┴───┴─┘
We can form the frets to use as the left argument as follows:
1 , 2 ~:/\ 3 1 1 1 2 2 3 NB. inserts ~: in the running sets of 2 numbers
1 1 0 0 1 0 1
Putting the two together:
(] <;.1~ 1 , 2 ~:/\ ]) 3 1 1 1 2 2 3
┌─┬─────┬───┬─┐
│3│1 1 1│2 2│3│
└─┴─────┴───┴─┘
Using the same mechanism as suggested previously:
,#(] (# , {.);.1~ 1 , 2 ~:/\ ]) 3 1 1 1 2 2 3
1 3 3 1 2 2 1 3
If you are looking for a nice J implementation of the look-and-say sequence then I'd suggest the one on Rosetta Code:
las=: ,#((# , {.);.1~ 1 , 2 ~:/\ ])&.(10x&#.inv)#]^:(1+i.#[)
5 las 1 NB. left arg is sequence length, right arg is starting number
11 21 1211 111221 312211

Resources