Deleting duplicate answers containing special characters in the survey - string

Hello I am dealing with the following problem. I have a survey where you were able to mark several answers as well as add your own one. I am trying to get unique answers to be able to count their occurance for example: let's suppose that we have 3 answers: a, b, c. Person nr 1 marked answer a, Person nr 2 marked answer b, c, Person nr 3 marked a, c. I would like to receive the result: "a" was marked 2 times. To do that i'm trying to delete duplicate answers and create a macro-variable that stores those unique answers: a, b, c.
I have already renamed all of the survey questions to v1-v&n_que. where n_que is a macro-variable that keeps information about the number of questions in the survey. I was trying to split all of the answers into a tables (using the previous example i would get a column with the following values): a, b, c, a, c. Then i wanted to sort this data out and remove duplicates. I've tried the following:
%macro coll_ans(lib, tab);
%do _i_ = 1 %to &n_que. %by 1;
%global betav&_i_.;
proc sql noprint;
select distinct v&_i_. into :betav&_i_. separated by ', '
from &lib..&tab.
where v&_i_. ^= ' ';
quit;
data a&_i_.;
%do _j_ = 1 %to %sysfunc(countw(%quote(&&&betav&_i_.), ',')) %by 1;
text = %scan(%quote(&&&betav&_i_), &_j_., ',');
output;
%end;
run;
%end;
%mend coll_ans;
It's worth mentioning that if somebody picked more than 1 answer, for example a and b, the answers are separated with the comma, that's why i picked this separator, to unify the record.
I have tried almost everything, changing %quote to %bquote, %superq, writing && instead of &&& and i keep getting the following error (first of 40 others):
ERROR: The function NO is unknown, or cannot be accessed.
"NO" is one of the answer to the first question in the survey, full answer is: NO (go to the 9th question). It's worth mentioning that the whole survey is in polish but i am using the right encoding so i don't believe it may cause some problems (hopefully).
I will be grateful for all the advice, because I encountered an insurmountable wall

Guessing you have a data set like:
data have ;
input id v1 : $8. v2 : $8.;
cards ;
1 a a
2 b,c b
3 a,c c
;
You can transpose the data set to make it have one record per ID-variable-value.
data tran (keep=id VarName Value);
set have ;
array vars{*} v1 v2 ;
do i=1 to dim(vars) ;
Varname=vname(vars{i}) ;
do j=1 to countw(vars{i},',') ;
Value=scan(vars{i},j,',') ;
output ;
end ;
end ;
run ;
The output data set looks like:
id Varname Value
1 v1 a
1 v2 a
2 v1 b
2 v1 c
2 v2 b
3 v1 a
3 v1 c
3 v2 c
You can the use PROC FREQ or SQL to get the counts.
proc freq data=tran ;
tables varname*value/missing list ;
run ;
Outputs
Varname Value Frequency
v1 a 2
v1 b 1
v1 c 2
v2 a 1
v2 b 1
v2 c 1

First of all, it's would be better if you posted the format in which you receive the survey data as this would dictate the simplest/fastest approach overall.
Also, as a general rule, it's best to get the inputs & outputs right in non-macro SAS code then use macros to optimize the process etc. It's easier to debug that way - even for someone who's being using macros for a long time... :)
That said, from your Proc SQL code, it appears that:
a. you're receiving the answers in a single delimited text field, such as "a,b,c" or "b,c" or "a,b,z"
*** example data;
data work.answers;
length answer $10.;
input answer;
datalines;
a,b,c
a
b
b,c
NO
a,b,z
n
run;
*** example valid answer entries;
data work.valid;
length valid $10.;
input valid;
datalines;
a
b
c
NO
YES
run;
b. you want to validate each answer entry and generate counts, like:
NO 1
YES 0
a 3
b 4
c 2
Many ways to do this in SAS but for parsing tokenized text data, a de-duplicated lookup table using the hash object is handy. Code below also prints the following to the log for debugging/verification...
answer=a,b,c num_answers=3 val=a val=b val=c validated=a,b,c
answer=a num_answers=1 val=a validated=a
answer=b num_answers=1 val=b validated=b
answer=b,c num_answers=2 val=b val=c validated=b,c
answer=NO num_answers=1 val=NO validated=NO
answer=a,b,z num_answers=3 val=a val=b val=z -invalid validated=a,b, validated=a,b,
answer=n num_answers=1 val=n -invalid validated= validated=
Once you've mastered the declaration syntax for the hash object, it's quite logical and relatively fast. And of course you can add validation rules - such as upper case & lower case entries ...
*** first, de-duplicate your lookup table. ;
proc sort data=work.valid nodupkey;
by valid;
run;
data _null_;
length valid $10. answer_count 4. count 4. validated $10.;
retain count 0;
*** initialize & load hash object ;
if _N_ = 1 then do;
declare hash h(multidata: 'n', ordered: 'y');
rc = h.defineKey('valid');
rc = h.defineData('valid','count');
rc = h.defineDone();
do until(eof1);
set work.valid end=eof1;
h.add();
end;
end;
*** now process questions/answers;
do until(eof);
*** read each answer;
set answers end=eof;
num_answers=countw(answer);
putlog answer= num_answers= #;
*** parse each answer entry;
validated=answer;
do i=1 to num_answers;
val=scan(answer,i);
putlog val= #;
*** (optional) keep track of total #answers: valid + invalid;
answer_count+1;
*** check answer entry in lookup table;
rc= h.find(key:val);
*** if entry NOT in lookup table, remove from validated answer;
if rc ne 0 then do;
putlog "-invalid " #;
validated=tranwrd(validated,trim(val),' ');
end;
*** if answer found, increment counter in lookup table;
else do;
count+1;
h.replace();
end;
end;
putlog validated=;
end;
*** save table of answer counts to disk;
if eof then h.output(dataset: 'work.counts');
run;

Related

GREP-like function to retrieve text in SAS

I want to retrieve specific text within a column in a SAS file.
The file would like the following:
Patient Location infoTxt
001 B Admission Code: 123456 X
Exit Code: 98765W
002 C Admission Code: 4567 WY
Exit Code: 76543Z
003 D Admission Code: 67890 L
Exit Code: 4321Z
I want to retrieve just the information after the colon for Admission Code and Exit Code and put them in their own columns. The 'codes' can be any combination of letters, numbers, and blank spaces. The new data would look like the following:
Patient Location AdmissionCode ExitCode
001 B 123456 X 8765W
002 C 4567 WY 76543Z
003 D 67890 L 4321Z
I'm not familiar with the functions in SAS, but maybe the logic would look something like the following:
data want;
set have;
do i = 1 to dim(infoTxt)
AdmissionCode = substring(string1, regexpr(":", string) + 1);
ExitCode = substring(string2, regexpr(":", string) + 1);
run;
In the code above, string1 would represent the first line of text in infoTxt and string2 would represent the second line of text infoTxt.
SAS can utilize Perl regular expressions through the family of functions that start with PRX. The tip sheet is a great summary if you are familiar with regular expressions.
PRXMATCH and PRXPOSN can test a regex pattern with capture groups and retrieve the group text.
data have;
input;
text = _infile_;
datalines;
Admission Code: 123456 X Exit Code: 98765W
Admission Code: 4567 WY Exit Code: 76543Z
Admission Code: 67890 L Exit Code: 4321Z
run;
data want;
set have;
if _n_ = 1 then do;
retain rx;
rx = prxparse ('/Admission Code: (.*)Exit Code:(.*)/');
end;
length AdmissionCode ExitCode $50;
if prxmatch(rx,text) then do;
AdmissionCode = prxposn(rx, 1, text);
ExitCode = prxposn(rx, 2, text);
end;
drop rx;
run;
I like a RegEX with a capture buffer as much as the next guy but you could also use input statement features to read this data.
data info;
infile cards n=2 firstobs=2;
input #1 patient:$3. location :$1. #'Admission Code: ' AdmissionCode &$16. #2 #'Exit Code: ' ExitCode &$16.;
cards;
Patient Location infoTxt
001 B Admission Code: 123456 X
Exit Code: 98765W
002 C Admission Code: 4567 WY
Exit Code: 76543Z
003 D Admission Code: 67890 L
Exit Code: 4321Z
;;;;
run;
proc print;
run;
There may be a solution out there that does it all in one data step. This creates two steps to deal with the admission and exit being on different rows-- first a data step, then a join to get it back together.
SAS does have regex syntax but I used SAS character functions instead. substr has 3 arguments, string, start position, and end position-- but end position is optional and I've omitted it to tell it to grab everything after the start position. retain is used to fill in the patient and location in the second row of each group.
data admission exit;
set grep;
retain patient2 location2;
if patient ne '' then do;
patient2=patient;
location2=location;
admissioncode=substr(infoTxt,find(infoTxt,":")+2);
output admission;
end;
else do;
exitcode=substr(infoTxt,find(infoTxt,":")+2);
output exit;
end;
run;
proc sql;
create table dat as select a.patient2 as patient,a.location2 as location,a.admissioncode,b.exitcode
from admission a
left join exit b on a.patient2=b.patient2 and a.location2=b.location2
;
quit;
Provided that you always have the same pattern of colons and line breaks, I think you can do this with scan:
admission_code = scan(infoTxt, 2, '3A0A0D'x);
exit_code = scan(infoTxt, 4, '3A0A0D'x);
This uses the hex literal '3A0A0D'x to specify :, line feed and carriage return as delimiters for the scan function.

Is it possible to create a string of alphabet characters (seperated by ands) based on the number of members of a global variable in SAS?

I am looking to build a system that merges data sets based on common values using the merge function.
However I know that sometimes I will have to merge 2 data sets, and sometimes I will have to merge 20 data sets.
so the generic merge function looks as follows
DATA [data_set];
merge
[loop over data to be merged]
by [factors which I am merging by]
if [a and b]
format [sort order]
run;
The problem is that after the if the [a and b] obviously needs to be a string generated to be the length equal to the number of tables being merged. If I want to merge 2 tables [a and b] is great but if I want to merge 3 tables it would have to be [a and b and c].
Is there a way for me to generate a string [a and b and.... N] where is produced based on the length of a global variable?
Hopefully my question is clear, I cannot provide the actual code I am using as it contains sensitive information. I will try to provide more information / answer questions as best I can if there is something I missed.
Using the makedata function from #Richard
%let MergeData = X1 X2 X3 X4 X5 X6 X7 X8 X9 X10;
DATA X_COMB;
merge
%internalmacro1(
%nrstr(
#L1#;
)
,L1 = &MergeData.
);
by id;
if ; /* create a string based on %&MergeData length, of # and # and # ... where # = a,b,c,d */
run;
but where that last if exists I want the number of items in the %MergeData set, I think I need to make an array and convert to hex for the values maybe? There must be some equivalent hex value for 'a b c' etc.
The problem is that people are going to add and remove items to the set &MergeData so the merge I'm trying to create needs to scale to the size of the number of data sets being input? Sorry I can't provide more!
Rosie Method:
proc import
datafile = "FilePath\Alphabet.csv"
DBMS = csv
OUT = AlphabetConversion;
;
data desiredstatements;
set AlphabetConversion (obs=26); /*This limits the observations used*/
run;
proc sql;
select AlphabetConversion into :dynamiccode from desiredstatements separated by " and ";
quit;
%put &dynamiccode.; /*Check the log to see what you got and make sure it's the code you want */
Quentin Method:
DATA Merged_Data
merge
mydata1 (in=_mydata1) mydata2(in=_mydata2) mydata3 (in=_mydata3);
by ID;
if _mydata1 and _mydata2 and _mydata3;
This structure is fine for a merge where you get to specify all of your inputs. My problem is that I am trying to write a macro that will take mydata1 and mydata2 sometimes, and mydata1-mydata20 other times. I don't know how to make the if _mydata1 and _mydata2 .... _mydata20, when it has 20 data sets to be merged, and _mydata1 and _mydata2 when it only has two.
Perhaps you can some example code, based on this data
%macro makedata;
%local i;
data %do i = 1 %to 10; x&i(keep=id x&i) %end;;
do id = 1 to 42;
array x(10);
do _n_ = 1 to dim(x);
x(_n_) = id * 100 + _n_;
end;
output;
end;
run
%mend;
%makedata;
data want;
merge ... fill in the rest ...;
... fill in the rest ...
run;
Given a bunch of data sets like:
data mydata1 ;
input id x1 ;
cards ;
1 10
2 20
3 30
;
data mydata2 ;
input id x2 ;
cards ;
1 100
3 300
;
data mydata3 ;
input id x3 ;
cards ;
1 1000
2 2000
3 3000
;
You can merge them together and keep only the records that match between all three data sets like:
data all ;
merge
mydata1 (in=_mydata1)
mydata2 (in=_mydata2)
mydata3 (in=_mydata3)
;
by id ;
if _mydata1 and _mydata2 and _mydata3 ;
run ;
If you look at the above step, it's clear that there are two lists. A list of data sets on the merge statement, and a list of variables on the IF statement. You can use the macro language to generate that step. When you call the macro you would pass it a list of data sets to be merged. The macro would then generate the DATA step with the merge.
Here's a macro, which uses macro looping to generate the two lists:
%macro innermerge
(data= /*space-delimited list of data sets to be merged*/
,by= /*space-delimited list of BY variables for merge*/
,out= /*output data set*/
)
;
%local i data_i ;
data &out ;
merge
%do i=1 %to %sysfunc(countw(&data,%str( ))) ;
%let data_i=%scan(&data,&i,%str( )) ;
&data_i (in=_&data_i)
%end ;
;
by &by ;
if
%do i=1 %to %sysfunc(countw(&data,%str( ))) ;
%let data_i=%scan(&data,&i,%str( )) ;
%if &i>1 %then %do ;
and
%end ;
_&data_i
%end ;
;
run ;
%mend;
Use like:
%innermerge
(data=mydata1 mydata2
,by=id
,out=want
)
MPRINT(INNERMERGE): data want ;
MPRINT(INNERMERGE): merge mydata1 (in=_mydata1) mydata2 (in=_mydata2) ;
MPRINT(INNERMERGE): by id ;
MPRINT(INNERMERGE): if _mydata1 and _mydata2 ;
MPRINT(INNERMERGE): run ;
NOTE: There were 3 observations read from the data set WORK.MYDATA1.
NOTE: There were 2 observations read from the data set WORK.MYDATA2.
NOTE: The data set WORK.WANT has 2 observations and 3 variables.
%innermerge
(data=mydata1 mydata2 mydata3
,by=id
,out=want
)
MPRINT(INNERMERGE): data want ;
MPRINT(INNERMERGE): merge mydata1 (in=_mydata1) mydata2 (in=_mydata2) mydata3 (in=_mydata3) ;
MPRINT(INNERMERGE): by id ;
MPRINT(INNERMERGE): if _mydata1 and _mydata2 and _mydata3 ;
MPRINT(INNERMERGE): run ;
NOTE: There were 3 observations read from the data set WORK.MYDATA1.
NOTE: There were 2 observations read from the data set WORK.MYDATA2.
NOTE: There were 3 observations read from the data set WORK.MYDATA3.
NOTE: The data set WORK.WANT has 2 observations and 4 variables.
You're close with your attempt of the code I gave you earlier, but the one key thing (other than me giving you the wrong order for the SQL statement - sorry about that, fixed below) is that you are reading in a CSV and assuming it has headers, whereas I would prefer to define the variable in the code. And then you are naming the dataset name when you should be naming the variable in the proc SQL - try this:
/*File must have one letter pet row*/
%let file=\\hefce-sas\nuser\user\thomaro\SAS\Temp\Alphabet.csv;
data AlphabetConversion;
infile "&file."
delimiter=',' missover dsd;
format letter $1.;
input letter $;
run;
/*I've explicitly defined this so it runs but this needs to be dynamic - see below*/
%let numstatements = 5;
data desiredstatements;
set AlphabetConversion (obs=&numstatements.); /*This limits the observations used*/
run;
proc sql;
select letter into :dynamiccode separated by ' and ' from desiredstatements;
quit;
/*Check the log to see what you got and make sure it's the code you want */
%put &dynamiccode.;
This construct is very useful in general for dynamically creating code.
Now there is the question of dynamically defining numstatements - this should be based on the line at the top of your code:
%let MergeData = X1 X2 X3 X4 X5 X6 X7 X8 X9 X10;
So I would suggest you want a macro function that counts words - there isn't one of these, but you can turn the normal function countw into a macro function using %sysfunc() - so you would want to replace the %let above with something like this:
%let numstatements = %sysfunc(countw(&mergedata.));
%put The number of datasets to be merged is &numstatements.;
Now (if I've understood your problem correctly) when you want to run the code with a different number of datasets, all you need to do is change your %let MergeData= at the top of the code and you're good to go.
If I've understood your question, does the following achieve what you want?:
%let ds_list = a b c d;
%let ds_and_ds = %sysfunc(tranwrd(&ds_list,%str( ),%str( and )));
%put ds_list = &ds_list;
%put ds_and_ds = &ds_and_ds;
If not, then please give an example of the data set list into which you want to insert "and" between each data set.

Finding location of specified substring in a specified string (MATLAB)

I have a simple question that I need help on. My code,I believe, is almost complete but im having trouble with the a specific line of code.
I have an assignment question (2 parts) that asks me to find whether a protein (string), has the specified motif (substring) at that particular location (location). This is the first part, and the function and code looks like this:
function output = Motif_Match(motif,protein,location)
%This code wil print a '1' if the motif occurs in the protein starting
at the given location, else it wil print a '0'
for k = 1:location %Iterates through specified location
if protein(1, [k, k+1]) == motif; % if the location matches the protein and motif
output = 1;
else
output = 0;
end
end
This part I was able to get correctly, and example of this is as follows:
p = 'MGNAAAAKKGN'
m = 'GN'
Motif_Match(m,p,2)
ans =
1
The second part of the question, which I am stuck on, is to take the motif and protein and return a vector containing the locations at which the motif occurs in the protein. To do this, I am using calls to my previous code and I am not supposed to use any functions that make this easy such as strfind, find, hist, strcmp etc.
My code for this, so far, is:
function output = Motif_Find(motif,protein)
[r,c] = size(protein)
output = zeros(r,c)
for k = 1:c-1
if Motif_Match(motif,protein,k) == 1;
output(k) = protein(k)
else
output = [];
end
end
I belive something is wrong at line 6 of this code. My thinking on this is that I want the output to give me the locations to me and that this code on this line is incorrect, but I can't seem to think of anything else. An example of what should happen is as follows:
p = 'MGNAAAAKKGN';
m = 'GN';
Motif_Find(m,p)
ans =
2 10
So my question is, how can I get my code to give me the locations? I've been stuck on this for quite a while and can't seem to get anywhere with this. Any help will be greatly appreciated!
Thank you all!
you are very close.
output(k) = protein(k)
should be
output(k) = k
This is because we want just the location K of the match. Using protien(k) will gives us the character at position K in the protein string.
Also the very last thing I would do is only return the nonzero elements. The easiest way is to just use the find command with no arguments besides the vector/matrix
so after your loop just do this
output = find(output); %returns only non zero elements
edit
I just noticed another problem output = []; means set output to an empty array. this isn't what you want i think what you meant was output(k) = 0; this is why you weren't getting the result you expected. But REALLY since you already made the whole array zeros, you don't need that at all. all together, the code should look like this. I also replaced your size with length since your proteins are linear sequences, not 2d matricies
function output = Motif_Find(motif,protein)
protein_len = length(protein)
motif_len = length(motif)
output = zeros(1,protein_len)
%notice here I changed this to motif_length. think of it this way, if the
%length is 4, we don't need to search the last 3,2,or 1 protein groups
for k = 1:protein_len-motif_len + 1
if Motif_Match(motif,protein,k) == 1;
output(k) = k;
%we don't really need these lines, since the array already has zeros
%else
% output(k) = 0;
end
end
%returns only nonzero elements
output = find(output);

Recognize relevant string information by checking the first characters

I have a table with 2 columns. In column 1, I have a string information, in column 2, I have a logical index
%% Tables and their use
T={'A2P3';'A2P3';'A2P3';'A2P3 with (extra1)';'A2P3 with (extra1) and (extra 2)';'A2P3 with (extra1)';'B2P3';'B2P3';'B2P3';'B2P3 with (extra 1)';'A2P3'};
a={1 1 0 1 1 0 1 1 0 1 1 }
T(:,2)=num2cell(1);
T(3,2)=num2cell(0);
T(6,2)=num2cell(0);
T(9,2)=num2cell(0);
T=table(T(:,1),T(:,2));
class(T.Var1);
class(T.Var2);
T.Var1=categorical(T.Var1)
T.Var2=cell2mat(T.Var2)
class(T.Var1);
class(T.Var2);
if T.Var1=='A2P3' & T.Var2==1
disp 'go on'
else
disp 'change something'
end
UPDATES:
I will update this section as soon as I know how to copy my workspace into a code format
** still don't know how to do that but here it goes
*** why working with tables is a double edged sword (but still cool): I have to be very aware of the class inside the table to refer to it in an if else construct, here I had to convert two columns to categorical and to double from cell to make it work...
Here is what my data looks like:
I want to have this:
if T.Var1=='A2P3*************************' & T.Var2==1
disp 'go on'
else
disp 'change something'
end
I manage to tell matlab to do as i wish, but the whole point of this post is: how do i tell matlab to ignore what comes after A2P3 in the string, where the string length is variable? because otherwise it would be very tiring to look up every single piece of string information left on A2P3 (and on B2P3 etc) just to say thay.
How do I do that?
Assuming you are working with T (cell array) as listed in your code, you may use this code to detect the successful matches -
%%// Slightly different than yours
T={'A2P3';'NotA2P3';'A2P3';'A2P3 with (extra1)';'A2P3 with (extra1) and (extra 2)';'A2P3 with (extra1)';'B2P3';'B2P3';'NotA2P3';'B2P3 with (extra 1)';'A2P3'};
a={1 1 0 1 1 0 1 1 0 1 1 }
T(:,2)=num2cell(1);
T(3,2)=num2cell(0);
T(6,2)=num2cell(0);
T(9,2)=num2cell(0);
%%// Get the comparison results
col1_comps = ismember(char(T(:,1)),'A2P3') | ismember(char(T(:,1)),'B2P3');
comparisons = ismember(col1_comps(:,1:4),[1 1 1 1],'rows').*cell2mat(T(:,2))
One quick solution would be to make a function that takes 2 strings and checks whether the first one starts with the second one.
Later Edit:
The function will look like this:
for i = 0, i < second string's length, i = i + 1
if the first string's character at index i doesn't equal the second string's character at index i
return false
after the for, return true
This assuming the second character's lenght is always smaller the first's. Otherwise, return the function with the arguments swapped.

Removing minimal letters from a string 'A' to remove all instances of string 'B'

If we have string A of length N and string B of length M, where M < N, can I quickly compute the minimum number of letters I have to remove from string A so that string B does not occur as a substring in A?
If we have tiny string lengths, this problem is pretty easy to brute force: you just iterate a bitmask from 0 to 2^N and see if B occurs as a substring in this subsequence of A. However, when N can go up to 10,000 and M can go up to 1,000, this algorithm obviously falls apart quickly. Is there a faster way to do this?
Example: A=ababaa B=aba. Answer=1.Removing the second a in A will result in abbaa, which does not contain B.
Edit: User n.m. posted a great counter example: aabcc and abc. We want to remove the single b, because removing any a or c will create another instance of the string abc.
Solve it with dynamic programming. Let dp[i][j] the minimum operator to make A[0...i-1] have a suffix of B[0...j-1] as well as A[0...i] doesn't contain B, dp[i][j] = Infinite to index the operator is impossible. Then
if(A[i-1]=B[i-1])
dp[i][j] = min(dp[i-1][j-1], dp[i-1][j])
else dp[i][j]=dp[i-1][j]`,
return min(A[N][0],A[N][1],...,A[N][M-1]);`
Can you do a graph search on the string A. This is probably too slow for large N and special input but it should work better than an exponential brute force algorithm. Maybe a BFS.
I'm not sure this question is still of someone interest, but I have an idea that maybe could work.
once we decided that the problem is not to find the substring, is to decide which letter is more convenient to remove from string A, the solution to me appears pretty simple: if you find an occurrence of B string into A, the best thing you can do is just remove a char that is inside the string, closed to the right bondary...let say the one previous the last. That's why if you have a substring that actually end how it starts, if you remove a char at the beginning you just remove one of the B occurencies, while you can actually remove two at once.
Algorithm in pseudo cose:
String A, B;
int occ_posit = 0;
N = B.length();
occ_posit = A.getOccurrencePosition(B); // pseudo function that get the first occurence of B into A and returns the offset (1° byte = 1), or 0 if no occurence present.
while (occ_posit > 0) // while there are B into A
{
if (B.firstchar == B.lastchar) // if B starts as it ends
{
if (A.charat[occ_posit] == A.charat[occ_posit+1])
A.remove[occ_posit - 1]; // no reason to remove A[occ_posit] here
else
A.remove[occ_posit]; // here we remove the last char, so we could remove 2 occurencies at the same time
}
else
{
int x = occ_posit + N - 1;
while (A.charat[x + 1] == A.charat[x])
x--; // find the first char different from the last one
A.remove[x]; // B does not ends as it starts, so if there are overlapping instances they overlap by more than one char. Removing the first that is not equal to the char following B instance, we kill both occurrencies at once.
}
}
Let's explain with an example:
A = "123456789000987654321"
B = "890"
read this as a table:
occ_posit: 123456789012345678901
A = "123456789000987654321"
B = "890"
first occurrence is at occ_posit = 8. B does not end as it starts, so it get into the second loop:
int x = 8 + 3 - 1 = 10;
while (A.charat[x + 1] == A.charat[x])
x--; // find the first char different from the last one
A.remove[x];
the while find that A.charat11 matches A.charat[10] (="0"), so x become 9 and then while exits as A.charat[10] does not match A.charat9. A then become:
A = "12345678000987654321"
with no more occurencies in it.
Let's try with another:
A = "abccccccccc"
B = "abc"
first occurrence is at occ_posit = 1. B does not end as it starts, so it get into the second loop:
int x = 1 + 3 - 1 = 3;
while (A.charat[x + 1] == A.charat[x])
x--; // find the first char different from the last one
A.remove[x];
the while find that A.charat4 matches A.charat[3] (="c"), so x become 2 and then while exits as A.charat[3] does not match A.charat2. A then become:
A = "accccccccc"
let's try with overlapping:
A = "abcdabcdabff"
B = "abcdab"
the algorithm results in: A = "abcdacdabff" that has no more occurencies.
finally, one letter overlap:
A = "abbabbabbabba"
B = "abba"
B end as it starts, so it enters the first if:
if (A.charat[occ_posit] == A.charat[occ_posit+1])
A.remove[occ_posit - 1]; // no reason to remove A[occ_posit] here
else
A.remove[occ_posit]; // here we remove the last char, so we could remove 2 occurencies at the same time
that lets the last "a" of B instance to be removed. So:
1° step: A= "abbbbabbabba"
2° step: A= "abbbbabbbba" and we are done.
Hope this helps
EDIT: pls note that the algotirhm must be corrected a little not to give error when you are close to the A end with your search, but this is just an easy programming issue.
Here's a sketch I've come up with.
First, if A contains any symbols that are not found in B, split up A into a bunch of smaller strings containing only those characters found in B. Apply the algorithm on each of the smaller strings, then glue them back together to get the total result. This really functions as an optimization.
Next, check if A contains any of B. If there isn't, you're done. If A = B, then delete all of them.
I think a relatively greedy algorithm may work.
First, mark all of the symbols in A which belong to at least one occurrence of B. Let A = aabcbccabcaa, B = abc. Bolding indicates these marked characters:
a abc bcc abc aa. If there's an overlap, mark all possible. This operation is naively approximately (A-B) operations, but I believe it can be done in around (A/B) operations.
Consider the deletion of each marked letter in A: a abc bcc abc aa.
Check whether the deletion of that marked letter decreases the number of marked letters. You only need to check the substrings which could possibly be affected by the deletion of the letter. If B has a length of 4, only the substrings starting at the following locations would need to be deleted if x were being checked:
-------x------
^^^^
Any further left or right will exist regardless of the presence of x.
For instance:
Marking the [a] in the following string: a [a]bc bcc abc aa.
Its deletion yields abcbccabcaa, which when marked produces abc bcc abc aa, which has an equal number of marked characters. Since only the relative number is required for this operation, it can be done in approximately 2B time for each selected letter. For each, assign the relative difference between the two. Pick an arbitrary one which is maximal and delete it. Repeat until done. Each pass is roughly up to 2AB operations, for a maximum of A passes, giving a total time of about 2A^2 B.
In the above example, these values are assigned:
aabcbccabcaa
033 333
So arbitrarily deleting the first marked b gives you: aacbccabcaa. If you repeat the process, you get:
aacbccabcaa
333
The final result is done.
I believe the algorithm is correctly minimal. I think it is true that whenever A requires only one deletion, the algorithm must be optimal. In that case, the letter which reduces the most possible matches (ie: all of them) should be best. I can come up with no such proof, though. I'd be interested in finding any counter-examples to optimality.
Find the indeces of each substring in the main string.
Then using a dynamic programming algorithm (so memoize intermediate values), remove each letter that is part of a substring from the main string, add 1 to the count, and repeat.
You can find the letters, because they are within the indeces of each match index + length of B.
A = ababaa
B = aba
count = 0
indeces = (0, 2)
A = babaa, aabaa, abbaa, abbaa, abaaa, ababa
B = aba
count = 1
(2nd abbaa is memoized)
indeces = (1), (1), (), (), (0), (0, 2)
answer = 1
You can take it a step further, and try to memoize the substring match indeces of substrings, but that might not actually be a performance gain.
Not sure on the exact bounds, but shouldn't take too long computationally.

Resources