How to find a string inside a string and print it? - string

How to print a string inside a string?
I have row data like this
City_name
Abcde/def/Report_names/names
Now i want to print only that string which is after report_names
Abcde/def length may vary from line to line.
Report_names/names is standard naming format.
So i want to print only text appearing after /report_names/

For simple exercises like this, it is best to learn and then use standard string functions, INSTR and SUBSTR. instr will find the position of the first letter of a substring you are searching for within a longer string. So if you search for 'Report_names/' you will find the position of R. The first letter of "names" is 13 positions to the right of that. That will be the first letter of the substring you actually want, which is the "names". With this you should be able to understand what the query below does:
-- begin TEST DATA (not part of the solution to the problem)
with
test_data ( str ) as (
select 'City_name Abcde/def/Report_names/Ann, Helen, Mary' from dual
)
-- end of TEST DATA; SQL query begins BELOW THIS LINE
-- use your actual table and column names instead of "test_data" and "str"
select str, substr( str, instr(str, 'Report_names/') + 13 ) as names
from test_data
;
STR NAMES
------------------------------------------------- --------------------
City_name Abcde/def/Report_names/Ann, Helen, Mary Ann, Helen, Mary
1 row selected.

Related

Overlapping values of strings in Python

I am building a puzzle word game in Python. I have the correct puzzle word, and the guessed puzzle word. I want to build a third string which shows the correct letters in the guessed puzzle in the correct puzzle word, and _ at the position of the incorrect letters.
For example, say the correct word is APPLE and the guessed word is APTLE
then i want to have a third string: AP_L_
The guessed word and correct word are guaranteed to be 3 to 5 characters long, but the guessed word is not guaranteed to be the same length as the correct word
For example, correct word is TEA and the guessed word is TEAKO, then the third string should be TEA__ because the players guessed the last two letters incorrectly.
Another example, correct word is APPLE and guessed word is POP, the third string should be:
_ _ P_ _ (without space separation)
I can successfully get the matched indexes of the correct and guessed word; however, I am having problems building the third string. I just learned that strings in Python are immutable and that i cannot assign something like str1[index] = str2[index]
I have tried many things, including using lists, but i am not getting the correct answer. The attached code is my most recent attempt, would you please help me solve this?
Thank you
find the match between puzzle_word and guess
def matcher(str_a, str_b):
#find indexes where letters overlap
matched_indexes = [i for i, (a, b) in enumerate(zip(str_a, str_b)) if a == b]
result = []
for i in str_a:
result.append('_')
for value in matched_indexes:
result[value].replace('_', str_a[value])
print(result)
matcher("apple", "allke")
the output result right now is list of five "_"
cases:
correct word is APPLE and the guessed word is APTLE third
string: AP_L_
correct word is TEA and the guessed word is TEAKO,
third string should be TEA__
correct word is APPLE and guessed
word is POP, third string should be _ _ P_ _
You can use itertools.zip_longest here to always make sure you pad out to the longest word provided and then create a new string by joining the matching characters or otherwise a _. eg:
from itertools import zip_longest
correct_and_guess = [
('APPLE', 'APTLE'),
('TEA', 'TEAKO'),
('APPLE', 'POP')
]
for correct, guess in correct_and_guess:
# If characters in same positions match - show character otherwise `_`
new_word = ''.join(c if c == g else '_' for c, g in zip_longest(correct, guess, fillvalue='_'))
print(correct, guess, new_word)
Will print the following:
APPLE APTLE AP_LE
TEA TEAKO TEA__
APPLE POP __P__
Couple of things here.
str.replace() does not replace inline; as you noted strings are immutable, so you have to assign the result of replace:
result[value] = result[value].replace('_', str_a[value])
However, there's no point doing this since you can just assign to the list element:
result[value] = str_a[value]
And finally you can assign a list of the length of str_a without the for loop, which might be more readable:
result = ['_'] * len(str_a)

SAS finding an uppercase word within a string

I have a string which contains one word in uppercase somewhere within it. I want to extract that one word into a new variable using SAS.
I think I need to find a way to code up finding a word which contains two or more uppercase letters (as the start of a sentence would begin with an uppercase letter).
i.e. How do I create the variable 'word':
data example;
length txtString $50;
length word $20;
infile datalines dlm=',';
input txtString $ word $;
datalines;
This is one EXAMPLE. Of what I need.,EXAMPLE
THIS is another.,THIS
etc ETC,ETC
;
run;
Hope someone can help and the question is clear
Thanks in advance
Consider a regex match/replace with a negative lookbehind to include two types of matches:
consecutive upper case words followed by a space with at least two characters (to avoid title cases at beginning of sentence): (([A-Z ]){2,})
consecutive upper case words followed by a period with at least two characters: (to avoid title cases at beginning of sentence): (([A-Z.]){2,})
CAVEAT: This solution works except the I article is also matched which technically is a valid match as it is also an all uppercase one-word. Being the only type in English language, consider a tranwrd() replace for such a special case. In fact, relatedly, this solution matches ALL uppercase words.
data example;
length txtString $50;
length word $20;
infile datalines dlm=',';
input txtString $ word $;
datalines;
This is one EXAMPLE. Of what I need.,EXAMPLE
THIS is another.,THIS
etc ETC,ETC
;
run;
data example;
set example;
pattern_num = prxparse("s/(?!(([A-Z ]){2,})|(([A-Z.]){2,})).//");
wordextract = prxchange(pattern_num, -1, txtString);
wordextract = tranwrd(wordextract, " I ", "");
drop pattern_num;
run;
txtString word wordextract
This is one EXAMPLE. Of what I need. EXAMPLE EXAMPLE
THIS is another. THIS THIS
etc ETC ETC ETC
SAS has a prxsubstr() function call that finds the starting position and length of a substring that matches a given regex pattern within a given string. Here's a sample solution using the prxsubstr() function call:
data solution;
set example;
/* Build a regex pattern of the word to search for, and hang on to it */
/* (The regex below means: word boundary, then two or more capital letters,
then word boundary. Word boundary here means the start or the end of a string
of letters, digits and/or underscores.) */
if _N_ = 1 then pattern_num = prxparse("/\b[A-Z]{2,}\b/");
retain pattern_num;
/* Get the starting position and the length of the word to extract */
call prxsubstr(pattern_num, txtString, mypos, mylength);
/* If a word matching the regex pattern is found, extract it */
if mypos ^= 0 then word = substr(txtString, mypos, mylength);
run;
SAS prxsubstr() documentation: http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002295971.htm
Regex word boundary info: http://www.regular-expressions.info/wordboundaries.html

Extracting a specific word and a number of tokens on each side of it from each string in a column in SAS?

Extracting a specific word and a number of tokens on each side of it from each string in a column in SAS EG ?
For example,
row1: the sun is nice
row2: the sun looks great
row3: the sun left me
Is there a code that would produce the following result column (2 words where sun is the first):
SUN IS
SUN LOOKS
SUN LEFT
and possibly a second column with COUNT in case of duplicate matches.
So if there was 20 SUN LOOKS then it they would be grouped and have a count of 20.
Thanks
I think you can use functions findw() and scan() to do want you want. Both of those functions operate on the concept of word boundaries. findw() returns the position of the word in the string. Once you know the position, you can use scan() in a loop to get the next word or words following it.
Here is a simple example to show you the concept. It is by no means a finished or polished solution, but intended you point you in the right direction. The input data set (text) contains the sentences you provided in your question with slight modifications. The data step finds the word "sun" in the sentence and creates a variable named fragment that contains 3 words ("sun" + the next 2 words).
data text2;
set text;
length fragment $15;
word = 'sun'; * search term;
fragment_len = 3; * number of words in target output;
word_pos = findw(sentence, word, ' ', 'e');
if word_pos then do;
do i = 0 to fragmen_len-1;
fragment = catx(' ', fragment, scan(sentence, word_pos+i));
end;
end;
run;
Here is a partial print of the output data set.
You can use a combination of the INDEX, SUBSTR and SCAN functions to achieve this functionality.
INDEX - takes two arguments and returns the position at which a given substring appears in a string. You might use:
INDEX(str,'sun')
SUBSTR - simply returns a substring of the provided string, taking a second numeric argument referring to the starting position of the substring. Combine this with your INDEX function:
SUBSTR(str,INDEX(str,'sun'))
This returns the substring of str from the point where the word 'sun' first appears.
SCAN - returns the 'words' from a string, taking the string as the first argument, followed by a number referring to the 'word'. There is also a third argument that specifies the delimiter, but this defaults to space, so you wouldn't need it in your example.
To pick out the word after 'sun' you might do this:
SCAN(SUBSTR(str,INDEX(str,'sun')),2)
Now all that's left to do is build a new string containing the words of interest. That can be achieved with concatenation operators. To see how to concatenate two strings, run this illustrative example:
data _NULL_;
a = 'Hello';
b = 'World';
c = a||' - '||b;
put c;
run;
The log should contain this line:
Hello - World
As a result of displaying the value of the c variable using the put statement. There are a number of functions that can be used to concatenate strings, look in the documentation at CAT,CATX,CATS for some examples.
Hopefully there is enough here to help you.

Replace multiple substrings using strrep in Matlab

I have a big string (around 25M characters) where I need to replace multiple substrings of a specific pattern in it.
Frame 1
0,0,0,0,0,1,2,34,0
0,1,2,3,34,12,3,4,0
...........
Frame 2
0,0,0,0,0,1,2,34,0
0,1,2,3,34,12,3,4,0
...........
Frame 7670
0,0,0,0,0,1,2,34,0
0,1,2,3,34,12,3,4,0
...........
The substring I need to remove is the 'Frame #' and it occurs around 7670 times. I can give multiple search strings in strrep, using a cell array
strrep(text,{'Frame 1','Frame 2',..,'Frame 7670'},';')
However that returns a cell array, where in each cell, I have the original string with the corresponding substring of one of my input cell changed.
Is there a way to replace multiple substrings from a string, other than using regexprep? I noticed that it is considerably slower than strrep, that's why I am trying to avoid it.
With regexprep it would be:
regexprep(text,'Frame \d*',';')
and for a string of 25MB it takes around 47 seconds to replace all the instances.
EDIT 1: added the equivalent regexprep command
EDIT 2: added size of the string for reference, number of occurences for the substring and timing of execution for the regexprep
Ok, in the end I found a way to go around the problem. Instead of using regexprep to change the substring, I remove the 'Frame ' substring (including whitespace, but not the number)
rawData = strrep(text,'Frame ','');
This results in something like this:
1
0,0,0,0,0,1,2,34,0
0,1,2,3,34,12,3,4,0
...........
2
0,0,0,0,0,1,2,34,0
0,1,2,3,34,12,3,4,0
...........
7670
0,0,0,0,0,1,2,34,0
0,1,2,3,34,12,3,4,0
...........
Then, I change all the commas (,) and newline characters (\n) into a semicolon (;), using again strrep, and I create a big vector with all the numbers
rawData = strrep(rawData,sprintf('\r\n'),';');
rawData = strrep(rawData,';;',';');
rawData = strrep(rawData,';;',';');
rawData = strrep(rawData,',',';');
rawData = textscan(rawData,'%f','Delimiter',';');
then I remove the unnecessary numbers (1,2,...,7670), since they are located at a specific point in the array (each frame contains a specific amount of numbers).
rawData{1}(firstInstance:spacing:lastInstance)=[];
And then I go on with my manipulations. It seems that the additional strrep and removal of the values from the array is much much faster than the equivalent regexprep. With a string of 25M chars with regexprep I can do the whole operation in about 47", while with this workaround it takes only 5"!
Hope this helps somehow.
I think that this can be done using only textscan, which is known to be very fast. Be specifying a 'CommentStyle' the 'Frame #' lines are stripped out. This may only work because these 'Frame #' lines are on their own lines. This code returns the raw data as one big vector:
s = textscan(text,'%f','CommentStyle','Frame','Delimiter',',');
s = s{:}
You may want to know how many elements are in each frame or even reshape the data into a matrix. You can use textscan again (or before the above) to get just the data for the first frame:
f1 = textscan(text,'%f','CommentStyle','Frame 1','Delimiter',',');
f1 = s{:}
In fact, if you just want the elements from the first line, you can use this:
l1 = textscan(text,'%f,','CommentStyle','Frame 1')
l1 = l1{:}
However, the other nice thing about textscan is that you can use it to read in the file directly (it looks like you may be using some other means currently) using just fopen to get an FID. Thus the string data text doesn't have to be in memory.
Using regular expressions:
result = regexprep(text,'Frame [0-9]+','');
It's possible to avoid regular expressions as follows. I use strrep with suitable replacement strings that act as masks. The obtained strings are equal-length and are assured to be aligned, and can thus be combined into the final result using the masks. I've also included the ; you want. I don't know if it will be faster than regexprep or not, but it's definitely more fun :-)
% Data
text = 'Hello Frame 1 test string Frame 22 end of Frame 2 this'; %//example text
rep_orig = {'Frame 1','Frame 2','Frame 22'}; %//strings to be replaced.
%//May be of different lengths
% Computations
rep_dest = cellfun(#(s) char(zeros(1,length(s))), rep_orig, 'uni', false);
%//series of char(0) of same length as strings to be replaced (to be used as mask)
aux = cell2mat(strrep(text,rep_orig.',rep_dest.'));
ind_keep = all(double(aux)); %//keep characters according to mask
ind_semicolon = diff(ind_keep)==1; %//where to insert ';'
ind_keep = ind_keep | [ind_semicolon 0]; %// semicolons will also be kept
result = aux(1,:); %//for now
result(ind_semicolon) = ';'; %//include `;`
result = result(ind_keep); %//remove unwanted characters
With these example data:
>> text
text =
Hello Frame 1 test string Frame 22 end of Frame 2 this
>> result
result =
Hello ; test string ; end of ; this

SQLite Query for a character with a prefix and a suffix

Okay actually I'm writing a program for parsing Japanese/Chinese text, but I try to map it to an english example. No, I don't want to use it to create password lists :).
Suppose there is a text without spaces (space is not used in most east asian languages) like :
helloiamwritingproperenglish!
Given is a specific character position in the text like the r in proper:
helloiamwritingproperenglish!
^
so the text can be decomposed in prefix + 'r' + suffix.
Additionally there is a dictionary stored in SQLite containing character combinations (words) like:
sqllite>SELECT writingKey from dic_writings;
writingKey
----------
A, Aa, ...
I want to find all regular words in the dictionary that are containing the selected character 'r' and a (maybe empty) substring of prefix and suffix like:
sqllite>FindCandidates('helloiamwritingp','r','operenglish!');
R, Pro, Rope, Prop, Proper
A Query to find all words in the input text could be:
SELECT * FROM dic_writings WHERE (text LIKE ('%'||writingKey||'%'));
but this approach is not very fast and I need to filter the words containing the selected 'r' (checking for 'r' is not enough actually). Anybody has an idea? Thank you for your time!

Resources