Parse string for matching substrings and replace with data - string

Let's say I have some text in rails:
text = "A bunch of data goes in here: %#user.name%#, %#user.email%#, %#company.name%#, %#company.state%# and then some other information as well"
I am looking for the best way to parse through that text looking for all substrings between %# and another %# in order to replace it with actual data. The text should not anticipate that data will be in any particular order and it should ideally be able to turn the substrings into references to local variables that match the substring.

Use the String#scan method.
For your case in particular, the regex I used to match was: /(\%\#.*?\%\#)/
text = "A bunch of data goes in here: %#user.name%#, %#user.email%#, %#company.name%#, %#company.state%# and then some other information as well"
regex = /(\%\#.*?\%\#)/
#here's the one line version
text.scan(regex).each {|match| text.sub!(match[0], eval(match[0].gsub(/[\%\#]/, '')))}
#Here's the more organized version
text.scan(regex).each do |match|
current_match = match[0]
replacement_var = current_match.gsub(/[\%\#]/, '')
text.sub!(current_match, eval(replacement_var))
end
puts text

text = "A bunch of data goes in here: %#user.name%#, %#user.email%#, %#company.name%#, %#company.state%# and then some other information as well"
placeholder = text[/\%\#.*?\%\#/]
while placeholder
case placeholder
when "%#user.name%#"
text.sub!(/\%\#.*?\%\#/,"Steve")
when "%#user.email%#"
text.sub!(/\%\#.*?\%\#/,"steve#example.com")
when "%#company.name%#"
text.sub!(/\%\#.*?\%\#/,"Wayne Industries")
when "%#company.state%#"
text.sub!(/\%\#.*?\%\#/,"Gotham")
else
text.sub!(/\%\#.*?\%\#/,"unknown")
end
placeholder = text[/\%\#.*?\%\#/]
end

Related

Regular expression match / break

I am doing text analysis on SEC filings (e.g., 10-K), and the documents I have are the complete submission. The complete filing submission includes the 10-K, plus several other documents. Each document resides within the tags ‘<DOCUMENT>’ and ‘</DOCUMENT>’.
What I want: To count the number of words in the 10-K only before the first instance of ‘</DOCUMENT>’
How I want to accomplish it: I want to use a for loop, with a regex (regex_end10k) to indicate where to stop the for loop.
What is happening: No matter where I put my regex match break, the program counts all of the words in the entire document. I have no error, however I cannot get the desired results.
How I know this: I have manually trimmed one filing, while retaining the full document (results below). When I manually remove the undesired documents after the first instance of ‘</DOCUMENT>’, I yield about 750,000 fewer words.
Current output
Note: Apparently I don't have enough SO reputation to embed a screenshot in my post; it defaults to a link.
What I have tried: several variations of where to put the regex match break. No matter what, it almost always counts the entire document. I believe that the two functions may be performed over the entire document. I have tried putting the break statement within get_text_from_html() so that count_words() only performs on the 10-K, but I have had no luck.
The code below is a snippet from a larger function. It's purpose is to (1) strip html tags and (2) count the number of words in the text. If I can provide any additional information, please let me know and I'll update my post.
The remaining code (not shown) extracts firm and report identifiers, (e.g., ‘file’ or ‘cik’) from the header section between tags ‘<SEC-HEADER>’ and ‘</SEC-HEADER>’. Using the same logic, when extracting header information, I use a regex match break logic and it works perfectly. I need help trying to understand why this same logic isn’t working when I try to count the number of words and how to correct my code. Any help is appreciated.
regex_end10k = re.compile(r'</DOCUMENT>', re.IGNORECASE)
for line in f:
def get_text_from_html(html:str):
doc = lxml.html.fromstring(html)
for table in doc.xpath('.//table'): # optional: removes tables from HTML source code
table.getparent().remove(table)
for tag in ["a", "p", "div", "br", "h1", "h2", "h3", "h4", "h5"]:
for element in doc.findall(tag):
if element.text:
element.text = element.text + "\n"
else:
element.text = "\n"
return doc.text_content()
to_clean = f.read()
clean = get_text_from_html(to_clean)
#print(clean[:20000])
def count_words(clean):
words = re.findall(r"\b[a-zA-Z\'\-]+\b",clean)
word_count = len(words)
return word_count
header_vars["words"] = count_words(clean)
match = regex_end10k.search(line) # This should do it, but it doesn't.
if match:
break
You dont need regx, just split your orginal string, and then in the part before count the words, simple example above:
text = 'Text before <DOCUMENT> text after'
splited_text = text.split('<DOCUMENT>')
splited_text_before = splited_text[0]
count_words = len(splited_text_before.split())
print(splited_text_before)
print(count_words)
output
Text before
2

replace 2 or 3 words in one sentence in one cell from another words in another cell on Excel

I was looking for a solution and I found it here
replacing many words every one with alternative word
But now I'm using a alternative code that I've got from the link below that post, which is case sensitve.
Function SubstituteMultipleCS(text As String, old_text As Range, new_text As Range)
Dim i As Single
For i = 1 To old_text.Cells.Count
Result = Replace(text, old_text.Cells(i), new_text.Cells(i))
text = Result
Next i
SubstituteMultipleCS = Result
End Function
I'm using it to make German Anki cards so I need to replace some words with ___. It's working with one single word or a bunch of words if they are together, but...
The problem is the following:
Some verbs conjugation have a sentence structure when I must place the main verb after the noun and the particle, which belongs to the verb, at the end. Something like this
As you can see in the picture, the verb "schaute an" is not replaced by the new word because "schaute" is separated from "an" in the original sentence.
Is there any way to fix this?
thank you.
Here is a formula you may use (which works for your current sample data:
Formula in C2:
=IFERROR(TRIM(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(" "&SUBSTITUTE(B2,"."," ")&" "," "&FILTERXML("<t><s>"&SUBSTITUTE(A2," ","</s><s>")&"</s></t>","//s[position() = 1]")&" ",D2,1),IFERROR(" "&FILTERXML("<t><s>"&SUBSTITUTE(A2," ","</s><s>")&"</s></t>","//s[position() = 2]")&" ",""),D2,1),IFERROR(" "&FILTERXML("<t><s>"&SUBSTITUTE(A2," ","</s><s>")&"</s></t>","//s[position() = 3]")&" ",""),D2,1))&".","")
The advantage of nested substitutes is that we can tell the function to only replace the first occurence if you had a sentence where multiple could occur. Not sure if it's watertight.

MATLAB: Only pick filenames coinciding with some input string

Say I have a directory full of filenames such as:
1242349_blabla.wav
fdp23424_asdf.wav
o2349_0.wav
and I have an input text file listing unique IDs on each newline coinciding with numbers within these filenames (e.g. '23424' for the second filename above).
I'd like to construct a struct of filenames only containing those filenames in that directory that coincide with some ID in the input text file:
fid = fopen('input.txt');
input = textscan(fid, '%s', 'Delimiter', '\n');
filenames = dir(fullfile('/somedir/', '*.wav'));
for i = 1:length(filenames)
for j = 1:length(input)
if (strfind(input{1}(j), filenames(i).name)) ~= [])
% create new struct with chosen filenames
end
end
end
However, I get the error "undefined function 'ne' for input arguments of type 'cell'". I've tried loads of options to no avail. Also, the input evaluates to a 38x1 cell, but which has length 1, so the inner loop will only go once... Any ideas?
Regular expressions are definitely the most flexible and powerful solution. But, if your needs are simpler...you can get away with something simpler, like using wildcards in your dir command. Try something like this:
%get your file IDs from the input file
fid = fopen('input.txt');
input = textscan(fid, '%s', 'Delimiter', '\n');
IDs = input{1};
%loop over each string
myfilenames = {};
for idx = 1:length(IDs)
%get all files build off the given ID
fnames = dir(['somedir/*' IDs{idx} '*.wav']); %wildcards!
%gather the new filenames that match
for Ifname=1:length(fnames)
myfilenames{end+1}=fnames(Ifname).name;
end
end
I would use regular expressions to search for occurrences of the ID in your cell array. Regular expressions are designed to search for patterns in a particular string for you. Because you want to search for specific numbers in a set of strings, I would certainly recommend you use it. Specifically, use the regexp function, and the pattern you want to search for is the ID that you want are searching for.
How regexp works is that you can provide a cell array of strings, and the output will be another cell array where each element is a numeric array that determines the starting index of where the particular pattern you're looking for starts for a particular string in the cell array. Should the array be empty, this means that we didn't find any pattern that matched what you're looking for. If it isn't empty, then it will contain the starting index of where the ID is located in the string. This doesn't really matter - you want to determine whether the ID exists in a particular string, and so checking to see whether each array is empty is what will be useful.
As such, given your filenames that you read through dir, we can create a cell array that stores just the file names themselves, run regexp, then filter out those file names that don't contain the ID you want. Something like this:
f = dir(fullfile('/somedir/', '*.wav'));
filenames = {f.name};
ID = 23424;
check = regexp(filenames, num2str(ID));
filtered_ind = cellfun(#isempty, check);
final_files = f(~filtered_ind);
The first line of code reads the files from your desired directory. The second line of code extracts the names from each name field of the structure as a cell array. The third line is the ID you want to check for. The fourth line does a regexp call on the file names and searches for those file names that contain your desired number. Note that we need to convert the number to a string, as the pattern is expected to be a string. The next line after that finds those filenames that do not have the ID you are looking for, and the last line simply finds those files that do have the ID you're looking for.
You can then go ahead and start your processing. Specifically, you can loop over this cell array and go ahead and create your structures per element in this cell:
for i = 1:length(final_files)
s = final_files(i); %// Get the dir structure for a file that passed the ID check
%// Create your structure now...
%// ...
end
However, you have a series of IDs that you want to check. We can simply take the code above and apply a loop to it. In other words, you'd do something like:
fid = fopen('input.txt');
input = textscan(fid, '%s', 'Delimiter', '\n');
IDs = input{1};
f = dir(fullfile('/somedir/', '*.wav'));
filenames = {f.name};
for idx = 1 : length(IDs)
%// Get an ID
ID = IDs{idx};
%// Do our checking and filter out those files that don't contain our ID
check = regexp(filenames,ID);
filtered_ind = cellfun(#isempty, check);
final_files = f(~filtered_ind);
%// Do your final processing
for i = 1:length(final_files)
s = final_files(i); %// Get the dir structure for a file that passed the ID check
%// Create your structure now...
%// ...
end
end
With the above code, we open the text file, then parse each string that's in the text file and place it into a cell array called IDs. Note here that the IDs are now all strings, so there's no need to do any conversions. After, for each ID we have, we search our filenames to see which files have this ID we're looking for. We filter out those filenames that don't have this ID, then we loop over each one of these files and create our structures. We do this for each ID that we have.
Just to demonstrate that this regexp stuff is working, as a small example, let's use the three filenames you have provided with your post. I've placed these names in a cell array, then I'll run lines 3 to 5 in the code I wrote, then I will filter out those filenames that don't contain the ID we're looking for:
filenames = {'1242349_blabla.wav'; 'fdp23424_asdf.wav'; 'o2349_0.wav'};
ID = 23424;
check = regexp(filenames, num2str(ID));
filtered_ind = cellfun(#isempty, check);
final_filenames = filenames(~filtered_ind);
final_filenames is a cell array our filenames that have our ID. We thus get:
final_filenames =
'fdp23424_asdf.wav'
Good luck!

Reading from a string using sscanf in Matlab

I'm trying to read a string in a specific format
RealSociedad
this is one example of string and what I want to extract is the name of the team.
I've tried something like this,
houseteam = sscanf(str, '%s');
but it does not work, why?
You can use regexprep like you did in your post above to do this for you. Even though your post says to use sscanf and from the comments in your post, you'd like to see this done using regexprep. You would have to do this using two nested regexprep calls, and you can retrieve the team name (i.e. RealSociedad) like so, given that str is in the format that you have provided:
str = 'RealSociedad';
houseteam = regexprep(regexprep(str, '^<a(.*)">', ''), '</a>$', '')
This looks very intimidating, but let's break this up. First, look at this statement:
regexprep(str, '^<a(.*)">', '')
How regexprep works is you specify the string you want to analyze, the pattern you are searching for, then what you want to replace this pattern with. The pattern we are looking for is:
^<a(.*)">
This says you are looking for patterns where the beginning of the string starts with a a<. After this, the (.*)"> is performing a greedy evaluation. This is saying that we want to find the longest sequence of characters until we reach the characters of ">. As such, what the regular expression will match is the following string:
<ahref="/teams/spain/real-sociedad-de-futbol/2028/">
We then replace this with a blank string. As such, the output of the first regexprep call will be this:
RealSociedad</a>
We want to get rid of the </a> string, and so we would make another regexprep call where we look for the </a> at the end of the string, then replace this with the blank string yet again. The pattern you are looking for is thus:
</a>$
The dollar sign ($) symbolizes that this pattern should appear at the end of the string. If we find such a pattern, we will replace it with the blank string. Therefore, what we get in the end is:
RealSociedad
Found a solution. So, %s stops when it finds a space.
str = regexprep(str, '<', ' <');
str = regexprep(str, '>', '> ');
houseteam = sscanf(str, '%*s %s %*s');
This will create a space between my desired string.

How to store string matrix and write to a file?

I don't know if Matlab can do this, but I want to store some strings in a 4×3 matrix, each element in the matrix is a string.
test_string_01 test_string_02 test_string_03
test_string_04 test_string_05 test_string_06
test_string_07 test_string_08 test_string_09
test_string_10 test_string_11 test_string_12
Then, I want to write this matrix into a plain text file, either comma or space delimited.
test_string_01,test_string_02,test_string_03
test_string_04,test_string_05,test_string_06
test_string_07,test_string_08,test_string_09
test_string_10,test_string_11,test_string_12
Seems like matrix data type is not capable of storing strings. I looked at cell. I tried to use dlmwrite() or csvwrite(), but both of them only accept matrices. I also tried cell2mat() first, but in that way all letters in the strings are comma seperated, like
t,e,s,t,_,s,t,r,i,n,g,_,0,1,t,e,s,t,_,s,t,r,i,n,g,_,0,2,t,e,s,t,_,s,t,r,i,n,g,_,0,3
So is there any way to achieve this?
It is possible to shorten yuk's solution a bit.
strings = {
'test_string_01','test_string_02','test_string_03'
'test_string_04','test_string_05','test_string_06'
'test_string_07','test_string_08','test_string_09'
'test_string_10','test_string_11','test_string_12'};
fid = fopen('output.txt','w');
fmtString = [repmat('%s\t',1,size(strings,2)-1),'%s\n'];
fprintf(fid,fmtString,strings{:});
fclose(fid);
Cell array is the way to store strings.
I agree it's a pain to save strings into a text file, but you can do it with this code:
strings = {
'test_string_01','test_string_02','test_string_03'
'test_string_04','test_string_05','test_string_06'
'test_string_07','test_string_08','test_string_09'
'test_string_10','test_string_11','test_string_12'};
fid = fopen('output.txt','w');
for row = 1:size(strings,1)
fprintf(fid, repmat('%s\t',1,size(strings,2)-1), strings{row,1:end-1});
fprintf(fid, '%s\n', strings{row,end});
end
fclose(fid);
Substitute \t with , to get csv file.
You can also store cell array of strings into Excel file with XLSWRITE (requires COM interface, so it's on Windows only):
xlswrite('output.xls',strings)
In most cases you can use the delimiter ' ' and get Matlab to save a string into file with dlmwrite.
For example,
output=('my_first_String');
dlmwrite('myfile.txt',output,'delimiter','')
will save a file named myfile.txt containing my_first_String.

Resources