How to split/tokenize a string by given requirements?

How to split/tokenize a string by given requirements? - string

I have this string here I need to split up in tokens.
thorak={name="Thorak"}
The result should look something like this:
["thorak", "=", "{", "name", "=", "Thorak", "}"]
My thought
I thought about having different RegEx rules running over it but am a bit unsure how to do this properly.
Consider this rule array:
["^(\w+)", "^=", "^{", "^(\w+)", "^=", "^(\w+)", "^}"]
Given a RegEx rule that only matches Strings ^(\w+) I would apply it to the string.
It should match with the thorak string and there I have my first token.
To have this work in a loop I might do the following:
Match a RegEx rule
Save the matching string in an array
Remove the matching string from the iterated string (to have RegEx rules running over the next parts)
Repeat until the String is empty OR no rule was able to be applied
This is my first time doing a bit more labor work on string so I wonder what nifty tricks there exist to make what I want easier.

Related

String contains substring and substring not part of longer word (exact match)

I have captured the full text of a PDF-file in a string called pdfText.
Next I am looping through an array containing substrings to be found/searched for in the pdfText-string.
One of the substrings is Invoice.
Both pdfText and the substrings I am searching for are converted to lower case.
If at least one of the substrings are found in the pdfText, a boolean is set to true.
Now, I have an example where the pdtText contains '...Net amount to be invoiced...'. This is the only variant of 'invoice' in the text.
This of course returns true if I use
substring = "Invoice" ... pdfText.contains(substring.ToLower).
But in this case I need it to return false. I need to find only exact matches.
Another example, if the pdfText contains '...This is an invoice. Please pay....Net amount to be invoiced...' the boolean should be set to true because of the first invoice-match, but not the second invoiced-(non)match.
So what I am looking for is to find a substring Invoice in a string pdfText and make sure, that the substring is not part of a longer word invoiced, invoice-process etc.. Note, that invoice. should return True.
I believe this should be possible, but cannot wrap my head around it currently.
I might need to use regex?

This one uses the RegEx, with a slight change, proposed by #Mederic at https://stackoverflow.com/a/45587916/2326360
Use the build in UiPath activity Is Match, found under Programming->String.
Use it inside your loop, with the current settings.
The RegEx is: substring+"[^a-zA-Z]"
I have declared the following variables:

RegEx would be a good approach.
I only started RegEx not long ago but I think this would work fine.
RegEx:
(invoice)[^a-zA-Z]
Explanation:
() Creates a Capture Group
invoice looks for the match for invoice
[^a-zA-Z] Checks there are no characters from a-z or A-Z after
Example:
Sample: This was invoiced
Result: No Result
Sample: This is an invoice.
Result: Match on invoice. Capture group 1 = invoice
Implementation:
Dim m As Match = Regex.Match(pdfText.ToLower,"(invoice)[^a-zA-Z]")
' If successful, write the group.
If (m.Success) Then
Dim key As String = m.Groups(1).Value
Console.WriteLine(key)
End If

Display the specific part of the string in PostgreSQL 9.3

I have a string to modify as per the requirements.
For example:
The given string is:
str1 varchar = '123,456,789';
I want to show the string as:
'456,789'
Note: The first part (delimited) with comma, I want to remove from string and show the rest of string.
In SQL Server I used STUFF() function.
SELECT STUFF('123,456,789',1,4,'');
Result:
456,789
Question: Is there any string function in PostgreSQL 9.3 version to do the same job?

you can use regular expressions:
select substring('123,456,789' from ',(.*)$');
The comma matches the first comma found in the string. The part inside the brackets (.*) is returned from the function. The symbol $ means the end of the string.
A alternative solution without regular expressions:
select str, substring(str from position(',' in str)+1 for length(str)) from
(select '123,456,789'::text as str) as foo;

You could first turn the string to array and return second and third cell:
select array_to_string((regexp_split_to_array('123,456,789', ','))[2:3], ',')
Or you could use substring-function with regular expressions (pattern matching):
SELECT substring('123,456,789' from '[0-9]+,([0-9]+,[0-9]+)')
[0-9]+ means one or more digits
parentheses tell to return that part from the string
Both solutions work on your specific string.

Your The SQL Server example indicates you just want to remove the first 4 characters, which makes the rest of your question seem misleading because it completely ignores what's in the string. Only the positions matters.
Be that as it may, the simple and cheap way to cut off leading characters is with right():
SELECT right('123,456,789', -4);
SQL Fiddle.

Reading from a string using sscanf in Matlab

I'm trying to read a string in a specific format
RealSociedad
this is one example of string and what I want to extract is the name of the team.
I've tried something like this,
houseteam = sscanf(str, '%s');
but it does not work, why?

You can use regexprep like you did in your post above to do this for you. Even though your post says to use sscanf and from the comments in your post, you'd like to see this done using regexprep. You would have to do this using two nested regexprep calls, and you can retrieve the team name (i.e. RealSociedad) like so, given that str is in the format that you have provided:
str = 'RealSociedad';
houseteam = regexprep(regexprep(str, '^<a(.*)">', ''), '</a>$', '')
This looks very intimidating, but let's break this up. First, look at this statement:
regexprep(str, '^<a(.*)">', '')
How regexprep works is you specify the string you want to analyze, the pattern you are searching for, then what you want to replace this pattern with. The pattern we are looking for is:
^<a(.*)">
This says you are looking for patterns where the beginning of the string starts with a a<. After this, the (.*)"> is performing a greedy evaluation. This is saying that we want to find the longest sequence of characters until we reach the characters of ">. As such, what the regular expression will match is the following string:
<ahref="/teams/spain/real-sociedad-de-futbol/2028/">
We then replace this with a blank string. As such, the output of the first regexprep call will be this:
RealSociedad</a>
We want to get rid of the </a> string, and so we would make another regexprep call where we look for the </a> at the end of the string, then replace this with the blank string yet again. The pattern you are looking for is thus:
</a>$
The dollar sign ($) symbolizes that this pattern should appear at the end of the string. If we find such a pattern, we will replace it with the blank string. Therefore, what we get in the end is:
RealSociedad

Found a solution. So, %s stops when it finds a space.
str = regexprep(str, '<', ' <');
str = regexprep(str, '>', '> ');
houseteam = sscanf(str, '%*s %s %*s');
This will create a space between my desired string.

Lua: Search a specific string

Hi all tried all the string pattrens and library arguments but still stuck.
i want to get the name of the director from the following string i have tried the string.matcH but it matches the from the first character it finD from the string
the string is...
fixstrdirector = {id:39254,cast:[{id:15250,name:Hope Davis,character:Aunt Debra,order:5,cast_id:10,profile_path:/aIHF11Ss8P0A8JUfiWf8OHPVhOs.jpg},{id:53650,name:Anthony Mackie,character:Finn,order:3,cast_id:11,profile_path:/5VGGJ0Co8SC94iiedWb2o3C36T.jpg},{id:19034,name:Evangeline Lilly,character:Bailey Tallet,order:2,cast_id:12,profile_path:/oAOpJKgKEdW49jXrjvUcPcEQJb3.jpg},{id:6968,name:Hugh Jackman,character:Charlie Kenton,order:0,cast_id:13,profile_path:/wnl7esRbP3paALKn4bCr0k8qaFu.jpg},{id:79072,name:Kevin Durand,character:Ricky,order:4,cast_id:14,profile_path:/c95tTUjx5T0D0ROqTcINojpH6nB.jpg},{id:234479,name:Dakota Goyo,character:Max Kenton,order:1,cast_id:15,profile_path:/7PU6n4fhDuFwuwcYVyRNVEZE7ct.jpg},{id:8986,name:James Rebhorn,character:Marvin,order:6,cast_id:16,profile_path:/ezETMv0YM0Rg6YhKpu4vHuIY37D.jpg},{id:930729,name:Marco Ruggeri,character:Cliff,order:7,cast_id:17,profile_path:/1Ox63ukTd2yfOf1LVJOMXwmeQjO.jpg},{id:19860,name:Karl Yune,character:Tak Mashido,order:8,cast_id:18,profile_path:/qK315vPObCNdywdRN66971FtFez.jpg},{id:111206,name:Olga Fonda,character:Farra Lemkova,order:9,cast_id:19,profile_path:/j1qabOHf3Pf82f1lFpUmdF5XvSp.jpg},{id:53176,name:John Gatins,character:Kingpin,order:10,cast_id:41,profile_path:/A2MqnSKVzOuBf8MVfNyve2h2LxJ.jpg},{id:1126350,name:Sophie Levy,character:Big Sister,order:11,cast_id:42,profile_path:null},{id:1126351,name:Tess Levy,character:Little Sister,order:12,cast_id:43,profile_path:null},{id:1126352,name:Charlie Levy,character:Littlest Sister,order:13,cast_id:44,profile_path:null},{id:187983,name:Gregory Sims,character:Bill Panner,order:14,cast_id:45,profile_path:null}],crew:[{id:58726,name:Leslie Bohem,department:Writing,job:Screenplay,profile_path:null},{id:53176,name:John Gatins,department:Writing,job:Screenplay,profile_path:/A2MqnSKVzOuBf8MVfNyve2h2LxJ.jpg},{id:17825,name:Shawn Levy,department:Directing,job:Director,profile_path:/7f2f8EXdlWsPYN0HPGcIlG21xU.jpg},{id:12415,name:Richard Matheson,department:Writing,job:Story,profile_path:null},{id:57113,name:Dan Gilroy,department:Writing,job:Story,profile_path:null},{id:25210,name:Jeremy Leven,department:Writing,job:Story,profile_path:null},{id:17825,name:Shawn Levy,department:Production,job:Producer,profile_path:/7f2f8EXdlWsPYN0HPGcIlG21xU.jpg},{id:34970,name:Susan Montford,department:Production,job:Producer,profile_path:/1XJt51Y9ciPhkHrAYE0j6Jsmgji.jpg},{id:3183,name:Don Murphy,department:Production,job:Producer,profile_path:null},{id:34967,name:Rick Benattar,department:Production,job:Producer,profile_path:null},{id:1126348,name:Eric Hedayat,department:Production,job:Producer,profile_path:null},{id:186721,name:Ron Ames,department:Production,job:Producer,profile_path:null},{id:10956,name:Josh McLaglen,department:Production,job:Executive Producer,profile_path:null},{id:57634,name:Mary McLaglen,department:Production,job:Executive Producer,profile_path:null},{id:23779,name:Jack Rapke,department:Production,job:Executive Producer,profile_path:null},{id:488,name:Steven Spielberg,department:Production,job:Executive Producer,profile_path:/cuIYdFbEe89PHpoiOS9tmo84ED2.jpg},{id:30,name:Steve Starkey,department:Production,job:Executive Producer,profile_path:null},{id:24,name:Robert Zemeckis,department:Production,job:Executive Producer,profile_path:/isCuZ9PWIOyXzdf3ihodXzjIumL.jpg},{id:531,name:Danny Elfman,department:Sound,job:Original Music Composer,profile_path:/pWacZpYPos8io22nEiim7d3wp2j.jpg},{id:18265,name:Mauro Fiore,department:Crew,job:Cinematography,profile_path:null},{id:54271,name:Dean Zimmerman,department:Editing,job:Editor,profile_path:null},{id:25365,name:Richard Hicks,department:Production,job:Casting,profile_path:null},{id:5490,name:David Rubin,department:Production,job:Casting,profile_path:null},{id:52088,name:Tom Meyer,department:Art,job:Production Design,profile_path:null}]}
i have tried string.match(fixstrdirector,"name:(.+),department:Directing")
but it gives me the from the first occurace it find the name to the end of thr string
output:
Hope Davis,character:Aunt Debra,order:5,cast_id:10,profile_path:/aIHF11Ss8P0A8JUfiWf8OHPVhOs.jpg},{id:53650,name:Anthony Mackie,character:Finn,order:3,cast_id:11,profile_path:/5VGGJ0Co8SC94iiedWb2o3C36T.jpg},{id:19034,name:Evangeline Lilly,character:Bailey Tallet,order:2,cast_id:12,profile_path:/oAOpJKgKEdW49jXrjvUcPcEQJb3.jpg},{id:6968,name:Hugh Jackman,character:Charlie Kenton,order:0,cast_id:13,profile_path:/wnl7esRbP3paALKn4bCr0k8qaFu.jpg},{id:79072,name:Kevin Durand,character:Ricky,order:4,cast_id:14,profile_path:/c95tTUjx5T0D0ROqTcINojpH6nB.jpg},{id:234479,name:Dakota Goyo,character:Max Kenton,order:1,cast_id:15,profile_path:/7PU6n4fhDuFwuwcYVyRNVEZE7ct.jpg},{id:8986,name:James Rebhorn,character:Marvin,order:6,cast_id:16,profile_path:/ezETMv0YM0Rg6YhKpu4vHuIY37D.jpg},{id:930729,name:Marco Ruggeri,character:Cliff,order:7,cast_id:17,profile_path:/1Ox63ukTd2yfOf1LVJOMXwmeQjO.jpg},{id:19860,name:Karl Yune,character:Tak Mashido,order:8,cast_id:18,profile_path:/qK315vPObCNdywdRN66971FtFez.jpg},{id:111206,name:Olga Fonda,character:Farra Lemkova,order:9,cast_id:19,profile_path:/j1qabOHf3Pf82f1lFpUmdF5XvSp.jpg},{id:53176,name:John Gatins,character:Kingpin,order:10,cast_id:41,profile_path:/A2MqnSKVzOuBf8MVfNyve2h2LxJ.jpg},{id:1126350,name:Sophie Levy,character:Big Sister,order:11,cast_id:42,profile_path:null},{id:1126351,name:Tess Levy,character:Little Sister,order:12,cast_id:43,profile_path:null},{id:1126352,name:Charlie Levy,character:Littlest Sister,order:13,cast_id:44,profile_path:null},{id:187983,name:Gregory Sims,character:Bill Panner,order:14,cast_id:45,profile_path:null}],crew:[{id:58726,name:Leslie Bohem,department:Writing,job:Screenplay,profile_path:null},{id:53176,name:John Gatins,department:Writing,job:Screenplay,profile_path:/A2MqnSKVzOuBf8MVfNyve2h2LxJ.jpg},{id:17825,name:Shawn Levy

You're searching from the first occurrence of "name:" until the "department:Directing" with everything in between.
Instead, you need to restrict what can be between the two strings. Here for example I'm saying that the characters that make up the name can only be alphanumeric or a space:
string.match(fixstrdirector,"name:([%w ]+),department:Directing")
Alternatively, given that there's a comma separating the parameters, a better approach would be to search for "name:" followed by any characters other than a comma, followed by "department:Directing":
string.match(fixstrdirector,"name:([^,]+),department:Directing")
Of course that wouldn't work if the name had a comma it in!

Lua patterns provides - modifier for tasks as you have above. As stated on PiL - Section 20.2:
The + modifier matches one or more characters of the original class.
It will always get the longest sequence that matches the pattern.
Like *, the modifier - also matches zero or more occurrences of
characters of the original class. However, instead of matching the
longest sequence, it matches the shortest one.
Next, when you are using . to match, it'll find any and all characters satisfying the pattern. Therefore, you'll get the result from first occurence of name until the ,department:Directing is found. Since you know that it is a JSON data, you can try to match for [^,]; that is, non-comma characters.
So, for your case try:
local tAllNames = {}
for sName in fixstrdirector:gmatch( "name:([^,]-),department:Directing" ) do
tAllNames[ #tAllNames + 1 ] = sName
end
and all your required names will be stored in the table tAllNames. An example of the above can be seen at codepad.

C# 4.0 function to check for first four characters in the string

I need to validate for valid code name.
So, my string can have values like below:
String test = "C000. ", "C010. ", "C020. ", "C030. ", "CA00. ","C0B0. ","C00C. "
So my function needs to validate below conditions:
It should start with C
After that next 3 characters should be numeric before .
Rest it can be anything.
So in above string values, only ["C000.", "C010.", "C020.", "C030."] are valid ones.
EDIT:
Below is the code I tried:
if (nameObject.Title.StartsWith(String.Format("^[C][0-9]{3}$",nameObject.Title)))

I'd suggest a regex, for example (written off the top of my head, may need work):
string s = "C030.";
Regex reg = new Regex("C[0-9]{3,3}\\.");
bool isMatch = reg.IsMatch(s);

This regex should do the trick:
Regex.IsMatch(input, #"C[0-9]{3}\..*")

Check out http://www.techotopia.com/index.php/Working_with_Strings_in_C_Sharp
for a quick tutorial on (among other things) individual access of string elements, so you can test each element for your criteria.
If you think your criteria may change, using regular expressions gives you maximum flexibility (but is more runtime intensive than regular string-element evaluation). In your case, it may be overkill, IMHO.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to split/tokenize a string by given requirements? - string

Related

String contains substring and substring not part of longer word (exact match)

Display the specific part of the string in PostgreSQL 9.3

Reading from a string using sscanf in Matlab

Lua: Search a specific string

C# 4.0 function to check for first four characters in the string

Categories

Resources