Extract Number from string into a list in Scala - string

I have the following string :
var myStr = "abc12ef4567gh90ijkl789"
The size of the list is not fixed and it contains number in between. I want to extract the numbers and store them in the form of a list in this manner:
List(12,4567,90,789)
I tried the solution mentioned here but cannot extend it to my case. I just want to know if there is any faster or efficient solution instead of just traversing the string and extracting the numbers one by one using brute force ? Also, the string can be arbitrary length.

It seems you may just collect the numbers using
("""\d+""".r findAllIn myStr).toList
See the Scala demo. \d+ matches one or more digits, findAllIn searches for multiple occurrences of the pattern inside a string (and also un-anchors the pattern so that partial matches could be found).
If you prefer a splitting approach, you might use
myStr.split("\\D+").filter(_.nonEmpty).toList
See another demo. Here, \D+ matches one or more non-digit chars, and these chunks are used to split on (texts between these chunks land in the result). .filter(_.nonEmpty) will remove empty items that usually appear due to matches at the start/end of the string.

Related

How to Automatically add thousand separators for every number in a string?

How can i create a thousand separator for every number which is in my string?
So for example this string:
string = "123456,78+1234"
should be displayed as:
TextView = "123.456,78+1.234"
And the string should be editable, so the thousand separator should adapt when i remove or add a digit.
I have already read all the posts I could find about it, but I could never find an up-to-date working answer. So I would be really grateful for your help!
Your question contains two sub-questions:
A. You want to add thousand separators to a string which contains a group of numbers.
B. You want it to change.
And the answers are:
A: In your example there's , as a delimiter, so you need to split the string using this delimiter to an array of strings.
Then iterate over them and have your dots added to every 3nth index of their characters; you can also use String.format("%,d", substr.toLong()).
Lastly, append all of the strings back together with , as the separator.
B: This one can be done in different ways. You may store the original string somewhere and observe it, so when it changes it goes to the function which does A, and use the function result the way you like (which I suppose is to be set in a TextView).

Excel: Find words of certain length in string?

I have this file where I want to make a conditional check for any cell that contains the letter combination "_SOL", or where the string is followed by any numeric character like "_SOL1524", and stop looking after that. So I don't want matches for "_SOLUTION" or "_SOLothercharactersthannumeric".
So when I use the following formula, I also get results for words like "_SOLUTION":
=IF(ISNUMBER(FIND("_SOL",A1))=TRUE,"Yay","")
How can I avoid this, and only get matches if the match is "_SOL" or "_SOLnumericvalue" (one numeric character)
Clarification: The whole strings may be "Blabla_SOL_BLABLA", "Blabla_SOLUTION_BLABLA" or "Blabla_SOL1524_BLABLA"
Maybe this, which will check if the character after "_SOL" is numeric.
=IF(ISNUMBER(VALUE(MID(A1,FIND("_SOL",A1)+4,1))),"Yay","")
Or, as per OP's request and suggestion, to include the possibility of an underscore after "SOL"
=IF(OR(ISNUMBER(VALUE(MID(A1,FIND("_SOL",A1)+4,1))),ISNUMBER(FIND("_SOL_",A1))),"Yay","")
Here is an alternative way to check if your string contains SOL followed by either nothing or any numeric value up to any characters after SOL:
=IF(COUNT(FILTERXML("<t><s>"&SUBSTITUTE(A1,"_","1</s><s>")&"</s></t>","//s[substring-after(.,'SOL')*0=0]")>0),"Yey","Nay")
Just to use in an unfortunate event where you would encounter SOL1TEXT for example. Or, maybe saver (in case you have text like AEROSOL):
=IF(COUNT(FILTERXML("<t><s>"&SUBSTITUTE(A1,"_","</s><s>")&"</s></t>","//s[translate(.,'1234567890','')='SOL']")>0),"Yey","Nay")
And to prevent that you have text like 123SOL123 you could even do:
=IF(COUNT(FILTERXML("<t><s>"&SUBSTITUTE(A1,"_","1</s><s>")&"</s></t>","//s[starts-with(., 'SOL') and substring(., 4)*0=0]")>0),"Yey","Nay")

Permutations of a string of non-unique characters

While there are a lot of solutions for how to find all the (unique) permutations of a string of unique characters, I haven't found solutions that work when the characters are non-unique. I have listed out my idea below and would appreciate feedback, but also feel free to provide your own ideas.
My idea:
To illustrate my algorithm, I'm using the example of the string ABBC, which I want to find all permutations of. Since there are two B's I will be labelling them B1 and B2.
Create a new string by removing all duplicate characters from the original string (e.g. turn AB1B2C into AB1C).
Find all possible permutations of the new string (e.g. AB1C, ACB1, B1AC, etc.). There are many algorithms to do this, since the string's characters are all unique.
Choose one duplicate character. For each permutation, insert the chosen duplicate characters at every "position" of the permutation, except when the character just before the duplicate character has the same value as the duplicate character (e.g. For the permutation AB1C, since the duplicate character is B2, insert it to get B2AB1C, AB2B1C, AB1CB2. Exception: Don't do AB1B2C, since that's just a duplicate of AB2B1C).
Continue to do step 3 but now choose a different duplicate character. (Do this until all duplicate characters have been chosen exactly once.)
Prior research: The answer by Prakhar on this SO question claims to work for duplicates: Generate list of all possible permutations of a string. It might, but I suspect there's a bug in the code.
How about this: suppose that the string with duplicates is of length N. Now consider the sequence 0,1,...N-1. Find all its permutations using one of the known algorithms. For each permutation in this list, generate a corresponding string by using the number in the permutation as an index into the original string. For example, if the string is ABBC, then the sequences will be 0,1,2,3; 0,1,3,2; etc. The sequence 3,0,1,2, as an example, is one of the permutations, and it yields the string CABB

Finding similar strings in large datasets

I'm using levenshtein distance to retrieve similar strings from a list. At the moment the list has just a few thousand items, but we'll need to support at least 100k items.
I'm trying to make this more efficient and one technique I came up with was to calculate the levenshtein distance only on strings that are of similar length. I though about also filtering on the initial character i.e. if the string to search starts with b then I'll run the calculation only on the strings that start with b. But I'm not sure if I could assume this to work all the time.
I was wondering if you all have a better way of getting this done?
Thanks
One way to go would be to hope that a match with small edit distance would have within it a short exact match. If you assume this, then, given the string ABCDEF, retrieve all strings containing ABC, BCD, CDE, or DEF, and compute their edit distances. You may even find that the best match among these is so close that any closer match must have a short match inside it, so you would have found it already. You would have to accept that if you are unlucky you may miss some good matches, or be forced to go through all the possibilities one by one.
As an alternative to building a database of substrings, you could build a http://en.wikipedia.org/wiki/Suffix_array and LCP array from a string obtained by concatenating all the stored strings, separating them with a marker character not otherwise used. This takes time and space linear in the input size. You would then search for exact matches by looking for strings in the suffix array starting ABCDEF, BCDEF, CDEF, and DEF.

How will you sort strings in the following example?

so i have a list of string
{test,testertest,testing,tester,testingtest}
I want to sort it in descending order .. how do u sort strings in general ? Is it based on the length or is it character by character ??
how would it be in the example above ?? I want to sort them in a descending way.
No matter what language you’re in, there’s a built-in sort function that performs a lexicographical order, which returns
['test','tester','testertest','testing','testingtest']
for your example. If I wanted this reversed, I would just say reversed(sorted(myList)) in Python and be done with it. If you look to your right you can see plenty of related questions that require a more specialized ordering method (for numbers, dates, etc.), but lexicographic order works on strings containing any kind of data.
Here’s how it works:
compare(string A, string B):
if A and B are both non-empty:
if A[0] == B[0]:
// First letters are the same; compare by the rest
return compare(A[1:], B[1:])
else:
// Compare the first letters by Unicode code point
return compare(A[0], B[0])
else:
// They were equal till now; the shorter one shall be sorted first
return compare(length of A, length of B)
I would sort it like this:
testingtest
testing
testertest
tester
test
Assuming C#
string[] myStrings = {"test","testertest","testing","tester","testingtest"};
Array.Sort(myStrings);
Array.Reverse(myStrings);
foreach(string s in myStrings)
{
Console.WriteLine(s);
}
Not always an ideal way to do it - you could implement a custom comparer instead - but for the trivial example you asked about this is probably the most logical approach.
In computer science strings are usually sorted character by character, with the preferred sort order being (for a standard english character set):
Null characters first
Followed by whitepsace
Followed by symbols
Followed by numeric characters in obvious numerical order
Followed by alphabetic characters in obvious alphabetical order
When sorting characters generally lowercase characters come before uppercase characters.
So for example if we were to sort / compare:
test i ng
test e r
Then "tester" would come before "testing" - the first different character in the string is the 5th one, and "e" comes before "i".
Similarily if we were to compare:
test
testing
Then in this case "test" would come first - once again the strings are identical until the 5th character, where the string "test" ends (i.e. no character) which becomes before any alphanumerical character.
Note that this can produce some counter-intutive results when dealing with numbers - for example try sorting the strings "50" and "100" - you will find that "100" comes before "50". Why? because the strings differ at character 1 and "5" comes after "1".
In nearly all languages there is a function which will do all of the above for you!
You should use that function instead of trying to sort strings yourself! For example:
// C#
string[] myStrings = {"test","testertest","testing","tester","testingtest"};
Array.Sort(myStrings);
in Java you can use natural ordering with
java.util.Collections.sort(list);
the make it descending
java.util.Collections.reverse(list);
or create your own Comparator to do the reverse sorting.
When comparing two strings to see which sorts first, the comparison is typically done on a character by character basis. If the characters in the first position (e.g., t in your example) are identical, you move to the next character. When two characters differ, that "may" define which string is considered "greater".
However, depending on the locale used and a number of other factors, it is possible for later characters in the two strings being compared to override a difference in an earlier character. For example, in some collations, the diacritics on letters are considered to be of secondary weight. So a primary difference in a later character can override the secondary difference.
When two strings are otherwise identical but one is longer, the longer one is typically considered to be "greater". When sorting in descending order, the "greater" of two strings is sorted first.
Do you want to know if test should appear after tester in a descending order? Or are you particularly interested in sorting strings with similar prefixes?
If it's the later, I'd suggest a Trie if the input tends to grow big time.

Resources