String parsing in optimal way

String parsing in optimal way - string

Suppose I have a string as onehhhtwominusthreehhkkseveneightjnine
Now I want to parse this string to get the numbers out of it. For Example this string should return an array, [one,two,minusthree,seven,eight,nine].
The order of the Integers should be maintained.
Can anyone Please suggest an optimal way to do this parsing? Thanks.

(You haven't mentioned a programming language?)
I would probably search for "minus" and check the number(s) that follow it. Then search for "one", then "two", noting their indexes. This would provide enough information to map and output the results, and order, that you need.
Another option is to look at each character in order, comparing each to the 10 choices. I couldn't tell you which is the most efficient - I think it depends on the possible total string length. I'd probably write both and profile them.
If the string to search is not of inordinate length then I suspect that the second approach might be more efficient. This is because, as soon as you have a match, you can eliminate searching the following (known) length of characters.
That is, if you have "abceightd", once you discover the "e" and its "eight" you can skip four characters. You can also skip the a, b, and c anyway, as they are not the beginning character for any of the 10 choices.
I am assuming your choices are:
one, two, three, four, five, six, seven, eight, nine, minus

Assuming that a) you have access to regular expressions in your choice of programming language and b) your possible choices are as Andy G has assumed... then this regular expression can pick out the numbers grouped with their associated minus, if present:
/((?:minus)*(?:one|two|three|four|five|six|seven|eight|nine))/g
Applied to your example string using JavaScript's RegEx.exec(), for example, this extracts:
one
two
minusthree
seven
eight
nine
You could easily place a space after any minus matched if required. Does this help at all?

Related

How to make an excel (365) function that recognizes different words in the same cell and changes them individually

What im working with
I have a list of product names, but unfortunately they are written in uppercase I now want to make only the first letter uppercase and the rest lowercase but I also want all words with 3 or less symbols to stay uppercase
im trying if functions but nothing is really working
i use the german excel version but i would be happy if someone has any idea on how to do it im trying different functions for hours but nothing is working
=IF(LENGTH(C6)<=3,UPPER(C6),UPPER(LEFT(C6,1))&LOWER(RIGHT(C6,LENGTH(C6)-1)))
but its a #NAME error excel does not recognize the first and the last bracket

This is hard! Let me explain:
I do believe there are German words in the mix that are below 4 characters in length that you should exclude. My German isn't great but there would probably be a huge deal of words below 4 characters;
There seems to be substrings that are 3+ characters in length but should probably stay uppercase, e.g. '550E/ER';
There seem to be quite a bunch of characters that could be used as delimiters to split the input into 'words'. It's hard to catch any of them without a full list;
Possible other reasons;
With the above in mind I think it's safe to say that we can try to accomplish something that you want as best as we can. Therefor I'd suggest
To split on multiple characters;
Exclude certain words from being uppercase when length < 3;
Include certain words to be uppercase when length > 3 and digits are present;
Assume 1st character could be made uppercase in any input;
For example:
Formula in B1:
=MAP(A1:A5,LAMBDA(v,LET(x,TEXTSPLIT(v,{"-","/"," ","."},,1),y,TEXTSPLIT(v,x,,1),z,TEXTJOIN(y,,MAP(x,LAMBDA(w,IF(SUM(--(w={"zu","ein","für","aus"})),LOWER(w),IF((LEN(w)<4)+SUM(IFERROR(FIND(SEQUENCE(10,,0),w),)),UPPER(w),LOWER(w)))))),UPPER(LEFT(z))&MID(z,2,LEN(v)))))
You can see how difficult it is to capture each and every possibility;
The minute you exclude a few words, another will pop-up (the 'x' between numbers for example. Which should stay upper/lower-case depending on the context it is found in);
The second you include words containing digits, you notice that some should be excluded ('00SICHERUNGS....');
If the 1st character would be a digit, the whole above solution would not change 1st alpha-char in upper;
Maybe some characters shouldn't be used as delimiters based on context? Think about hypenated words;
Possible other reasons.
Point is, this is not just hard, it's extremely hard if not impossible to do on the type of data you are currently working with! Even if one is proficient with writing a regular expression (chuck in all (non-available to Excel) tokens, quantifiers and methods if you like), I'd doubt all edge-case could be covered.

Because you are dealing with any number of words in a cell you'll need to get crafty with this one. Thankfully there is TEXTSPLIT() and TEXTJOIN() that can make short work of splitting the text into words, where we can then test the length, change the capitalization, and then join them back together all in one formula:
=TEXTJOIN(" ", TRUE, IF(LEN(TEXTSPLIT(C6," "))<=3,UPPER(TEXTSPLIT(C6," ")),PROPER(TEXTSPLIT(C6," "))))
Also used PROPER() formula as well, which only capitalizes the first character of a word.

How do I concatenate combinations of letters and numbers in APL?

I'm in Dyalog 17 and would like to generate unique names to be used with its graphics object library. So, for example, I have the letter 'l' and want to take the number 1, convert it to a character and then concatenate the two together to form 'l1'. This is such trivial stuff in other languages but I can't find the documentation explaining how to do this in APL. Thanks for your help!

To concatenate the letter 'l' to the number 1 to form the characters 'l1' you do this:
'l',⍕1

The system function ⎕FMT can be of use here. For example:
'P<I>ZI7' ⎕FMT ⍳10
I000000
I000001
I000002
I000003
I000004
I000005
I000006
I000007
I000008
I000009
The format string specifies to format the numbers as integers, in width of 7, zero filling, with a positive left decoration of the letter'I'.

I'm on APL2 in the Mainframe, so my answer might not be exactly what you're after, but here's how I would do it:
∊⍕¨'L',1
So first catanate the letter and the numeric digits. Then FORMAT EACH to produce a vector of character scalars. Finally, ENLIST to produce a simple vector.
This is a slight generalization of SteveH's reply. More general in the sense that it handles input strings (rather than scalars) and works equally well regardless if the digit or letter comes first.

What is lexicographical order?

What is the exact meaning of lexicographical order? How it is different from alphabetical order?

lexicographical order is alphabetical order. The other type is numerical ordering. Consider the following values,
1, 10, 2
Those values are in lexicographical order. 10 comes after 2 in numerical order, but 10 comes before 2 in "alphabetical" order.

Alphabetical order is a specific kind of lexicographical ordering. The term lexicographical often refers to the mathematical rules or sorting. These include, for example, proving logically that sorting is possible. Read more about lexicographical order on wikipedia
Alphabetical ordering includes variants that differ in how to handle spaces, uppercase characters, numerals, and punctuation. Purists believe that allowing characters other than a-z makes the sort not "alphabetic" and therefore it must fall in to the larger class of "lexicographic". Again, wikipedia has additional details.
In computer programming, a related question is dictionary order or ascii code order. In dictionary order, the uppercase "A" sorts adjacent to lowercase "a". However, in many computer languages, the default string compare will use ascii codes. With ascii, all uppercase letters come before any lowercase letters, which means that that "Z" will sort before "a". This is sometimes called ASCIIbetical order.

This simply means "dictionary order", i.e., the way in which words are ordered in a dictionary. If you were to determine which one of the two words would come before the other in a dictionary, you would compare the words letter by the letter starting from the first position. For example, the word "children" will appear before (and can be considered smaller) than the word "chill" because the first four letters of the two words are the same but the letter at the fifth position in "children" (i.e. d ) comes before (or is smaller than) the letter at the fifth position in "chill" (i.e. l ). Observe that lengthwise, the word "children" is bigger than "chill" but length is not the criteria here. For the same reason, an array containing 12345 will appear
before an array containing 1235. (Deshmukh, OCP Java SE 11 Programmer I 1Z0815 Study guide 2019)

Lexicographical ordering means dictionary order.
For ex: In dictionary 'ado' comes after 'adieu' because 'o' comes after 'i' in English alphabetic system.
This ordering is not based on length of the string, but on the occurrence of the smallest letter first.

I want to add an answer that is more related to the programming side of the term rather than the mathematical side of it.
Lexicographical order is not always an equivalent of "dictionary order", at least this definition is not complete in the realm of programming, rather, it refers to "an ordering based on multiple criteria".
For example, almost in all famous programming languages, there are standard tools for sorting collections of objects, now what if you want to sort a collection based on more than one thing? For instance, let's say you want to sort some items based on their prices first AND then based on their popularity. This is an example of Lexicographical Order.
For example in Java (8+), you could do something like this:
// sorts items from the cheapest AND the most popular ones
// towards the most expensive AND the least popular ones.
Collections.sort(items,
Comparator.comparing(Item::price)
.thenComparing(Item::popularity)
.reversed()
);
And the Java documentation uses this term too, to refer to such type of ordering when explaining the "thenComapring()" method:
Returns a lexicographic-order comparator with another comparator.

Lexicographical order is nothing but the dictionary order or preferably the order in which words appear in the dictonary. For example, let's take three strings, "short", "shorthand" and "small". In the dictionary, "short" comes before "shorthand" and "shorthand" comes before "small". This is lexicographical order.

Finding similar strings in large datasets

I'm using levenshtein distance to retrieve similar strings from a list. At the moment the list has just a few thousand items, but we'll need to support at least 100k items.
I'm trying to make this more efficient and one technique I came up with was to calculate the levenshtein distance only on strings that are of similar length. I though about also filtering on the initial character i.e. if the string to search starts with b then I'll run the calculation only on the strings that start with b. But I'm not sure if I could assume this to work all the time.
I was wondering if you all have a better way of getting this done?
Thanks

One way to go would be to hope that a match with small edit distance would have within it a short exact match. If you assume this, then, given the string ABCDEF, retrieve all strings containing ABC, BCD, CDE, or DEF, and compute their edit distances. You may even find that the best match among these is so close that any closer match must have a short match inside it, so you would have found it already. You would have to accept that if you are unlucky you may miss some good matches, or be forced to go through all the possibilities one by one.
As an alternative to building a database of substrings, you could build a http://en.wikipedia.org/wiki/Suffix_array and LCP array from a string obtained by concatenating all the stored strings, separating them with a marker character not otherwise used. This takes time and space linear in the input size. You would then search for exact matches by looking for strings in the suffix array starting ABCDEF, BCDEF, CDEF, and DEF.

How will you sort strings in the following example?

so i have a list of string
{test,testertest,testing,tester,testingtest}
I want to sort it in descending order .. how do u sort strings in general ? Is it based on the length or is it character by character ??
how would it be in the example above ?? I want to sort them in a descending way.

No matter what language you’re in, there’s a built-in sort function that performs a lexicographical order, which returns
['test','tester','testertest','testing','testingtest']
for your example. If I wanted this reversed, I would just say reversed(sorted(myList)) in Python and be done with it. If you look to your right you can see plenty of related questions that require a more specialized ordering method (for numbers, dates, etc.), but lexicographic order works on strings containing any kind of data.
Here’s how it works:
compare(string A, string B):
if A and B are both non-empty:
if A[0] == B[0]:
// First letters are the same; compare by the rest
return compare(A[1:], B[1:])
else:
// Compare the first letters by Unicode code point
return compare(A[0], B[0])
else:
// They were equal till now; the shorter one shall be sorted first
return compare(length of A, length of B)

I would sort it like this:
testingtest
testing
testertest
tester
test

Assuming C#
string[] myStrings = {"test","testertest","testing","tester","testingtest"};
Array.Sort(myStrings);
Array.Reverse(myStrings);
foreach(string s in myStrings)
{
Console.WriteLine(s);
}
Not always an ideal way to do it - you could implement a custom comparer instead - but for the trivial example you asked about this is probably the most logical approach.

In computer science strings are usually sorted character by character, with the preferred sort order being (for a standard english character set):
Null characters first
Followed by whitepsace
Followed by symbols
Followed by numeric characters in obvious numerical order
Followed by alphabetic characters in obvious alphabetical order
When sorting characters generally lowercase characters come before uppercase characters.
So for example if we were to sort / compare:
test i ng
test e r
Then "tester" would come before "testing" - the first different character in the string is the 5th one, and "e" comes before "i".
Similarily if we were to compare:
test
testing
Then in this case "test" would come first - once again the strings are identical until the 5th character, where the string "test" ends (i.e. no character) which becomes before any alphanumerical character.
Note that this can produce some counter-intutive results when dealing with numbers - for example try sorting the strings "50" and "100" - you will find that "100" comes before "50". Why? because the strings differ at character 1 and "5" comes after "1".
In nearly all languages there is a function which will do all of the above for you!
You should use that function instead of trying to sort strings yourself! For example:
// C#
string[] myStrings = {"test","testertest","testing","tester","testingtest"};
Array.Sort(myStrings);

in Java you can use natural ordering with
java.util.Collections.sort(list);
the make it descending
java.util.Collections.reverse(list);
or create your own Comparator to do the reverse sorting.

When comparing two strings to see which sorts first, the comparison is typically done on a character by character basis. If the characters in the first position (e.g., t in your example) are identical, you move to the next character. When two characters differ, that "may" define which string is considered "greater".
However, depending on the locale used and a number of other factors, it is possible for later characters in the two strings being compared to override a difference in an earlier character. For example, in some collations, the diacritics on letters are considered to be of secondary weight. So a primary difference in a later character can override the secondary difference.
When two strings are otherwise identical but one is longer, the longer one is typically considered to be "greater". When sorting in descending order, the "greater" of two strings is sorted first.

Do you want to know if test should appear after tester in a descending order? Or are you particularly interested in sorting strings with similar prefixes?
If it's the later, I'd suggest a Trie if the input tends to grow big time.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

String parsing in optimal way - string

Related

How to make an excel (365) function that recognizes different words in the same cell and changes them individually

How do I concatenate combinations of letters and numbers in APL?

What is lexicographical order?

Finding similar strings in large datasets

How will you sort strings in the following example?

Categories

Resources