Compare strings binary (& not alphanumeric) - string

How do you compare strings binary (not alphanumeric) ??
Torrent spec:
Keys must be strings and appear in sorted order (sorted as raw
strings, not alphanumerics). The strings should be compared using a
binary comparison, not a culture-specific "natural" comparison.
So i need to sort a dict by key... but i dont get this spec..
Explanations ..anyone?
Update: acordingly to: http://docs.oracle.com/cd/B19306_01/server.102/b14225/ch5lingsort.htm
Using Binary Sorts
One way to sort character data is based on the numeric values of the
characters defined by the character encoding scheme. This is called a
binary sort. Binary sorts are the fastest type of sort. They produce
reasonable results for the English alphabet because the ASCII and
EBCDIC standards define the letters A to Z in ascending numeric value.
Note: In the ASCII standard, all uppercase letters appear before any
lowercase letters. In the EBCDIC standard, the opposite is true: all
lowercase letters appear before any uppercase letters.
When characters used in other languages are present, a binary sort
usually does not produce reasonable results. For example, an ascending
ORDER BY query returns the character strings ABC, ABZ, BCD, ÄBC, when
Ä has a higher numeric value than B in the character encoding scheme.
A binary sort is not usually linguistically meaningful for Asian
languages that use ideographic characters.
So basically it the same result for english as alfabetically sorting..
Nice..

Any standard sort routine should work, as long as you ensure the characters are treated as bytes.

Related

How do I concatenate combinations of letters and numbers in APL?

I'm in Dyalog 17 and would like to generate unique names to be used with its graphics object library. So, for example, I have the letter 'l' and want to take the number 1, convert it to a character and then concatenate the two together to form 'l1'. This is such trivial stuff in other languages but I can't find the documentation explaining how to do this in APL. Thanks for your help!
To concatenate the letter 'l' to the number 1 to form the characters 'l1' you do this:
'l',⍕1
The system function ⎕FMT can be of use here. For example:
'P<I>ZI7' ⎕FMT ⍳10
I000000
I000001
I000002
I000003
I000004
I000005
I000006
I000007
I000008
I000009
The format string specifies to format the numbers as integers, in width of 7, zero filling, with a positive left decoration of the letter'I'.
I'm on APL2 in the Mainframe, so my answer might not be exactly what you're after, but here's how I would do it:
∊⍕¨'L',1
So first catanate the letter and the numeric digits. Then FORMAT EACH to produce a vector of character scalars. Finally, ENLIST to produce a simple vector.
This is a slight generalization of SteveH's reply. More general in the sense that it handles input strings (rather than scalars) and works equally well regardless if the digit or letter comes first.

What is lexicographical order?

What is the exact meaning of lexicographical order? How it is different from alphabetical order?
lexicographical order is alphabetical order. The other type is numerical ordering. Consider the following values,
1, 10, 2
Those values are in lexicographical order. 10 comes after 2 in numerical order, but 10 comes before 2 in "alphabetical" order.
Alphabetical order is a specific kind of lexicographical ordering. The term lexicographical often refers to the mathematical rules or sorting. These include, for example, proving logically that sorting is possible. Read more about lexicographical order on wikipedia
Alphabetical ordering includes variants that differ in how to handle spaces, uppercase characters, numerals, and punctuation. Purists believe that allowing characters other than a-z makes the sort not "alphabetic" and therefore it must fall in to the larger class of "lexicographic". Again, wikipedia has additional details.
In computer programming, a related question is dictionary order or ascii code order. In dictionary order, the uppercase "A" sorts adjacent to lowercase "a". However, in many computer languages, the default string compare will use ascii codes. With ascii, all uppercase letters come before any lowercase letters, which means that that "Z" will sort before "a". This is sometimes called ASCIIbetical order.
This simply means "dictionary order", i.e., the way in which words are ordered in a dictionary. If you were to determine which one of the two words would come before the other in a dictionary, you would compare the words letter by the letter starting from the first position. For example, the word "children" will appear before (and can be considered smaller) than the word "chill" because the first four letters of the two words are the same but the letter at the fifth position in "children" (i.e. d ) comes before (or is smaller than) the letter at the fifth position in "chill" (i.e. l ). Observe that lengthwise, the word "children" is bigger than "chill" but length is not the criteria here. For the same reason, an array containing 12345 will appear
before an array containing 1235. (Deshmukh, OCP Java SE 11 Programmer I 1Z0815 Study guide 2019)
Lexicographical ordering means dictionary order.
For ex: In dictionary 'ado' comes after 'adieu' because 'o' comes after 'i' in English alphabetic system.
This ordering is not based on length of the string, but on the occurrence of the smallest letter first.
I want to add an answer that is more related to the programming side of the term rather than the mathematical side of it.
Lexicographical order is not always an equivalent of "dictionary order", at least this definition is not complete in the realm of programming, rather, it refers to "an ordering based on multiple criteria".
For example, almost in all famous programming languages, there are standard tools for sorting collections of objects, now what if you want to sort a collection based on more than one thing? For instance, let's say you want to sort some items based on their prices first AND then based on their popularity. This is an example of Lexicographical Order.
For example in Java (8+), you could do something like this:
// sorts items from the cheapest AND the most popular ones
// towards the most expensive AND the least popular ones.
Collections.sort(items,
Comparator.comparing(Item::price)
.thenComparing(Item::popularity)
.reversed()
);
And the Java documentation uses this term too, to refer to such type of ordering when explaining the "thenComapring()" method:
Returns a lexicographic-order comparator with another comparator.
Lexicographical order is nothing but the dictionary order or preferably the order in which words appear in the dictonary. For example, let's take three strings, "short", "shorthand" and "small". In the dictionary, "short" comes before "shorthand" and "shorthand" comes before "small". This is lexicographical order.

Convert string into fixed length numbers and convert it back

I have more than 100 cpp files. I need to assign unique ID to each of them. I aslo need to know which file it is based on their ID. I found the maximum length of file's name contains 64 characters and the ID can only be at most 8 bytes long. Is there any algorithm can help to assign unique ID to source file in VS2013 in C++ and can also let user know which file it is based on the ID ?
Just store a mapping between filename and an integer.
-----Yes, this way is very simple. But every time when people create new course files, the mapping need to be re-coded. So I won't use this way.
HERE IS THE ORIGINAL QUESTION SO THAT THE COMMENTS BELOW MAKE SENSE
Now I have a bunch of strings, like "AAA", or "ABBCCHH". The maximum of string contains 64 characters. Now I need an algorithm which can convert string into numbers( not must be integer, double float is also acceptable). But the length of numbers must be fixed. For example, if "A" is convert into 12312, 5 digits, "ABBHGGH" should also have 5 digits after converted. And these numbers can also be converted back to original strings. Is there any algorithms can do that ? The converted number cannot over 8 bytes. That's why I cannot just use ASCII etc simple algorithm. I don't know which algorithm can do that.
To generate unique IDs of an arbitrary set of filenames (the actual question here), you could use a cryptographic hash (SHA-1, -256, -384, -512). This will result in a unique, fixed-length hexadecimal output. If you can't allow the characters a-f in the output, you can convert the hexadecimal value to decimal.
This process is not reversible, but you can maintain a map (lookup table) of the input values to the IDs.
If you want a simpler solution, just hexadecimal encode the filenames. This is reversible. (You can add the hex -> decimal conversion here if necessary as well).

How to set Locale on GSON (for decimal number separators)

I know you can set date formatting, but how do you set other locale specific stuff like decimal number formatting? (I mean comma vs. dot)
Short answer: you don't.
You seem to be somewhat confusing the textual representation of a number and an actual numeric value.
While JSON is a text-based data interchange format, a number (vs. a string) in JSON is a numeric value and is not affected by locale. Section 2.4 of the JSON specification provides the specific definition (emphasis mine):
2.4. Numbers
The representation of numbers is similar to that used in most
programming languages. A number contains an integer component that
may be prefixed with an optional minus sign, which may be followed by
a fraction part and/or an exponent part.
Octal and hex forms are not allowed. Leading zeros are not
allowed.
A fraction part is a decimal point followed by one or more digits.
An exponent part begins with the letter E in upper or lowercase,
which may be followed by a plus or minus sign. The E and optional
sign are followed by one or more digits.
Numeric values that cannot be represented as sequences of digits
(such as Infinity and NaN) are not permitted.
Given the above, something such as {"my_double":3,2} is not valid JSON. {"my_double":3.2} is.
When parsing JSON, a parser is going to store numbers to a primitive data type (int or double). Your locale will then display those properly using the normal methods for converting them to strings; Integer.toString(myInt),String.valueOf(myDouble), etc.

How will you sort strings in the following example?

so i have a list of string
{test,testertest,testing,tester,testingtest}
I want to sort it in descending order .. how do u sort strings in general ? Is it based on the length or is it character by character ??
how would it be in the example above ?? I want to sort them in a descending way.
No matter what language you’re in, there’s a built-in sort function that performs a lexicographical order, which returns
['test','tester','testertest','testing','testingtest']
for your example. If I wanted this reversed, I would just say reversed(sorted(myList)) in Python and be done with it. If you look to your right you can see plenty of related questions that require a more specialized ordering method (for numbers, dates, etc.), but lexicographic order works on strings containing any kind of data.
Here’s how it works:
compare(string A, string B):
if A and B are both non-empty:
if A[0] == B[0]:
// First letters are the same; compare by the rest
return compare(A[1:], B[1:])
else:
// Compare the first letters by Unicode code point
return compare(A[0], B[0])
else:
// They were equal till now; the shorter one shall be sorted first
return compare(length of A, length of B)
I would sort it like this:
testingtest
testing
testertest
tester
test
Assuming C#
string[] myStrings = {"test","testertest","testing","tester","testingtest"};
Array.Sort(myStrings);
Array.Reverse(myStrings);
foreach(string s in myStrings)
{
Console.WriteLine(s);
}
Not always an ideal way to do it - you could implement a custom comparer instead - but for the trivial example you asked about this is probably the most logical approach.
In computer science strings are usually sorted character by character, with the preferred sort order being (for a standard english character set):
Null characters first
Followed by whitepsace
Followed by symbols
Followed by numeric characters in obvious numerical order
Followed by alphabetic characters in obvious alphabetical order
When sorting characters generally lowercase characters come before uppercase characters.
So for example if we were to sort / compare:
test i ng
test e r
Then "tester" would come before "testing" - the first different character in the string is the 5th one, and "e" comes before "i".
Similarily if we were to compare:
test
testing
Then in this case "test" would come first - once again the strings are identical until the 5th character, where the string "test" ends (i.e. no character) which becomes before any alphanumerical character.
Note that this can produce some counter-intutive results when dealing with numbers - for example try sorting the strings "50" and "100" - you will find that "100" comes before "50". Why? because the strings differ at character 1 and "5" comes after "1".
In nearly all languages there is a function which will do all of the above for you!
You should use that function instead of trying to sort strings yourself! For example:
// C#
string[] myStrings = {"test","testertest","testing","tester","testingtest"};
Array.Sort(myStrings);
in Java you can use natural ordering with
java.util.Collections.sort(list);
the make it descending
java.util.Collections.reverse(list);
or create your own Comparator to do the reverse sorting.
When comparing two strings to see which sorts first, the comparison is typically done on a character by character basis. If the characters in the first position (e.g., t in your example) are identical, you move to the next character. When two characters differ, that "may" define which string is considered "greater".
However, depending on the locale used and a number of other factors, it is possible for later characters in the two strings being compared to override a difference in an earlier character. For example, in some collations, the diacritics on letters are considered to be of secondary weight. So a primary difference in a later character can override the secondary difference.
When two strings are otherwise identical but one is longer, the longer one is typically considered to be "greater". When sorting in descending order, the "greater" of two strings is sorted first.
Do you want to know if test should appear after tester in a descending order? Or are you particularly interested in sorting strings with similar prefixes?
If it's the later, I'd suggest a Trie if the input tends to grow big time.

Resources