Convert string into fixed length numbers and convert it back - string

I have more than 100 cpp files. I need to assign unique ID to each of them. I aslo need to know which file it is based on their ID. I found the maximum length of file's name contains 64 characters and the ID can only be at most 8 bytes long. Is there any algorithm can help to assign unique ID to source file in VS2013 in C++ and can also let user know which file it is based on the ID ?
Just store a mapping between filename and an integer.
-----Yes, this way is very simple. But every time when people create new course files, the mapping need to be re-coded. So I won't use this way.
HERE IS THE ORIGINAL QUESTION SO THAT THE COMMENTS BELOW MAKE SENSE
Now I have a bunch of strings, like "AAA", or "ABBCCHH". The maximum of string contains 64 characters. Now I need an algorithm which can convert string into numbers( not must be integer, double float is also acceptable). But the length of numbers must be fixed. For example, if "A" is convert into 12312, 5 digits, "ABBHGGH" should also have 5 digits after converted. And these numbers can also be converted back to original strings. Is there any algorithms can do that ? The converted number cannot over 8 bytes. That's why I cannot just use ASCII etc simple algorithm. I don't know which algorithm can do that.

To generate unique IDs of an arbitrary set of filenames (the actual question here), you could use a cryptographic hash (SHA-1, -256, -384, -512). This will result in a unique, fixed-length hexadecimal output. If you can't allow the characters a-f in the output, you can convert the hexadecimal value to decimal.
This process is not reversible, but you can maintain a map (lookup table) of the input values to the IDs.
If you want a simpler solution, just hexadecimal encode the filenames. This is reversible. (You can add the hex -> decimal conversion here if necessary as well).

Related

Fixed vs variable length STRING field definitions in ECL

I have a question regarding STRING field definitions.
Am I better off to fully qualify my STRING fields or allow them to be variable length?
For example I am working with a data file which contains multiple string data elements which can be up to 1000 characters in length.
When I define the ECL fields as STRING1000 the strings are padded and difficult to view in ECL Watch.
If I define the ECL fields simply as STRING, the string fields are adjusted to the length of the field value and much easier to read in ECL Watch.
With regards to my question, does either option affect the size of my dataset in memory or on disk?
What is the best practice I should follow?
The standard answer to this question is:
IF you know the string is always going to contain n number of characters (like a US state code or zipcode field) OR the string will always contain 1 to n characters where n is a small number and the average length of the actual data approaches the max (like most street address fields) THEN you should define that field as a STRINGn. ELSE IF n is a large number and the average length of the data is small compared to the maximum THEN variable-length STRING would be best.
Both options affect the storage and memory size:
Fixed-length fields are always stored at their defined length.
Variable-length STRING fields are stored with a leading 4-byte integer indicating the actual number of characters following that instance (like a Pascal string)
Therefore, if you define a string field that always contains 2 characters as a STRING2 it occupies two bytes of storage, but define it as a STRING and it will occupy six.

Most efficient way to store a string in bytes?

Say I have a simple bytecode-like file format for saving data.
If I want to store a string, should I do it like in source files where all characters between a certain byte is the string,
or should I first store the length of the string then the string bytes?
Or are both solutions horrible and if so which one can I use?
It depends on whether you want to store:
a single string
a number of strings
different length strings
all the same length
For all of the above, it may also matter if your strings contain:
any characters
only certain characters
formatting
In general, you should use Unicode.
For a single string, you simply can use an entire file to contain the string, the end-of-file will be the same as the end of string. No need to store the length of the string.
If the strings aren't all (around) the same length you can use an inline separator to separate the strings. Often the newline character is useful for this (especially since a lot of programming languages support this way of reading in a file line-by-line), but other markers such as tab are common.
CSV text files often use double quotes to enclose strings that contain commas (or other column separator) (which would otherwise indicate the next column value was starting), or line-breaks (which would otherwise indicate the next row).
Of course, now you have the problem of how to store a double quote in your string.
If you want to store formatting, you can use a markup language (html) or it may be enough to allow for line breaks and/or some markdown.

Why strings cannot be indexed by integer values

I learned that Swift strings cannot be indexed by integer values. I remembered it and I use the rule. But I've never fully understood the mechanic behind it.
The explanation of from the official document is as follows
"Different characters can require different amounts of memory to store, so in order to determine which Character is at a particular position, you must iterate over each Unicode scalar from the start or end of that String. For this reason, Swift strings cannot be indexed by integer values"
I've read it several times, I still don't quite get the point. Can someone explain me a bit more why Swift String cannot be indexed by integer values?
Many Thanks
A string is stored in memory as an array of bytes.
A given character can require 1 to 4 bytes for the basic codepoint, plus any number of combining diacritical mark.
For example, é requires 2 bytes.
Now, if you have the strings efgh and éfgh, to access the second character (f), for the first string, the character is in the byte array at index 1, for the second string, it is at index 2.
In order to know that, you need to inspect the first character. For accessing any character based on its index, you need to go through all the previous characters to know how many bytes each takes.

Parsing apart a decimal into two integers in either Stata or Excel

I'm working with a dataset that has really terrible ID numbers that are an integer followed by a 13 digit decimal. However, the first 6-7 decimal places are zeroes. For example:
10.0000000960554
This is making my life difficult. So I want to parse the IDs apart at the decimal into two integers, drop the leading zeros, and put them back together as one giant integer. However, everything I find for how to do this in Excel keeps the numbers after the decimal after the decimal. For Stata, I've tried to convert the numeric into a string so I can then parse it, but Stata won't let me because it's a decimal:
encode ScrambledID, generate StringID
Here's the error:
not possible with numeric variable
r(107);
An added issue, I can't just split the decimal in Excel and then multiply by 1e+12 because it messes with the values (long story with how they were derived).
Like I said, I'm fine with doing this in either Stata or Excel. Either way this is driving me nuts.
In Excel:
In one column put:
=int(A1)
In the next put:
=--MID(A1,FIND(".",A1)+1,999)
As #Grade'Eh'Bacon stated, I have use a few shortcuts in the above formula. The -- at the beginning change text that are numbers into numbers. It replaces the VALUE() function.
The 999 is a superfluous number in that it is assumed the length of the string being split is not longer than 999 characters. It can be replaced with the LEN() function which would return the actual length of the string.
So putting the two together:
=VALUE(MID(A1,FIND(".",A1)+1,LEN(A1))
Where A1 is the location of the number
Your story is truly shocking.
I'd advise extreme caution in any software. For a start, numbers with decimal parts will be rendered differently depending on whether they are imported as 4-byte or 8-byte reals, in Stata terms as floats or doubles. The underlying problem is that many decimal numbers have no exact binary representation.
In Stata terms, encode is indeed out of the question for a numeric variable (and your example would also fail for other reasons). But ideally you should import the identifiers as strings in the first place. Otherwise you should try a conversion such as generate stringID = string(numid, "%16.13f").
. di %21s string(10.0000000960554, "%16.13f")
10.0000000960554
. di %21s string(10.00000009605539, "%16.13f")
10.0000000960554
. di %21s string(10.00000009605544, "%16.13f")
10.0000000960554
. di %21s string(10.00000009605535, "%16.13f")
10.0000000960554

Compare strings binary (& not alphanumeric)

How do you compare strings binary (not alphanumeric) ??
Torrent spec:
Keys must be strings and appear in sorted order (sorted as raw
strings, not alphanumerics). The strings should be compared using a
binary comparison, not a culture-specific "natural" comparison.
So i need to sort a dict by key... but i dont get this spec..
Explanations ..anyone?
Update: acordingly to: http://docs.oracle.com/cd/B19306_01/server.102/b14225/ch5lingsort.htm
Using Binary Sorts
One way to sort character data is based on the numeric values of the
characters defined by the character encoding scheme. This is called a
binary sort. Binary sorts are the fastest type of sort. They produce
reasonable results for the English alphabet because the ASCII and
EBCDIC standards define the letters A to Z in ascending numeric value.
Note: In the ASCII standard, all uppercase letters appear before any
lowercase letters. In the EBCDIC standard, the opposite is true: all
lowercase letters appear before any uppercase letters.
When characters used in other languages are present, a binary sort
usually does not produce reasonable results. For example, an ascending
ORDER BY query returns the character strings ABC, ABZ, BCD, ÄBC, when
Ä has a higher numeric value than B in the character encoding scheme.
A binary sort is not usually linguistically meaningful for Asian
languages that use ideographic characters.
So basically it the same result for english as alfabetically sorting..
Nice..
Any standard sort routine should work, as long as you ensure the characters are treated as bytes.

Resources