Most efficient way to store a string in bytes? - string

Say I have a simple bytecode-like file format for saving data.
If I want to store a string, should I do it like in source files where all characters between a certain byte is the string,
or should I first store the length of the string then the string bytes?
Or are both solutions horrible and if so which one can I use?

It depends on whether you want to store:
a single string
a number of strings
different length strings
all the same length
For all of the above, it may also matter if your strings contain:
any characters
only certain characters
formatting
In general, you should use Unicode.
For a single string, you simply can use an entire file to contain the string, the end-of-file will be the same as the end of string. No need to store the length of the string.
If the strings aren't all (around) the same length you can use an inline separator to separate the strings. Often the newline character is useful for this (especially since a lot of programming languages support this way of reading in a file line-by-line), but other markers such as tab are common.
CSV text files often use double quotes to enclose strings that contain commas (or other column separator) (which would otherwise indicate the next column value was starting), or line-breaks (which would otherwise indicate the next row).
Of course, now you have the problem of how to store a double quote in your string.
If you want to store formatting, you can use a markup language (html) or it may be enough to allow for line breaks and/or some markdown.

Related

How to Automatically add thousand separators for every number in a string?

How can i create a thousand separator for every number which is in my string?
So for example this string:
string = "123456,78+1234"
should be displayed as:
TextView = "123.456,78+1.234"
And the string should be editable, so the thousand separator should adapt when i remove or add a digit.
I have already read all the posts I could find about it, but I could never find an up-to-date working answer. So I would be really grateful for your help!
Your question contains two sub-questions:
A. You want to add thousand separators to a string which contains a group of numbers.
B. You want it to change.
And the answers are:
A: In your example there's , as a delimiter, so you need to split the string using this delimiter to an array of strings.
Then iterate over them and have your dots added to every 3nth index of their characters; you can also use String.format("%,d", substr.toLong()).
Lastly, append all of the strings back together with , as the separator.
B: This one can be done in different ways. You may store the original string somewhere and observe it, so when it changes it goes to the function which does A, and use the function result the way you like (which I suppose is to be set in a TextView).

Extract Number from string into a list in Scala

I have the following string :
var myStr = "abc12ef4567gh90ijkl789"
The size of the list is not fixed and it contains number in between. I want to extract the numbers and store them in the form of a list in this manner:
List(12,4567,90,789)
I tried the solution mentioned here but cannot extend it to my case. I just want to know if there is any faster or efficient solution instead of just traversing the string and extracting the numbers one by one using brute force ? Also, the string can be arbitrary length.
It seems you may just collect the numbers using
("""\d+""".r findAllIn myStr).toList
See the Scala demo. \d+ matches one or more digits, findAllIn searches for multiple occurrences of the pattern inside a string (and also un-anchors the pattern so that partial matches could be found).
If you prefer a splitting approach, you might use
myStr.split("\\D+").filter(_.nonEmpty).toList
See another demo. Here, \D+ matches one or more non-digit chars, and these chunks are used to split on (texts between these chunks land in the result). .filter(_.nonEmpty) will remove empty items that usually appear due to matches at the start/end of the string.

Convert string into fixed length numbers and convert it back

I have more than 100 cpp files. I need to assign unique ID to each of them. I aslo need to know which file it is based on their ID. I found the maximum length of file's name contains 64 characters and the ID can only be at most 8 bytes long. Is there any algorithm can help to assign unique ID to source file in VS2013 in C++ and can also let user know which file it is based on the ID ?
Just store a mapping between filename and an integer.
-----Yes, this way is very simple. But every time when people create new course files, the mapping need to be re-coded. So I won't use this way.
HERE IS THE ORIGINAL QUESTION SO THAT THE COMMENTS BELOW MAKE SENSE
Now I have a bunch of strings, like "AAA", or "ABBCCHH". The maximum of string contains 64 characters. Now I need an algorithm which can convert string into numbers( not must be integer, double float is also acceptable). But the length of numbers must be fixed. For example, if "A" is convert into 12312, 5 digits, "ABBHGGH" should also have 5 digits after converted. And these numbers can also be converted back to original strings. Is there any algorithms can do that ? The converted number cannot over 8 bytes. That's why I cannot just use ASCII etc simple algorithm. I don't know which algorithm can do that.
To generate unique IDs of an arbitrary set of filenames (the actual question here), you could use a cryptographic hash (SHA-1, -256, -384, -512). This will result in a unique, fixed-length hexadecimal output. If you can't allow the characters a-f in the output, you can convert the hexadecimal value to decimal.
This process is not reversible, but you can maintain a map (lookup table) of the input values to the IDs.
If you want a simpler solution, just hexadecimal encode the filenames. This is reversible. (You can add the hex -> decimal conversion here if necessary as well).

String Comparison ignoring special characters C#

I am creating a generic List if unique strings.My string formats are like GBP/101-P506 some time it could be GBP-101-P-506. Both of these strings have to be considered as SAME. how could I compare such strings?
Most straight forward way would be, to replace the special characters with empty strings and compare the results...
Use temporary variables if you don't want to modify your originals.
Regards
RegEx the input and normalize the data before entering it into your data structure. If you don't want to change the original strings, you will have to consider all possible valid values anytime you need to perform operations on the strings.

Comma separators in Fortran

I have come across the following issue with Fortran: that in reading a character array, for example, or any list in actuality, from a data file with fmt=*, both non-interquote blanks AND commas are natively considered as delimiters for the elements in the array/list. The fact that commas act as delimiters is a big problem for me.
So the question is: do you know of any semantic option or compilation directive in Fortran that permits to consider the commas in input files as characters and not as delimiters,
with the only delimiters being blanks? As an specific example, I would like that when reading a record like:
x,y,z
with:
read (7,*) adummy
would result in adummy (a scalar character variable) getting the value x,y,z not x.
Any help would be most welcome.
The solution is to specify formatting to match your data record, i.e. use character data descriptor when specifying the format:
read(7,fmt='(A)')adummy
will result in adummy having value x,y,z, assuming it is a variable of sufficient length.
However this method will not treat blanks as delimiters either, so if you want to read commas as character strings but have blanks as delimiter, the common way to achieve this is to read the whole record into the character variable and do the splitting into separate variables afterwards.

Resources