The best way to resolve ambiguity for hexadecimal versus decimal strings - string

Let's say I want to accept user input as a string -- and it can either be a decimal or hexadecimal string -- and then I want to parse it into an integer.
The problem is, for some strings this is ambiguous: "12345", "00001", and other short strings with no "letter" digits.
So, I'd like to allow some way for the users to disambiguate those strings. Obviously they can prefix with "0x" if the string is actually supposed to be a hexadecimal integer, but if it's supposed to be decimal what should they do?
This seems like such a common problem, it must've been solved before.
Is there some sort of standard that's been adopted?

Related

Why is 0128 in octal considered not valid to convert in decimal?

I'm practicing for an exam and I'm doing literals, what came up to me was a question that asked to convert 0128 octal into a decimal , so I also have the solution for this question which is that it has too many bits to be considered an octal so it can't be converted into a decimal as well, but the motivation of it is not described.
Do you know why because I'm trying to figure it out, but I couldn't find any answer yet.
One answer is "invalid input" but a different answer might be to consider the input as "012" with the first non-octal character acting as the termination of the octal number. The answer would therefore be 10 decimal.

How do I parse a float from a string that might contain left-over characters without doing manual parsing?

How do I parse a float from a string that may contain left-over characters, and figure out where it ends, without doing manual parsing (e.g. parseFloat() in JS)?
An example would be the string "2e-1!". I want to evaluate the next float from the string such that I can split this string into "2e-1" and "!" (for example, if I wanted to implement a calculator.) How would I do this without writing a parser to see where the float ends, taking that as an &str, and then using .parse()?
I am aware that this means that I may have to parse nonstandard things such as "3e+1+e3" to 3e+1 (30) and "+e3". This is intended; I do not know all of the ways to format floating point numbers, especially the ones valid to Rust's parse::<f64>(), but I want to handle them regardless.
How can I do this, preferably without external libraries?
As mentioned in the comments, you need to either implement your own floating-point parser or use an external library. The parser in the standard library always errors out when it encounters additional junk in the input string – it doesn't even allow leading or trailing whitespace.
A good external crate to use is nom. It comes with an integrated parser for floating-point numbers that meets your requirements. Examples:
use nom::number::complete::double;
let parser = double::<_, ()>;
assert_eq!(parser("2.0 abc"), Ok((" abc", 2.0)));
let result = parser("Nanananananana hey, Jude!").unwrap();
assert_eq!(result.0, "ananananana hey, Jude!");
assert!(result.1.is_nan());
The parser expects the floating-point number at the very beginning of the string. If you want to allow leading whitespace, you can remove it first using trim_start().

map strings to unique ids

is there any python library that converts strings to unique ids?
for example, I have this kind of data written in txt file
"name", "john doe"
"age", "twenty two"
"school","xxxx"
"name", "sam x"
"age", "twenty two"
"school","yyyy"
and I want the output like this
1,55
2,44
3,77
1,56
2,44
3,78
I don't care about the numbers range but it must be positive.
and is there any way to retrieve the original strings from any given id?
thank you,
I'm not a Python dev, but it sounds like you want a kind of reversible hash.
Since you want integers, what you might do is loop through each character of a string, get the ASCII value, and concatenate those. It will have to still be a string, I think, due to leading zero needs.
You could also try simply assigning each letter of the alphabet a two digit value, then use that. So "age" might be 11 and 36 and 15 for 113615. Then you can just decode in two-character substrings. 10-99 is enough for lower and upper case and a few symbols even. You could have a single function for encoding and decoding with a quick check of the type of the argument passed (integer or string).
Be aware that Python may have a much more elegant way to do this, I'm just thinking about the overall strategy that would really work in almost any language.

Why does the W3C XML Schema specification allow integers to have leading zeros?

Recently I had an example where in a xml message integer fields contained leading zeros. Unfortunately these zeros had relevance. One could argue why in the schema definition integer was chosen. But that is not my question. I was a little surprised leading zeros where allowed at all. So I looked up the specs which of course told me the supertype is decimal. But as expected specification don't really tell you why certain choices where made. So my question is really what is the rationale for allowing leading zeros at all? I mean numbers generally don't have leading zeros.
On a side note I guess the only way to add a restriction on leading zeros is by a pattern.
My recollection is that the XML Schema working group allowed leading zeroes in XSD decimals because they are allowed in normal decimal notation: 1, 01, 001, 0001, etc. all denote the same number in normal numerical notation. (But I don't actually remember that it was discussed at any length, so perhaps this is just my reason for believing it was the right thing to do and other WG members had other reasons for being satisfied with it.)
You are correct to suggest that the root of the problem is the use of xsd:integer as a type for a notation using strings of digits in which leading zeroes are significant (as for example in U.S. zip codes); I think you may be over-generous to say that one could argue about that decision. What possible arguments could one bring forward in favor of such an obviously erroneous choice?
Although numbers often doesn't have leading zeroes, parsing numbers almost always allows leading zeroes.
You don't want to disallow leading zeroes for numbers completely, because you want the option to write a number like 0.12 and not only like .12. As you want to allow at least one leading zero for floating point numbers, it would feel a bit restrictive to only allow one leading zero, and only for floating point numbers.
Sometimes numbers do have leading zeroes, for example the components in a date in ISO8601 format; 2014-05-02. If you want to parse a component it's convenient if the leading zero is allowed, so that you don't have to write extra code to remove it before parsing.
The XML specification just uses the same sets of rules for parsing numbers that is generally used for most formats and in most programming languages.

What is the difference between binary safe strings and binary unsafe strings?

I was reading redis manifesto[1] and it seems redis accepts only binary safe strings as keys but I don't know the difference between the two. Can anyone explain with an example?
[1] http://oldblog.antirez.com/post/redis-manifesto.html
According to Redis documentation, simple Redis strings have syntax "+redis_response\r\n" whereas bulk Redis strings have syntax "$str_len\r\nbinary_safe_string\r\n".
In other words, binary safe string in Redis can contain any data as simple as "foo" to any binary data upto 512MB say a JEPG image. Binary safe string has its length encoded in it and does not terminate with any particular character such as a NULL terminating string in C which ends with '\0.
HTH,
Swanand
I'm not familiar with the system in question, but the term "binary safe string" might be used either to describe certain string-storage types or to describe particular string instances. In a binary-safe string type, a string of length N may be used to encapsulate any sequence of N values in the range either 0-255 or 0-65535 (for 8- or 16-bit types, respectively). A binary-safe string instance might be one whose representation may be subdivided into uniformly-sized pieces, with each piece representing one character, as distinct from a string instance in which different characters require different amounts of storage space.
Some string types (which are not binary safe) will use variable-length representations for certain characters, and will behave oddly if asked to act upon e.g. a string which contains the code for "first half of a multi-part character" followed by something other than a "second half of multi-part character". Further, some code which works with strings will assume that it the Nth character will be stored in either the Nth byte or the Nth pair of bytes, and will malfunction if given a string in which, e.g. the 8th character is stored in the 12th and 13th pairs of bytes.
Looking only briefly at the link provided, I would guess that it's saying that the redis does not expect to only work with strings that use different numbers of bytes to hold different characters, though I'm not quite clear whether it's assuming that a string type will be able to handle any possible sequence of bytes, or whether it's assuming that any string instance which it's given may be safely regarded as a sequence of bytes. I think the fundamental concepts of interest, though, are (1) some string types use variable-length encodings and others do not; (2) even in types that use variable-length encodings, a useful subset of string instances will consist only of fixed-length characters.
Binary-safe means that a string can contain any character, while binary-unsafe can not, such as '\0' in C language. '\0' is the ending of a string, which means characters after '\0' and before '\0' will be considered as two different strings.

Resources