xsd pattern acceptable for decimals? - xsd

We have a request to implement our webservice response so that xsd:decimal fraction digits will be zero-padded when it's not long enough when a pattern indicates so. I am wondering if this is a reasonable request and if xsd:decimal is supposed to be used with patterns like these.
Here is the relevant part of the xsd according to their specs:
<xsd:simpleType>
<xsd:restriction base="xsd:decimal">
<xsd:totalDigits value="14"/>
<xsd:fractionDigits value="2"/>
<xsd:pattern value="[\-+]?[0-9]{1,12}[.][0-9]{2}"/>
</xsd:restriction>
</xsd:simpleType>
So the fractionDigits are set to 2 which means the precision can be a maximum of 2 digits. According to http://zvon.org/xxl/XMLSchemaTutorial/Output/ser_types_st2.html it is also fine if there are less fraction digits (for example for a number like 5.1)
But according to the pattern {2} there should always be 2 fraction digits.
We're developing a generic application development platform (Mendix) and there's no telling what the decimal will be used for in advance (currency, pH values, distances, etc). This case comes from a specific project where our platform is being used but normally we won't know what kind of data is being transferred. We could decide to just follow the WSDL in this regard which states it should have 2 fraction digits. but our implementation of it must be very generic.
There is nothing stating with what exactly these fraction digits should be padded with or even that we should pad instead of just leaving out this decimal altogether. In theory we could decide to pad with 5's until it matches the pattern. As far as I know patterns are rarely used and if they are it's used for things like passwords. The XSD specification is vague though so it would be appreciated if someone could shed some light on whether this is valid use of an XSD and if it makes sense for us to decide to pad with 0's.

To be practical, my litmus test would be to look at this from the perspective of the technology stack directly involved or, better yet, that is mainstream.
I can think of JAXB on Java or xsd/svcutil/etc.exe on .NET; A quick test of these pretty common tools against this schema fragment using a value of 1 would fail to produce valid XML. This would send developers scrambling for all sorts of customizations to make it work as per the XSD pattern. Painfull, high developing and maintenance costs...
The same would be applicable to an XSLT; there will be a need to manually format the output... Bottom line, XSD patterns are not machine usable for "automatic" formatting... I have yet to see such a thing...
I also believe that a requirement such as this is unreasonable and I personally feel that it should be considered as an antipattern when it comes to describing data being exchanged. Since there's no absolute, it is conceivable that there must be an exception; I can't think of any, but one must explore the reason why you were presented with such a requirement; I would then try to find a solution that wouldn't involve this pattern...

It actually depends on the parameter of that particular decimal value ..
XML is among the most preferred ones to transport and store data it can have various data. I choose two examples here:
Data is currency, Here my advice is to force [0-9]*[.][0-9]{2} .. aiding to this our Client data restoration software is designed such a way to pad 0s.
Data is a pH value of a chemical, well. Here one digit after decimal point is mandatory. [0-1][0-9][.][0-9]
So it all depends on the object we are referring to .. Unless its really necessary it wouldn't be fair to force the pattern :)

Related

Hypothesis search tree

I have a object with many fields. Each field has different range of values. I want to use hypothesis to generate different instances of this object.
Is there a limit to the number of combination of field values Hypothesis can handle? Or what does the search tree hypothesis creates look like? I don't need all the combinations but I want to make sure that I get a fair number of combinations where I test many different values for each field. I want to make sure Hypothesis is not doing a DFS until it hits the max number of examples to generate
TLDR: don't worry, this is a common use-case and even a naive strategy works very well.
The actual search process used by Hypothesis is complicated (as in, "lead author's PhD topic"), but it's definitely not a depth-first search! Briefly, it's a uniform distribution layered on a psudeo-random number generator, with a coverage-guided fuzzer biasing that towards less-explored code paths, with strategy-specific heuristics on top of that.
In general, I trust this process to pick good examples far more than I trust my own judgement, or that of anyone without years of experience in QA or testing research!

Randomized (as in QuickCheck) versus deterministic (as in SmallCheck) property checking

I know of two approaches to property checking:
Randomized approach (as in QuickCheck) makes you define a generator of random values for your types, then verifies your invariants for each of a large number of randomly generated cases. For example, in case of a vector space ℤⁿ (defined, say, as [Word]), it would generate vectors of any length and direction.
Deterministic approach (as in SmallCheck) works the same, except that it generates valus from simple and small to more complex and extensive deterministically, covering (as I understand) a small part of domain tightly. For example, in the same case as above, it would generate zero vector, then all vectors of length <= 1, then all vectors of length <= 2, and so on, eventually covering all interesting values (such as unit vectors), for which there may conceivably be unforeseen corner cases that break an invariant.
Did I get it correct? What are the benefits and downsides of either approach? What are the preferred use cases for either? Maybe, for best results, both approaches should be combined, say, covering some moderately sized "base" with a deterministic checker and then a random sample with a randomized one?
I'm looking for some solid, motivated best practice to use daily and advance from now on.
P.S. This question is not intended to provoke discussion. I am asking specifically what benefits either approach to property checking has matter-of-factly, not which is nicer, easier to use or better by any other measure that can be considered a matter of taste. An answer would provide either a critique of the problem posed, a mindful, mathematical observation of the nature of either method of property checking, or a case / experience report that justifies one of the approaches as preferable for a certain class of situations. In the end, I intend this question to generate a set of criteria that help determine which approach to testing will catch more programming mistakes. Put this way, I don't see how this question could be considered subjective in any way other than good, as outlined in the rules. There is a companion reddit thread that offers a channel for more liberal communication.

Using "seed" based math to recreate application instances

Okay so I was thinking today about Minecraft a game which so many of you are so familiar with, I'm sure and while my question isn't directly related to the game I find it much simply to describe my question using the game as an example.
My question is, is there any way a type of "seed" or string of characters can be used to recreate an instance of a program (not in the literal programming sense) by storing a code which when re-entered into this program as a string at run-time, could recreate the data it once held again, in fields, text boxes, canvases, for example, exactly as it was.
As I understand it, Minecraft takes the string of ASCII characters you enter, all which truly are numbers, and performs a series of operations on it which evaluate to some type of hash or number which is finite... this number (again as I understand) is the representation of that string you entered. So it makes sense that because a string when parsed by this algorithm will always evaluate to the same hash. 1 + 1 will always = 2 so a seeds value must always equal that seeds value in the end. And in doing so you have the ability to replicate exactly, worlds, by entering this sort of key which is evaluated the same on every machine.
Now, if we can exactly replicate worlds like this this is it possible to bring it into a more abstract concept like the following?...
Say you have an application, like Microsoft Word. Word saved the data you have entered as a file on your hard drive it holds formatting data, the strings you've entered, the format of the file... all that on a physical file... Now imagine if when you entered your essay into Word instead of saving it and bringing your laptop to school you instead click on parse and instead of creating a file, you are given a hash code... Now you goto school you know you have to print it. so you log onto the computer and open Word... Now instead of open there is an option now called evaluate you click it and enter the hash your other computer formulated and it creates the exact essay you have written.
Is this possible, and if so are there obvious implementations of this i simply am not thinking of or are just so seemingly part of everyday I don't think recognize it? And finally... if possible, what methods and algorithms would go into such a thing?
[EDIT]
I had to do some research on the anatomy of a seed and I think this explains it well
The limit is 32 characters or for a
numeric seed, 19 digits plus the minus sign.
Numeric seeds can range from -9223372036854775808 to
9223372036854775807 which is a total of 18446744073709551616 Text
strings entered will be "hashed" to one of the numeric seeds in the
above range. The "Seed for the World Generator" window only allows 32
characters to be entered and will not show or use any more than that."
BUT looking back on it lossless compression IS EXACTLY what I was
describing after re-reading the wiki page and remembering that (you
are very correct) the seed only partakes in the generation, the final
data is stores as a "physical" file on the HDD (which again, you are correct) is raw uncompressed data in a file
So in retrospect, I believe I was describing lossless compression, trying in my mind to figure out how the seed was able to replicate the exact same world, forgetting the seed was only responsible for generating the code, not the saving or compression of it.
So thank you for your help guys! It's really appreciated I believe we can call this one solved!
There are several possibilities to achieve this "string" that recovers your data. However they're not all applicable depending on the context.
An actual seed, which initializes for example a peudo-random number generator, then allows to recreate the same sequence of pseudo-random numbers (see this question).
This is possibly similar to what Minecraft relies on, because the whole process of how to create a world based on some choices (possibly pseudo-random choices) is known in advance. Even if we pretend that we have random numbers, computers are actually deterministic, which makes this possible.
If your document were generated randomly then this would be applicable: with the same seed, the same gibberish comes out.
Some key-value dictionary, or hash map. Then the values have to be accessible by both sides and the string is the key that allows to retrieve the value.
Think for example of storing your word file on an online server, then your key is the URL linking to your file.
Compressing all the information that is in your data into the string. This is much harder, and there are strong limits due to the entropy of the data. See Shannon's source coding theorem for example.
You would be better off (as in, it would be easier) to just compress your file with a usual algorithm (zip or 7z or something else), rather than reimplementing it yourself, especially as soon as your document starts having fancy things (different styles, tables, pictures, unusual characters...)
With the simple hypothesis of 27 possible characters (26 letters and the space), Shannon himself shows in Prediction and Entropy of Printed English (Bell System Technical Journal, 30: 1. January 1951 pp 50-64, online version) that there is about 2.14 bits of entropy per letter in English. That's about 550 characters encoded with your 32 character string.
While this is significantly better than the 8 bits we use for each ASCII character, it also shows it is very likely to be impossible to encode a document in English in less than a fourth of its size. Then you'd still have to add punctuation, and all the rest of the fuss.

What's the real difference between 3GLs and 4GLs?

I can't find anything that is useful to determine if a language is third generation or fourth. All I find is open statements like "higher level" and "closer to English" and some sources say that they are domain specific languages like SQL and others say that they can be general purpose. I'm really confused.
If 2GLs are the Assembly languages and 5GLs are the inference languages like Prolog, how do you determine if a programming language is a 3GL or a 4GL?
Most use of the terms was pure marketing -- "Oh, you're still using a third generation language? That's so last week!"
Behind that, there was a tiny bit of technical meaning though (at least for a little while, though many "4GLs" ignored it). The basic difference was (supposed to be that) third generation languages allowed you to manipulate only individual data items, where fourth generation languages allows you to manipulate groups of items as a group rather than individually.
Two obvious examples of this are SQL and APL. In SQL, you mostly work with sets. The result of a query is a set (not exactly a mathematical set, but at least somewhat similar). You can use and manipulate that set as a whole, merge it with other sets, etc. Until or unless you're exposing it to the outside world (e.g., with a cursor) you don't have to deal with the individual records/rows/tuples that make up that set.
In APL you get somewhat the same idea, except you're working with arrays instead of sets. To get an idea of what this means, let's assume you wanted to "rotate" an array so the currently-first element was moved to the end, and each other element was shifted ahead a spot. To do that in a typical 3GL (Fortran, Pascal, C, etc.) you'd write a loop that worked with the individual elements in the array. In APL, however, you have a single operator that will do that to the array as a whole, all in one operation. Even operations that work with individual items are generally trivial to apply to an entire array at once with the / operator, so (for example) the sum of all the elements in an array named a could be computed with +/a (or maybe that was /+a -- it's been a long time since I wrote any APL).
There are some pretty serious problems with the general idea of the distinction involved there though. One is that it placed a heavy emphasis on syntax -- obviously the actions involved required things like loops internally, so the distinction amounted to a syntax for an implicit loop. Another problem was that in quite a few cases you ended up with something of a blend of the two generations -- e.g., most BASICs being able to treat a string as a single thing, but requiring loops for similar operations on arrays/matrices. Finally, there was a little bit of a problem with relevance: although in a few special cases (like SQL) being able to work with a group/set/array/whatever of data as a whole really made a big difference -- but for the most part it did little to let people think and work with higher level abstractions (as was at least apparently the original intent).
That combined with a move toward languages that blurred the distinction between what was built in, and what was part of a library (or whatever). In C++, most ML-family languages, etc., it's trivial to write a function with arbitrary actions (including but not limited to loops) and attach it to an operator that's essentially indistinguishable from one that's built into the language.
It was a catchy phrase with a meaning most couldn't explain and even fewer cared about -- a prime candidate for being turned into pure marketspeak, usually translated roughly as: "you should pay me a lot for my slow, ugly, buggy CRUD application generator."
"Language generations" were a hot buzzword in the 1980s and early 1990s. They were always ill-defined, and little used in actual academic discourse.
The terms had little meaning at the time, and none now.

Evaluating the "Value" Attribute

I'm attempting to use the OpenAmplify API to evaluate the content of a URI. The point is to draw out the topics that are truly relevant to the article. Unfortunately, the topical analysis I'm getting back is:
Huge, and
Varied
Neither quality is terribly useful for what I'm trying to do because the signal to noise ratio is being heavily skewed towards noise. I'm analyzing web content, so there is a certain amount (perhaps a large amount) of irrelevant content (ads, etc.) involved. I get that.
Nonetheless, many of the topics being returned are either useless (utterly non-sensical, not even words), irrelevant (as in, where did that come from?) or too granular to provide any meaning or insight. I can probably filter out most of this noise using the value, um, value that is returned for each domain, subdomain, topic, et al, but I don't really know what it means.
Certainly I understand that the value it's a measure of "the prominence of the word in the text," but the number itself appears entirely arbitrary in a way that I prevents me saying something like "ignore any terms with a value less than 50" and have it carry any real meaning.
Are there any range criteria that I can use to help me understand how to use a topic's value score as a filtering threshold? Alternatively, is there another field that I should be using for this sort of filtration?
Thanks for your help.
From other channels, I've learned that the value attribute can't be evaluated the way I was hoping. It means different things for different signals and none are defined in such a way that are meaningful for this kind of requirement.

Resources