Compile-time constraints for strings in F#, similar to Units of Measure - is it possible? - string

I'm developing a Web application using F#. Thinking of protecting user input strings from SQL, XSS, and other vulnerabilities.
In two words, I need some compile-time constraints that would allow me discriminate plain strings from those representing SQL, URL, XSS, XHTML, etc.
Many languages have it, e.g. Ruby’s native string-interpolation feature #{...}.
With F#, it seems that Units of Measure do very well, but they are only available for numeric types.
There are several solutions employing runtime UoM (link), however I think it's an overhead for my goal.
I've looked into FSharpPowerPack, and it seems quite possible to come up with something similar for strings:
[<MeasureAnnotatedAbbreviation>] type string<[<Measure>] 'u> = string
// Similarly to Core.LanguagePrimitives.IntrinsicFunctions.retype
[<NoDynamicInvocation>]
let inline retype (x:'T) : 'U = (# "" x : 'U #)
let StringWithMeasure (s: string) : string<'u> = retype s
[<Measure>] type plain
let fromPlain (s: string<plain>) : string =
// of course, this one should be implemented properly
// by invalidating special characters and then assigning a proper UoM
retype s
// Supposedly populated from user input
let userName:string<plain> = StringWithMeasure "John'); DROP TABLE Users; --"
// the following line does not compile
let sql1 = sprintf "SELECT * FROM Users WHERE name='%s';" userName
// the following line compiles fine
let sql2 = sprintf "SELECT * FROM Users WHERE name='%s';" (fromPlain userName)
Note: It's just a sample; don't suggest using SqlParameter. :-)
My questions are: Is there a decent library that does it? Is there any possibility to add syntax sugar?
Thanks.
Update 1: I need compile-time constraints, thanks Daniel.
Update 2: I'm trying to avoid any runtime overhead (tuples, structures, discriminated unions, etc).

A bit late (I'm sure there's a time format where there is only one bit different between February 23rd and November 30th), I believe these one-liners are compatible for your goal:
type string<[<Measure>] 'm> = string * int<'m>
type string<[<Measure>] 'm> = { Value : string }
type string<[<Measure>] 'm>(Value : string) = struct end

In theory it's possible to use 'units' to provide various kinds of compile-time checks on strings (is this string 'tainted' user input, or sanitized? is this filename relative or absolute? ...)
In practice, I've personally not found it to be too practical, as there are so many existing APIs that just use 'string' that you have to exercise a ton of care and manual conversions plumbing data from here to there.
I do think that 'strings' are a huge source of errors, and that type systems that deal with taintedness/canonicalization/etc on strings will be one of the next leaps in static typing for reducing errors, but I think that's like a 15-year horizon. I'd be interested in people trying an approach with F# UoM to see if they get any benefit, though!

The simplest solution to not being able to do
"hello"<unsafe_user_input>
would be to write a type which had some numeric type to wrap the string like
type mystring<'t>(s:string) =
let dummyint = 1<'t>
Then you have a compile time check on your strings

It's hard to tell what you're trying to do. You said you "need some runtime constraints" but you're hoping to solve this with units of measure, which are strictly compile-time. I think the easy solution is to create SafeXXXString classes (where XXX is Sql, Xml, etc.) that validate their input.
type SafeSqlString(sql) =
do
//check `sql` for injection, etc.
//raise exception if validation fails
member __.Sql = sql
It gives you run-time, not compile-time, safety. But it's simple, self-documenting, and doesn't require reading the F# compiler source to make it work.
But, to answer your question, I don't see any way to do this with units of measure. As far as syntactic sugar goes, you might be able to encapsulate it in a monad, but I think it will make it more clunky, not less.

You can use discriminated unions:
type ValidatedString = ValidatedString of string
type SmellyString = SmellyString of string
let validate (SmellyString s) =
if (* ... *) then Some(ValidatedString s) else None
You get a compile-time check, and adding two validated strings won't generate a validated string (which units of measure would allow).
If the added overhead of the reference types is too big, you can use structs instead.

Related

Why are new programming languages shifting types to the other side?

If you look at Rust, Go, Swift, TypeScript and a few others, and compare them to C/C++, the first thing that I noticed was how the types have moved positions.
int one = 1;
In comparsion to:
let one:int = 1;
My question: Why?
To me, personally, it is weird reading type specifiers that far into the line, since I am very used to them being on the left. So it interests me on why the type specifiers are being moved - and this not being the case with just one, but many modern/new languages that are on the table.
To me, personally, it is weird reading type specifiers that far into the line, since I am very used to them being on the left
And English is the best language because it is the only language where the words are spoken in the same order I think them. One wonders why anyone speaks French at all, with the words all in the wrong order!
So it interests me on why the type specifiers are being moved - and this not being the case with just one, but many modern/new languages that are on the table.
I note that you ignore the existence of the many older languages which use this pattern. Visual Basic (mid 1990s) immediately comes to mind.
Function F(x As String) As Object
Pascal, 1970s:
var
Set1 : set of 1..10;
Simply-typed lambda calculus, a programming language invented before computers, in the 1940s:
λx:S.λy:T:S-->T-->S
The whole ML family. I could go on. There are plenty of very old languages that use the types on the right convention.
But we can get far older than the 1940s. When you say in mathematics f : Q --> R, you are putting the name of the function on the left and the type -- a map from Q to R -- on the right. When you say x∈R to indicate that x is a real, you're putting the type on the right. "Type on the right" predates type on the left in C by literally centuries. This is not anything new!
In fact the "types on the left" syntax is the weird one! It just seems natural to you because you happen to have used a language that uses this convention in your formative years.
The types on the right syntax is much superior, for numerous reasons. Just a few:
var x : int = 1;
function y(z : int) : string { ... }
emphasizes that x is a variable and y is a function. If the type comes to the left and you see int y then you don't know whether it is a function or a variable until later. This makes programs harder for humans to read, which is bad enough. As a compiler developer, let me tell you it is quite inconvenient that the type comes on the left in C#. (I could point out numerous inconsistencies in how C# syntax deals with the positions of types.)
Another reason: In the "type on the right" syntax you can make types optional. If you have
var x : int = 1;
then you can easily say "well, we can infer the int, and so eliminate it"
var x = 1;
but if the int is on the left, then what do you do?
Inverting this: you mention TypeScript. TypeScript is a gradually-typed JavaScript. The convention in JavaScript is already
var x = 1;
function f(y) { }
Given that, plainly it is easier to modify both existing code, and the language as a whole, to introduce optional type elements on the right than it would be to make the "var" and "function" keywords replaced by a type.
Consider also the positioning. When you say:
int x = 1;
then the two things that must be consistent -- the type and the initializer -- are as far apart as they possibly can be. With var x : int = 1; they are side by side. And in
int f() {
...
...
return 123;
}
what have we got? The return is logically as far to the right as possible, so why does the function declaration move the type of the return as far to the left as possible?" With the type on the right syntax we have this nice flow:
function f(x : string) : int
{ ... ... ... return 123; }
What happens in a function call? The flow of the declaration is now the same as the flow of control: the things on the left -- initialization of formal parameters -- happens first, and the things on the right -- production of a return value -- happen last.
I could go on at some additional length pointing out how the C style gets it completely backwards, but it is late. Summing up: first, type on the right is superior in almost every possible way, and second, it is very, very old. New languages which use this convention are the ones that are being consistent with traditional practice.
If you do a web search, it is not hard to find the developers of newer languages answering this question in their own words. For example, the Go developers have a FAQ entry on this, as well as an entire article on their language blog. Many programmers are so used to C-like languages that any alternative seems weird, so this question tends to come up a lot...
However, you could argue that the C type declaration syntax itself is odd at best. The pattern-like features for pointers and function types become awkward and unintuitive very quickly, and were never developed as part of, or into, any kind of more general pattern-matching facility. For the sake of familiarity, they were adopted to a greater or lesser degree by many successive C-like languages, but the feature itself sticks out as more of a failed experiment that we have to live with for the sake of backwards compatibility.
One advantage of extricating yourself from C type syntax is that it makes it easier to use types in more places than just declarations. If you can place types conveniently wherever they make sense, you can use your types as annotation, as described in the Swift documentation.

Is there a way to use the namelist I/O feature to read in a derived type with allocatable components?

Is there a way to use the namelist I/O feature to read in a derived type with allocatable components?
The only thing I've been able to find about it is https://software.intel.com/en-us/forums/intel-fortran-compiler-for-linux-and-mac-os-x/topic/269585 which ended on an fairly unhelpful note.
Edit:
I have user-defined derived types that need to get filled with information from an input file. So, I'm trying to find a convenient way of doing that. Namelist seems like a good route because it is so succinct (basically two lines). One to create the namelist and then a namelist read. Namelist also seems like a good choice because in the text file it forces you to very clearly show where each value goes which I find highly preferable to just having a list of values that the compiler knows the exact order of. This makes it much more work if I or anyone else needs to know which value corresponds to which variable, and much more work to keep clean when inevitably a new value is needed.
I'm trying to do something of the basic form:
!where myType_T is a type that has at least one allocatable array in it
type(myType_T) :: thing
namelist /nmlThing/ thing
open(1, file"input.txt")
read(1, nml=nmlThing)
I may be misunderstanding user-defined I/O procedures, but they don't seem to be a very generic solution. It seems like I would need to write a new one any time I need to do this action, and they don't seem to natively support the
&nmlThing
thing%name = "thing1"
thing%siblings(1) = "thing2"
thing%siblings(2) = "thing3"
thing%siblings(3) = "thing4"
!siblings is an allocatable array
/
syntax that I find desirable.
There are a few solutions I've found to this problem, but none seem to be very succinct or elegant. Currently, I have a dummy user-defined type that has arrays that are way large instead of allocatable and then I write a function to copy the information from the dummy namelist friendly type to the allocatable field containing type. It works just fine, but it is ugly and I'm up to about 4 places were I need to do this same type of operation in the code.
Hence trying to find a good solution.
If you want to use allocatable components, then you need to have an accessible generic interface for a user defined derived type input/output procedure (typically by the type having a generic binding for such a procedure). You link to a thread with an example with such a procedure.
Once invoked, that user defined derived type input/output procedure is then responsible for reading and writing the data. That can include invoking namelist input/output on the components of the derived type.
Fortran 2003 also offers derived types with length parameters. These may offer a solution without the need for a user defined derived type input/output procedure. However, use of derived types with length parameters, in combination with namelist, will put you firmly in the "highly experimental" category with respect to the current compiler implementation.

Advantage to a certain string comparison order

Looking at some pieces of code around the internet, I've noticed some authors tend to write string comparisons like
if("String"==$variable)
in PHP, or
if("String".equals(variable))
Whereas my preference is:
if(variable.equals("String"))
I realize these are effectively equal: they compare two strings for equality. But I was curious if there was an advantage to one over the other in terms of performance or something else.
Thank you for the help!
One example to the approach of using an equality function or using if( constant == variable ) rather than if( variable == constant ) is that it prevents you from accidentally typoing and writing an assignment instead of a comparison, for instance:
if( s = "test" )
Will assign "test" to s, which will result in undesired behaviour which may potentially cause a hard-to-find bug. However:
if( "test" = s )
Will in most languages (that I'm aware of) result in some form of warning or compiler error, helping to avoid a bug later on.
With a simple int example, this prevents accidental writes of
if (a=5)
which would be a compile error if written as
if (5=a)
I sure don't know about all languages, but decent C compilers warn you about if (a=b). Perhaps whatever language your question is written in doesn't have such a feature, so to be able to generate an error in such a case, they have reverted the order of the comparison arguments.
Yoda conditions call these some.
The kind of syntaxis a language uses has nothing to do with efficiency. It is all about how the comparison algorithm works.
In the examples you mentioned, this:
if("String".equals(variable))
and this:
if(variable.equals("String"))
would be exactly the same, because the expression "String" will be treated as a String variable.
Languages that provide a comparison method for Strings, will use the fastest method so you shouldn't care about it, unless you want to implement the method yourself ;)

Why do a lot of programming languages put the type *after* the variable name?

I just came across this question in the Go FAQ, and it reminded me of something that's been bugging me for a while. Unfortunately, I don't really see what the answer is getting at.
It seems like almost every non C-like language puts the type after the variable name, like so:
var : int
Just out of sheer curiosity, why is this? Are there advantages to choosing one or the other?
There is a parsing issue, as Keith Randall says, but it isn't what he describes. The "not knowing whether it is a declaration or an expression" simply doesn't matter - you don't care whether it's an expression or a declaration until you've parsed the whole thing anyway, at which point the ambiguity is resolved.
Using a context-free parser, it doesn't matter in the slightest whether the type comes before or after the variable name. What matters is that you don't need to look up user-defined type names to understand the type specification - you don't need to have understood everything that came before in order to understand the current token.
Pascal syntax is context-free - if not completely, at least WRT this issue. The fact that the variable name comes first is less important than details such as the colon separator and the syntax of type descriptions.
C syntax is context-sensitive. In order for the parser to determine where a type description ends and which token is the variable name, it needs to have already interpreted everything that came before so that it can determine whether a given identifier token is the variable name or just another token contributing to the type description.
Because C syntax is context-sensitive, it very difficult (if not impossible) to parse using traditional parser-generator tools such as yacc/bison, whereas Pascal syntax is easy to parse using the same tools. That said, there are parser generators now that can cope with C and even C++ syntax. Although it's not properly documented or in a 1.? release etc, my personal favorite is Kelbt, which uses backtracking LR and supports semantic "undo" - basically undoing additions to the symbol table when speculative parses turn out to be wrong.
In practice, C and C++ parsers are usually hand-written, mixing recursive descent and precedence parsing. I assume the same applies to Java and C#.
Incidentally, similar issues with context sensitivity in C++ parsing have created a lot of nasties. The "Alternative Function Syntax" for C++0x is working around a similar issue by moving a type specification to the end and placing it after a separator - very much like the Pascal colon for function return types. It doesn't get rid of the context sensitivity, but adopting that Pascal-like convention does make it a bit more manageable.
the 'most other' languages you speak of are those that are more declarative. They aim to allow you to program more along the lines you think in (assuming you aren't boxed into imperative thinking).
type last reads as 'create a variable called NAME of type TYPE'
this is the opposite of course to saying 'create a TYPE called NAME', but when you think about it, what the value is for is more important than the type, the type is merely a programmatic constraint on the data
If the name of the variable starts at column 0, it's easier to find the name of the variable.
Compare
QHash<QString, QPair<int, QString> > hash;
and
hash : QHash<QString, QPair<int, QString> >;
Now imagine how much more readable your typical C++ header could be.
In formal language theory and type theory, it's almost always written as var: type. For instance, in the typed lambda calculus you'll see proofs containing statements such as:
x : A y : B
-------------
\x.y : A->B
I don't think it really matters, but I think there are two justifications: one is that "x : A" is read "x is of type A", the other is that a type is like a set (e.g. int is the set of integers), and the notation is related to "x ε A".
Some of this stuff pre-dates the modern languages you're thinking of.
An increasing trend is to not state the type at all, or to optionally state the type. This could be a dynamically typed langauge where there really is no type on the variable, or it could be a statically typed language which infers the type from the context.
If the type is sometimes given and sometimes inferred, then it's easier to read if the optional bit comes afterwards.
There are also trends related to whether a language regards itself as coming from the C school or the functional school or whatever, but these are a waste of time. The languages which improve on their predecessors and are worth learning are the ones that are willing to accept input from all different schools based on merit, not be picky about a feature's heritage.
"Those who cannot remember the past are condemned to repeat it."
Putting the type before the variable started innocuously enough with Fortran and Algol, but it got really ugly in C, where some type modifiers are applied before the variable, others after. That's why in C you have such beauties as
int (*p)[10];
or
void (*signal(int x, void (*f)(int)))(int)
together with a utility (cdecl) whose purpose is to decrypt such gibberish.
In Pascal, the type comes after the variable, so the first examples becomes
p: pointer to array[10] of int
Contrast with
q: array[10] of pointer to int
which, in C, is
int *q[10]
In C, you need parentheses to distinguish this from int (*p)[10]. Parentheses are not required in Pascal, where only the order matters.
The signal function would be
signal: function(x: int, f: function(int) to void) to (function(int) to void)
Still a mouthful, but at least within the realm of human comprehension.
In fairness, the problem isn't that C put the types before the name, but that it perversely insists on putting bits and pieces before, and others after, the name.
But if you try to put everything before the name, the order is still unintuitive:
int [10] a // an int, ahem, ten of them, called a
int [10]* a // an int, no wait, ten, actually a pointer thereto, called a
So, the answer is: A sensibly designed programming language puts the variables before the types because the result is more readable for humans.
I'm not sure, but I think it's got to do with the "name vs. noun" concept.
Essentially, if you put the type first (such as "int varname"), you're declaring an "integer named 'varname'"; that is, you're giving an instance of a type a name. However, if you put the name first, and then the type (such as "varname : int"), you're saying "this is 'varname'; it's an integer". In the first case, you're giving an instance of something a name; in the second, you're defining a noun and stating that it's an instance of something.
It's a bit like if you were defining a table as a piece of furniture; saying "this is furniture and I call it 'table'" (type first) is different from saying "a table is a kind of furniture" (type last).
It's just how the language was designed. Visual Basic has always been this way.
Most (if not all) curly brace languages put the type first. This is more intuitive to me, as the same position also specifies the return type of a method. So the inputs go into the parenthesis, and the output goes out the back of the method name.
I always thought the way C does it was slightly peculiar: instead of constructing types, the user has to declare them implicitly. It's not just before/after the variable name; in general, you may need to embed the variable name among the type attributes (or, in some usage, to embed an empty space where the name would be if you were actually declaring one).
As a weak form of pattern-matching, it is intelligable to some extent, but it doesn't seem to provide any particular advantages, either. And, trying to write (or read) a function pointer type can easily take you beyond the point of ready intelligability. So overall this aspect of C is a disadvantage, and I'm happy to see that Go has left it behind.
Putting the type first helps in parsing. For instance, in C, if you declared variables like
x int;
When you parse just the x, then you don't know whether x is a declaration or an expression. In contrast, with
int x;
When you parse the int, you know you're in a declaration (types always start a declaration of some sort).
Given progress in parsing languages, this slight help isn't terribly useful nowadays.
Fortran puts the type first:
REAL*4 I,J,K
INTEGER*4 A,B,C
And yes, there's a (very feeble) joke there for those familiar with Fortran.
There is room to argue that this is easier than C, which puts the type information around the name when the type is complex enough (pointers to functions, for example).
What about dynamically (cheers #wcoenen) typed languages? You just use the variable.

What is a strictly typed language? [duplicate]

This question already has answers here:
What are the key aspects of a strongly typed language?
(8 answers)
Closed 1 year ago.
What is a strictly typed language?
Strictly typed languages enforce typing on all data being interacted with.
For example
int i = 3
string s = "4"
From here on out, whenever you use i, you can only interact with it as an integer type. That means you are restricted to using with methods that work with integers.
As for string s you can only interact with it as a string type. You can concatenate it with other string, print it out, etc. However, even though it contains that character "4", you cannot add to an integer without using some function to convert the string to an integer type.
In a dynamically typed language, you have a lot more flexibility:
i = 3
s = "4"
Types are inferred; meaning they are determined based on the data they are set to. i is obstensively a number type, and s is a string type, based on how they were set. However when you have i + s; type inference is used and depending on your environment, you may get the result i + s = 7; since s was implicitly converted to an int by the programming environment. However, this operation could also result in the string "34", if the environment infers an int + string should equal a concatenation operation vs an addition operation.
This flexibility has made loosely typed languages very popular. However, because these type inference can sometimes produce unexpected results; they can also result in more bugs in your code if you're not careful. In a typed language, if I perform i + s, I am forced by the compiler to change s into an int first, so I know by adding i to s, I will get 7 because I was forced to convert s to an explicit int first. In a dynamic language, it attempts to do this for you implicitly, but the results may not be what you were expecting, since anything can be in i or s; a string, a number, or even an object. You don't know until you run your code and see what happens.
I tried to look up "strict typing" and wasn't able to find a definitive definition for the term. Perhaps it refers to a strongly typed language?
Strong typing refers to a type system in which there are restrictions to the operation on which two variables of different types can be performed. For example, in a very strongly typed language, trying to add a string and number may lead to an error.
string s;
number n;
s + n; <-- Type error.
The error may occur at compile time for statically typed languages or at runtime for dynamically typed languages. It should be noted that static/dynamic and strong/weak may sound like similar concepts, they are quite different.
A less strongly typed language may allow casting of variables to allow operations between variables originating from different types:
s + (string)n; <-- Allowed, as (number) has been explicitly
casted to (string), so variable types match.
In a weakly typed language, variables of differing types may become automatically casted to compatible types.
s + n; <-- Allowed, where the language will cast
the (number) to (string)
Perhaps, the "strictly typed language" refers to a very strongly typed language in which there are more strict restrictions as to how operations can be performed on variables of different types.
There's dissenting opinions about how strong or weak various type systems are, but I've generally heard "strictly typed programming language" to mean a very strongly typed programming language. This often describes the static type systems found in several functional languages.
Languages where variables must be declared to contain a specific type of data.
If your variable declarations look like:
String myString = "Fred";
then your language is strictly typed, variable "myString" is explicitly declared to contain only string data.
If the following works:
x = 10;
x = "Fred";
then it's loosely typed (two different types of data in the same variable and scope).
languages where '1' + 3 would be illegal, because it's adding a string to an integer.

Resources