Case sensitive/insensitive comparisons for Data.Text? - haskell

I often need to do comparisons of Data.Text values with differing requirements for case sensitivity - this comes up frequently when I'm using chatter for NLP tasks.
For example, when searching tokens for Information Extraction tasks, I frequently need to search based on equality relationships that are less restrictive than standard string equality. Case sensitivity is the most common of those changes, but it's often a function of the specific token. A term like "activate" might usually be lower case, but if it's the first word in a sentence, it'll start with a leading capital, or if used in a title text may be in all caps or capitalized mid-sentence, so comparisons that ignore case make sense. Conversely, an acronym (e.g., "US") has different semantics depending on the capitalization.
That's all to say that I can't easily create a typeclass wrapper for each equality class, since it's a value-driven aspect. (so the case-insensitive package doesn't look like it would work).
So far, I'm using toLower to make a canonical representation, and comparing those representations so I can create custom versions of Text comparison functions that take a sensitivity flag, e.g.:
matches :: CaseSensitive -> Text -> Text -> Bool
matches Sensitive x y = x == y
matches Insensitive x y = (T.toLower x) == (T.toLower y)
However, I'm concerned that this takes extra passes over the input text. I could imagine it fusing in some cases, but probably not all (eg: T.isSuffixOf, T.isInfixOf).
Is there a better way to do this?

If the style of the comparison is driven by the semantics of the thing being compared, does it make sense to pass those semantics be passed around with the actual text? You can also then normalise where appropriate to avoid the repeated passes later:
data Token = Token CaseSensitive Text -- Text is all lower-case if Insensitive
deriving Eq
and perhaps define a smart constructor:
token Sensitive t = Token Sensitive t
token Insensitive t = Token Insensitive (T.toLower t)
This implies that the acronym "US" won't ever compare equal to the word "us", but that seems logical anyway.
You might also tag the values with something more detailed like acronym/word/... rather than just Sensitive/Insensitive.

Related

How to prove the correctness of a given grammar?

I am wondering how programming langauge developers validate and prove that their grammar is correct. Suppose that I created a new grammar for a new langauge. I can test my grammar with a unit test tool by providing different kinds of test programs. However, I will never 100% ensure that my grammar is correct. How do language developers ensure that their grammar is correct in real world?
Let's say I created a grammar for a new language using pencil and paper. However, I did a mistake and my grammar accepts the expressions that end with a + like 2+2+. I will implement my language using this incorrect grammar, if I don't find the mistake in it. After implementation and unit testing, I can find the error. Is it possible to find it before starting any implementation?
Definitely, I can try my grammar with some sample inputs using pencil and paper (derivation etc.), but I may miss some corner cases. Is there a better approach or how in the real language developers test their grammar?
A proof is a logical argument that demonstrates the truth of a claim. There are as many ways to prove something as there are ways of thinking about a problem. A common way to prove things about discrete structures (like grammars) is using mathematical induction. Basically, you show that something is true in base cases - the simplest cases possible - and then show that if it's true for all cases under a certain size, it must therefore be true for cases of the next size.
In our case: suppose we wanted only to prove your grammar didn't generate + at the end of a word. We could do induction on the number of productions used in constructing a string in the language. We would identify all relevant base cases, show the property holds for these strings, and then show that longer strings in the language are constructed in such a way that it is impossible to get a + at the end. Here's an example.
S := S + S | (S) | x
Base case: the shortest string in the language is x, generated as S -> x. It does not end with a +.
Induction hypothesis: assume all strings produced using up to and including k productions do not end with +.
Induction step: we must show strings produced using more than k productions do not end with +. If we apply the rule (S) to any string generated from S, we do not add + so the property holds. If we apply S + S to strings generated from S, the last symbol in S + S is the last symbol of a shorter string (at least 2 symbols shorter) generated by S. By the induction hypothesis, that string did not end in +, so neither does this one. There are no other productions, so no string in the language ends in +. QED

US state resolution from unstructured text

I have a database with a "location" field that contains unconstrained user input in the form of a string. I would like to map each entry to either a US state or NULL.
For example:
'Southeastern Massachusetts' -> MA
'Brookhaven, NY' -> NY
'Manitowoc' -> WI
'Blue Springs, MO' -> MO
'A Damp & Cold Corner Of The World.' -> NULL
'Baltimore, Maryland' -> MD
'Indiana' -> IN
I can tolerate some errors but fewer would obviously be better. What's is the best way to go about this?
You may use Geonames which provides very large lists of location names with information about them, and is free. String matching (or approximate string matching) would then be probably not too hard to implement in the simplest cases.
One difficulty you'll probably encounter are names which are ambiguous, i.e. have multiple referents (e.g. Washington, is it the state or the city). If multiple indicators are present, you may check their coherence. Otherwise, you may check other words in input, but this is probably risky.
IMO, this is very close to Entity Linking with a posterior search to the closest state considering entities that have been linked.
For posterity: I just threw a bunch of regexps at it, which worked 'pretty alright'.

How to generate a random alphanumeric string with Erlang?

I'm trying to generate an random alphanumeric ID with Erlang.
I naively tried crypto:strong_rand_bytes(Bytes) to generate a random binary and then used that binary like it was created with <<"my_unique_random_id">> - which didn't work because random bits are not necessarily a valid UTF-8 string, right?
Well, I looked for other options in the erlang docs and elsewhere, but I didn't find anything. Could someone point me to a solution?
It might depend on the randomness you need. Erlang's crypto module produces stronger random data than the random module (see also [erlang-questions] Yaws security alert - Yaws 1.93 and this question). If you want to use strong_rand_bytes to generate an ID maybe getting the base64 of it might be enough:
> base64:encode(crypto:strong_rand_bytes(Bytes)).
You could turn this into a list if needed.
According to Generating random strings in Erlang it only takes a few lines of Erlang to generate a string of a specified length from a certain set of characters.
get_random_string(Length, AllowedChars) ->
lists:foldl(fun(_, Acc) ->
[lists:nth(random:uniform(length(AllowedChars)),
AllowedChars)]
++ Acc
end, [], lists:seq(1, Length)).
The blog post has a line-by-line explanation of the code. Look to the comments for a couple of optimization tips.
I have prepared small module do to this
Also it uses crypto:rand_uniform/2 but not obsolete random:uniform
module(cloud_rnd).
-export([rnd_chars/1, rnd_numbers/1, rnd_chars_numbers/1]).
rnd_chars(L) -> get_rnd(L, chars).
rnd_numbers(L) -> get_rnd(L, numbers).
rnd_chars_numbers(L) -> get_rnd(L, chars_numbers).
get_rnd(L, chars) -> gen_rnd(L, "abcdefghijklmnopqrstuvwxyz");
get_rnd(L, numbers) -> gen_rnd(L, "1234567890");
get_rnd(L, chars_numbers) -> gen_rnd(L, "abcdefghijklmnopqrstuvwxyz1234567890").
gen_rnd(Length, AllowedChars) ->
MaxLength = length(AllowedChars),
lists:foldl(
fun(_, Acc) -> [lists:nth(crypto:rand_uniform(1, MaxLength), AllowedChars)] ++ Acc end,
[], lists:seq(1, Length)
).
The problem with responses to the various "I need random strings" questions (in whatever language) is almost every solution uses a flawed specification, namely, string length. The questions themselves rarely reveal why the random strings are needed, but I will boldly assume they are to be used as identifiers which need to be unique.
There are two leading ways to get strictly unique strings: deterministically (which is not random) and store/compare (which is onerous). What to do? Give up the ghost. Go with probabilistic uniqueness instead. That is, accept that there is some (however small) risk that your strings won't be unique. This is where understanding collision probability and entropy are helpful.
So I'll rephrase my bold assumption as you need some number of identifiers with a small risk of repeat. As a concrete example, let's say you need 5 million Ids with a less than 1 in a trillion risk of repeat. So what length of string do you need? Well, that question is underspecified as it depends on the characters used. But more importantly, it's misguided. What you need is specification of the entropy of the strings, not their length.
This is where EntropyString can help.
Bits = entropy_string:bits(5.0e6, 1.0e12).
83.37013046707142
entropy_string:random_string(Bits).
<<"QDrjGQFGgGjJ4t9r2">>
There are other predefined characters sets, and you can specify your own characters as well (though for efficiency reasons only sets with powers of 2 characters are supported). And best of all, the risk of repeat in the specified number of strings is explicit. No more guessing with string length.
randchar(N) ->
randchar(N, []).
randchar(0, Acc) ->
Acc;
randchar(N, Acc) ->
randchar(N - 1, [random:uniform(26) + 96 | Acc]).
You may use function uef_bin:random_latin_binary/2 from here:
https://github.com/DOBRO/uef-lib#uef_binrandom_latin_binary2
Bin = uef_bin:random_latin_binary(Length, any)
And then, if you need a string() type:
String = erlang:binary_to_list(Bin)

Achieving the right abstractions with Haskell's type system

I'm having trouble using Haskell's type system elegantly. I'm sure my problem is a common one, but I don't know how to describe it except in terms specific to my program.
The concepts I'm trying to represent are:
datapoints, each of which takes one of several forms, e.g. (id, number of cases, number of controls), (id, number of cases, population)
sets of datapoints and aggregate information: (set of id's, total cases, total controls), with functions for adding / removing points (so for each variety of point, there's a corresponding variety of set)
I could have a class of point types and define each variety of point as its own type. Alternatively, I could have one point type and a different data constructor for each variety. Similarly for the sets of points.
I have at least one concern with each approach:
With type classes: Avoiding function name collision will be annoying. For example, both types of points could use a function to extract "number of cases", but the type class can't require this function because some other point type might not have cases.
Without type classes: I'd rather not export the data constructors from, say, the Point module (providing other, safer functions to create a new value). Without the data constructors, I won't be able to determine of which variety a given Point value is.
What design might help minimize these (and other) problems?
To expand a bit on sclv's answer, there is an extended family of closely-related concepts that amount to providing some means of deconstructing a value: Catamorphisms, which are generalized folds; Church-encoding, which represents data by its operations, and is often equivalent to partially applying a catamorphism to the value it deconstructs; CPS transforms, where a Church encoding resembles a reified pattern match that takes separate continuations for each case; representing data as a collection of operations that use it, usually known as object-oriented programming; and so on.
In your case, what you seem to want is an an abstract type, i.e. one that doesn't export its internal representation, but not a completely sealed one, i.e. that leaves the representation open to functions in the module that defines it. This is the same pattern followed by things like Data.Map.Map. You probably don't want to go the type class route, since it sounds like you need to work with a variety of data points, rather than on an arbitrary choice of a single type of data point.
Most likely, some combination of "smart constructors" to create values, and a variety of deconstruction functions (as described above) exported from the module is the best starting point. Going from there, I expect most of the remaining details should have an obvious approach to take next.
With the latter solution (no type classes), you can export a catamorphism on the type rather than the constructors..
data MyData = PointData Double Double | ControlData Double Double Double | SomeOtherData String Double
foldMyData pf cf sf d = case d of
(PointData x y) -> pf x y
(ControlData x y z) -> cf x y z
(SomeOtherData s x) -> sf s x
That way you have a way to pull your data apart into whatever you want (including just ignoring the values and passing functions that return what type of constructor you used) without providing a general way to construct your data.
I find the type-classes-based approach better as long as you are not going to mix different data points in a single data structure.
The name collision problem you mentioned can be solved by creating a separate type class for each distinct field, like this:
class WithCases p where
cases :: p -> NumberOfCases

Why do programming languages use commas to separate function parameters?

It seems like all programming languages use commas (,) to separate function parameters.
Why don't they use just spaces instead?
Absolutely not. What about this function call:
function(a, b - c);
How would that look with a space instead of the comma?
function(a b - c);
Does that mean function(a, b - c); or function(a, b, -c);? The use of the comma presumably comes from mathematics, where commas have been used to separate function parameters for centuries.
First of all, your premise is false. There are languages that use space as a separator (lisp, ML, haskell, possibly others).
The reason that most languages don't is probably that a) f(x,y) is the notation most people are used to from mathematics and b) using spaces leads to lots of nested parentheses (also called "the lisp effect").
Lisp-like languages use: (f arg1 arg2 arg3) which is essentially what you're asking for.
ML-like languages use concatenation to apply curried arguments, so you would write f arg1 arg2 arg3.
Tcl uses space as a separator between words passed to commands. Where it has a composite argument, that has to be bracketed or otherwise quoted. Mind you, even there you will find the use of commas as separators – in expression syntax only – but that's because the notation is in common use outside of programming. Mathematics has written n-ary function applications that way for a very long time; computing (notably Fortran) just borrowed.
You don't have to look further than most of our natural languages to see that comma is used for separation items in lists. So, using anything other than comma for enumerating parameters would be unexpected for anyone learning a programming language for the first time.
There's a number of historical reasons already pointed out.
Also, it's because in most languages, where , serves as separator, whitespace sequences are largely ignored, or to be more exact, although they may separate tokens, they do not act as tokens themselves. This is moreless true for all languages deriving their syntax from C. A sequence of whitespaces is much like the empty word and having the empty word delimit anything probably is not the best of ideas.
Also, I think it is clearer and easier to read. Why have whitespaces, which are invisible characters, and essentially serve nothing but the purpose of formatting, as really meaningful delimiters. It only introduces ambiguity. One example is that provided by Carl.
A second would f(a (b + c)). Now is that f(a(b+c)) or f(a, b+c)?
The creators of JavaScript had a very useful idea, similar to yours, which yields just the same problems. The idea was, that ENTER could also serve as ;, if the statement was complete. Observe:
function a() {
return "some really long string or expression or whatsoever";
}
function b() {
return
"some really long string or expression or whatsoever";
}
alert(a());//"some really long string or expression or whatsoever"
alert(b());//"undefined" or "null" or whatever, because 'return;' is a valid statement
As a matter of fact, I sometimes tend to use the latter notation in languages, that do not have this 'feature'. JavaScript forces a way to format my code upon me, because someone had the cool idea, of using ENTER instead of ;.
I think, there is a number of good reasons why some languages are the way they are. Especially in dynamic languages (as PHP), where there's no compile time check, where the compiler could warn you, that the way it resolved an ambiguity as given above, doesn't match the signature of the call you want to make. You'd have a lot of weird runtime errors and a really hard life.
There are languages, which allow this, but there's a number of reasons, why they do so. First and foremost, because a bunch of very clever people sat down and spent quite some time designing a language and then discovered, that its syntax makes the , obsolete most of the time, and thus took the decision to eliminate it.
This may sound a bit wise but I gather for the same reason why most earth-planet languages use it (english, french, and those few others ;-) Also, it is intuitive to most.
Haskell doesn't use commas.
Example
multList :: [Int] -> Int -> [Int]
multList (x : xs) y = (x * y) : (multList xs y)
multList [] _ = []
The reason for using commas in C/C++ is that reading a long argument list without a separator can be difficult without commas
Try reading this
void foo(void * ptr point & * big list<pointers<point> > * t)
commas are useful like spaces are. In Latin nothing was written with spaces, periods, or lower case letters.
Try reading this
IAMTHEVERYMODELOFAWHATDOYOUWANTNOTHATSMYBUCKET
it's primarily to help you read things.
This is not true. Some languages don't use commas. Functions have been Maths concepts before programming constructs, so some languages keep the old notation. Than most of the newer has been inspired by C (Javascript, Java, C#, PHP too, they share some formal rules like comma).
While some languages do use spaces, using a comma avoids ambiguous situations without the need for parentheses. A more interesting question might be why C uses the same character as a separator as is used for the "a then b" operator; the latter question is in some ways more interesting given that the C character set has at three other characters that do not appear in any context (dollar sign, commercial-at, and grave, and I know at least one of those (the dollar sign) dates back to the 40-character punchcard set.
It seems like all programming languages use commas (,) to separate function parameters.
In natural languages that include comma in their script, that character is used to separate things. For instance, if you where to enumerate fruits, you'd write: "lemon, orange, strawberry, grape" That is, using comma.
Hence, using comma to separate parameters in a function is more natural that using other character ( | for instance )
Consider:
someFunction( name, age, location )
vs.
someFunction( name|age|location )
Why don't they use just spaces instead?
Thats possible. Lisp does it.
The main reason is, space, is already used to separate tokens, and it's easier not to assign an extra functionality.
I have programmed in quite a few languages and while the comma does not rule supreme it is certainly in front. The comma is good because it is a visible character so that script can be compressed by removing spaces without breaking things. If you have space then you can have tabs and that can be a pain in the ... There are issues with new-lines and spaces at the end of a line. Give me a comma any day, you can see it and you know what it does. Spaces are for readability (generally) and commas are part of syntax. Mind you there are plenty of exceptions where a space is required or de rigueur. I also like curly brackets.
It is probably tradition. If they used space they could not pass expression as param e.g.
f(a-b c)
would be very different from
f(a -b c)
Some languages, like Boo, allow you to specify the type of parameters or leave it out, like so:
def MyFunction(obj1, obj2, title as String, count as Int):
...do stuff...
Meaning: obj1 and obj2 can be of any type (inherited from object), where as title and count must be of type String and Int respectively. This would be hard to do using spaces as separators.

Resources