US state resolution from unstructured text - nlp

I have a database with a "location" field that contains unconstrained user input in the form of a string. I would like to map each entry to either a US state or NULL.
For example:
'Southeastern Massachusetts' -> MA
'Brookhaven, NY' -> NY
'Manitowoc' -> WI
'Blue Springs, MO' -> MO
'A Damp & Cold Corner Of The World.' -> NULL
'Baltimore, Maryland' -> MD
'Indiana' -> IN
I can tolerate some errors but fewer would obviously be better. What's is the best way to go about this?

You may use Geonames which provides very large lists of location names with information about them, and is free. String matching (or approximate string matching) would then be probably not too hard to implement in the simplest cases.
One difficulty you'll probably encounter are names which are ambiguous, i.e. have multiple referents (e.g. Washington, is it the state or the city). If multiple indicators are present, you may check their coherence. Otherwise, you may check other words in input, but this is probably risky.
IMO, this is very close to Entity Linking with a posterior search to the closest state considering entities that have been linked.

For posterity: I just threw a bunch of regexps at it, which worked 'pretty alright'.

Related

Find any values from a set/list within a string

Long time lurker, first time poster. I'm hoping to get some advice from the brilliant minds in this community. In the project I'm working in, the goal is to look at a user-provided string and determine if the content of that string contains any (one or many) matches to a list of match criteria. For example:
User-provided string: "I like thing a and thing b"
Match List:
Match Criteria
Match Type
Category
Foo
Exact (Case Insensitive)
Bar
Thing a
Contains (Case Insensitive)
Things
Thing b
Contains (Case Insensitive)
Stuff
In this case, it would return the following matches:
Thing a > Things
Thing b > Stuff
As of now, my approach is to iterate through the match criteria list and check each list item against the user-supplied string using the Match Type specified (Exact, Contains, Regular Expression), returning a list of the matches and then doing some stuff with that list. This approach works, even when matching ~100 rules and handling a 200-record batch, but it seems obvious that the performance will be pretty terrible if a large number of rules is introduced.
Is there a better way to do this that would be supported in Apex called by a trigger? I would love to learn a more sophisticated approach if there is one.
Thanks in advance!
What do you need it for. Is it a pure apex exercise or is it "close" to certain standard sObjects? There are lots of built-in features around "fuzzy matching".
In no specific order...
Have you looked into "all things Einstein", from categorising leads to predicting how likely this opportunity is to close. Might not be direction you expected to take but who knows
Obviously SOSL comes to mind, like what powers the global search. It automatically does some substitutions for you like Mike -> Michael
Matching rules, duplicate rules. You'd have a limit of say 5 active rules but you could hook to them up from Apex, including creative abuse of the system. "Dear salesforce, let's pretend I'm making such and such Opportunity, can you find me similar Opportunities?" (plot twist- you're not making an Oppty at all, you're creating account of some venture capitalist looking for investments that match his preferences). Give matching rules a go, if not for everything then at least for more creative fuzzy matching. You really don't want to implement soundex, levenshtein etc manually...
tags? There's somewhat forgotten feature from SF classic, it creates bunch of tables (AccountTag, ContactTag). This plus SOSL could be close to what you need.
Additionally if you need this for anything close to Knowledge Base:
Data Categories come to mind
KB supports synonyms, letting you define your (not very intuitive) "thing b => stuff" mapping.
and it should survive translations

Case sensitive/insensitive comparisons for Data.Text?

I often need to do comparisons of Data.Text values with differing requirements for case sensitivity - this comes up frequently when I'm using chatter for NLP tasks.
For example, when searching tokens for Information Extraction tasks, I frequently need to search based on equality relationships that are less restrictive than standard string equality. Case sensitivity is the most common of those changes, but it's often a function of the specific token. A term like "activate" might usually be lower case, but if it's the first word in a sentence, it'll start with a leading capital, or if used in a title text may be in all caps or capitalized mid-sentence, so comparisons that ignore case make sense. Conversely, an acronym (e.g., "US") has different semantics depending on the capitalization.
That's all to say that I can't easily create a typeclass wrapper for each equality class, since it's a value-driven aspect. (so the case-insensitive package doesn't look like it would work).
So far, I'm using toLower to make a canonical representation, and comparing those representations so I can create custom versions of Text comparison functions that take a sensitivity flag, e.g.:
matches :: CaseSensitive -> Text -> Text -> Bool
matches Sensitive x y = x == y
matches Insensitive x y = (T.toLower x) == (T.toLower y)
However, I'm concerned that this takes extra passes over the input text. I could imagine it fusing in some cases, but probably not all (eg: T.isSuffixOf, T.isInfixOf).
Is there a better way to do this?
If the style of the comparison is driven by the semantics of the thing being compared, does it make sense to pass those semantics be passed around with the actual text? You can also then normalise where appropriate to avoid the repeated passes later:
data Token = Token CaseSensitive Text -- Text is all lower-case if Insensitive
deriving Eq
and perhaps define a smart constructor:
token Sensitive t = Token Sensitive t
token Insensitive t = Token Insensitive (T.toLower t)
This implies that the acronym "US" won't ever compare equal to the word "us", but that seems logical anyway.
You might also tag the values with something more detailed like acronym/word/... rather than just Sensitive/Insensitive.

Maintaining complex state in Haskell

Suppose you're building a fairly large simulation in Haskell. There are many different types of entities whose attributes update as the simulation progresses. Let's say, for the sake of example, that your entities are called Monkeys, Elephants, Bears, etc..
What is your preferred method for maintaining these entities' states?
The first and most obvious approach I thought of was this:
mainLoop :: [Monkey] -> [Elephant] -> [Bear] -> String
mainLoop monkeys elephants bears =
let monkeys' = updateMonkeys monkeys
elephants' = updateElephants elephants
bears' = updateBears bears
in
if shouldExit monkeys elephants bears then "Done" else
mainLoop monkeys' elephants' bears'
It's already ugly having each type of entity explicitly mentioned in the mainLoop function signature. You can imagine how it would get absolutely awful if you had, say, 20 types of entities. (20 is not unreasonable for complex simulations.) So I think this is an unacceptable approach. But its saving grace is that functions like updateMonkeys are very explicit in what they do: They take a list of Monkeys and return a new one.
So then the next thought would be to roll everything into one big data structure that holds all state, thus cleaning up the signature of mainLoop:
mainLoop :: GameState -> String
mainLoop gs0 =
let gs1 = updateMonkeys gs0
gs2 = updateElephants gs1
gs3 = updateBears gs2
in
if shouldExit gs0 then "Done" else
mainLoop gs3
Some would suggest that we wrap GameState up in a State Monad and call updateMonkeys etc. in a do. That's fine. Some would rather suggest we clean it up with function composition. Also fine, I think. (BTW, I'm a novice with Haskell, so maybe I'm wrong about some of this.)
But then the problem is, functions like updateMonkeys don't give you useful information from their type signature. You can't really be sure what they do. Sure, updateMonkeys is a descriptive name, but that's little consolation. When I pass in a god object and say "please update my global state," I feel like we're back in the imperative world. It feels like global variables by another name: You have a function that does something to the global state, you call it, and you hope for the best. (I suppose you still avoid some concurrency problems that would be present with global variables in an imperative program. But meh, concurrency isn't nearly the only thing wrong with global variables.)
A further problem is this: Suppose the objects need to interact. For example, we have a function like this:
stomp :: Elephant -> Monkey -> (Elephant, Monkey)
stomp elephant monkey =
(elongateEvilGrin elephant, decrementHealth monkey)
Say this gets called in updateElephants, because that's where we check to see if any of the elephants are in stomping range of any monkeys. How do you elegantly propagate the changes to both the monkeys and elephants in this scenario? In our second example, updateElephants takes and returns a god object, so it could effect both changes. But this just muddies the waters further and reinforces my point: With the god object, you're effectively just mutating global variables. And if you're not using the god object, I'm not sure how you'd propagate those types of changes.
What to do? Surely many programs need to manage complex state, so I'm guessing there are some well-known approaches to this problem.
Just for the sake of comparison, here's how I might solve the problem in the OOP world. There would be Monkey, Elephant, etc. objects. I'd probably have class methods to do lookups in the set of all live animals. Maybe you could lookup by location, by ID, whatever. Thanks to the data structures underlying the lookup functions, they'd stay allocated on the heap. (I'm assuming GC or reference counting.) Their member variables would get mutated all the time. Any method of any class would be able to mutate any live animal of any other class. E.g. an Elephant could have a stomp method that would decrement the health of a passed-in Monkey object, and there would be no need to pass that
Likewise, in an Erlang or other actor-oriented design, you could solve these problems fairly elegantly: Each actor maintains its own loop and thus its own state, so you never need a god object. And message passing allows one object's activities to trigger changes in other objects without passing a bunch of stuff all the way back up the call stack. Yet I have heard it said that actors in Haskell are frowned upon.
The answer is functional reactive programming (FRP). It it a hybrid of two coding styles: component state management and time-dependent values. Since FRP is actually a whole family of design patterns, I want to be more specific: I recommend Netwire.
The underlying idea is very simple: You write many small, self-contained components each with their own local state. This is practically equivalent to time-dependent values, because each time you query such a component you may get a different answer and cause a local state update. Then you combine those components to form your actual program.
While this sounds complicated and inefficient it's actually just a very thin layer around regular functions. The design pattern implemented by Netwire is inspired by AFRP (Arrowized Functional Reactive Programming). It's probably different enough to deserve its own name (WFRP?). You may want to read the tutorial.
In any case a small demo follows. Your building blocks are wires:
myWire :: WireP A B
Think of this as a component. It is a time-varying value of type B that depends on a time-varying value of type A, for example a particle in a simulator:
particle :: WireP [Particle] Particle
It depends on a list of particles (for example all currently existing particles) and is itself a particle. Let's use a predefined wire (with a simplified type):
time :: WireP a Time
This is a time-varying value of type Time (= Double). Well, it's time itself (starting at 0 counted from whenever the wire network was started). Since it doesn't depend on another time-varying value you can feed it whatever you want, hence the polymorphic input type. There are also constant wires (time-varying values that don't change over time):
pure 15 :: Wire a Integer
-- or even:
15 :: Wire a Integer
To connect two wires you simply use categorical composition:
integral_ 3 . 15
This gives you a clock at 15x real time speed (the integral of 15 over time) starting at 3 (the integration constant). Thanks to various class instances wires are very handy to combine. You can use your regular operators as well as applicative style or arrow style. Want a clock that starts at 10 and is twice the real time speed?
10 + 2*time
Want a particle that starts and (0, 0) with (0, 0) velocity and accelerates with (2, 1) per second per second?
integral_ (0, 0) . integral_ (0, 0) . pure (2, 1)
Want to display statistics while the user presses the spacebar?
stats . keyDown Spacebar <|> "stats currently disabled"
This is just a small fraction of what Netwire can do for you.
I know this is old topic. But I am facing the same problem right now while trying to implement Rail Fence cipher exercise from exercism.io. It is quite disappointing to see such a common problem having such poor attention in Haskell. I don't take it that to do some as simple as maintaining state I need to learn FRP. So, I continued googling and found solution looking more straightforward - State monad: https://en.wikibooks.org/wiki/Haskell/Understanding_monads/State

Achieving the right abstractions with Haskell's type system

I'm having trouble using Haskell's type system elegantly. I'm sure my problem is a common one, but I don't know how to describe it except in terms specific to my program.
The concepts I'm trying to represent are:
datapoints, each of which takes one of several forms, e.g. (id, number of cases, number of controls), (id, number of cases, population)
sets of datapoints and aggregate information: (set of id's, total cases, total controls), with functions for adding / removing points (so for each variety of point, there's a corresponding variety of set)
I could have a class of point types and define each variety of point as its own type. Alternatively, I could have one point type and a different data constructor for each variety. Similarly for the sets of points.
I have at least one concern with each approach:
With type classes: Avoiding function name collision will be annoying. For example, both types of points could use a function to extract "number of cases", but the type class can't require this function because some other point type might not have cases.
Without type classes: I'd rather not export the data constructors from, say, the Point module (providing other, safer functions to create a new value). Without the data constructors, I won't be able to determine of which variety a given Point value is.
What design might help minimize these (and other) problems?
To expand a bit on sclv's answer, there is an extended family of closely-related concepts that amount to providing some means of deconstructing a value: Catamorphisms, which are generalized folds; Church-encoding, which represents data by its operations, and is often equivalent to partially applying a catamorphism to the value it deconstructs; CPS transforms, where a Church encoding resembles a reified pattern match that takes separate continuations for each case; representing data as a collection of operations that use it, usually known as object-oriented programming; and so on.
In your case, what you seem to want is an an abstract type, i.e. one that doesn't export its internal representation, but not a completely sealed one, i.e. that leaves the representation open to functions in the module that defines it. This is the same pattern followed by things like Data.Map.Map. You probably don't want to go the type class route, since it sounds like you need to work with a variety of data points, rather than on an arbitrary choice of a single type of data point.
Most likely, some combination of "smart constructors" to create values, and a variety of deconstruction functions (as described above) exported from the module is the best starting point. Going from there, I expect most of the remaining details should have an obvious approach to take next.
With the latter solution (no type classes), you can export a catamorphism on the type rather than the constructors..
data MyData = PointData Double Double | ControlData Double Double Double | SomeOtherData String Double
foldMyData pf cf sf d = case d of
(PointData x y) -> pf x y
(ControlData x y z) -> cf x y z
(SomeOtherData s x) -> sf s x
That way you have a way to pull your data apart into whatever you want (including just ignoring the values and passing functions that return what type of constructor you used) without providing a general way to construct your data.
I find the type-classes-based approach better as long as you are not going to mix different data points in a single data structure.
The name collision problem you mentioned can be solved by creating a separate type class for each distinct field, like this:
class WithCases p where
cases :: p -> NumberOfCases

What's the most idiomatic approach to multi-index collections in Haskell?

In C++ and other languages, add-on libraries implement a multi-index container, e.g. Boost.Multiindex. That is, a collection that stores one type of value but maintains multiple different indices over those values. These indices provide for different access methods and sorting behaviors, e.g. map, multimap, set, multiset, array, etc. Run-time complexity of the multi-index container is generally the sum of the individual indices' complexities.
Is there an equivalent for Haskell or do people compose their own? Specifically, what is the most idiomatic way to implement a collection of type T with both a set-type of index (T is an instance of Ord) as well as a map-type of index (assume that a key value of type K could be provided for each T, either explicitly or via a function T -> K)?
I just uploaded IxSet to hackage this morning,
http://hackage.haskell.org/package/ixset
ixset provides sets which have multiple indexes.
ixset has been around for a long time as happstack-ixset. This version removes the dependencies on anything happstack specific, and is the new official version of IxSet.
Another option would be kdtree:
darcs get http://darcs.monoid.at/kdtree
kdtree aims to improve on IxSet by offering greater type-safety and better time and space usage. The current version seems to do well on all three of those aspects -- but it is not yet ready for prime time. Additional contributors would be highly welcomed.
In the trivial case where every element has a unique key that's always available, you can just use a Map and extract the key to look up an element. In the slightly less trivial case where each value merely has a key available, a simple solution it would be something like Map K (Set T). Looking up an element directly would then involve first extracting the key, indexing the Map to find the set of elements that share that key, then looking up the one you want.
For the most part, if something can be done straightforwardly in the above fashion (simple transformation and nesting), it probably makes sense to do it that way. However, none of this generalizes well to, e.g., multiple independent keys or keys that may not be available, for obvious reasons.
Beyond that, I'm not aware of any widely-used standard implementations. Some examples do exist, for example IxSet from happstack seems to roughly fit the bill. I suspect one-size-kinda-fits-most solutions here are liable to have a poor benefit/complexity ratio, so people tend to just roll their own to suit specific needs.
Intuitively, this seems like a problem that might work better not as a single implementation, but rather a collection of primitives that could be composed more flexibly than Data.Map allows, to create ad-hoc specialized structures. But that's not really helpful for short-term needs.
For this specific question, you can use a Bimap. In general, though, I'm not aware of any common class for multimaps or multiply-indexed containers.
I believe that the simplest way to do this is simply with Data.Map. Although it is designed to use single indices, when you insert the same element multiple times, most compilers (certainly GHC) will make the values place to the same place. A separate implementation of a multimap wouldn't be that efficient, as you want to find elements based on their index, so you cannot naively associate each element with multiple indices - say [([key], value)] - as this would be very inefficient.
However, I have not looked at the Boost implementations of Multimaps to see, definitively, if there is an optimized way of doing so.
Have I got the problem straight? Both T and K have an order. There is a function key :: T -> K but it is not order-preserving. It is desired to manage a collection of Ts, indexed (for rapid access) both by the T order and the K order. More generally, one might want a collection of T elements indexed by a bunch of orders key1 :: T -> K1, .. keyn :: T -> Kn, and it so happens that here key1 = id. Is that the picture?
I think I agree with gereeter's suggestion that the basis for a solution is just to maintiain in sync a bunch of (Map K1 T, .. Map Kn T). Inserting a key-value pair in a map duplicates neither the key nor the value, allocating only the extra heap required to make a new entry in the right place in the index. Inserting the same value, suitably keyed, in multiple indices should not break sharing (even if one of the keys is the value). It is worth wrapping the structure in an API which ensures that any subsequent modifications to the value are computed once and shared, rather than recomputed for each entry in an index.
Bottom line: it should be possible to maintain multiple maps, ensuring that the values are shared, even though the key-orders are separate.

Resources