I am interested in adding a search for a city but I am not 100% sure what part of the geo capsule to use. It sort of seems like SearchTerm is for distinct points like a specific address(at least that is what NamedPoint seems to be) and SearchRegion seems to be more like a region or city.
The SearchTerm section mentions a city, Mountain View (which i think is not a single point in space, but a two dimensional range. it could be the center point of Mountain view maybe?)
Utterances such as "SFO", "60 South Market Street", "Mountain View"
and "Golden Gate Bridge" can all be trained with SearchTerm. Your capsule does not have to handle the search action, but instead just needs one or more action that takes a NamedPoint as an input.
Adding to my confusion, I might not understand the difference between concepts you train on and inputs. the SearchRegion section says that's the one to use if you train on named points:
If you have trained on named points or divisions, you should provide
actions that take SearchRegion concepts as inputs.
I thought that training was done on inputs, but is there a difference between "training on a named point" and "NamedPoint input"? NamedPoint inputs seemed to go with the SearchTerm while named point trainings go with SearchRegions.
Does anyone have an understanding on when to use one over the other?
Both SearchRegion and SearchTerm can help you search for a city.
As mentioned in the docs, you want to use SearchRegion when you expect your users to use LocalityName or NeighborhoodName etc.
You may want to refer to this for more clarity on what each of the divisions mean.
Related
I'm wondering if there is a way (specific package, process, etc.) of grouping items based on an overall category? For example, I'm looking at empty search results and want to see what category customers are most interested in.
Let's say I have a list of searched terms: skittles, laundry, snickers and detergent. I would want to group these items based on a broader category (i.e., skittles and snickers are "candy" and laundry and detergent would be "cleaners").
I've done some research on this and have seen similar (but not exact) ways of doing this (e.g., common keyword grouping using NLP) but not sure if something like this exists in the world when there isn't necessarily any commonality. Any help or direction would be greatly appreciated.
Update here: The best way to handle this scenario is to use pretrained word embeddings using something like Google's BERT algorithm as the first pass and then layer on another ML model that is specific to the use case.
I'm learning about natural language processing and I'm wondering if someone could point me in the right direction.
Say I have a bunch of contracts, and they have something like:
Joe's Farm, hereafter known as the seller,
and Bob's supermarket, hereafter known as the buyer, blah blah..
I'd like to be able to identify which party is the buyer and seller in this sentence. From what I have read, it should be theoretically possible to:
1. Give the AI a lot of sample sentences and tell it "this is the buyer/seller".
2. After training, it should be able to analyze a new sentence.
I have tried some entity extraction (tokenizing the sentence and identifying the party names) but I don't know how to tell it "this party is the buyer".
One workaround is to identify segments of the sentence and search if that has the word "buyer" in it... which probably works in most cases, but I want to try to do this in an "AI" way.
Can anyone point me to the right direction on what to research?
Looks like a coreference resolution problem to me.
Stanford CoreNLP may be a good starting point. It comes with a deterministic, a statistical, and a neural system, as well as pre-trained models.
How you solve that problem will depend on several factors, most importantly:
Do all contracts have the same format?
After specifying the seller and the buyer as you showed in your examples, do their real names appear again the text? Or just referred to as seller and buyer?
With that in mind, let's assume that the contracts have various formats and that the proper/real names of the seller and the buyer appear only once in the text somewhere in the introduction of the contract. This second assumption simplifies the problem (and is more likely to be the case in real-world contracts).
I would tackle the problem in three steps:
Teach the program how to identify the introduction of the contract (i.e. the paragraph in which the contract says who is who; kinda like your example sentences.)
Split the introduction into two parts: the part/sentence(s) where the seller is defined and the part/sentence(s) where the buyer is defined.
Finally, look into the seller part to find who the seller is and the buyer part to find who the buyer is.
To solve the 1st step, a small training dataset would be necessary. If not available, you could manually identify introductions of several contracts and use them as your training dataset. From here, Naive Bayes would probably be the simplest way to identify if a section of the contract is an introduction or not (you can randomly divide the contract into multiple chunks). Naive Bayes relies only on the frequency of tokens (not the ordering). You can read more here.
To solve the 2nd step, I would pretty much repeat what's done in the first step: use a dataset to "classify" sections of the introduction as the seller part and the buyer part. Though, this step will probably require more accuracy than the first one. So, I suggest doing a n-gram language model. That looks at the frequency of tokens, but also the ordering and succession. You can read more here. For a n-gram, you want something in between: not too short (1-gram = Naive Bayes) and not too long (> ~ 6-gram) to avoid much overlaps between the seller and buyer sentences.
For the 3rd and last step, I cannot think of a straightforward way, but I would remove the stop words (i.e. frequent English words) first. Then, I would try to find rare words in close proximity to the target terms (buyer and seller). Since we are assuming that the real names only appear once in the contract, that could be a rule that helps you identify them.
There are probably many other things you can try depending on the size/availability of training datasets, but this should give you a start.
I'm new to natural language process so I apologize if my question is unclear. I have read a book or two on the subject and done general research of various libraries to figure out how i should be doing this, but I'm not confident yet that know what to do.
I'm playing with an idea for an application and part of it is trying to find product mentions in unstructured text (e.g. tweets, facebook posts, emails, websites, etc.) in real-time. I wont go into what the products are but it can be assumed that they are known (stored in a file or database). Some examples:
"starting tomorrow, we have 5 boxes of #hersheys snickers available for $5 each - limit 1 pp" (snickers is the product from the hershey company [mentioned as "#hersheys"])
"Big news: 12-oz. bottles of Coke and Pepsi on sale starting Fri." (coca-cola is the product [aliased as "coke"] from coca-cola company and Pepsi is the product from the PepsiCo company)
"#OMG, i just bought my dream car. a mustang!!!!" (mustang is the product from Ford)
So basically, given a piece of text, query the text to see if it mentions a product and receive some indication (boolean or confidence number) that it does mention the product.
Some concerns I have are:
Missing products because of misspellings. I thought maybe i could use a string similarity check to catch these.
Product names that are also English words or things would get caught. Like mustang the horse versus mustang the car
Needing to keep a list of alternative names for products (e.g. "coke" for "coco-cola", etc.)
I don't really know where to start with this but any help would be appreciated. I've already looked at NLTK and SciKit and didn't really gleam how to do this from there. If you know of examples or papers that explain this, links would be helpful. I'm not specific to any language at this point. Java preferably but Python and Scala are acceptable.
The answer that you chose is not really answering your question.
The best approach you can take is using Named Entity Recognizer(NER) and POS tagger (grab NNP/NNPS; Proper nouns). The database there might be missing some new brands like Lyft (Uber's rival) but without developing your own prop database, Stanford tagger will solve half of your immediate needs.
If you have time, I would build the dictionary that has every brands name and simply extract it from tweet strings.
http://www.namedevelopment.com/brand-names.html
If you know how to crawl, it's not a hard problem to solve.
It looks like your goal is to classify linguistic forms in a given text as references to semantic entities (which can be referred to by many different linguistic forms). You describe a number of subtasks which should be done in order to get good results, but they nevertheless are still independent tasks.
Misspellings
In order to deal with potential misspellings of words, you need to associate these possible misspellings to their canonical (i.e. correct) form.
Phonetic similarity: Many reasons for "misspellings" is opacity in the relationship between the word's phonetic form (i.e. how it sounds) and its orthographic form (i.e. how it's spelled). Therefore, a good way to address this is to index terms phonetically so that e.g. innovashun is associated with innovation.
Form similarity: Additionally, you could do a string similarity check, but you may introduce a lot of noise into your results which you would have to address because many distinct words are in fact very similar (e.g. chic vs. chick). You could make this a bit smarter by first morphologically analyzing the word and then using a tree kernel instead.
Hand-made mappings: You can also simply make a list of common misspelling → canonical_form mappings. This would work well for "exceptions" not handled by the above methods.
Word-sense disambiguation
Mustang the car and Mustang the horse are the same form but refer to entirely different entities (or rather classes of entities, if you want to be pedantic). In fact, we ourselves as humans can't tell which one is meant unless we also know the word's context. One widely-used way of modelling this context is distributional lexical semantics: Defining a word's semantic similarity to another as the similarity of their lexical contexts, i.e. the words preceding and succeeding them in text.
Linguistic aliases (synonyms)
As stated above, any given semantic entity can be referred to in a number of different ways: bathroom, washroom, restroom, toilet, water closet, WC, loo, little boys'/girls' room, throne room etc. For simple meanings referring to generic entities like this, they can often be considered to be variant spellings in the same way that "common misspellings" are and can be mapped to a "canonical" form with a list. For ambiguous references such as throne room, other metrics (such as lexical-distributional methods) can also be included in order to disambiguate the meaning, so that you don't relate e.g. I'm in the throne room just now! to The throne room of the Buckingham Palace is beautiful.
Conclusion
You have a lot of work to do in order to get where you want to go, but it's all interesting stuff and there are already good libraries available for doing most of these tasks.
I have a collection of bills and Invoices, so there is no context in the text (i mean they don't tell a story).
I want to extract people names from those bills.
I tried OpenNLP but the quality of trained model is not good because i don't have context.
so the first question is: can I train model contains only people names without context? and if that possible can you give me good article for how i build that new model (most of the article that i read didn't explain the steps that i should made to build new model).
I have database name with more than 100,000 person name (first name, last name), so if the NER systems don't work in my case (because there is no context), what is the best way to search for those candidates (I mean searching for every first name with all other last names?)
thanks.
Regarding "context", I guess you mean that you don't have entire sentences, i.e. no previous / next tokens, and in this case you face quite a non-standard NER. I am not aware of available software or training data for this particular problem, if you found none you'll have to build your own corpus for training and/or evaluation purposes.
Your database of names will probably greatly help, depending indeed on what proportion of bill names are actually present in the database. You'll also probably have to rely on character-level morphology of names, as patterns (see for instance patterns in [1]). Once you have a training set with features (presence in database, morphology, other information of bill) and solutions (actual names of annotated bills), using standard machine-learning as SVM will be quite straightforward (if you are not familiar with this, just ask).
Some other suggestions:
You may most probably also use other bill's information: company name, positions, tax mentions, etc.
You may also proceed in a a selective manner - if all bills should mention (exactly?) one person name, you may exclude all other texts (e.g. amounts, tax names, positions etc.) or assume in a dedicated model that among all text in a bill, only one should be guessed as a name.
[1] Ranking algorithms for named-entity extraction: Boosting and the voted perceptron (Michael Collins, 2002)
I'd start with some regular expressions, then possibly augment that with a dictionary-based approach (i.e., big list of names).
No matter what you do, it won't be perfect, so be sure to keep that in mind.
I'm trying to make an analysis of a set of phrases, and I don't know exactly how "natural language processing" can help me, or if someone can share his knowledge with me.
The objective is to extract streets and localizations. Often this kind of information is not presented to the reader in a structured way, and It's hard to find a way of parsing it. I have two main objectives.
First the extraction of the streets itself. As far as I know NLP libraries can help me to tokenize a phrase and perform an analysis which will get nouns (for example). But where a street begins and where does it ends?. I assume that I will need to compare that analysis with a streets database, but I don't know wich is the optimal method.
Also, I would like to deduct the level of severity , for example, in car accidents. I'm assuming that the only way is to stablish some heuristic by the present words in the phrase (for example, if deceased word appears + 100). Am I correct?
Thanks a lot as always! :)
The first part of what you want to do ("First the extraction of the streets itself. [...] But where a street begins and where does it end?") is a subfield of NLP called Named Entity Recognition. There are many libraries available which can do this. I like NLTK for Python myself. Depending on your choice I assume that a streetname database would be useful for training the recognizer, but you might be able to get reasonable results with the default corpus. Read the documentation for your NLP library for that.
The second part, recognizing accident severity, can be treated as an independent problem at first. You could take the raw words or their part of speech tags as features, and train a classifier on it (SVM, HMM, KNN, your choice). You would need a fairly large, correctly labelled training set for that; from your description I'm not certain you have that?
"I'm assuming that the only way is to stablish some heuristic by the present words in the phrase " is very vague, and could mean a lot of things. Based on the next sentence it kind of sounds like you think scanning for a predefined list of keywords is the only way to go. In that case, no, see the paragraph above.
Once you have both parts working, you can combine them and count the number of accidents and their severity per street. Using some geocoding library you could even generalize to neighborhoods or cities. Another challenge is the detection of synonyms ("Smith Str" vs "John Smith Street") and homonyms ("Smith Street" in London vs "Smith Street" in Leeds).