semantic search engine and navigator

semantic search engine and navigator - text

Hi I did a short course in AI and we designed a chatbot based on AIML and python. I have a new task to design some form of Semantic search engine. I want people to be able to navigate data or search for questions and I give them results. Initially it will be for specific topic e.g. transporation and geography. Some sample input from a user:
How much will it cost for me to get from x to y?
Ans: It will cost you 26$
How far is x from z?
Ans: It is 25 Miles
A user can add facourite routes so they can simply type in, Add favourite roAnd the user will then be asked to enter the f routes.
Ans: Are you asking to add an entry to your favourite routes?
User:yes.
Ans: Please enter a favourite route.
Show my common routes.
Ans: Your common routes are x,y and z.
So the data being searched may be specific to a user hence may have to use a database. Some data is external maybe envoke google maps to enquire on the distances. Some questions may simply require a response from a chatbot.
So what should i do upon user input?
Tokenize it, stem it, parse it?
I was hoping to use AIML somewhere but an article i read http://knytetrypper.proboards.com/index.cgi?board=gbot&action=print&thread=285 . Says AIML is only good for pattern matching. Someone please point me in the correct direction. I downloaded NLTK, it seems usefull but i don't know if it on its own can do what I require.
Any similar projects articles?

This is a really hard problem. If you restrict the inputs to a very small space, it can be doable though. At that point though you are just using a vocabulary and have basic commands for each possible query.
There are several ways to discriminate between types of queries:
1) parse and try to use all that info
2) partial parse/pos tag- find verbs
3) machine learning/classification approach, using pos as feature, distances, words/constructions like 'to'/'from'
... and then you can try to pull out the query params once you've classified the query correctly.
I would avoid doing a parse until you are very sure what kind of query it is- a classification approach is the best first step, and for messing with that NLTK is very useful.

Related

Python sentiment / text analysis advice

I don't know if this is the right place to ask this but, i am trying to build a bot in Python that will read incoming messages on a Slack channel where customer post their issues such as 'unable to connect to VPN', 'can someone reply to my ticket' etc…
The bot will analyze the message, determine if the customer is angry or not, and then propose a solution until an agent is free to actually check the issue.
Now, I was experimenting with TextBlob for the sentiment analysis part, but I don't know which technologies to actually use to determine the issue based on specific keywords and provide a solution to the user. Can someone propose me some python libraries/technologies that I could use to achieve this ?

To be honest your question is to generic to answer in one go.
Nontheless, you first have to clearly define the scope of your project. In doing so, you might want to first do a quick literaty review (Google Scholar) to familiarize with the state of the art technologies and methods.
From my little experience, a common (maybe simple) technique (lexicon-based approach) used to determine the sentiment of a word, is to use a pre-compiled dictionary (you can create your own though) that contains words - sentiment mappings. For example:
word:tired, sentiment:negative, score:5
So each time the bot finds the keyword "tired" in a sentence it will assign its corresponding negative value (polarity) to the sentence.
You might want to consider applying POS tags in the input text, as sometimes nouns or ``verbs carry significant meaning, compared to adjectives for example.
Keep in mind though, that negative comments can be written in the form of sarcasm. Sarcasm detectioin is a more difficult task though.
Alternatively, you could try using a pre-trained model such as bert-base-multilingual-uncased-sentiment that can be found here in Hugging Face.
For more information on the matter you have a look at this post.
Again as I mentioned, you have to clearly define your goals. This will enable you to specify the libraries or methodology available to solve your problem. Hope my answer helps.

tensorflow for classification of strings vs elasticsearch

So, a little bit on my problem.
TL;DR
Can I use machine-learning instead of Elastic Search to find results depending on the user's text input? Is it a good idea?
I am working on a car spare parts project, and we have split the car into 300 parts that we store on the database, with some data for each part (weight, availability, etc).
When the customer inputs the text of his part, we need to be able to classify the part, and map it to one in our database.
The current way it's being done is by people on our team manually mapping the customer text with the parts on our database, we want to automate that process.
We tried using MongoDB text search, but it was often inaccurate since parts have different names in different parts of the country.
So we wanted something that got more accurate results, and improved by the more data we have, we immediately considered TensorFlow, after some research and taking part of Google's Machine Learning Crash Course, I got to that point where it specified:
Models can't learn from string values, so you'll have to perform some feature engineering to convert those values to something numeric
That would be useful in the case we have limited number of features as strings, but we don't know what the user will input as a text.
So, my questions are:
1- Can we use Machine Learning to map text input by the user with some documents on our database?
2- If we can do that, is it a good idea to favor it over other search tools like ElasticSearch?
3- Can ElasticSearch improve its results the more data we have? How?
4- How would you go about this problem?
Note: I'd be doing that in Node.js, and since TensorFlow.js is new, I am inclining to go for other solutions, but if push comes to shove, and the results are much better, I would definitely go there.

TL;DR: Yes and yes.
TS;WM:
This is a perfectly suited problem for machine learning. Especially so, if you have a database of past customer texts that have already been mapped to parts. Ideally, you have hundreds of texts mapped to each part. If that is present, you can design and train a network. And models can learn from string values with some engineering, and it's not that bad.
I'm not sure ElasticSearch would improve much on the network. I don't know much about auto parts trading, but as a wild guess, "the large round thingy that helps change direction" would never be mapped to "steering wheel" by ES but could be learned easily by a network - provided there are at least some examples of people using that text to specify steering wheel.
You can but don't have to necessarily use tensorflow.js for your network. The AI could run on your server as a webservice, and you'd just send over the customer's text to it and it would send back it's recommendations of part SKUs and names.

Sentence correction using NLP

I'm trying to build a chat assistant in my website and it should answer queries like "Can you track my order?", "How's performance of XXX". The majority of the work lies in understanding the user's query.
I'm using 'Named Entity Recognizers' and "Text Parsers" for processing the queries. Before this, I'm passing the query through 'Spell checker' to reduce the errors like,
Can you track my ordr?
to
Can you track my order?
It's working in most of the cases but failing in cases like,
Can you track my water?
In this case, the spelling corrector doesn't correct the word 'water' and NER is not able identify the entity as 'order'.
The problem is 'Can you track my water?' may be a correct sentence in some other context but it's definitely a mistake in my context (domain). So I should be able to correct this sentence.
I'm stuck here.
Is there anyway I can correct these sentences using predefined queries and/or statistical data of user entered queries?

I don't know of a way you can change "water" to "order".
But if you have a predefined set of questions then you may give the user suggestions to select from, just before he submits the question.
NER may only recognize/classify entities it may not be used to replace parts of sentences, because the user may have intended what he said.
What you do is suggest most probabilistic word based on your set.
References:
What is the best way to find the most similar sentence?
Find semantically similar word

You could use n-gram models to find the most probable word and then use substitution. In your case, you substitute the word ordr by the word order. And if you want to go deeper you could use a machine learning model to handle the issue.

How to gather useful data for autocompletion?

im trying to implement a typical autocompletion-box, like you know from amazon.com.
There you go, type in a letter and you get a reasonable suggest about what you might try to enter into the search-box.
The box itself will be implemented by jquery, the persistence-layer and suggest algorithm will be based on Apache Lucene/Solr and its Suggest-Feature.
Additionaly i get a weighted suggestion into the result, using WFST-Suggestion by lucene.
My problem is, what does e.g. amazon to achieve this kind of reasonable data?
I mean where do they get all this keywords and score, so it makes sense?
Is it a pure hand-made style information on each product? What I think would be real tough.
Or is it possible to gather the data using things like clustering or classification from machine-learning-theory? (then I could use mahout or carrot2).
Looking on amazon suggestions, I think the data contains:
name of the product
producer/manufacturer/author of the product/book
product-features (like color, type, size)
Does it contain more?
The next thing would be that it looks that the suggestion itself is ranked. How do they receive this kind of score to weight the suggestions?
Is it a simple user-click-path-tracking, where you look, what the user typed into the box and what he selected or which product he looks afterward?
Is this kind of score computed on each query (maybe cached) using some logic? (Which? maybe bayes theorem?)

They might use something as simple as building an n-gram model from user queries and/or product names and use that to predict the most likely auto-completions.

Synonym style text lookup and parsing

We have a client who is looking for a means to import and categorize a large amount of textual data. This data has to be categorized and it's been suggested that the easiest way to to do this would be to look at the description field and try to match the words held there to see if a category can be derived for that particular record.
It was thought the best way to do this would be matching the words to key words held against each category and if that was unsuccessful then to use some kind of synonym look up to see if this could be used instead. So for example, if a particular record had the word "automobile" in it then a synonym look up could match that word to the word "car" which would be held against the category "vehicle".
Does anyone know of a web service or other means of looking up a dictionary to find synonyms for a particular word? The project manager has suggested buying a Google Enterprise Search license for this but from what I can make out that doesn't offer what these guys are looking for.
Any suggestions of other getting the client what they are looking for would be gratefully accepted.
Thanks! I'll look into Wordnet.
Do you know of any other types of textual classification software products out there. I see there's some discussion of using Bayasian algorithms for this but I can't see any real world examples of it.

The first thing that comes to mind is Wordnet. Wordnet is a human-generated database of words and related words, including synonyms. The Wikipedia Wordnet entry lists several interfaces to Wordnet. I believe some of them are web services.
You can also roll your own. Manning and Schutze's chapter 5 (free PDF) shows ways to do this.
Having said that, are you solving the right problem? How do you build the category list?
Is it a hierarchy? a tag cloud? See Clay Shirky's Ontology is Overrated for a critique of hierarchical categories. I believe that synonyms are less important if you base your classification on sets of words (Naive Bayes, for example) rather than on single words.

You should look at using WordNet. You can visit their website http://wordnet.princeton.edu/ to get more information, but there are libraries available for integrating against them in lots of languages.
Go to their online tool to see the use of it in action here: http://wordnetweb.princeton.edu/perl/webwn. If you look up a word, then click on "S" next to each definition, you'll get a list of semantically related words to that definition.
I also think you should check out software that will allow you to perform "document clustering." Here is an example: http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview. That should help you bootstrap the category creation process.
I think this will help get you a long way toward what you want!

For text classification you can take a look at Apache Mahout.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string