Can I use NLP to explore inventory catalog? - nlp

Can I use NLP to do an intelligent search on inventory catalog like following
1.show me the product under 5000.
2.Show me the jewelry in particular product.

Related

Azure search solr index definition for supporting multiple markets

I am building a product catalog for an e-comm website. I am having a requirement to build a azure search/solr/elastic search based index. The problem is saving the market specific attributes. The website is supporting 109 markets and there is each market specific data like ratings, price, views, wish-listed, etc. that I need to save in the index eg: Product1 will have 109 ratings(rating is different in each market)/109 prices(price might be different in each market) corresponding to 109 markets. Also I will have to use this attributes to add a boosting function so that when people are searching for this, products with higher view/ratings surfaces up. How do I design the index definition to support this? Can I achieve this by 1 index doc per product or do I have to create 1 index doc per market? Some pointers will be very helpful. I have spent couple of days on this and could not reach to a conclusion that is optimized for this use case. Thank you!
My proposed index definition:
-id
-mktUSA
--mktId
--rating
--views
--price
...
-mktCanada
--mktId
--rating
--views
--price
...
-locales
--En
--Fr
--Zh
...
...other properties
The problem with this approach is configuring a magnitude scoring functions inside scoring profile, to boost products based on the market
Say eg: If user is from Canada, only the Canada based rating/views should be considered and not the other market ratings while Cognitive search is calculating the search relevance score.
Is there any possible work around this? Elastic search has a neat solution of Function score query that can be used to configure the scoring function dynamically
From what I understand, your problem is that you want to have a single index with products that support 109 different markets. Many different properties for your Product model can then be market-specific. Your concern is that the model gets to big, or if it's a scalable design. It is. You can have 1000+ properties without a problem.
I have built a similar search solution for e-commerce for multiple markets.
For price, I specify one price per market. I have about 80 or so markets, so that's 80 prices. There is no way around it. I would probably do the same for ratings and views too. One per market.
In our application we use separate dimensions for market, language and country. A market can be Scandinavia, BeNeLux or Asia-Pacific. You need to clearly define what a market is in your case, and agree with the business which markets you have and how you handle changes. Countries can map directly to markets, but it may also differ. Finally, language is usually shared across markets/countries and you usually only have to support 20-25 languages.
Suggested data model
Id
TitleEnGb
TitleDeDe
TitleFrFr
...
PriceGb
PriceUs
PriceNo
PriceDe
...
RatingsGb
RatingsUs
RatingsNo
RatingsDe
...
DescriptionEnGb
DescriptionDeDe
DescriptionFrFr
...
I try to illustrate that the Title and Description are language-specific. The price and ratings are market-specific.
For the 20-25 language-specific properties, you have to think about what analyzers to use. You want to use language-specific analyzers, and preferably the Microsoft analyzers since they have much better linguistics support with full lemmatization and so on.
When you develop your frontend application you have to keep track of which market, country and language you then refer to the specific properties. This is the easiest way to support boosting and so on.
Per-market index is not recommended
You could create one index per market. I have gone down this route before. I would not recommend this. This means you have to update 109 indexes every time you add, change or delete an item. And Azure Search supports 50 indexes per service at the most anyways.

Does HLDA in Mallet return Word-Topic Distribution?

I am trying to generate a taxonomy of extracted terminology using topic models. Therefore, I had to use Hierarchical Latent Dirichlet allocation.
However, after getting the topics tree, I would like to annotate topics but I am unable to produce the word-topic distribution in Mallet.
I have checked the parameters, it seems as if the only output file I can get is the output state, and it doesn't show the needed information.
I am using mallet implementation from the command window, I am using the following command line:
bin/mallet run cc.mallet.topics.tui.HierarchicalLDATUI --input my_corpus.mallet --output-state topic-statehlda.txt
I managed to get the topic-statehlda.txt which contains all the topic paths for the words and I have also visualized it (an example of the topics tree TopicsTree- terms where trimmed because they make the tree big and difficult to navigate). Some terms occur in multiple topics so that is why I am interested in the word-topic distribution to be able to select the most representative ones.
Can you please advise? is there a way to retrieve topics labels in a different way?
I am applying HLDA over documents from the same topic, and I am only using HLDA to extract possible taxonomies over a list of automatically extracted terminology (noun phrases), does this look meaningful or is it a bad practice?
The corpus is a collection of OCR'ed insurance documents. An Example of my automatically extracted taxonomy is:
motor insurance policy, motor policy schedule, motorcycle policy schedule, policy cover, cover use, cover note, theft cover, windscreen cover, comprehensive cover,
breakdown cover, commercial vehicle policy, commercial vehicle, motor vehicle,
vehicle policyholder, vehicle insurer, insured vehicle
and I am trying to build a taxonomy that suggests that the first 3 phrases, for example, are under the same node (belong to the same level)

Azure ML Recommendations

I want to use Azure ML to find related products using information from receipts from a store.
I got a file of reciepts:
44366,136778
79619,88975
78861,78864
53395,78129,78786,79295,79353,79406,79408,79417,85829,136712
32340,33973
31897,32905
32476,32697,33202,33344,33879,34237,34422,48175,55486,55490,55498
17800
32476,32697,33202,33344,33879,34237,34422,48175,55490,55497,55498,55503
47098
136974
85832
Each row represent one receipt and each number is a product id.
Given a product id I want to get a list of similar products, i.e. products that was bought together by other customers.
Can anyone point me in the right direction of how do to this?
This seems a good fit for their frequently bought together service (https://datamarket.azure.com/dataset/amla/mba). You may have to preprocess the dataset to get it in the required format. This service has a web UI as well: https://marketbasket.cloudapp.net/
This is a typical problem for Recommender, you can use a model called Machbox recommender to cover such a problem.
Recommender typically use Scoring about items to propose and the use some tricky calculation to predict scores for items users had not scored yet ( a score would be typically 1 user bought the item, 0 he did not)
If you need more details let me know ..(you have access to a free version of Azure ML where you can try all this)
Regards

How to perform website benchmarking?

I am trying to do competitive analysis of online trends prevailing in real estate domain at state level in a country. I have to create a report which is not biased towards any particular company but it compares or just shows how the companies are performing for a list of trends. I will use parameters of Clickstream analysis to show the statistics of how the websites of the company perform. The trend specific performance can be depicted by Sentiment Analysis in my opinion. If there is some other way to do it in an effective manner I am looking forward to any such approach.
Now, I am not able to find any trends that come in common.
How can I find general trends which will be common for all real estate comapnies ?
I tried using Google Trends. They provide graphical and demographic information regarding a particular search term and lists related terms to the search which I am clueless how to use. And as I drill down from country to state, the amount data is very less.
Once I have the trends then I've to find how people are reacting to those trends. Sentiment Analysis is the thing which will provide me this info.
But even if I get the trends how will I get trend specific data from which I can calculate its polarity ?
Twitter and other social media sites can provide some data on which sentiment analysis can be performed. I used this site which gives the positive, negative and neutral behaviour related to some term on twitter. I need something analogous to this but the dataset on which this analysis can be performed should not be limited to social media only.
Are there any other entities I can add in this competitive analysis
report ?
The report will be generated on monthly basis. And I want maximum amount of automation in above tasks. I am thinking of using web-scraping also to scrape data of similar format. I would also like to know what data I should scrape and what data I should manually extract.

Can I identify intranet page content using Named Entity Recognition?

I am new to Natural Language Processing and I want to learn more by creating a simple project. NLTK was suggested to be popular in NLP so I will use it in my project.
Here is what I would like to do:
I want to scan our company's intranet pages; approximately 3K pages
I would like to parse and categorize the content of these pages based on certain criteria such as: HR, Engineering, Corporate Pages, etc...
From what I have read so far, I can do this with Named Entity Recognition. I can describe entities for each category of pages, train the NLTK solution and run each page through to determine the category.
Is this the right approach? I appreciate any direction and ideas...
Thanks
It looks like you want to do text/document classification, which is not quite the same as Named Entity Recognition, where the goal is to recognize any named entities (proper names, places, institutions etc) in text. However, proper names might be very good features when doing text classification in a limited domain, it is for example likely that a page with the name of the head engineer could be classified as Engineering.
The NLTK book has a chapter on basic text classification.

Resources