Wiki-distance: distance between Wiki topics and categories? - nlp

Is there something a [directional?] notion/implementation of distance between Wikipedia categories/pages?
For example consider: A) "Saint Louis University" B) "university"
Clearly "A" is a type of "B". How can you extract this from Wiki?
If you extract all the categories connect to A, you'd see that it gives
Category:1818 establishments in Missouri Territory
Category:Articles containing Latin-language text
Category:Association of Catholic Colleges and Universities
Category:Commons category with local link same as on Wikidata
Category:Coordinates on Wikidata
Category:Educational institutions established in 1818
Category:Instances of Infobox university using image size
Category:Jesuit universities and colleges in the United States
Category:Roman Catholic Archdiocese of St. Louis
Category:Roman Catholic universities and colleges in Missouri
and it does not contain anything that would directly connect to B (https://en.wikipedia.org/wiki/University). But essentially if you look further, you should be able to find a multi-hop path between A and B, possibly multiple hops. What are the popular ways of accomplishing this?

Some ideas/resources I collected. Will update this if I find more.
-- Using DBPedia: knowledge base curated based on Wiki. They provide an SparQL end-point to query this KB. But one has to simulate the desired similarity/distance behavior via their SparQL interface. Some ideas are here and here, but they seem to be outdated.
-- Using UMBEL: http://umbel.org/ which is a knowledge graph of concepts. I think the size of this knowledge graph is relatively small. But the I suspect that its precision is probably high. That being said, I'm not sure how this relates to Wikipedia at all. They have this api for calculating the distance measure between any pair of their concepts (at the moment of writing this post, their similarity API is down. So not a feasible solution at the moment).
-- Using http://degreesofwikipedia.com/ I don't the details of their algorithm and how they do, but they provide a distance between Wiki-concepts. And also this is directional. For example this and this.

If you have the entire Wikipedia category taxonomy, then you can compute the distance (shortest path length) between two categories. If one category is the ancestor of other, it is straight forward.
Otherwise you can find the Least Common Subsumer which is defined as follows.
Least common subsumer of two concepts A and B is the most specific
concept which is an ancestor of both A and B.
Then compute the distance between them via LCS.
I encourage you to go through similarity measures where you will find state-of-art techniques to compute semantic similarity between words.
Resource: My project on extracting Wikipedia category/concept might help you.
One very good related example
Compute semantic similarity between words using WordNet. WordNet organizes English words in hierarchical fashion. See this wordnet similarity for java demo. It uses eight different state-of-techniques to compute semantic similarity between words.

You might be looking for the "is a" relationship: Q734774 (the Wikidata item for Saint Louis University) is a university, a building and a private not-for-profit educational institution. You can use SPARQL to query it:
is Saint Louis University a university?
how far is Saint Louis University removed from the concept of "university"? (although I doubt this would produce anything meaningful)

Related

What Kind of Multi Criteria Decisoin Making methods i need for my problem?

I'm making an application to find the best products to buy based on several criteria. can be called a decision support system.
some examples of the criteria I use are:
location, the more the sending location is with my city, the better.
I have determined the weight of the location, I determine the weight
of my city with a value of 100, the farther the shipping city with
my city, then the weight will be smaller.
the number of reviews owned by a product, more means better
rating value, the higher the rating, the more means better
price, the cheaper the price the better
I was recommended to use a method called AHP, I have read about AHP and although I think AHP is a good method, in my opinion what I want can not be fulfilled entirely with AHP because it does not take into account the nominal value of the rating and price, it only counts one thing importance to another
my questions are :
with the requirements of the criteria, what MCDM method should I use?
Does AHP actually can accommodate my needs? if yes, how? is it using Fuzzy-AHP? if so, I will start learning Fuzzy and things related to it
Thanks for the question. So, AHP*1 is a method used in decision-making (DM) to methodically assign weights to the different criterion. In order to score, rank and select the most-desirable alternative you need to complement AHP with another MCDC method that fulfils those tasks.
There several methods to do that. TOPSIS and ELECTRE, for instance, are commonly used to that purpose. *2-3. I leave you a link on the papers and tutorials of those methods so you understand how they work. -- SEE RESOURCES.
In regards to using fuzzy logic in AHP. While there are several proposals on using FAHP*4, Saaty himself, creator of the AHP states that this is redundant*5-7 since the scale in which criteria are assessed to weighing in AHP already operates with a fuzzy logic.
However, in the case, your criteria are based on qualitative data and therefore you are dealing with uncertainty and potentially, incomplete information, you can use fuzzy numbers in TOPSIS for those variables. You can check the tutorials in resources to understand how to apply those methods.
In recent years, some researchers have argued that fuzzy TOPSIS only considers the membership function. (That is, the closest an imprecise parameter is to reality) and ignores the non-membership and indeterminacy degree *9-10, so how false and not determinable is that parameter. The neutrosophic theory was mainly pioneered by *10 Smarandache.
So, in response, nowadays, neutrosophic TOPSIS is being to be used to deal with uncertainty. I recommend reading the papers below to understand the concept.
So, in summary, I will personally recommend applying AHP and Fuzzy or Neutrosophic TOPSIS to address your problem.
Resources:
Manoj Mathew. Tutorial Youtube FAHP. Fuzzy Analytic Hierarchy Process (FAHP) - Using Geometric Mean. Retrieved from: https://www.youtube.com/watch?v=5k3Wz1AfVWs
Manoj Mathew. Tutorial Youtube FTOPSIS. Fuzzy TOPSIS. Retrieved from: https://www.youtube.com/watch?v=z188EQuWOGU
Manoj Mathew. TOPSIS - Technique for Order Preference by Similarity to Ideal Solution Retrieved from: https://www.youtube.com/watch?v=kfcN7MuYVeI
MCDC in R: https://www.rdocumentation.org/packages/MCDA/versions/0.0.19
MCDC in JS: https://www.npmjs.com/package/electre-js
MCDC in Python: https://github.com/pyAHP/pyAHP
REFERENCES:
1 Saaty, R. W. (1987). The analytic hierarchy process—what it is and how it is used. Mathematical Modelling, 9(3-5), 167.
doi:10.1016/0270-0255(87)90473-8
2 Hwang, C. L., & Yoon, K. (1981). Methods for multiple attribute decision making. In Multiple attribute decision making (pp. 58-191). Springer, Berlin, Heidelberg.
3 Figueira, J., Mousseau, V., & Roy, B. (2005). ELECTRE methods. In Multiple criteria decision analysis: State of the art surveys (pp. 133-153). Springer, New York, NY.
4 Mardani, A., Nilashi, M., Zavadskas, E. K., Awang, S. R., Zare, H., & Jamal, N. M. (2018). Decision Making Methods Based on Fuzzy Aggregation Operators: Three Decades Review from 1986 to 2017.
International Journal of Information Technology & Decision Making, 17(02), 391–466. doi:10.1142/s021962201830001x
5 Saaty, T. L. (1986). Axiomatic Foundation of the Analytic Hierarchy Process. Management Science, 32(7), 841.
doi:10.1287/mnsc.32.7.841
6 Saaty, R. W. (1987). The analytic hierarchy process—what it is and how it is used. Mathematical Modelling, 9(3-5), 167.
doi:10.1016/0270-0255(87)90473-8
7 Aczél, J., & Saaty, T. L. (1983). Procedures for synthesizing ratio judgements. Journal of Mathematical Psychology, 27(1), 93–102. doi:10.1016/0022-2496(83)90028-7
8 Wang, Y. M., & Elhag, T. M. (2006). Fuzzy TOPSIS method based on alpha level sets with an application to bridge risk assessment. Expert systems with applications, 31(2), 309-319
9 Zhang, Z., Wu, C.: A novel method for single-valued neutrosophic multi-criteria decision making with incomplete weight information. Neutrosophic Sets Syst. 4, 35–49 (2014)
10 Biswas, P., Pramanik, S., & Giri, B. C. (2018). Neutrosophic TOPSIS with Group Decision Making. Studies in Fuzziness and Soft Computing, 543–585. doi:10.1007/978-3-030-00045-5_21
10 Smarandache, F.: A Unifying Field in Logics. Neutrosophy: Neutrosophic Probability, Setand Logic. American Research Press, Rehoboth (1998)

Alternatives to TF-IDF and Cosine Similarity (comparing documents with different formats)

I've been working on a small, personal project which takes a user's job skills and suggests the most ideal career for them based on those skills. I use a database of job listings to achieve this. At the moment, the code works as follows:
1) Process the text of each job listing to extract skills that are mentioned in the listing
2) For each career (e.g. "Data Analyst"), combine the processed text of the job listings for that career into one document
3) Calculate the TF-IDF of each skill within the career documents
After this, I'm not sure which method I should use to rank careers based on a list of a user's skills. The most popular method that I've seen would be to treat the user's skills as a document as well, then to calculate the TF-IDF for the skill document, and use something like cosine similarity to calculate the similarity between the skill document and each career document.
This doesn't seem like the ideal solution to me, since cosine similarity is best used when comparing two documents of the same format. For that matter, TF-IDF doesn't seem like the appropriate metric to apply to the user's skill list at all. For instance, if a user adds additional skills to their list, the TF for each skill will drop. In reality, I don't care what the frequency of the skills are in the user's skills list -- I just care that they have those skills (and maybe how well they know those skills).
It seems like a better metric would be to do the following:
1) For each skill that the user has, calculate the TF-IDF of that skill in the career documents
2) For each career, sum the TF-IDF results for all of the user's skill
3) Rank career based on the above sum
Am I thinking along the right lines here? If so, are there any algorithms that work along these lines, but are more sophisticated than a simple sum? Thanks for the help!
The second approach you explained will work. But there are better ways to solve this kind of problem.
At first you should know a little bit about language models and leave the vector space model.
In the second step based on your kind of problem that is similar to expert finding/profiling you should learn a baseline language model framework to implement a solution.
You can implement A language modeling framework for expert finding with a little changes so that the formulas can be adapted to your problem.
Also reading On the assessment of expertise profiles will give you a better understanding of expert profiling with the framework above.
you can find some good ideas, resources and projects on expert finding/profiling at Balog's blog.
I would take SSRM [1] approach to expand query (job documents) using WordNet (extracted database [2]) as semantic lexicon - so you are not constrained only to direct word-vs-word matches. SSRM has its own similarity measure (I believe the paper is open-access, if not, check this: http://blog.veles.rs/document-similarity-computation-models-literature-review/, there are many similarity computation models listed). Alternativly, and if your corpus is big enough, you might try LSA/LSI[3,4] (also covered on the page) - without using external lexicon. But, if it is on English, WordNet's semantic graph is really rich in all directions (hyponims, synonims, hypernims... concepts/SinSet).
The bottom line: I would avoid simple SVM/TF-IDF for such concrete domain. I measured really serious margin of SSRM, over TF-IDF/VSM (measured as macro-average F1, 5-class single label classification, narrow domain).
[1] A. Hliaoutakis, G. Varelas, E. Voutsakis, E.G.M. Petrakis, E. Milios, Information Retrieval by Semantic Similarity, Int. J. Semant. Web Inf. Syst. 2 (2006) 55–73. doi:10.4018/jswis.2006070104.
[2] J.E. Petralba, An extracted database content from WordNet for Natural Language Processing and Word Games, in: 2014 Int. Conf. Asian Lang. Process., 2014: pp. 199–202. doi:10.1109/IALP.2014.6973502.
[3] P.W. Foltz, Latent semantic analysis for text-based research, Behav. Res. Methods, Instruments, Comput. 28 (1996) 197–202. doi:10.3758/BF03204765.
[4] A. Kashyap, L. Han, R. Yus, J. Sleeman, T. Satyapanich, S. Gandhi, T. Finin, Robust semantic text similarity using LSA, machine learning, and linguistic resources, Springer Netherlands, 2016. doi:10.1007/s10579-015-9319-2.

Possible approach to sentiment analysis (I apologize, I'm very new to NLP)

So I have an idea for classifying sentiments of sentences talking about a given brand product (in this case, pepsi). Basically, let's say I wanted to figure out how people feel about the taste of pepsi. Given this problem, I want to construct abstract sentence templates, basically possible sentence structures that would indicate an opinion about the taste of pepsi. Here's one example for a three word sentence:
[Pepsi] [tastes] [good, bad, great, horrible, etc.]
I then look through my database of sentences, and try to find ones that match this particular structure. Once I have this, I can simply extract the third component and get a sentiment regarding this particular aspect (taste) of this particular entity (pepsi).
The application for this would be looking at tweets, so this might yield a few tweets from the past year or so, but it wouldn't be enough to get an accurate read on the general sentiment, so I would create other possible structures, like:
[I] [love, hate, dislike, like, etc.] [the taste of pepsi]
[I] [love, hate, dislike, like, etc.] [the way pepsi tastes]
[I] [love, hate, dislike, like, etc.] [how pepsi tastes]
And so on and so forth.
Of course most tweets won't be this simple, there would be possible words that would mean the same as pepsi, or words in between the major components, etc - deviations that it would not be practical to account for.
What I'm looking for is just a general direction, or a subfield of sentiment analysis that discusses this particular problem. I have no problem coming up with a large list of possible structures, it's just the deviations from the structures that I'm worried about. I know this is something like a syntax tree, but most of what I've read about them has just been about generating text - in this case I'm trying to match a sentence to a structure, and pull out the entity, sentiment, and aspect components to get a basic three word answer.
This templates approach is the core idea behind my own sentiment mining work. You might find study of EBMT (example-based machine translation) interesting, as a similar (but under-studied) approach in the realm of machine translation.
Get familiar with Wordnet, for automatically generating rephrasings (there are hundreds of papers that build on WordNet, some of which will be useful to you). (The WordNet book is getting old now, but worth at least a skim read if you can find it in a library.)
I found Bing Liu's book a very useful overview of all the different aspects and approachs to sentiment mining, and a good introduction to further reading. (The Amazon UK reviews are so negative I wondered if it was a different book! The Amazon US reviews are more positive, though.)

How to group / compare similar news articles

In an app that i'm creating, I want to add functionality that groups news stories together. I want to group news stories about the same topic from different sources into the same group. For example, an article on XYZ from CNN and MSNBC would be in the same group. I am guessing its some sort of fuzzy logic comparison. How would I go about doing this from a technical standpoint? What are my options? We haven't even started the app yet, so we aren't limited in the technologies we can use.
Thanks, in advance for the help!
This problem breaks down into a few subproblems from a machine learning standpoint.
First, you are going to want to figure out what properties of the news stories you want to group based on. A common technique is to use 'word bags': just a list of the words that appear in the body of the story or in the title. You can do some additional processing such as removing common English "stop words" that provide no meaning, such as "the", "because". You can even do porter stemming to remove redundancies with plural words and word endings such as "-ion". This list of words is the feature vector of each document and will be used to measure similarity. You may have to do some preprocessing to remove html markup.
Second, you have to define a similarity metric: similar stories score high in similarity. Going along with the bag of words approach, two stories are similar if they have similar words in them (I'm being vague here, because there are tons of things you can try, and you'll have to see which works best).
Finally, you can use a classic clustering algorithm, such as k-means clustering, which groups the stories together, based on the similarity metric.
In summary: convert news story into a feature vector -> define a similarity metric based on this feature vector -> unsupervised clustering.
Check out Google scholar, there probably have been some papers on this specific topic in the recent literature. A lot of these things that I just discussed are implemented in natural language processing and machine learning modules for most major languages.
The problem can be broken down to:
How to represent articles (features, usually a bag of words with TF-IDF)
How to calculate similarity between two articles (cosine similarity is the most popular)
How to cluster articles together based on the above
There are two broad groups of clustering algorithms: batch and incremental. Batch is great if you've got all your articles ahead of time. Since you're clustering news, you've probably got your articles coming in incrementally, so you can't cluster them all at once. You'll need an incremental (aka sequential) algorithm, and these tend to be complicated.
You can also try http://www.similetrix.com, a quick Google search popped them up and they claim to offer this service via API.
One approach would be to add tags to the articles when they are listed. One tag would be XYZ. Other tags might describe the article subject.
You can do that in a database. You can have an unlimited number of tags for each article. Then, the "groups" could be identified by one or more tags.
This approach is heavily dependent upon human beings assigning appropriate tags, so that the right articles are returned from the search, but not too many articles. It isn't easy to do really well.

Finding related words (specifically physical objects) to a specific word

I am trying to find words (specifically physical objects) related to a single word. For example:
Tennis: tennis racket, tennis ball, tennis shoe
Snooker: snooker cue, snooker ball, chalk
Chess: chessboard, chess piece
Bookcase: book
I have tried to use WordNet, specifically the meronym semantic relationship; however, this method is not consistent as the results below show:
Tennis: serve, volley, foot-fault, set point, return, advantage
Snooker: nothing
Chess: chess move, checkerboard (whose own meronym relationships shows ‘square’ & 'diagonal')
Bookcase: shelve
Weighting of terms will eventually be required, but that is not really a concern now.
Anyone have any suggestions on how to do this?
Just an update: Ended up using a mixture of both Jeff's and StompChicken's answers.
The quality of information retrieved from Wikipedia is excellent, specifically how (unsurprisingly) there is so much relevant information (in comparison to some corpora where terms such as 'blog' and 'ipod' do not exist).
The range of results from Wikipedia is the best part. The software is able to match terms such as (lists cut for brevity):
golf: [ball, iron, tee, bag, club]
photography: [camera, film, photograph, art, image]
fishing: [fish, net, hook, trap, bait, lure, rod]
The biggest problem is classifying certain words as physical artefacts; default WordNet is not a reliable resource as many terms (such as 'ipod', and even 'trampolining') do not exist in it.
I think what you are asking for is a source of semantic relationships between concepts. For that, I can think of a number of ways to go:
Semantic similarity algorithms. These algorithms usually perform a tree walk over the relationships in Wordnet to come up with a real-valued score of how related two terms are. These will be limited by how well WordNet models the concepts that you are interested in. WordNet::Similarity (written in Perl) is pretty good.
Try using OpenCyc as a knowledge base. OpenCyc is a open-source version of Cyc, a very large knowledge base of 'real-world' facts. It should have a much richer set of sematic realtionships than WordNet does. However, I have never used OpenCyc so I can't speak to how complete it is, or how easy it is to use.
n-gram frequency analysis. As mentioned by Jeff Moser. A data-driven approach that can 'discover' relationships from large amounts of data, but can often produce noisy results.
Latent Semantic Analysis. A data-driven approach similar to n-gram frequency analysis that finds sets of semantically related words.
[...]
Judging by what you say you want to do, I think the last two options are more likely to be successful. If the relationships are not in Wordnet then semantic similarity won't work and OpenCyc doesn't seem to know much about snooker other than the fact that it exists.
I think a combination of both n-grams and LSA (or something like it) would be a good idea. N-gram frequencies will find concepts tightly bound to your target concept (e.g. tennis ball) and LSA would find related concepts mentioned in the same sentence/document (e.g. net, serve). Also, if you are only interested in nouns, filtering your output to contain only nouns or noun phrases (by using a part-of-speech tagger) might improve results.
In the first case, you probably are looking for n-grams where n = 2. You can get them from places like Google or create your own from all of Wikipedia.
For more information, check out this related Stack Overflow question.

Resources