how to extract thesaurus file from MS Word? - text

I write my graduate project about semantic text analysis in big data.
I need something like synonyms dictionary to make the job.
So I decided to use the thesaurus built-in MS Word.
Does anyone know in which file MS Word store its thesaurus so I'll be able to extract it and parse and use in my program?
(I know there are many others dictionaries in the net, but I'm too lazy to retype it manually).

Related

Workflow for interpreting linked data in .ttl files with Python RDFLib

I am using turtle files containing biographical information for historical research. Those files are provided by a major library and most of the information in the files is not explicit. While people's professions, for instance, are sometimes stated alongside links to the library's URIs, I only have URIs in the majority of cases. This is why I will need to retrieve the information behind them at some point in my workflow, and I would appreciate some advice.
I want to use Python's RDFLib for parsing the .ttl files. What is your recommended workflow? Should I read the prefixes I am interested in first, then store the results in .txt (?) and then write a script to retrieve the actual information from the web, replacing the URIs?
I have also seen that there are ways to convert RDFs directly to CSV, but although CSV is nice to work with, I would get a lot of unwanted "background noise" by simply converting all the data.
What would you recommend?
RDFlib's all about working with RDF data. If you have RDF data, my suggestion is to do as much RDF-native stuff that you can and then only export to CSV if you want to do something like print tabular results or load into Pandas DataFrames. Of course there are always more than one way to do things, so you could manipulate data in CSV, but RDF, by design, has far more information in it than a CSV file can so when you're manipulating RDF data, you have more things to get hold of.
most of the information in the files is not explicit
Better phrased: most of the information is indicated with objects identified by URIs, not given as literal values.
I want to use Python's RDFLib for parsing the .ttl files. What is your recommended workflow? Should I read the prefixes I am interested in first, then store the results in .txt (?) and then write a script to retrieve the actual information from the web, replacing the URIs?
No! You should store the ttl files you can get and then you may indeed retrieve all the other data referred to by URI but, presumably, that data is also in RDF form so you should download it into the same graph you loaded the initial ttl files in to and then you can have the full graph with links and literal values it it as your disposal to manipulate with SPARQL queries.

How to collect RDF triples for a simple knowledge graph?

When building a knowledge graph, the first step (if I understand it correctly), is to collect structured data, mainly RDF triples written by using some ontology, for example, Schema.org. Now, what is the best way to collect these RDF triples?
Seems two things we can do.
Use a crawler to crawls the web content, and for a specific page, search for RDF triples on this page. If we find them, collect them. If not, move on to the next page.
For the current page, instead of looking for existing RDF triples, use some NLP tools to understand the page content (such as using NELL, see http://rtw.ml.cmu.edu/rtw/).
Now, is my understanding above (basically/almost) correct? If so, why do we use NLP? why not just rely on the existing RDF triples? Seems like NLP is not as good/reliable as we are hoping… I could be completely wrong.
Here is another try of asking the same question
Let us say we want to create RDF triples by using the 3rd method mentioned by #AKSW, i.e., extract RDF triples from some web pages (text).
For example, this page. If you open it and use "view source", you can see quite some semantic mark-ups there (using OGP and Schema.org). So my crawler can simply do this: ONLY crawl/parse these mark-ups, and easily change these mark-ups into RDF triples, then declare success, move on to the next page.
So what the crawler has done on this text page is very simple: only collect semantic markups and create RDF triples from these markup. It is simple and efficient.
The other choice, is to use NLP tools to automatically extract structured semantic data from this same text (maybe we are not satisfied with the existing markups). Once we extract the structured information, we then create RDF triples from them. This is obviously a much harder thing to do, and we are not sure about its accuracy either (?).
What is the best practice here, what is the pros/cons here? I would prefer the easy/simple way - simply collect the existing markup and change that into RDF content, instead of using NLP tools.
And I am not sure how many people would agree with this? And is this the best practice? Or, it is simply a question of how far our requirements lead us?
Your question is unclear, because you did not state your data source, and all the answers on this page assumed it to be web markup. This is not necessarily the case, because if you are interested in structured data published according to best practices (called Linked Data), you can use so-called SPARQL endpoints to query Linked Open Data (LOD) datasets and generate your knowledge graph via federated queries. If you want to collect structured data from website markup, you have to parse markup to find and retrieve lightweight annotations written in RDFa, HTML5 Microdata, or JSON-LD. The availability of such annotations may be limited on a large share of websites, but for structured data expressed in RDF you should not use NLP at all, because RDF statements are machine-interpretable and easier to process than unstructured data, such as textual website content. The best way to create the triples you referred to depends on what you try to achieve.

elasticsearch, ngrams should cover entire query? (compound word query)

Suppose a user search "koreanpop"
when he really means "korean pop".
I don't think I can build a dictionary in order to recognize "korean" and "pop" as word.
I'm going to use nGram for query analyzer. (is this a horrible idea?)
I'd like to try out
"ko/reanpop"
"kor/eanpop"
"kore/anpop"
"korea/npop"
"korean/pop"
"koreanp/op"
and find out documents with both "korean/pop".
(which will be edge-ngram, min=2)
Is this an ok strategy in practice?
(I know that koreans do not use whitespaces as they should to separate words because Korean search engines support them)
How do I accomplish this with elasticsearch?

Finding possibly matching strings in a large dataset

I'm in the middle of a project where I have to process text documents and enhance them with Wikipedia links. Preprocessing a document includes locating all the possible target articles, so I extract all ngrams and do a comparison against a database containing all the article names. The current algorithm is a simple caseless string comparison preceded by simple trimming. However, I'd like it to be more flexible and tolerant to errors or little text modifications like prefixes etc. Besides, the database is pretty huge and I have a feeling that string comparison in such a large database is not the best idea...
What I thought of is a hashing function, which would assign a unique (I'd rather avoid collisions) hash to any article or ngram so that I could compare hashes instead of the strings. The difference between two hashes would let me know if the words are similiar so that I could gather all the possible target articles.
Theoretically, I could use cosine similiarity to calculate the similiarity between words, but this doesn't seem right for me because comparing the characters multiple times sounds like a performance issue to me.
Is there any recommended way to do it? Is it a good idea at all? Maybe the string comparison with proper indexing isn't that bad and the hashing won't help me here?
I looked around the hashing functions, text similarity algoriths, but I haven't found a solution yet...
Consider using the Apache Lucene API It provides functionality for searching, stemming, tokenization, indexing, document similarity scoring. Its an open source implementation of basic best practices in Information Retrieval
The functionality that seems most useful to you from Lucene is their moreLikeThis algorithm which uses Latent Semantic Analysis to locate similar documents.

Quick Filter List

Everyone is familiar with this functionality. If you open up the the outlook address book and start typing a name, the list below the searchbox instantly filters to only contain items that match your query. .NET Reflector has a similar feature when you're browsing types ... you start typing, and regardless of how large the underlying assembly is that you're browsing, it's near instantaneous.
I've always kind of wondered what the secret sauce was here. How is it so fast? I imagine there are also different algorithms if the data is present in-memory, or if they need to be fetched from some external source (ie. DB, searching some file, etc.).
I'm not sure if this would be relevant, but if there are resources out there, I'm particularly interested how one might do this with WinForms ... but if you know of general resources, I'm interested in those as well :-)
What is the most common use of the trie data structure?
A Trie is basically a tree-structure for storing a large list of similar strings, which provides fast lookup of strings (like a hashtable) and allows you to iterate over them in alphabetical order.
Image from: http://en.wikipedia.org/wiki/Trie:
In this case, the Trie stores the strings:
i
in
inn
to
tea
ten
For any prefix that you enter (for example, 't', or 'te'), you can easily lookup all of the words that start with that prefix. More importantly, lookups are dependent on the length of the string, not on how many strings are stored in the Trie. Read the wikipedia article I referenced to learn more.
The process is called full text indexing/search.
If you want to play with the algorithms and data structures for this I would recommend you read Programming Collective Intelligence for a good introduction to the field, if you just want the functionality I would recommend lucene.

Resources