Parse Tree for a proper structured sentence using OpenNLP - nlp

I have an NLP task where I need to make sure that a paragraph of multiple sentences include at least one well structured question, I'm using OpenNLP to generate the parse trees in the paragraph. My questions are:
1-Is there a way to get a list of possible parse trees for a properly structured question.
2- How can I compare two parse trees
Thanks

Well,you yourself have answered the question. You just have to get the dataset containing different types of questions and play with it.
Get different types of questions and parse trees corresponding to it. Get all the output parse trees in a format such that you can use it in the next step.
When it comes to comparing to parse trees,it's basically comparing text. Which is a quite simple task.
But obviously,doing it like this will take a bit longer time and memory if you directly play with text files. For that,convert and save your parse trees of standard questions in binary and this will take less time and memory when concatenated with the next step.
Hope this helps,All the best!

Related

Method to generate keywords out of a scientific text?

Which method of text analysis should I use if I need to get a number of multiword keywords, say (up to) 5 per text, analysing a scientific text of some length? In particular, the text could be
a title,
or an abstract.
Preferably a method already scripted on Python.
Thank you!
You could look into keyword extraction, collocation finding or text summarization. Depending on what you want to use it for you could also look into general terminology extraction. These are just some methods, there are also other approaches like topic modeling etc.
Collocation finding/terminology extraction are more about finding domain-specific terminology and require a larger amount of corpora, but they can help to unify the generated tags. Basically you would first run this kind of analysis to find ngrams which are domain-specific and therefore in scientific literature indicative of the topic and in a second step you would mark the occurence of these extracted ngrams in the original texts.
Keyword extraction and text summarization lean more towards being applied to single texts, but obviously the resulting tags are going to be less unified.
It's difficult to say which method makes the most sense for you as this depends on the amount of data you have, the diversity of topics within the data you have, what you are planning to do with the keywords/tags and how much time you want to spend to optimize this extraction.

Embeddings vs text cleaning (NLP)

I am a graduate student focusing on ML and NLP. I have a lot of data (8 million lines) and the text is usually badly written and contains so many spelling mistakes.
So i must go through some text cleaning and vectorizing. To do so, i considered two approaches:
First one:
cleaning text by replacing bad words using hunspell package which is a spell checker and morphological analyzer
+
tokenization
+
convert sentences to vectors using tf-idf
The problem here is that sometimes, Hunspell fails to provide the correct word and changes the misspelled word with another word that don't have the same meaning. Furthermore, hunspell does not reconize acronyms or abbreviation (which are very important in my case) and tends to replace them.
Second approache:
tokenization
+
using some embeddings methode (like word2vec) to convert words into vectors without cleaning text
I need to know if there is some (theoretical or empirical) way to compare this two approaches :)
Please do not hesitate to respond If you have any ideas to share, I'd love to discuss them with you.
Thank you in advance
I post this here just to summarise the comments in a longer form and give you a bit more commentary. No sure it will answer your question. If anything, it should show you why you should reconsider it.
Points about your question
Before I talk about your question, let me point a few things about your approaches. Word embeddings are essentially mathematical representations of meaning based on word distribution. They are the epitome of the phrase "You shall know a word by the company it keeps". In this sense, you will need very regular misspellings in order to get something useful out of a vector space approach. Something that could work out, for example, is US vs. UK spelling or shorthands like w8 vs. full forms like wait.
Another point I want to make clear (or perhaps you should do that) is that you are not looking to build a machine learning model here. You could consider the word embeddings that you could generate, a sort of a machine learning model but it's not. It's just a way of representing words with numbers.
You already have the answer to your question
You yourself have pointed out that using hunspell introduces new mistakes. It will be no doubt also the case with your other approach. If this is just a preprocessing step, I suggest you leave it at that. It is not something you need to prove. If for some reason you do want to dig into the problem, you could evaluate the effects of your methods through an external task as #lenz suggested.
How does external evaluation work?
When a task is too difficult to evaluate directly we use another task which is dependent on its output to draw conclusions about its success. In your case, it seems that you should pick a task that depends on individual words like document classification. Let's say that you have some sort of labels associated with your documents, say topics or types of news. Predicting these labels could be a legitimate way of evaluating the efficiency of your approaches. It is also a chance for you to see if they do more harm than good by comparing to the baseline of "dirty" data. Remember that it's about relative differences and the actual performance of the task is of no importance.

Extract recommendations/suggestions from text

My documents often include sentences like:
Had I known about this, I would have prevented this problem
or
If John was informed, this wouldn't happen
or
this wouldn't be a problem if Jason was smart
I'm interested in extracting these sort of information (not sure what they are called, linguistically). So I would like to extract either the whole sentence, or ideally, a summary like:
(inform John) (prevent)
Most, if not all, the examples of relation extraction, and information extraction that I've come across, follow fairly standard flow:
do NER, then relation extraction looks for relations like "in" or "at", etc (ch7 of nltk book for example).
Do these type of sentences fall under a certain category in NLP? Are there any papers/tutorials on something like this?
When you are asking for a suggestion on a topic which is pretty open, give more examples. I mean to say, if you just give one example and explain what are you targeting doesn't give enough information. For example, if you have sentences which following specific patterns, then it becomes easier to extract information (in your desired format) from them. Otherwise, it becomes broad and complex research problem!
From your example, it looks like you want to extract the head words of a sentence and other words which modify those heads. You can use dependency parsing for this task. Look at Stanford Neural Network Dependency Parser. A dependency parser analyzes the grammatical structure of a sentence, establishing relationships between "head" words and words which modify those heads. So, i believe it should help you in your desired task.
If you want to make it more general, then this problem aligns well with Open Information Extraction. You may consider looking into Stanford OpenIE api.
You may also consider Stanford Relation Extractor api in your task. But i strongly believe relation extraction through dependency parsing best suits your problem definition. You can read this paper to get some idea and utilize them in your task.

Building your own text corpus

It may sounds stupid, but do you know how to build text corpus? I have searched everywhere and there is already existing corpus, but I wonder how did they build it? For example, if I want to build corpus with positive and negative tweets, then I have to just make two files? But what about inner of those files? Dont get it((((
in this example he stores pos and neg tweets in RedisDB.
But what about inner of those files?
This depends mostly on what library you're using. XML (with a variety of tags) is common, as is one sentence per line. The tricky part is getting the data in the first place.
For example, if I want to build corpus with positive and negative tweets
Does this mean that you want to know how to mark the tweets as positive and negative? If so, what you're looking for is called text classification or semantic analysis.
If you want to find a bunch of tweets, I'd check one of these pages (just from a quick search of my own).
Clickonf5: http://clickonf5.org/5438/download-tweets-pdf-xml-format-local-machine-server/
Quora: http://quora.com/What-is-the-best-tool-to-download-and-archive-Twitter-data-of-certain-hashtags-and-mentions-for-academic-research
Google Groups: http://groups.google.com/forum/?fromgroups#!topic/twitter-development-talk/kfislDfxunI
For general learning about how to create a corpus, I would check out the Handbook of Natural Language Processing Wiki by Richard Xiao.

Finding words from a dictionary in a string of text

How would you go about parsing a string of free form text to detect things like locations and names based on a dictionary of location and names? In my particular application there will be tens of thousands if not more entries in my dictionaries so I'm pretty sure just running through them all is out of the question. Also, is there any way to add "fuzzy" matching so that you can also detect substrings that are within x edits of a dictionary word? If I'm not mistaken this falls within the field of natural language processing and more specifically named entity recognition (NER); however, my attempt to find information about the algorithms and processes behind NER have come up empty. I'd prefer to use Python for this as I'm most familiar with that although I'm open to looking at other solutions.
You might try downloading the Stanford Named Entity Recognizer:
http://nlp.stanford.edu/software/CRF-NER.shtml
If you don't want to use someone else's code and you want to do it yourself, I'd suggest taking a look at the algorithm in their associated paper, because the Conditional Random Field model that they use for this has become a fairly common approach to NER.
I'm not sure exactly how to answer the second part of your question on looking for substrings without more details. You could modify the Stanford program, or you could use a part-of-speech tagger to mark proper nouns in the text. That wouldn't distinguish locations from names, but it would make it very simple to find words that are x words away from each proper noun.

Resources