Tool for detecting differences between text passages from two different groups - nlp

I have text data from two different groups. In total I have around 4000 text passages with around 300 words.
I am searching for a tool that allows me to analyze the difference between these two groups.
In the best case, this tool can analyze different dimensions, e.g. the length of sentences, usage of superlatives, perspective of the narrator, usage of passive form, clear and objective writing VS hedging and imprecise writing.

In Python, you can use the nltk or spacey packages to process the texts so that you can analyze them (using pandas, for example). But there's not ready-made software (as far as I know) that will do all of that for you. You're going to have to write your own code.
For example, you would create a pandas dataframe with a row for all of the texts, with their group ('A' or 'B' or whatever) as one of the columns and the raw text as the other. Then you use nltk to tokenize the text and do whatever other preprocessing you want to do, storing the clean, tokenized text in another column. Then you can have a column for, for example, sentence length (which you can compute using nltk). From there you'll be able to get the means of the two groups, standard deviation, statistical significance of difference, etc.
It's straightforward for something like sentence length, but the other features you mention are more difficult. What does it mean for a text to be clear and objective, or hedged and imprecise? That means nothing on its own: you have to decide what exactly you mean by that, and what features characterize it. For example, you could make a list of hedgers ('I think', 'may', 'might', 'I'm not sure but', etc.) and then count their frequency in each text.
Something like "perspective of the narrator" might need to be annotated manually, depending on what you mean by it. If you just mean 1st person vs. 3rd person, that could be easy to identify (compare the 'I's vs. the 'he/she's), but anything more subtle than that, I'm not sure how you'd do it.
Good luck with your project!

Related

Which correlation to choose?

I have a data set: article titles and number of views. In it I want to conduct such research, the dependence of the words applied in the title to the number of views. I have cleaned the data to the type of words without endings and using python want to calculate the average value of views for each word. this will allow me to judge the topics that people are more and less interested. Wanted to ask how such methods are called in statistics so i can read about them. For there are many different correlations, and i would like to approach the task as correctly as possible. P.S. Then I want to count the same correlation for a bundle of two and three words.

Method to generate keywords out of a scientific text?

Which method of text analysis should I use if I need to get a number of multiword keywords, say (up to) 5 per text, analysing a scientific text of some length? In particular, the text could be
a title,
or an abstract.
Preferably a method already scripted on Python.
Thank you!
You could look into keyword extraction, collocation finding or text summarization. Depending on what you want to use it for you could also look into general terminology extraction. These are just some methods, there are also other approaches like topic modeling etc.
Collocation finding/terminology extraction are more about finding domain-specific terminology and require a larger amount of corpora, but they can help to unify the generated tags. Basically you would first run this kind of analysis to find ngrams which are domain-specific and therefore in scientific literature indicative of the topic and in a second step you would mark the occurence of these extracted ngrams in the original texts.
Keyword extraction and text summarization lean more towards being applied to single texts, but obviously the resulting tags are going to be less unified.
It's difficult to say which method makes the most sense for you as this depends on the amount of data you have, the diversity of topics within the data you have, what you are planning to do with the keywords/tags and how much time you want to spend to optimize this extraction.

How do i retain numbers while preprocessing data using gensim in python?

I have used gensim.utils.simple_preprocess(str(sentence) to create a dictionary of words that I want to use for topic modelling. However, this is also filtering important numbers (house resolutions, bill no, etc) that I really need. How did I overcome this? Possibly by replacing digits with their word form. How do i go about it, though?
You don't have to use simple_preprocess() - it's not doing much, it's not that configurable or sophisticated, and typically the other Gensim algorithms just need lists-of-tokens.
So, choose your own tokenization - which in some cases, depnding on your source data, could be as simple as a .split() on whitespace.
If you want to look at what simple_preprocess() does, as a model, you can view its Python source at:
https://github.com/RaRe-Technologies/gensim/blob/351456b4f7d597e5a4522e71acedf785b2128ca1/gensim/utils.py#L288

Methods for extracting locations from text?

What are the recommended methods for extracting locations from free text?
What I can think of is to use regex rules like "words ... in location". But are there better approaches than this?
Also I can think of having a lookup hash table table with names for countries and cities and then compare every extracted token from the text to that of the hash table.
Does anybody know of better approaches?
Edit: I'm trying to extract locations from tweets text. So the issue of high number of tweets might also affect my choice for a method.
All rule-based approaches will fail (if your text is really "free"). That includes regex, context-free grammars, any kind of lookup... Believe me, I've been there before :-)
This problem is called Named Entity Recognition. Location is one of the 3 most studied classes (with Person and Organization). Stanford NLP has an open source Java implementation that is extremely powerful: http://nlp.stanford.edu/software/CRF-NER.shtml
You can easily find implementations in other programming languages.
Put all of your valid locations into a sorted list. If you are planning on comparing case-insensitive, make sure the case of your list already is normalized.
Then all you have to do is loop over individual "words" in your input text and at the start of each new word, start a new binary search in your location list. As soon as you find a no-match, you can skip the entire word and proceed with the next.
Possible problem: multi-word locations such as "New York", "3rd Street", "People's Republic of China". Perhaps all it takes, though, is to save the position of the first new word, if you find your bsearch leads you to a (possible!) multi-word result. Then, if the full comparison fails -- possibly several words later -- all you have to do is revert to this 'next' word, in relation to the previous one where you started.
As to what a "word" is: while you are preparing your location list, make a list of all characters that may appear inside locations. Only phrases that contain characters from this list can be considered a valid 'word'.
How fast are the tweets coming in? As in is it the full twitter fire hose or some filtering queries?
A bit more sophisticated approach, that is similar to what you described is using an NLP tool that is integrated to a gazetteer.
Very few NLP tools will keep up to twitter rates, and very few do very well with twitter because of all of the leet speak. The NLP can be tuned for precision or recall depending on your needs, to limit down performing lockups in the gazetteer.
I recommend looking at Rosoka(also Rosoka Cloud through Amazon AWS) and GeoGravy

Given a list of dozens of words, how do I find the best matching sections from a corpus of hundreds of texts?

Let’s say I have a list of 250 words, which may consist of unique entries throughout, or a bunch of words in all their grammatical forms, or all sorts of words in a particular grammatical form (e.g. all in the past tense). I also have a corpus of text that has conveniently been split up into a database of sections, perhaps 150 words each (maybe I would like to determine these sections dynamically in the future, but I shall leave it for now).
My question is this: What is a useful way to get those sections out of the corpus that contain most of my 250 words?
I have looked at a few full text search engines like Lucene, but am not sure they are built to handle long query lists. Bloom filters seem interesting as well. I feel most comfortable in Perl, but if there is something fancy in Ruby or Python, I am happy to learn. Performance is not an issue at this point.
The use case of such a program is in language teaching, where it would be nice to have a variety of word lists that mirror the different extents of learner knowledge, and to quickly find fitting bits of text or examples from original sources. Also, I am just curious to know how to do this.
Effectively what I am looking for is document comparison. I have found a way to rank texts by similarity to a given document, in PostgreSQL.

Resources