How to compare complexities of corpora?

I would like to compare how complex (varied or predictable) are my three corpora. They are from different topics, so some vocabulary is different, some is the same. Looking at one of the data sets it's clear that the syntax is more difficult than in the other two, sentences are longer, etc. I built word N-Gram language models using the SRILM toolkit (I'm new to language modelling) with the idea that I can then compare these models. One measure mentioned in relation to language models is perplexity. I'm confused about the following question: Can I just use perplexities of the three LMs directly as a measure of how varied are the corpora? The vocabulary and the sizes of the corpora are different, so now I think that this won't be a good comparison. I also built LMs from POS-Tags but the quality of the POS-Tagging result is not good because the language is from fora, has spelling mistakes, ungrammatical sentences and so on. What measures could be used to compare complexity of corpora from different domains? I'd appreciate your advise.
"I also built LMs from POS-Tags but the quality of the POS-Tagging result is not good because the language is from fora, has spelling mistakes, ungrammatical sentences and so on."
Aside from it being noisy, like you pointed out, you should think carefully about whether particular linguistic features are useful in your analysis. Does one corpus having proportionally more nouns move you toward what you want to learn about the corpora? Maybe in something like authorship attribution, but I can't really think of anywhere else that's effective.
If data sparsity is an issue, LSI can help by collapsing related terms together. This could also help with the spelling issues, collapsing poorly spelt words with their correct counterparts if they appear in similar contexts.
"The vocabulary and the sizes of the corpora are different, so now I think that this won't be a good comparison."
It's not the end of the world. Having more data is always better, but you can work with what you have.
If you haven't chosen a language model yet, there's a few decisions you have to make:
Are you going to smooth the data? How?
Are you going to use an advanced technique to better exploit the data, such as Latent Semantic Indexing (LSI)?
You mention that you have a language model; I'm assuming your language model is a probability distribution such that P(N-gram|topic). If this is correct, you've already normalized the data, so the two probability distributions should be readily comparable. Having more data would get you a more reliable result, but if your corpora are "big enough" to sample each topic reliably, you can move right into comparison.
As for comparison, try the KL-Divergence. KL-Divergence is "a measure of the information lost when Q is used to approximate P." Less loss means that the corpora are more similar. If you want a symmetric comparison, a "cheap" way to do it is to add D(P||Q) + D(Q||P). Note, though:
The KL divergence is only defined if Q(i)=0 ⇒ P(i)=0, for all i (absolute continuity).
So you'll have to smooth, in some manner.


Relationship between vocab size and complexity

I have 2 corpuses, if one has a larger vocabulary size than the other, does it mean its language is more complex?
Apart from complexity of the language, what else can effect the size of the vocabulary in a corpus?
No. Language consists of a lot more than just vocabulary. If the grammatical structures are convoluted, then even a smaller vocabulary can lead to very complex sentences.
In order to answer the second part properly, you'd need to define first what exactly you mean by 'complexity'. This is not a measure that can easily be quantified (such as, eg, sentence length).
Most reading comprehension measures combine the length of words and sentences, on the assumption that longer words and longer sentences are harder to understand; however, shorter words tend to have more different meanings, and are arguably harder to understand if their meaning is not clear from the context.
Update after clarification: The size of the vocabulary depends on various factors, such as:
active vocabulary of the author: if I write a text in my native language (where my vocab is large), the number of different words I use in it will be bigger. If I write in a foreign language where I don't know that many words, it will of course be smaller
the language itself: a bit of an anomaly, but English has a much larger vocabulary than some other languages, due to its history. There are many near-synonyms, so it's easier to use more different word. Other languages are more limited.
topic: this is probably the biggest factor, as a very limited, technical topic will result in a more limited vocab. Wikipedia in general uses a broad range of words, but if you only take the articles on animals, the vocab will be more restricted.
style: similar to (1), I have an influence on the vocab size by how I write. By limiting my vocab, I can make a text more 'plain' (and leave more to the reader's imagination).
Apart from what Oliver has mentioned, from my professional experience the size of the vocabulary in a corpus often depends on the following:
How exactly do you tokenize and count vocabulary in your corpora?
For example, if you count compounds as a number of separate tokens you will have slightly different numbers compared to if you counted each compound noun as one token.
(elaborating on the issue of "topic" mentioned by Oliver above): each particular topic has its own set of terminology (knitting vs airspace engineering) but the total term density will depend on the author's vocabulary.
Inclusion of loanwords
As to your first question of language complexity, every language's complexity is relative to the issue at hand. If we are developing an English-Japanese translator -- the Japanese language is VERY complex, if a Chinese person is learning Japanese, it is MODERATELY complex. If we are comparing inflectional morphology: Russian and German are more complex than English. Basically, there are many ways of looking at the issue of language complexity depending on the participants' perspectives.

Are transformer-based language models overfitting on the paraphrase identification task? What tools overcome this?

I've been working on a sentence transformation task that involves paraphrase identification as a critical step: if we are confident enough that the state of the program (a sentence repeatedly modified) has become a paraphrase of a target sentence, stop transforming. The overall goal is actually to study potential reasoning in predictive models that can generate language prior to a target sentence. The approach is just one specific way of reaching that goal. Nevertheless, I've become interested in the paraphrase identification task itself, as it's received some boost from language models recently.
The problem I run into is when I manipulate sentences from examples or datasets. For example, in this HuggingFace example, if I negate either sequence or change the subject to Bloomberg, I still get a majority "is paraphrase" prediction. I started going through many examples in the MSRPC training set and negating one sentence in a positive example or making one sentence in a negative example a paraphrase of the other, especially when doing so would be a few word edit. I found to my surprise that various language models, like bert-base-cased-finetuned-mrpc and textattack/roberta-base-MRPC, don't change their confidences much on these sorts of changes. It's surprising as these models claim an f1 score of 0.918+. The dataset is clearly missing a focus on negative examples and small perturbative examples.
My question is, are there datasets, techniques, or models that deal well when given small edits? I know that this is an extremely generic question, much more than is typically asked on StackOverflow, but my concern is in finding practical tools. If there is a theoretical technique, then it might not be suitable as I'm in the category of "available tools define your approach" rather than vice-versa. So I hope that the community would have a recommendation on this.
Short answer to the question: yes, they are overfitting. Most of the important NLP data sets are not actually well-crafted enough to test what they claim to test, and instead test the ability of the model to find subtle (and not-so-subtle) patterns in the data.
The best tool I know for creating data sets that help deal with this is Checklist. The corresponding paper, "Beyond Accuracy: Behavioral Testing of NLP models with CheckList" is very readable and goes into depth on this type of issue. They have a very relevant table... but need some terms:
We prompt users to evaluate each capability with
three different test types (when possible): Minimum Functionality tests, Invariance, and Directional Expectation tests... A Minimum Functionality test (MFT), is a collection of simple examples (and labels) to check a
behavior within a capability. MFTs are similar to
creating small and focused testing datasets, and are
particularly useful for detecting when models use
shortcuts to handle complex inputs without actually
mastering the capability.
...An Invariance test (INV) is when we apply
label-preserving perturbations to inputs and expect
the model prediction to remain the same.
A Directional Expectation test (DIR) is similar,
except that the label is expected to change in a certain way. For example, we expect that sentiment
will not become more positive if we add “You are
lame.” to the end of tweets directed at an airline
(Figure 1C).
I haven't been actively involved in NLG for long, so this answer will be a bit more anecdotal than SO's algorithms would like. Starting with the fact that in my corner of Europe, the general sentiment toward peer review requirements for any kind of NLG project are higher by several orders of magnitude compared to other sciences - and likely not without reason or tensor thereof.
This makes funding a bigger challenge, so wherever you are, I wish you luck on that front. I'm not sure of how big of a deal this site is in the niche, but [Ehud Reiter's Blog][1] is where I would start looking into your tooling ideas.
Maybe even reach out to them/him personally, because I can't think of another source that has an academic background and a strong propensity for practical applications of NLG, at least based on the kind of content they've been putting out over the years.
Your background, environment/funding, and seniority level/control you have over the project will eventually compose your vector decision for you. I's just how it goes on the bleeding edge of anything. What I will add, though, is not to limit yourself to a single language or technology in this phase because of those precise reasons you've mentioned. I'd recommend the same in terms of potential open source involvement but if your profile information is accurate, that probably won't happen, no matter what you do and accomplish.
But yeah, in the grand scheme of things, your question is far from too broad, in my view. It identifies a rather unmistakable problem pattern that not all branches of science are as lackadaisical to approach as NLG-adjacent fields seem to be right now. In that regard, it's not broad enough and will need to be promulgated far and wide before community-driven tooling will give you serious options on a micro level.
Blasphemy, sure, but the performance is already stacked against you As for the question potentially being too broad, I'd posit it is not broad enough, so long as we collectively remain in a "oh, I was waiting for you to start doing something about it" phase.
P.S. I'd eliminate any Rust and ECMAScript alternatives prior to looking into Python, blapshemous as this might sound to a 2021 data scientist


Natural Language Generation - how to test if it sounds natural

I just have a set of sentences, which I have generated based on painting analysis. However I need to test how natural they sound. Is there any api or application which does this?
I am using the Standford Parser to give me a breakdown, but this doesn't exactly do the job I want!
Also can one test how similar sentences are? As I randomly generating parts of sentences and want to check the variety of the sentences produced.
A lot of NLP stuff works using things called 'Language Models'.
A language model is something that can take in some text and return a probability. This probability should typically be indicative of how "likely" the given text is.
You typically build a language model by taking a large chunk of text (which we call the "training corpus") and computing some statistics out of it (which represent your "model"), and then using those statistics to take in new, previously unseen sentences and returning probabilities for them.
You should probably google for "language models", "unigram models", "n-gram models" and click on some of the results to find some article or presentation which helps you understand the previous sentence. (Its hard for me to recommend an appropriate tutorial for you because I don't know what your existing background is)
Anyway, one way to think about language models is that they are systems that take in new text and tell you how similar the new text is to the training corpus the language model was made out of. So if you build 2 language models, one out of all the plays written by Shakespeare and another out of a large number of legal documents, then the second one should be giving you a much higher probability to sentences for some new legal document that just got released (as compared to the first model) while the first model should give you a much higher probability for some other old english play (written by some other author) because that play is probably more similar to Shakespeare (in terms of the kind of words used, sentence lengths, grammar, etc) than it is to modern legal language.
All the things you see the stanford parser give you back for a sentence you give it are generated using language models. One way to think about how those features are built is to pretend that the computer tried every possible combination of tags and every possible parse tree for the sentence you gave it, and used some clever language model to identify which is most probable sequence of tags and most probable parse tree out there, and returned those back to you.
Getting back to your problem, you need to build a language model out of what you consider natural sounding text and then use that language model to evaluate the sentences you want to measure the naturalness of. To do this, you will have to identify a good training corpus and decide on what type of language model you want to build.
If you can't think of anything better, a collection of wikipedia articles might serve to be a good training corpus representing what natural sounding english looks like.
As for model type, an "n-gram model" would probably be good enough for your task. More complicated models like "Hidden Markov Models" and "PCFG's" (the stuff that is powering the stanford page you linked to) would definitely make things even better, but n-grams are definitely the most simple thing you could start with.

What are the most challenging issues in Sentiment Analysis(opinion mining)?

Opinion Mining/Sentiment Analysis is a somewhat recent subtask of Natural Language processing.Some compare it to text classification,some take a more deep stance towards it. What do you think about the most challenging issues in Sentiment Analysis(opinion mining)? Can you name a few?
The key challenges for sentiment analysis are:-
1) Named Entity Recognition - What is the person actually talking about, e.g. is 300 Spartans a group of Greeks or a movie?
2) Anaphora Resolution - the problem of resolving what a pronoun, or a noun phrase refers to. "We watched the movie and went to dinner; it was awful." What does "It" refer to?
3) Parsing - What is the subject and object of the sentence, which one does the verb and/or adjective actually refer to?
4) Sarcasm - If you don't know the author you have no idea whether 'bad' means bad or good.
5) Twitter - abbreviations, lack of capitals, poor spelling, poor punctuation, poor grammar, ...
I agree with Hightechrider that those are areas where Sentiment Analysis accuracy can see improvement. I would also add that sentiment analysis tends to be done on closed-domain text for the most part. Attempts to do it on open domain text usually winds up having very bad accuracy/F1 measure/what have you or else it is pseudo-open-domain because it only looks at certain grammatical constructions. So I would say topic-sensitive sentiment analysis that can identify context and make decisions based on that is an exciting area for research (and industry products).
I'd also expand his 5th point from Twitter to other social media sites (e.g. Facebook, Youtube), where short, ungrammatical utterances are commonplace.
I think the answer is the language complexity, mistakes in grammar, and spelling. There is vast of ways people expresses there opinions, e.g., sarcasms could be wrongly interpreted as extremely positive sentiment.
The question may be too generic, because there are several types of sentiment analysis (document level, sentence level, comparative sentiment analysis, etc.) and each type has some specific problems.
Generally speaking, I agree with the answer by #Ian Mercer, and I would add 3 other issues:
How to detect a more in depth sentiment/emotion. Positive and negative is a very simple analysis, one of the challenge is how to extract emotions like how much hate there is inside the opinion, how much happiness, how much sadness, etc.
How to detect the object that the opinion is positive for and the object that the opinion is negative for. For example, if you say "She won him!", this means a positive sentiment for her and a negative sentiment for him, at the same time.
How to analyze very subjective sentences or paragraphs. Sometimes even for humans it is very hard to agree on the sentiment of this high subjective texts. Imagine for a computer...
Although this is a little bit an old question, let me add some note related to Arabic sentiment anlsysis in specific. Arabic language has morphological complexities and dialectal varieties which require advanced preprocessing and lexical building processes that surpass what is needed for the English language.
Please, refer to

Finding related words (specifically physical objects) to a specific word

I am trying to find words (specifically physical objects) related to a single word. For example:
Tennis: tennis racket, tennis ball, tennis shoe
Snooker: snooker cue, snooker ball, chalk
Chess: chessboard, chess piece
Bookcase: book
I have tried to use WordNet, specifically the meronym semantic relationship; however, this method is not consistent as the results below show:
Tennis: serve, volley, foot-fault, set point, return, advantage
Snooker: nothing
Chess: chess move, checkerboard (whose own meronym relationships shows ‘square’ & 'diagonal')
Bookcase: shelve
Weighting of terms will eventually be required, but that is not really a concern now.
Anyone have any suggestions on how to do this?
Just an update: Ended up using a mixture of both Jeff's and StompChicken's answers.
The quality of information retrieved from Wikipedia is excellent, specifically how (unsurprisingly) there is so much relevant information (in comparison to some corpora where terms such as 'blog' and 'ipod' do not exist).
The range of results from Wikipedia is the best part. The software is able to match terms such as (lists cut for brevity):
golf: [ball, iron, tee, bag, club]
photography: [camera, film, photograph, art, image]
fishing: [fish, net, hook, trap, bait, lure, rod]
The biggest problem is classifying certain words as physical artefacts; default WordNet is not a reliable resource as many terms (such as 'ipod', and even 'trampolining') do not exist in it.
I think what you are asking for is a source of semantic relationships between concepts. For that, I can think of a number of ways to go:
Semantic similarity algorithms. These algorithms usually perform a tree walk over the relationships in Wordnet to come up with a real-valued score of how related two terms are. These will be limited by how well WordNet models the concepts that you are interested in. WordNet::Similarity (written in Perl) is pretty good.
Try using OpenCyc as a knowledge base. OpenCyc is a open-source version of Cyc, a very large knowledge base of 'real-world' facts. It should have a much richer set of sematic realtionships than WordNet does. However, I have never used OpenCyc so I can't speak to how complete it is, or how easy it is to use.
n-gram frequency analysis. As mentioned by Jeff Moser. A data-driven approach that can 'discover' relationships from large amounts of data, but can often produce noisy results.
Latent Semantic Analysis. A data-driven approach similar to n-gram frequency analysis that finds sets of semantically related words.
Judging by what you say you want to do, I think the last two options are more likely to be successful. If the relationships are not in Wordnet then semantic similarity won't work and OpenCyc doesn't seem to know much about snooker other than the fact that it exists.
I think a combination of both n-grams and LSA (or something like it) would be a good idea. N-gram frequencies will find concepts tightly bound to your target concept (e.g. tennis ball) and LSA would find related concepts mentioned in the same sentence/document (e.g. net, serve). Also, if you are only interested in nouns, filtering your output to contain only nouns or noun phrases (by using a part-of-speech tagger) might improve results.
In the first case, you probably are looking for n-grams where n = 2. You can get them from places like Google or create your own from all of Wikipedia.
For more information, check out this related Stack Overflow question.
