Azure Cognitive Text Analytics - azure

The new Text Analytics library working with v3.0-preview for Sentiment Analysis. I passed a text with multiple sentences as a document to get the sentiment of the whole text.
I have received the following warning in the response.
"warnings":["Sentence was truncated because it exceeded the maximum token count."]}

Frist result on your favorite search engine: https://learn.microsoft.com/en-us/azure/cognitive-services/text-analytics/overview#data-limits
Maximum size of a single document: 5,120 characters as measured by
StringInfo.LengthInTextElements.

This is by design: a document is split internally into sentences, but we have a maximum length limit for the number of words in a sentence.

Related

Sentiment analysis: How do I preprocess a file in Weka and move it to Tableau?

I'm very new to data analytics and English is not my first language so please bear with me.
Currently working in Weka doing a sentiment (positive / negative) analysis of tweets. I have applied the StringToWordVector (including stemming and removing stop words) and afterwards ranked and removed all unnecessary attributes. I am left with 773 words / attributes.
My goal is to create a word cloud in either Tableau or Wordart that of course highlights how frequent each word is and also whether the word has a predominantly positive or negative
connotation.
I am struggling to find a way to process my Weka file so that the 773 words, their frequency and whether they have a positive or negative sentiment is saved and easily read into Tableau or Wordart.
Can anyone help with how to accomplish this?
I have attached a picture of the Weka preprocess window where I'm currently stuck.
Thanks!
My process so far:
RemoveWithValues to remove unwanted sentiment categories, so I'm left with 'positive' and 'negative'
StringToWordVector followed by ranking and removal of some attributes / words
Stuck... When loading the data into Tableau in it's current state just creates a big mess.
Goal is to visualize the data in a meaningful way. Additions to a word cloud are also welcome

Microsoft Translate Adding Extra Words To Translation

I am trying to translate to from English to Welsh. I have a data set of 3032 sentences which I am aware is below the recommended 10000 limit but the issue is random words being added to sentences or added at the end of the translation.
With the dataset I have, I am getting a BLEU score of 94.25.
Image of Translation Differences
I have attached four examples where extra words are being added throughout the form. At no point in the dataset is there duplication of words that match any of these formats and there is no trailing whitespace in the translations which would explain why "yn" in particular is appearing as a new sentence.
Is there any way of removing these erroneous extra words or increasing the accuracy of the translation? To increase the overall amount of sentences to more than 10000 would be a very large task and would not be something to undertake if the system is still going to have a high chance of returning random words.
I also raised this as a support request with Microsoft. They had said the issue was down to using a dictionary that included verbs as part of the translation.
I have since tried using English UK as the basis for the translation - an option that previously failed to build - and with the same dataset the BLEU score is 93.24 but the extra words have disappeared.
My issue has been resolved and it's now down to training out the incorrect translations. It appears the English to Welsh translation has a bug.

Batch running spaCy nlp() pipeline for large documents

I am trying to run the nlp() pipeline on a series of transcripts amounting to 20,211,676 characters. I am running on a machine with 8gb RAM. I'm very new at both Python and spaCy, but the corpus comparison tools and sentence chunking features are perfect for the paper I'm working on now.
What I've tried
I've begun by importing the English pipeline and removing 'ner' for faster speeds
nlp = spacy.load('en_core_web_lg', disable = ['ner'])
Then I break up the corpus into pieces of 800,000 characters since spaCy recommends 100,000 characters per gb
split_text = [text[i:i+800000] for i in range(0, len(text), 800000)]
Loop the pieces through the pipeline and create a list of nlp objects
nlp_text = []
for piece in split_text:
piece = nlp(piece)
nlp_text.append(piece)
Which works after a long wait period. note: I have tried upping the threshold via 'nlp.max_length' but anything above 1,200,000 breaks my python session.
Now that I have everything piped through I need to concatenate everything back since I will eventually need to compare the whole document to another (of roughly equal size). Also I would be interested in finding the most frequent noun-phrases in the document as a whole, not just in artificial 800,000 character pieces.
nlp_text = ''.join(nlp_text)
However I get the error message:
TypeError: sequence item 0: expected str instance,
spacy.tokens.doc.Doc found
I realize that I could turn to string and concatenate, but that would defeat the purpose of having "token" objects to works with.
What I need
Is there anything I can do (apart from using AWS expensive CPU time) to split my documents, run the nlp() pipeline, then join the tokens to reconstruct my complete document as an object of study? Am I running the pipeline wrong for a big document? Am I doomed to getting 64gb RAM somewhere?
Edit 1: Response to Ongenz
(1) Here is the error message I receive
ValueError: [E088] Text of length 1071747 exceeds maximum of 1000000.
The v2.x parser and NER models require roughly 1GB of temporary memory
per 100,000 characters in the input. This means long texts may cause
memory allocation errors. If you're not using the parser or NER, it's
probably safe to increase the nlp.max_length limit. The limit is in
number of characters, so you can check whether your inputs are too
long by checking len(text).
I could not find a part of the documentation that refers to this directly.
(2) My goal is to do a series of measures including (but not limited to if need arises): word frequency, tfidf count, sentence count, count most frequent noun-chunks, comparing two corpus using w2v or d2v strategies.
My understanding is that I need every part of the spaCy pipeline apart from Ner for this.
(3) You are completely right about cutting the document, in a perfect world I would cut on a line break instead. But as you mentioned I cannot use join to regroup my broken-apart corpus, so it might not be relevant anyways.
You need to join the resulting Docs using the Doc.from_docs method:
docs = []
for piece in split_text:
doc = nlp(piece)
docs.append(doc)
merged = Doc.from_docs(docs)
See the documentation here fore more details.

Trying to detect products from text while using a dictionary

I have a list of products names and a collection of text generated from random users. I am trying to detect products mentioned in the text while talking into account spelling variation. For example the text
Text = i am interested in galxy s8
Mentions the product samsung galaxy s8
But note the difference in spellings.
I've implemented the following approaches:
1- max tokenized products names and users text (i split words by punctuation and digits so s8 will be tokenized into 's' and '8'. Then i did a check on each token in user's text to see if it is in my vocabulary with damerau levenshtein distance <= 1 to allow for variation in spelling. Once i have detected a sequence of tokens that do exist in the vocabulary i do a search for the product that matches the query while checking the damerau levenshtein distance on each token. This gave poor results. Mainly because the sequence of tokens that exist in the vocabulary do not necessarily represent a product. For example since text is max tokenized numbers can be found in the vocabulary and as such dates are detected as products.
2- i constructed bigram and trigram indicies from the list of products and converted each user text into a query.. but also results weren't so great given the spelling variation
3- i manually labeled 270 sentences and trained a named entity recognizer with labels ('O' and 'Product'). I split the data into 80% training and 20% test. Note that I didn't use the list of products as part of the features. Results were okay.. not great tho
None of the above results achieved a reliable performance. I tried regular expressions but since there are so many different combinations to consider it became too complicated.. Are there better ways to tackle this problem? I suppose ner could give better results if i train more data but suppose there isn't enough training data, what do u think a better solution would be?
If i come up with a better alternative to the ones I've already mentioned, I'll add it to this post. In the meantime I'm open to suggestions
Consider splitting your problem into two parts.
1) Conduct a spelling check using a dictionary of known product names (this is not a NLP task and there should be guides on how to impelement spell check).
2) Once you have done pre-processing (spell checking), use your NER algorithm
It should improve your accuracy.

Resource that provides number of documents where the term is covered

I am looking for resources that provides the number of documents a term is covered in. For example, there is about 25 billion documents that contains the term "the" in the indexed internet.
I don't know of any document frequency lists for large corpora such as the web, but there are some term frequency lists available. For example, there are the frequency lists from the web corpora compiled by the Web-As-Corpus Kool Yinitiative, which include the 2-billion ukWaC English web corpus. Alternatively, there are the n-grams from the Google Books Corpus.
It has been shown that such term frequency counts can be used to reliably approximate document frequency counts.
Here is a little more treatable frequencies.
Also take a look at this site - it contains a lot of info about existing corpora and words/ngrams lists. Unfortunately, most resources are paid, but not n-grams (for n > 1), so if you're going to process multiword terms, it can help.

Resources