Iterate a spark pipeline - apache-spark

Currently I'm working on a Sentiment analysis project using Spark. I'm trying to implement a pipe line like this:
Raw text---(Tokenized)-->Tokenized Words---(join with Sentiment Dictionary)--->Words with Sentiment value---(distribute words to sentence again)--->Sentence with Sentiment value---(average sentiment value of words from sentence it appeared in)--->new Sentiment Dictionary
Now i want to repeat this process until the different between 2 new Sentiment Dictionary in 2 consecutive iterations are bellow a defined value. However, I'm not sure how do I do this, I wrote a custom transformer for this pipeline (since most of my transformer are not available in ml library). On the step of iteration, I'm not sure what's the best way to do this. Should I just put a while loop there and repeat everything, or there is a better mechanism?
Thank you for your time.

Related

How to construct the pipeline for ner & archive better results using spacy?

I'm currently trying to do named entity recognition for tweets using spacy. For that purpose i created with Gensim word vectors which I'm using to train a new blank ner model. At the moment I am a bit confused how the set up the pipeline for my purpose. Regarding this, I have the following question:
The pipeline which I'm currently using consists only of one ner component. Have someone recommendations for construction the pipeline (e.g. using tok2vec before ner)?
I also wonder if my approach of training a new model with previously created word vectors is the right one and how i could further improve my prediction accuracy.

Does pattern in sentence edits affect the performance of sentence correction seq2seq model

I am trying to train a seq2seq model using T5 transformer for sentence correction task. I am using StackOverflow dataset for the training and evaluation process. The dataset contains original and edited sentences extracted from StackOverflow posts.
Below are some samples:
Original
Edited
is it possible to print all reudctions in Haskell - using WinHugs
Is it possible to print all reductions in Haskell - using WinHugs
How do I pass a String into a fucntion in an NVelocty Template
How do I pass a String into a function in an NVelocity Template
Caconical term for something that can only occur once
Canonical term for something that can only occur once
When trained on samples that have a high similarity (using the longest common sequence to determine this) and are edited due to spelling correction, verb changes, and preposition changes the model is predicting good recommendation. But when I use samples that do not have high similarity the model is not predicting very accurate results. Below are some samples:
Original
Edited
For what do API providers use API keys, such as the UPS API Key
Why do some API providers require an API key
NET - Programmatic Cell Edit
NET - working with GridView Programmatically
How to use http api (pseudo REST) in C#
How to fire a GET request over a pseudo REST service in C#
I am using simpletranfromers for training T5 model based on t5-base.
Can anyone confirm that is it a limitation of seq2seq models that they can not learn much when the input and target sequences are out of pattern?

Is there any way to classify text based on some given keywords using python?

i been trying to learn a bit of machine learning for a project that I'm working in. At the moment I managed to classify text using SVM with sklearn and spacy having some good results, but i want to not only classify the text with svm, I also want it to be classified based on a list of keywords that I have. For example: If the sentence has the word fast or seconds I would like it to be classified as performance.
I'm really new to machine learning and I would really appreciate any advice.
I assume that you are already taking a portion of your data, classifying it manually and then using the result as your training data for the SVM algorithm.
If yes, then you could just append your list of keywords (features) and desired classifications (labels) to your training data. If you are not doing it already, I'd recommend using the SnowballStemmer on your training data features.

NLP Structure Question (best way for doing feature extraction)

I am building an NLP pipeline and I am trying to get my head around in regards to the optimal structure. My understanding at the moment is the following:
Step1 - Text Pre-processing [a. Lowercasing, b. Stopwords removal, c. stemming, d. lemmatisation,]
Step 2 - Feature extraction
Step 3 - Classification - using the different types of classifier(linearSvC etc)
From what I read online there are several approaches in regard to feature extraction but there isn't a solid example/answer.
a. Is there a solid strategy for feature extraction ?
I read online that you can do [a. Vectorising usin ScikitLearn b. TF-IDF]
but also I read that you can use Part of Speech or word2Vec or other embedding and Name entity recognition.
b. What is the optimal process/structure of using these?
c. On the text pre-processing I am ding the processing on a text column on a df and the last modified version of it is what I use as an input in my classifier. If you do feature extraction do you do that in the same column or you create a new one and you only send to the classifier the features from that column?
Thanks so much in advance
The preprocessing pipeline depends mainly upon your problem which you are trying to solve. The use of TF-IDF, word embeddings etc. have their own restrictions and advantages.
You need to understand the problem and also the data associated with it. In order to make the best use of the data, we need to implement the proper pipeline.
Specifically for text related problems, you will find word embeddings to be very useful. TF-IDF is useful when the problem needs to be solved emphasising the words with lesser frequency. Word embeddings, on the other hand, convert the text to a N-dimensional vector which may show up similarity with some other vector. This could bring a sense of association in your data and the model can learn the best features possible.
In simple cases, we can use a bag of words representation to tokenize the texts.
So, you need to discover the best approach for your problem. If you are solving a problems which closely resembles the famous NLP problems like IMDB review classification, sentiment analysis on Twitter data, then you can find a number of approaches on the internet.

Sentiment analysis with NLTK python for sentences using sample data or webservice?

I am embarking upon a NLP project for sentiment analysis.
I have successfully installed NLTK for python (seems like a great piece of software for this). However,I am having trouble understanding how it can be used to accomplish my task.
Here is my task:
I start with one long piece of data (lets say several hundred tweets on the subject of the UK election from their webservice)
I would like to break this up into sentences (or info no longer than 100 or so chars) (I guess i can just do this in python??)
Then to search through all the sentences for specific instances within that sentence e.g. "David Cameron"
Then I would like to check for positive/negative sentiment in each sentence and count them accordingly
NB: I am not really worried too much about accuracy because my data sets are large and also not worried too much about sarcasm.
Here are the troubles I am having:
All the data sets I can find e.g. the corpus movie review data that comes with NLTK arent in webservice format. It looks like this has had some processing done already. As far as I can see the processing (by stanford) was done with WEKA. Is it not possible for NLTK to do all this on its own? Here all the data sets have already been organised into positive/negative already e.g. polarity dataset http://www.cs.cornell.edu/People/pabo/movie-review-data/ How is this done? (to organise the sentences by sentiment, is it definitely WEKA? or something else?)
I am not sure I understand why WEKA and NLTK would be used together. Seems like they do much the same thing. If im processing the data with WEKA first to find sentiment why would I need NLTK? Is it possible to explain why this might be necessary?
I have found a few scripts that get somewhat near this task, but all are using the same pre-processed data. Is it not possible to process this data myself to find sentiment in sentences rather than using the data samples given in the link?
Any help is much appreciated and will save me much hair!
Cheers Ke
The movie review data has already been marked by humans as being positive or negative (the person who made the review gave the movie a rating which is used to determine polarity). These gold standard labels allow you to train a classifier, which you could then use for other movie reviews. You could train a classifier in NLTK with that data, but applying the results to election tweets might be less accurate than randomly guessing positive or negative. Alternatively, you can go through and label a few thousand tweets yourself as positive or negative and use this as your training set.
For a description of using Naive Bayes for sentiment analysis with NLTK: http://streamhacker.com/2010/05/10/text-classification-sentiment-analysis-naive-bayes-classifier/
Then in that code, instead of using the movie corpus, use your own data to calculate word counts (in the word_feats method).
Why dont you use WSD. Use Disambiguation tool to find senses. and use map polarity to the senses instead of word. In this case you will get a bit more accurate results as compared to word index polarity.

Resources