I am attempting to classify dump pile in google earth engine. Using Sentinel 2 data and classifying the first 5 PCAs lead to the best visual result. Here the PCA code used in script with Sentinel2.
Is there a way of iteratively testing SVM parameters in google earth engine and select best fit based on ROC, AUC.
How to limit overfittig (other than visual inspection)?
Related
To improve the recomender system for Buyer Material Groups, our company is willing to train a model using customer historial spend data. The model should be trained on historical "Short text descriptions" to predict the appropriate BMG. The dataset has more that 500.000 rows and the text descriptions are multilingual (up to 40 characters).
1.Question: can i use supervised learning if i consider the fact that the descriptions are in multiple languages? If Yes, are classic approaches like multinomial naive bayes or SVM suitable?
2.Question: if i want to improve the first model in case it is not performing well, and use unsupervised multilingual emdedding to build a classifier. how can i train this classifier on the numerical labels later?
if you have other ideas or approaches please feel free :). (It is a matter of a simple text classification problem)
Can I use supervised learning if i consider the fact that the descriptions are in multiple languages?
Yes, this is not a problem except it makes your data more sparse. If you actually only have 40 characters (is that not 40 words?) per item, you may not have enough data. Also the main challenge for supervised learning will be whether you have labels for the data.
If Yes, are classic approaches like multinomial naive bayes or SVM suitable?
They will work as well as they always have, though these days building a vector representation is probably a better choice.
If i want to improve the first model in case it is not performing well, and use unsupervised multilingual emdedding to build a classifier. how can i train this classifier on the numerical labels later?
Assuming the numerical labels are labels on the original data, you can add them as tokens like LABEL001 and the model can learn representations of them if you want to make an unsupervised recommender.
Honestly these days I wouldn't start with Naive Bayes or classical models, I'd go straight to word vectors as a first test for clustering. Using fasttext or word2vec is pretty straightforward. The main problem is that if you really only have 40 characters per item, that just might not be enough data to cluster usefully.
I want to evaluate different classifiers in performing the link-prediction task by using node embedding algorithms. More specifically, I want to evaluate if node embedding can improve the accuracy of different classifiers predicting new links between nodes.
My idea is the following:
I create a dataset containing both positive and negative samples (real links and non-existing links)
I split the dataset in Development Test (DS) and Evaluation Test (ES).
I use the DS to perform the Grid Search cross-validation (CV) to find the best model
I train the best model on the entire DS, and then I evaluate its performance on ES.
The problem is the following: I cannot use node embedding algorithms on the entire dataset because, in this case, ES will contain information related to the original graph topology. Therefore, I need to extract node embeddings from the training and test sets generated during the Grid Search CV, but how can I do it by using the sklearn.model_selection.GridSearchCV class?
i been trying to learn a bit of machine learning for a project that I'm working in. At the moment I managed to classify text using SVM with sklearn and spacy having some good results, but i want to not only classify the text with svm, I also want it to be classified based on a list of keywords that I have. For example: If the sentence has the word fast or seconds I would like it to be classified as performance.
I'm really new to machine learning and I would really appreciate any advice.
I assume that you are already taking a portion of your data, classifying it manually and then using the result as your training data for the SVM algorithm.
If yes, then you could just append your list of keywords (features) and desired classifications (labels) to your training data. If you are not doing it already, I'd recommend using the SnowballStemmer on your training data features.
I have tried different approaches like multinomialNB, SVM, MLPClassifier, CNN as well as LSTM network to train the dataset that consists of tweets and labels (big 5 classes - openness, conscientiousness, extraversion, agreeable, neuroticism). But the accuracy is at around 60% even after using word2vec, NRC features & MRC features. Is there something that I can do to improve the accuracy?
Would you please add few more details about the dataset you are using?
For example I would add:
Dataset size (number of samples)
Classes distribution (are they balanced or not)
Do you do any preprocessing?
Without the above information I would just guess but if I were you would try:
clean the tweets from noise e.g usernames,garbage symbols etc.
If the dataset is small
try random search on models (Naive Bayes ,SVM, Logistic regression) using various vectorizations strategies e.g bag of words, tf-idf and do hyper-parameter search
try applying transfer learning from a model trained on tweets, for example for sentiment analysis.
If the dataset is large enough
try neural network approach
Embedding(Glove, word2vec, fasttext) + RNN(LSTM, GRU) + Attention
try training own embedding
use pretrained ones such as those
Embedding + CNN + RNN
Bag of words + FNN
If classes are not balanced
use weighted loss
try to balance them
try stacking multiple models (ensemble)
Hope it helps!
Is the main premise of your project to do personality detection? If not, I would recommend using the Google Sentiment API to calculate sentiment of Twitter data.
We are using Azure Machine Learning Studio for building Trained Model and for that we have used Two Class Bayes Point Machine Algorithm.
For sample data , we have imported .CSV file that contains columns such as: Tweets and Label.
After deploying the web service, we got improper output.
We want our algorithm to predict the result of Label as 0 or 1 on the basis of different types tweets, that are already stored in the dataset.
While testing it with the tweets that are there in the dataset, it gives proper result, but the problem occurs while testing it with other tweets(that are not there in the dataset).
You can view our experiment over here:
Experiment
Are you planning to do a binary classification based on the textual data on tweets? If so you should try doing feature hashing before doing the classification.