My question is if there is some tool for spelling correction. I've seen bigrams analysis, Jaccard coefficient, and dictionaries based on training documents (python). Their results are very accurate (80-90%), but they can't correct sentences, for example "welcometo"-> "welcome to".
Thanks in advance!
Please try experimenting with Hunspell. It is default spell checker for Mozilla, MacOS. If you are using python try pyhunspell
Related
I'm using GPT-3 for some experiments where I prompt the language model with tests from cognitive science. The tests have the form of short text snippets. Now I'd like to check whether GPT-3 has already encountered these text snippets during training. Hence my question: Is there any way to sift through GPT-3's training text corpora? Can one find out whether a certain string is part of these text corpora?
Thanks for your help!
I don't think that's possible, unfortunately. GPT-3's training corpora is private.
But if that was possible, it would be great for detecting plagiarism. Maybe ask if it it knows where a certain line of text came from?
I'm looking for examples of pytorch being used to classify non-MNIST digits. After hours of searching, it appears the algorithms are against me. Does anyone have a good example? Thanks.
I am posting this as answer since i do not have the rep to comment,
Please view the google street view dataset (SVHN). It is like MNIST but there is much more noise present in the data. Another option for you could be to use GANs and make more images which practically wouldn't have existed before. You could also try your hand at non - english mnist data-sets (though it moves away from your original goal).
Link to SVHN with pytorch: https://github.com/potterhsu/SVHNClassifier-PyTorch
Link to original SVHN: https://pytorch.org/docs/stable/torchvision/datasets.html#svhn
P.S. You could also try making a dataset on your own! This is quite fun to do.
I have a project where I need to analyze a text to extract some information if the user who post this text need help in something or not, I tried to use sentiment analysis but it didn't work as expected, my idea was to get the negative post and extract the main words in the post and suggest to him some articles about that subject, if there is another way that can help me please post it below and thanks.
for the dataset i useed, it was a dataset for sentiment analyze, but now I found that it's not working and I need a dataset use for this subject.
Please use the NLP methods before processing the sentiment analysis. Use the TFIDF, Word2Vector to create vectors on the given dataset. And them try the sentiment analysis. You may also need glove vector for the conducting analysis.
For this topic I found that this field in machine learning is called "Natural Language Questions" it's a field where machine learning models trained to detect questions in text and suggesting answer for them based on data set you are working with, check this article for more detail.
I want to categorize comments as positive or negative based on the content.
This is a problem of NLP(Natural Lang Processing) and i am finding difficulties in implementing this.
Check out this blog post. The author describes how to build a Twitter Sentiment Classifier with Python and NLTK. Looks like a good start, as sentiment analysis is no easy task with lots of active research going on in the field.
Also search SO for Sentiment Analysis, I believe there already are many useful answers about this topic on the site.
Here is, Combination of Semi Supervised co-occurance based and unsupervised WSD based classifier. Its in Python though. And you need nltk, wordnet, SentiWord-net and movie review corpus which comes with nltk.
https://github.com/kevincobain2000/sentiment_classifier
The problem is quite complex, anyway I love Pattern: http://www.clips.ua.ac.be/pages/pattern-examples-elections
If you are not categorizing a lot of comments you may wish to try using the chatterboax API
Else you can use Linpipe, but you will have to train your models
Please suggest a good machine learning classifier for truecasing of dataset.
Also, Is it possible to specify out own rules/features for truecasing in such a classifier? Thanks for all your suggestions.
Thanks
I implemented a version of a truecaser in Python. It can be trained for any language when you provide enough data (i.e. correctly cased sentences).
For English, it achieves an accuracy of 98.38% on sample sentences from Wikipedia. A pre-trained model for English is provided.
You can find it here:
https://github.com/nreimers/truecaser
Please take a look at this whitepaper.
http://www.cs.cmu.edu/~llita/papers/lita.truecasing-acl2003.pdf
They report 98% of accuracy.