I have a dataset where lot of names are written like man1sh instead of manish, vikas as v1kas.
How can one correct these names in nlp?
Any help is appreciated.
Try the Deep Neural Network based spell correction https://medium.com/#majortal/deep-spelling-9ffef96a24f6 this method is the state of the art method at the moment. Here is the code https://github.com/MajorTal/DeepSpell and some one already made an improvement over it https://hackernoon.com/improving-deepspell-code-bdaab1c5fb7e.I am not able to find the paper but there is also a paper published that does character level deep neural network for edit distance with good results and a public dataset.
For the above methods, like for all Machine Learning solutions, you need data for training. If you don't have data for your case then the old simple edit distance methods http://norvig.com/spell-correct.html are the only way.
Related
I have a project where I need to analyze a text to extract some information if the user who post this text need help in something or not, I tried to use sentiment analysis but it didn't work as expected, my idea was to get the negative post and extract the main words in the post and suggest to him some articles about that subject, if there is another way that can help me please post it below and thanks.
for the dataset i useed, it was a dataset for sentiment analyze, but now I found that it's not working and I need a dataset use for this subject.
Please use the NLP methods before processing the sentiment analysis. Use the TFIDF, Word2Vector to create vectors on the given dataset. And them try the sentiment analysis. You may also need glove vector for the conducting analysis.
For this topic I found that this field in machine learning is called "Natural Language Questions" it's a field where machine learning models trained to detect questions in text and suggesting answer for them based on data set you are working with, check this article for more detail.
I am given a task of classifying a given news text data into one of the following 5 categories - Business, Sports, Entertainment, Tech and Politics
About the data I am using:
Consists of text data labeled as one of the 5 types of news statement (Bcc news data)
I am currently using NLP with nltk module to calculate the frequency distribution of every word in the training data with respect to each category(except the stopwords).
Then I classify the new data by calculating the sum of weights of all the words with respect to each of those 5 categories. The class with the most weight is returned as the output.
Heres the actual code.
This algorithm does predict new data accurately but I am interested to know about some other simple algorithms that I can implement to achieve better results. I have used Naive Bayes algorithm to classify data into two classes (spam or not spam etc) and would like to know how to implement it for multiclass classification if it is a feasible solution.
Thank you.
In classification, and especially in text classification, choosing the right machine learning algorithm often comes after selecting the right features. Features are domain dependent, require knowledge about the data, but good quality leads to better systems quicker than tuning or selecting algorithms and parameters.
In your case you can either go to word embeddings as already said, but you can also design your own custom features that you think will help in discriminating classes (whatever the number of classes is). For instance, how do you think a spam e-mail is often presented ? A lot of mistakes, syntaxic inversion, bad traduction, punctuation, slang words... A lot of possibilities ! Try to think about your case with sport, business, news etc.
You should try some new ways of creating/combining features and then choose the best algorithm. Also, have a look at other weighting methods than term frequencies, like tf-idf.
Since your dealing with words I would propose word embedding, that gives more insights into relationship/meaning of words W.R.T your dataset, thus much better classifications.
If you are looking for other implementations of classification you check my sample codes here , these models from scikit-learn can easily handle multiclasses, take a look here at documentation of scikit-learn.
If you want a framework around these classification that is easy to use you can check out my rasa-nlu, it uses spacy_sklearn model, sample implementation code is here. All you have to do is to prepare the dataset in a given format and just train the model.
if you want more intelligence then you can check out my keras implementation here, it uses CNN for text classification.
Hope this helps.
I am new to computer vision, and now I am do some research on object detection. I have read papers about faster RCNN and RFCN, also read YOLO. It seems the biggest problem is the speed? And all of them use image data data only. Are there any models that combines text and image data? Which means we can use the information from text to help detection when the training data is small. For example, when the training data is small, the model cannot tell dogs and cats clearly, but the model could tell there is a bone near that object, and the model gets some information from text that the object near a bone is most likely a dog, thus the model now could tell what the object is. Does this kind of algorithm exist? I haven't found them, hope you could help me. Thanks a lot.
It seems you have mostly referred to research on Deep Networks for Object Detection. Prior to the success of deep networks, researchers were looking to to the possibility of using text with image features to implement ideas similar to yours. You might want to refer to papers from ACM Multimedia and IEEE TMM, especially those before 2014.
The problem was that those approaches could not perform as well as the simplest of the deep networks that use only images. There is some work on combining both images and text, such as this paper. I am sure at least some researchers are already working on this.
I am trying to build a classifier to detect subjectivity. I have text files tagged with subjective and objective . I am little lost with the concept of features creation from this data. I have found the lexicon of the subjective and objective tag. One thing that I can do is to create a feature of having words present in respective dictionary. Maybe the count of words present in subjective and objective dictionary. After that I intend to use naive bayes or SVM to develop the model
My problem is as follow
Is my approach correct ?
Can I create more features ? If possible suggest some or point me to some paper or link
Can I do some test like chi -sq etc to identify effective words from the dictionary ?
You are basically on the right track. I would try and apply classifier with features you already have and see how well it will work, before doing anything else.
Actually best way to improve your work is to google for subjectivity classification papers and read them (there are a quite a number of them). For example this one lists typical features for this task.
And yes Chi-squared can be used to construct dictionaries for text classification (other commonly used methods are TD*IDF, pointwise mutal information and LDA)
Also, recently new neural network-based methods for text classification such as paragraph vector and dynamic convolutional neural networks with k-max pooling demonstrated state-of-the-art results on sentiment analysis, thus they should probably be good for subjectivity classification as well.
Objective: a node.js function that can be passed a news article (title, text, tags, etc.) and will return a category for that article ("Technology", "Fashion", "Food", etc.)
I'm not picky about exactly what categories are returned, as long as the list of possible results is finite and reasonable (10-50).
There are Web APIs that do this (eg, alchemy), but I'd prefer not to incur the extra cost (both in terms of external HTTP requests and also $$) if possible.
I've had a look at the node module "natural". I'm a bit new to NLP, but it seems like maybe I could achieve this by training a BayesClassifier on a reasonable word list. Does this seem like a good/logical approach? Can you think of anything better?
I don't know if you are still looking for an answer, but let me put my two cents for anyone who happens to come back to this question.
Having worked in NLP i would suggest you look into the following approach to solve the problem.
Don't look for a single package solution. There are great packages out there, no doubt for lots of things. But when it comes to active research areas like NLP, ML and optimization, the tools tend to be atleast 3 or 4 iterations behind whats there is academia.
Coming to the core problem. What you want to achieve is text classification.
The simplest way to achieve this would be an SVM multiclass classifier.
Simplest yes, but also with very very (see the double stress) reasonable classification accuracy, runtime performance and ease of use.
The thing which you would need to work on would be the feature set used to represent your news article/text/tag. You could use a bag of words model. add named entities as additional features. You can use article location/time as features. (though for a simple category classification this might not give you much improvement).
The bottom line is. SVM works great. they have multiple implementations. and during runtime you don't really need much ML machinery.
Feature engineering on the other hand is very task specific. But given some basic set of features and a good labelled data you can train a very decent classifier.
here are some resources for you.
http://svmlight.joachims.org/
SVM multiclass is what you would be interested in.
And here is a tutorial by SVM zen himself!
http://www.cs.cornell.edu/People/tj/publications/joachims_98a.pdf
I don't know about the stability of this but from the code its a binary classifier SVM. which means if you have a known set of tags of size N you want to classify the text into, you will have to train N binary SVM classifiers. One each for the N category tags.
Hope this helps.