I have five decision trees for five datasets. I want to combine them all into one decision tree.
I believe It is something similar to bagging technique. It would be great if experts post few links that are helpful. I am not looking to retrain the datasets, just combining the trees. Help is appreciated. TIA :)
Related
I understand it’s usually difficult to evaluate unsupervised models. However, since I’m new to NLP, I figured I’d ask to see if there’s a way to compare 2 topic models. I have an STM and an LDA models, but if someone asks how do I choose one vs the other, I have no answer. Is there a way where given the data I have, I can show some metric that I can compare between models and state with some degree of confidence the LDA is better or worse than STM?
Gensim, for Python, has a tutorial showing how to compare LDA models. Since (I believe) the outputs of STM and LDA should be more or less the same, you should be able to do the same analysis.
For what it's worth, I don't know that there is a universally-accepted way to compare models. I've dabbled with it in R doing something similar to the Gensim tutorial. It can be a bit of a headache and (at least in my case) the results weren't super clear.
I am new to decision trees. I am planning to build a large decision tree that I would like to update later with additional data. What is the best approach to this? Can any decision tree be later updated?
Decision trees are most often trained on all available data. That is, when you have new data, you retrain the entire tree. Since this process is very fast it is usually not problematic. If data is too big to fit in memory, you can often get around it by subsampling (row sampling) the training set, since tree-based models don't need that much data to give good results.
Note that decision trees are quite vunerable to overfitting, and you should consider Random Forest or another ensemble method. With bagging it is possible to train different trees on different subsets of data.
There also exists incremental and online learning methods for decision trees. CART, ID3 and VFDT learner are some examples.
see gaenari
it is c++ incremental decision tree.
it continuously insert new chunk dataset, and update.
rebuild can update model when accuracy decreasing(concept drifting).
I'm working with decision trees in python's scikit learn. Unlike many use cases for this, I'm not so much interested in the accuracy of the classifier at this point so much as I am extracting the specific path a data point takes through the tree when I call .predict() on it. Has anyone done this before?
I know this can be done in R using rpart, however trying hard to do same using Python. Any pointers would be helpful
I have used three different ways to calculate the matching between the resume and the job description. Can anyone tell me that what method is the best and why?
I used NLTK for keyword extraction and then RAKE for
keywords/keyphrase scoring, then I applied cosine similarity.
Scikit for keywords extraction, tf-idf and cosine similarity
calculation.
Gensim library with LSA/LSI model to extract keywords and calculate
cosine similarity between documents and query.
Nobody here can give you the answer. The only way to decide which method works better is to have one or more humans independently match lots and lots of resumes and job descriptions, and compare what they do to what your algorithms do. Ideally you'd have a dataset of already matched resumes and job descriptions (companies must do this kind of thing when people apply), because it takes a lot of work to create a sufficiently large dataset.
Next time you take on this kind of project, start by considering how you are going to evaluate the performance of the solution you'll put together.
As already mentioned in answers, try ti use Doc2Vec.
Seems using Doc2Vec from Gensim on both corpora (CVs and job descriptions) separately and then using cosine similarity between the two vectors is the easiest flow to work. It works better than others on documents which are not similar in form and words content but similar in context and sematics, so merely keywords would not help much here.
Then you can try to train CNN on the corpus of pairs of matched CV&JD with labels like yes/no if available and use it to qulaify CVs/resumees against job descriptions.
Basically I'm going to try these aproaches in my pretty much the same task, pls see https://datascience.stackexchange.com/questions/22421/is-there-an-algorithm-or-nn-to-match-two-documents-basically-not-closely-simila
Since its highly likely that job description and resume content can be different, you should think from semantics point of view. One thing possible you can do is use some domain knowledge. But its pretty difficult to gain domain knowledge for a variety of job types. Researchers sometimes use dictionary to augment the similarity matching between documents.
Researchers are using deep neural networks to capture both syntactic and semantic structure of documents. You can use doc2Vec to compare two documents. Gensim can produce doc2Vec representation for you. I believe that will give better results compared to keyword extraction and similarity computation. You can build your own neural network model to train on job descriptions and resumes. I guess neural networks will be effective for your work.
Objective: a node.js function that can be passed a news article (title, text, tags, etc.) and will return a category for that article ("Technology", "Fashion", "Food", etc.)
I'm not picky about exactly what categories are returned, as long as the list of possible results is finite and reasonable (10-50).
There are Web APIs that do this (eg, alchemy), but I'd prefer not to incur the extra cost (both in terms of external HTTP requests and also $$) if possible.
I've had a look at the node module "natural". I'm a bit new to NLP, but it seems like maybe I could achieve this by training a BayesClassifier on a reasonable word list. Does this seem like a good/logical approach? Can you think of anything better?
I don't know if you are still looking for an answer, but let me put my two cents for anyone who happens to come back to this question.
Having worked in NLP i would suggest you look into the following approach to solve the problem.
Don't look for a single package solution. There are great packages out there, no doubt for lots of things. But when it comes to active research areas like NLP, ML and optimization, the tools tend to be atleast 3 or 4 iterations behind whats there is academia.
Coming to the core problem. What you want to achieve is text classification.
The simplest way to achieve this would be an SVM multiclass classifier.
Simplest yes, but also with very very (see the double stress) reasonable classification accuracy, runtime performance and ease of use.
The thing which you would need to work on would be the feature set used to represent your news article/text/tag. You could use a bag of words model. add named entities as additional features. You can use article location/time as features. (though for a simple category classification this might not give you much improvement).
The bottom line is. SVM works great. they have multiple implementations. and during runtime you don't really need much ML machinery.
Feature engineering on the other hand is very task specific. But given some basic set of features and a good labelled data you can train a very decent classifier.
here are some resources for you.
http://svmlight.joachims.org/
SVM multiclass is what you would be interested in.
And here is a tutorial by SVM zen himself!
http://www.cs.cornell.edu/People/tj/publications/joachims_98a.pdf
I don't know about the stability of this but from the code its a binary classifier SVM. which means if you have a known set of tags of size N you want to classify the text into, you will have to train N binary SVM classifiers. One each for the N category tags.
Hope this helps.