Using Gensim to perform LDA, I was able to do initial text preprocessing and cleanup using:
gensim.utils.simple_preprocess(str(sentence),deacc=True)
It was very efficient and almost does all required forms of text cleanup in one command. Now, I am trying to learn LDA using Scikit learn and I was wondering if there is a similar way to achieve the same preprocessing using Sci-kit learn, instead of having to load both libraries.
I don't think scikit-learn provides a similar utility function, but the whole logic of what Gensim's simple_preprocess() is doing is only about 20 lines of source code, spread across 5 functions:
simple_preprocess(), which relies on...
tokenize(), which relies on...
to_unicode(), deaccent(), & simple_tokenize()
So if you wanted the same behavior, without installing Gensim, you have the option to just copy & paste (or otherwise lightly adapt) that source code.
Related
!pip install transformers
from transformers import InputExample, InputFeatures
What are InputExample and InputFeatures here?
thanks.
Check out the documentation.
Processors
This library includes processors for several traditional tasks. These
processors can be used to process a dataset into examples that can be
fed to a model.
And
class transformers.InputExample
A single training/test example for simple sequence classification.
As well as
class transformers.InputFeatures
A single set of features of data. Property names are the same names as
the corresponding inputs to a model.
So basically InputExample is just a raw input and InputFeatures is the (numerical) feature representation of that Input that the model uses.
I couldn't find any tutorial explicitly explaining this but you can check out Chapter 4 (From text to features) in this tutorial where it is nicely explained on an example.
From my experience the transformers library has an absolute ton of classes and structures so going too deep into the technical implementation can make it easy to get lost in. For starters I would recommend trying to get an idea of the broader picture by just getting some example projects to work as well as checking out their 🤗 Course.
so this is a specific question involving two Tensorflow text classification tutorials on tensorflow.org. Sorry if this is the wrong place to ask.
Basically, there are two tutorials, one is "Classify Text with BERT" https://www.tensorflow.org/text/tutorials/classify_text_with_bert
And the other is "Fine-tuning a BERT model"
https://www.tensorflow.org/text/tutorials/fine_tune_bert
In these two tutorials, it describes preprocessing data. In "Classify Text with BERT", they use a preprocessing model provided by Tensorflow Hub, but in "Fine-tuning a BERT model", they implement python code which tokenizes the data and encodes it and some other stuff. Basically, it seems like the latter method is a lot more complicated than the former.
My question is, why does one tutorial use a preprocessing model provided, while the other actually implements python code? Is there a difference between the two tutorials that requires them to use their specific preprocessing methods?
Thank you!
I'm interested in NLP and I come up with Tensorflow and Bert, both seem to be from Google and both seem to be the best thing for Sentiment Analysis as of today but I don't understand what are they exactly and what is the difference between them... Can someone explain?
Tensorflow is an open-source library for machine learning that will let you build a deep learning model/architecture. But the BERT is one of the architectures itself. You can build many models using TensorFlow including RNN, LSTM, and even the BERT. The transformers like the BERT are a good choice if you just want to deploy a model on your data and you don't care about the deep learning field itself. For this purpose, I recommended the HuggingFace library that provides a straightforward way to employ a transformer model in just a few lines of code. But if you want to take a deeper look at these models, I will suggest you to learns about the well-known deep learning architectures for text data like RNN, LSTM, CNN, etc., and try to implement them using an ML library like Tensorflow or PyTorch.
Bert and Tensorflow is not different thing , There are not only 2, but many implementations of BERT. Most are basically equivalent.
The implementations that you mentioned are:
The original code by Google, in Tensorflow. https://github.com/google-research/bert
Implementation by Huggingface, in Pytorch and Tensorflow, that reproduces the same results as the original implementation and uses the same checkpoints as the original BERT article. https://github.com/huggingface/transformers
These are the differences regarding different aspects:
In terms of results, there is no difference in using one or the other, as they both use the same checkpoints (same weights) and their results have been checked to be equal.
In terms of reusability, HuggingFace library is probably more reusable, as it is designed specifically for that. Also, it gives you the freedom of choosing TensorFlow or Pytorch as deep learning framework.
In terms of performance, they should be the same.
In terms of community support (e.g. asking questions in github or stackoverflow about them), HuggingFace library is better suited, as there are a lot of people using it.
Apart from BERT, the transformers library by HuggingFace has implementations for lots of models: OpenAI GPT-2, RoBERTa, ELECTRA, ...
I want to build a Random Forest Regressor to model count data (Poisson distribution). The default 'mse' loss function is not suited to this problem. Is there a way to define a custom loss function and pass it to the random forest regressor in Python (Sklearn, etc..)?
Is there any implementation to fit count data in Python in any packages?
In sklearn this is currently not supported. See discussion in the corresponding issue here, or this for another class, where they discuss reasons for that a bit more in detail (mainly the large computational overhead for calling a Python function).
So it could be done as discussed within the issues, by forking sklearn, implementing the cost function in Cython and then adding it to the list of available 'criterion'.
If the problem is that the counts c_i arise from different exposure times t_i, then indeed one cannot fit the counts, but one can still fit the rates r_i = c_i/t_i using MSE loss function, where one should, however, use weights proportional to the exposures, w_i = t_i.
For a true Random Forest Poisson regression, I've seen that in R there is the rpart library for building a single CART tree, which has a Poisson regression option. I wish this kind of algorithm would have been imported to scikit-learn.
In R, writing a custom objective function is fairly simple.
randomForestSRC package in R has provision for writing your own custom split rule. The custom split rule, however has to be written in pure C language.
All you have to do is, write your own custom split rule, register the split rule, compile and install the package.
The custom split rule has to be defined in the file called splitCustom.c in randomForestSRC source code.
You can find more info
here.
The file in which you define the split rule is
this.
I know I am not suppose to ask for a tool, resource, etc on stackoverflow: But I think this is an important question and people will benefit from it. Here comes the question: I have found word2vec but failed to find doc2vec implementation in the tensorflow package, and will be surprised if it is not supported in tensorflow.
I guess that will be very slow, TensorFlow does not support so-called “inline” matrix operations, but forces you to copy a matrix in order to perform an operation on it. Copying very large matrices is costly in every sense. TF takes 4x as long as the state of the art deep learning tools. Google says it’s working on the problem. Source
you can go ahead and implement it on your own which is not hard as there are many types of word2vec implementations but the question remains, is it useful and fast?