I know I am not suppose to ask for a tool, resource, etc on stackoverflow: But I think this is an important question and people will benefit from it. Here comes the question: I have found word2vec but failed to find doc2vec implementation in the tensorflow package, and will be surprised if it is not supported in tensorflow.
I guess that will be very slow, TensorFlow does not support so-called “inline” matrix operations, but forces you to copy a matrix in order to perform an operation on it. Copying very large matrices is costly in every sense. TF takes 4x as long as the state of the art deep learning tools. Google says it’s working on the problem. Source
you can go ahead and implement it on your own which is not hard as there are many types of word2vec implementations but the question remains, is it useful and fast?
Related
I trained a BERTopic model and analysed the resulting topics. About half of them are good, but the others I don't need. Can I remove them from the model, to get faster predictions?
I had the same question and I asked in github discussions for the package. If you ask there the package author answers very quickly.
Here is his answer to our question:
"Deleting topics is unlikely to help with speeding up the model in .transform as it is not possible to do that easily in the underlying models. Instead, I would either advise using the slower model and use .merge_topics to merge all unwanted topics into a single topic, so that it is easier to identify those. Or you can adjust the min_topic_size a bit lower to get a balance between helpful topics and speed of the transform function.
Do note that the transform function can be speed up by a number of different ways. For example, if you have an older GPU then embedding the documents can be much slower. In practice, it is helpful to identify which steps of the algorithm are relatively slow for you. By setting verbose=True, you have some indication of the time spent at each of those steps. If UMAP is too slow for you, then you can consider using PCA instead." -c Maarten Grootendorst
Also note that you could improve the speed of .transform by enabling the gpu acceleration for the latter two stages (by default only the first stage is gpu accelerated). You will find the info on that here https://maartengr.github.io/BERTopic/getting_started/tips_and_tricks/tips_and_tricks.html#gpu-acceleration
I'm interested in NLP and I come up with Tensorflow and Bert, both seem to be from Google and both seem to be the best thing for Sentiment Analysis as of today but I don't understand what are they exactly and what is the difference between them... Can someone explain?
Tensorflow is an open-source library for machine learning that will let you build a deep learning model/architecture. But the BERT is one of the architectures itself. You can build many models using TensorFlow including RNN, LSTM, and even the BERT. The transformers like the BERT are a good choice if you just want to deploy a model on your data and you don't care about the deep learning field itself. For this purpose, I recommended the HuggingFace library that provides a straightforward way to employ a transformer model in just a few lines of code. But if you want to take a deeper look at these models, I will suggest you to learns about the well-known deep learning architectures for text data like RNN, LSTM, CNN, etc., and try to implement them using an ML library like Tensorflow or PyTorch.
Bert and Tensorflow is not different thing , There are not only 2, but many implementations of BERT. Most are basically equivalent.
The implementations that you mentioned are:
The original code by Google, in Tensorflow. https://github.com/google-research/bert
Implementation by Huggingface, in Pytorch and Tensorflow, that reproduces the same results as the original implementation and uses the same checkpoints as the original BERT article. https://github.com/huggingface/transformers
These are the differences regarding different aspects:
In terms of results, there is no difference in using one or the other, as they both use the same checkpoints (same weights) and their results have been checked to be equal.
In terms of reusability, HuggingFace library is probably more reusable, as it is designed specifically for that. Also, it gives you the freedom of choosing TensorFlow or Pytorch as deep learning framework.
In terms of performance, they should be the same.
In terms of community support (e.g. asking questions in github or stackoverflow about them), HuggingFace library is better suited, as there are a lot of people using it.
Apart from BERT, the transformers library by HuggingFace has implementations for lots of models: OpenAI GPT-2, RoBERTa, ELECTRA, ...
For my Bachelorthesis I need to train different word embedding algorithms on the same corpus to benchmark them.
I am looking to find preprocessing steps but am not sure which ones to use and which ones might be less useful.
I already looked for some studies but also wanted to ask if someone has experience with this.
My objective is to train Word2Vec, FastText and GloVe Embeddings on the same corpus. Not too sure which one now, but I think of Wikipedia or something similar.
In my opinion:
POS-Tagging
remove non-alphabetic characters with regex or similar
Stopword removal
Lemmatization
catching Phrases
are the logical options.
But I heard that stopword removal can be kind of tricky, because there is a chance that some embeddings still contain stopwords due to the fact that automatic stopword removal might not fit to any model/corpus.
Also I have not decided if I want to choose spacy or nltk as library, spacy is mightier but nltk is mainly used at the chair I am writing.
Preprocessing is like hyperparameter optimization or neural architecture search. There isn't a theoretical answer to "which one should I use". The applied section of this field (NLP) is far ahead of the theory. You just run different combinations until you find the one that works best (according to your choice of metric).
Yes Wikipedia is great, and almost everyone uses it (plus other datasets). I've tried spacy and it's powerful, but I think I made a mistake with it and I ended up writing my own tokenizer which worked better. YMMV. Again, you just have to jump in and try almost everything. Check with your advisor that you have enough time and computing resources.
I have explored about techniques to build Image Recommendation system with Deep Learning models, which it has to search in 100k images to find the top similar ones for recommendation on the given input image, I need the simple, best and reliable implementation references.
I tried with VGG-19 model didn't get expected results, not aware of other techniques.
There is a simpler method , which is similar to word embeddings. If we find an expressive vector representation, or embedding for images, we can then calculate their similarity by looking at how close their vectors are to each other.
In addition, if we calculate these vectors for all images in our database ahead of time, this approach is both fast (one forward pass, and an efficient similarity search), and scalable.
Annoy is one of the libraries that implement fast solutions.
I want to customise the last layer of VGG 19 architecture for a classification problem. which will be more useful keras or pytorch?
It heavily depends on what you want to do with it.
While Keras offers different backends, such as TensorFlow or Theano (which in turn can offer you a little more flexibility), and transfers better to production systems,
PyTorch is definitely also easy to implement. Additionally, it offers great scaling on (multi-)GPU systems, since it is trivial to outsource your computations in a PyTorch model. I do not know how easy that is in Keras (never done it, so I genuinely cannot judge).
If you just want to play around with one of the frameworks, it usually boils down to personal preference. I personally prefer PyTorch, due to its more "python-esque" approach to things, but I know many people that prefer Keras because of its clear and simple layout and documentation.
Providing a little more information, or your context, can also potentially increase the quality of the answers you receive.