Most of the embeddings, publicly available, that I know are done over news articles, which use a different language/words as the one used in user/customer reviews.
Although such embeddings can be used in NLP tasks concerning reviews
and user generated content, I think the difference in language has an important role, and as such I would rather use embeddings trained over user generated content, such as product reviews.
I'm looking for a corpus of reviews or comments in English -- although in German and Dutch would also be useful -- to generate embeddings, or alternatively embeddings already trained over such a corpus.
Found two datasets/corpus in English:
https://www.yelp.com/dataset_challenge
https://snap.stanford.edu/data/web-Amazon.html
in German:
http://www.uni-weimar.de/en/media/chairs/webis/corpora/corpus-webis-cls-10/
Related
I have a dataset that includes English, Spanish and German documents. I want to represent them using document embeddings techniques to compute their similarities. However, as the documents are in different languages and the length of each one is paragraph-sized, it is difficult to find a pre-trained model (I do not have enough data for training) .
I found some interesting models like Sent2Vec and LASER that also work on multilingual context. However, both of them have been implemented for sentence representation. The question is two folds:
Is there any model that could be used to represent multilingual paragraphs?
Is it possible to employ sent2vec (or LASER) to represent paragraphs (I mean to represent each paragraph using an embeddings vector)?
Any help would be appreciated.
Are there any latest pre-trained multilingual word embeddings (multiple languages are jointly mapped to a same vector space)?
I have looked at the following but they don't fit my needs:
FastText / MUSE (https://fasttext.cc/docs/en/aligned-vectors.html): this one seems too old, and the word vectors are not using subwords / wordpiece information.
LASER (https://github.com/yannvgn/laserembeddings): I'm now using this one, it's using subword information (via BPE), however, it's suggested that not to use this for word embedding because it's designed to embed sentences (https://github.com/facebookresearch/LASER/issues/69).
BERT multilingual (bert-base-multilingual-uncased in https://huggingface.co/transformers/pretrained_models.html): it's contextualised embeddings that can be used to embed sentences, and seems not good at embedding words without contexts.
Here is the problem I'm trying to solve:
I have a list of company names, which can be in any language (mainly English), and I have a list of keywords in English to measure how close a given company name is with regards to the keywords. Now I have a simple keyword matching solution, but I want to improve it using pretrained embeddings. As you can see in the following examples, there are several challenges:
keyword and brand name is not separated by space (now I'm using package "wordsegment" to split words into subwords), so embedding with subword info should help a lot
keyword list is not extensive and company name could be in different languages (that's why I want to use embedding, because "soccer" is close to "football")
Examples of company names: "cheapfootball ltd.", "wholesalefootball ltd.", "footballer ltd.", "soccershop ltd."
Examples of keywords: "football"
Check if this would do:
Multilingual BPE-based embeddings
Aligned multilingual sub-word vectors
If you're okay with whole word embeddings:
(Both of these are somewhat old, but putting it here in-case it helps someone)
Multilingual FastText
ConceptNet NumberBatch
If you're okay with contextual embeddings:
Multilingual ELMo
XLM-RoBERTa
You can even try using the (sentence-piece tokenized) non-contextual input word embeddings instead of the output contextual embeddings, of the multilingual transformer implementations like XLM-R or mBERT. (Not sure how it will perform)
I think it might be a little misleading to build a model using embedding into this application(learned by experience). Because if there are two companies, football ltd, and soccer ltd, the model might say both are a match, which might not be right.
One approach is to remove redundant words, i.e., corporation from the Facebook corporation, ltd from Facebook ltd and try matching.
Another approach is to use deepmatcher, which uses deep learning fuzzy matching based on words context.
Link
If the sentence similarity is the primary approach you want to follow STSBenchmark algorithms might be worth exploring :Link
Sent2vec link and InferSent Link uses Fasttext but seems to have good results on STSBenchmark
If anyone is still looking for some sort of multilingual word embedding.
This might help:
https://github.com/babylonhealth/fastText_multilingual
As far as I know, it doesn't support sentences, and my trials to do it weren't really successful.
we have a news website where we have to match news to a particular user.
We have to use for the matching only the user textual information, like for example the interests of the user or a brief description about them.
I was thinking to threat both the user textual information and the news text as document and find document similarity.
In this way, I hope, that if in my profile I wrote sentences like: I loved the speach of the president in Chicago last year, and a news talks about: Trump is going to speak in Illinois I can have a match (the example is purely casual).
I tried, first, to embed my documents using TF-IDF and then I tried a kmeans to see if there was something that makes sense, but I don't like to much the results.
I think the problem derives from the poor embedding that TF-IDF gives me.
Thus I was thinking of using BERT embedding to retrieve the embedding of my documents and then use cosine similarity to check similarity of two document (a document about the user profile and a news).
Is this an approach that could make sense? Bert can be used to retrieve the embedding of sentences, but there is a way to embed an entire document?
What would you advice me?
Thank you
BERT is trained on pairs of sentences, therefore it is unlikely to generalize for much longer texts. Also, BERT requires quadratic memory with the length of the text, using too long texts might result in memory issues. In most implementations, it does not accept sequences longer than 512 subwords.
Making pre-trained Transformers work efficiently for long texts is an active research area, you can have a look at a paper called DocBERT to have an idea what people are trying. But it will take some time until there is a nicely packaged working solution.
There are also other methods for document embedding, for instance Gensim implements doc2vec. However, I would still stick with TF-IDF.
TF-IDF is typically very sensitive to data pre-processing. You certainly need to remove stopwords, in many languages it also pays off to do lemmatization. Given the specific domain of your texts, you can also try expanding the standard list of stop words by words that appear frequently in news stories. You can get further improvements by detecting and keeping together named entities.
I want to use NER(CRF classifier) to identify Author names in a query. I trained NER following the method given in nlp.stanford.edu site using the training file:training-data.col. And tested using the file:testing-data.tsv.
The NER is tagging every input as Author, even the data that is tagged as non-Author in the training data. Can anyone tell me why NER is tagging the non-Authors in training data as Authors and how to train NER to identify Authors(I have the list of Author names to train).
Any suggestions for reference material on NER other than nlp.stanford.edu site will be helpful.
That's a very small piece of training data, so I'm not surprised that it made the wrong inferences. Since the only example it has seen of "Atal" is as Author, it's tagging "Atal" as such.
But more so, if you want to discriminate between people listed at the beginning as Author and people listed in the text as 0, Stanford NER is not going to do that. Stanford NER is intended to make long distance inferences about the named-entity tags of tokens in natural language text. In other words, it's doing the opposite of what you're trying to do.
You could probably do this with some simple pattern recognition---if your documents are formatted in a similar way, with the authors together, I would start with exploiting that. You could use the NER to tag the authors as PERSON, and then use that tag as a feature in your own tagging.
I'm kinda newbie and not native english so have some trouble understanding Gensim's word2vec and doc2vec.
I think both give me some words most similar with query word I request, by most_similar()(after training).
How can tell which case I have to use word2vec or doc2vec?
Someone could explain difference in short word, please?
Thanks.
In word2vec, you train to find word vectors and then run similarity queries between words. In doc2vec, you tag your text and you also get tag vectors. For instance, you have different documents from different authors and use authors as tags on documents. Then, after doc2vec training you can use the same vector aritmetics to run similarity queries on author tags: i.e who are the most similar authors to AUTHOR_X? If two authors generally use the same words then their vector will be closer. AUTHOR_X is not a real word which is part of your corpus just something you determine. So you don't need to have it or manually insert it into your text. Gensim allows you to train doc2vec with or without word vectors (i.e. if you only care about tag similarities between each other).
Here is a good presentation on word2vec basics and how they use doc2vec in an innovative way for product recommendations (related blog post).
If you tell me about what problem you are trying to solve, may be I can suggest which method will be more appropriate.