How do I combine all Bert embeddings to form a feature? - nlp

Thank you in advance for any help offered. I am working on a product classification task. I embeded customer reviews one by one for every single product by Bert. I want to form a new feature called "customer review" (a vector representation for reviews) for products I want to classify. Is it feasible to form this feature by combining all Bert embeddings of one specific product? If so, what should I do? Any suggestion is appreciated.

Related

Label Dutch reviews on specific customer categories for language classification

I am looking for a classification module that is able to classify reviews in custom categories. This needs to be done for specifically Dutch reviews.
Does anyone have an idea what package would be most suitable for such a kind of project?
Thank you in advance.
Kind regards
I am trying to find a package that is able to classify reviews on custom made categories.

Should the dataset be domain specific when it comes to Named Entity Recognition?

For my final year undergraduate project, I intend to use named entity recognition to classify a fiction summary based on LOCATION, PERSON, and so on. When I was looking into datasets I couldn't find any labelled dataset of fiction summaries.
My doubt is, whether the the training dataset for NER should be specific to the domain? in my case, for fiction. If not even though I'm developing a model for fiction can I use dataset like 'conll2003' which is a dataset about news domain?
I would love replies as I'm stuck with this now without being able to proceed in my project.
Thanks in advance :)
I tried labelling an unlabelled fiction summary dataset manually but seems like it will be taking very much long time which I can't afford. That's why I wanted to know whether I can use labelled datasets which are not specific to the domain

NLP Aspect Mining approach

I'm trying to implement as aspect miner based on consumer reviews in amazon for durable- washing machine, refrigerator. The idea is to output sentiment polarity for aspects instead of the entire sentence. For eg: 'Food was good but service was bad' review must output food to be positive and service to be negative. I read through Richard Socher's paper on RNTN model for fine grained sentiment classifier but I guess I'll need to manually tag sentiment for phrases for a different domain and create my own treebank for better accuracy.
Here's an alternate approach I'd thought of. Could someone pls validate/guide me with your feedback
Break the approach into 2 sub tasks. 1) Identify aspects 2) Identify sentiment
Identify aspects
Use POS tagger to identify all nouns. This should shortlist
potentially all aspects in the reviews.
Use word2vec of these nouns to determine similar nouns and reduce the dataset size
Identify sentiments
Train a CNN or dense net model on reviews with rating 1,2,4,5(ignore
3 as we need data that has polarity)
Breakdown the test set reviews into phrases(eg 'Food was good') and then score them using the above model
Find the aspects identified in the 1st sub task and tag them to
their respective phrases.
I don't know how to answer this question but have a few suggestions:
Take a look at multitask learning in neuralnets literature and try an end2end neuralnet for multiple tasks.
Use pretrained word vectors like w2v or glov as inputs.
Don't rely on pos taggers when you use internet data,
Find a way to represent your name entities and oov in your design.
Don't ignore 3!!
You should annotate some data periodically.

Multiclass text classification: new class if input does not match to a class

I am trying to classify pieces of text to categories. I have 9 categories but the given sentences i have can be classify to more categories. My objective is to take a piece of text and find the industry of each sentence, one common problem i have is that my training set does not have a "Porn" category and sentences with porn material classified to "Financial".
I want my classifier to check if the sentence can be categorized to a class and if not just print that cant classify that text.
I am using Tf-idf vectorizer to transform the sentences and then i feed the data to a LinearSVC.
Can anyone help me with this issue?
Or can anyone provde me any usefull material?
Firstly, the problem you have with the “Porn” documents being classified as “Financial” doesn’t seem to be entirely related to the other question here. I’ll address the main question for now.
The setting is that you have data for 9 categories, but the actual document universe is bigger. The problem is to determine that you haven’t seen the likes of a particular data point before. This seems to be more like outlier or anomaly detection, than classification.
You'll have to do some background reading to proceed further, but here are some points to get you started. One strategy to use is to determine if the new document is “similar” to other documents that you have in your collection. The idea being that an outlier is not likely to be similar to “normal” documents. To do this, you would need a robust measure of document similarity.
Outline of a potential method you can use:
Find a good representation of the documents, say tf-idf vectors, or better.
Benchmark the documents within your collection. For each document, the “goodness” score is the highest similarity score with all other documents in the collection. (Alternately, you can use k’th highest similarity, for some fault tolerance.)
Given the new document, measure its goodness score in a similar way.
How does the new document compare to other documents in terms of the goodness score? A very low goodness score is a sign of an outlier.
Further reading:
Survey of Anomaly Detection
LSA, which is a technique for text representation and similarity computation.

Customizing my Own model in Stanford NER

Could I ask about Stanford NER?? Actually, I'm trying to train my own model, to use it later for learning. According to the documentation, I have to add my own features in SeqClassifierFlags and add code for each Feature in NERFeatureFactory.
My questions is that, I have my tokens with all features extracted and Last column represents the label. So, is there any way in Stanford NER to give it my Tab-Delimeted file which contains 30 columns (1 is word , 28 are featurs, and 1 is label) to train my own model without spending time for extracting features???
Of course, in Testing phase, I will give it a file like the the aforementioned file without label to predict the label.
Is this possible or Not??
Many thanks in Advance
As explained in the FAQ page, the only way to customize the NER model is by inserting the data and specifying the features that you want to extract.
But, wait ... you have the data, and you have managed to extract the features, so I think you don't need the NER model, you need a classifier. I know this answer is pretty pretty late, but maybe this classifier will be a good place to start.

Resources