Number of training samples for text classification tas - nlp

Suppose you have a set of transcribed customer service calls between customers and human agents, where on average each call's length is 7 minutes. Customers will mostly call because of issues they have with the product. Let's assume that a human can assign one label per axis per call:
Axis 1: What was the problem from the customer's perspective?
Axis 2: What was the problem from the agent's perspective?
Axis 3: Could the agent resolve the customer's issue?
Based on the manually labeled texts you want to train a text classifier that shall predict a label for each call for each of the three axes. But the labeling of recordings takes time and costs money. On the other hand you need a certain amount of training data to get good prediction results.
Given the above assumptions, how many manually labeled training texts would you start with? And how do you know that you need more labeled training texts?
Maybe you've worked on a similar task before and can give some advice.
UPDATE (2018-01-19): There's no right or wrong answer to my question. Ok, ideally, somebody worked on exactly the same task, but that's very unlikely. I'll leave the question open for one more week and then accept the best answer.

This would be tricky to answer but I will try my best based on my experience.
In the past, I have performed text classification on 3 datasets; the number in the bracket indicates how big my dataset was: restaurant reviews (50K sentences), reddit comments (250k sentences) and developer comments from issue tracking systems (10k sentences). Each of them had multiple labels as well.
In each of the three cases, including the one with 10k sentences, I achieved an F1 score of more than 80%. I am stressing on this dataset specifically because I was told by some that the size is less for this dataset.
So, in your case, assuming you have atleast 1000 instances (calls that include conversation between customer and agent) of average 7 minute calls, this should be a decent start. If the results are not satisfying, you have the following options:
1) Use different models (MNB, Random Forest, Decision Tree, and so on in addition to whatever you are using)
2) If point 1 gives more or less similar results, check the ratio of instances of all the classes you have (the 3 axis you are talking about here). If they do not share a good ratio, get more data or try out the different balancing techniques if you cannot get more data.
3) Another way would be to classify them at a sentence level than message or conversation level to generate more data and individual labels for sentences rather than message or the conversation itself.

Related

How is it possible to map several samples (time series) to one label as input to a neural network?

I have currently a project, where the goal is to create text from time series data. The features of the time series data are the values of sensors in a pencil. The idea is to accomplish that by a Seq2Seq LSTM-Network, like a classical LSTM-Translator, but not between to languages, but between sensor data and text. Unfortunately, I don't know how to label the data correctly and feed this data to the network.
The best way in my opinion is to map one of the labels to one of the recording (let's say 100 samples). So that the network "sees" 100 samples (time series, so one after another) and gets the label (the text written in these 100 samples, tokenized and embedded).
But how do I achieve that? In all examples that I could find, each sample in time series data had one label, so 100 samples, 100 labels. So, sorry that I had to ask here.
I first thought, that I would just repeat the label 100 times, but I think the network will mix it up. I did not try anything else to be honest.
Thanks in advance!
Best,
Jan

How can i trace eating with mouth closed using Google ML Kit

Currently i tried to calculate distance between UPPER_LIP_BOTTOM and LOWER_LIP_TOP, and i set the threshold value 23 (Calculated by minimum distance between both UPPER_LIP_BOTTOM and LOWER_LIP_TOP), if current distance go above the THRESHOLD it will show "Eating" but this method is not working when i am eating with my mouth closed.
You can experiment with a couple of things:
Take all the points in the mouth as the input and build a second ML classifier model (a single layer fully connected model might work).
In addition to above, take input from multiple frames. There maybe additional complication if the frames are not taken in regular intervals.
I am interested in the use-case, can you tell us more?

Handling optional data in Logistic regression

I am working with data which contains marks and other features of students and trying to predict whether they will get a high salary or not using scikit-learn in python. I ran into a problem,
since a student does not take all the subject his/her score in a subject is -1 if he has not taken the subject (a student can take multiple subjects).
Below a snapshot taken from the data file:
Snapshot
I am trying to find a way to interpret the -1 in a way that doesn't alter the data much.
My Approach:
Take the percentile marks for each student and then take the average of all percentiles for each student giving a single number for each student which a lot easier to work with but this method may lose some information about the distribution of marks.
Fill the -1 value with the average of marks for all the students in that subject, but this will not work if the data is biased towards one subject
Is there any better way the deal with this kind of data?
Your "-1"'s amount to missing data, so you are asking how to approach a classification task with missing data. See here and here and here, among many others, for discussions on this topic.
A couple important considerations that come to mind:
One option is to "impute" the missing values, which is what you're describing with using "average marks." This approach often requires the assumption that the data is "missing at random" which in your case is unlikely to be true: for example, a bad student is more likely to not take a difficult subject, so missing values tell you something.
Using regression models (like logistic regression) are in general going to require some type of imputation. But there are other models, like decision trees or Random forests, that can handle missing data without imputation.

How to handle a highly unbalanced dataset

I was checking the dataset CERT V4.1 which was synthesized to simulate insider threats. I realized that it contains about 850K samples and there are about 200 samples considered as malicious data. Is this normal? am I missing something here? If this is the case, how can I handle such data if I want to use deep learning?
If you have unbalanced Data you have many options (see link below).
Additional to these there is a really interesting approach that works like this:
1: you randomly split your 850K negative samples in blocks of 200
2: you build one classifier for every block where you put all positive samples in together with one block of the negative samples
3: Use all classifiers in paralell and let them vote, find a good threshold of how many positive votes you need to be "sure enough" to classify the test sample as positive
Regarding that your data is 200 vs 850K (meaning around 4250 Classifiers) you might consider to combine this approach with one of the others, like duplicating mentioned by #Prune or one of the approaches explained in the link below.
Here you have some approaches dealing with imbalanced data
http://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
Yes, this is normal in many paradigms: a large majority of the traffic is "normal". You handle this simply by being careful to distribute the negative samples proportionately in your train, test, and validation sets. For instance, if your desired proportions are 50-30-20, make sure that you have about 100 malicious samples in the training set, 60 in testing, and 20 in validation.
If the training fails in this paradigm, you can also try adding multiple instances of each malicious sample to each of the sets: duplicate those 100 records several times; for instance, add 10 copies of each sample to each of the data sets (but still do not cross from one set to another -- you would now have 1000 malicious samples in the training set, not 10 copies each of the original 200).

Simple Binary Text Classification

I seek the most effective and simple way to classify 800k+ scholarly articles as either relevant (1) or irrelevant (0) in relation to a defined conceptual space (here: learning as it relates to work).
Data is: title & abstract (mean=1300 characters)
Any approaches may be used or even combined, including supervised machine learning and/or by establishing features that give rise to some threshold values for inclusion, among other.
Approaches could draw on the key terms that describe the conceptual space, though simple frequency count alone is too unreliable. Potential avenues might involve latent semantic analysis, n-grams, ..
Generating training data may be realistic for up to 1% of the corpus, though this already means manually coding 8,000 articles (1=relevant, 0=irrelevant), would that be enough?
Specific ideas and some brief reasoning are much appreciated so I can make an informed decision on how to proceed. Many thanks!
Several Ideas:
Run LDA and get document-topic and topic-word distributions say (20 topics depending on your dataset coverage of different topics). Assign the top r% of the documents with highest relevant topic as relevant and low nr% as non-relevant. Then train a classifier over those labelled documents.
Just use bag of words and retrieve top r nearest negihbours to your query (your conceptual space) as relevant and borrom nr percent as not relevant and train a classifier over them.
If you had the citations you could run label propagation over the network graph by labelling very few papers.
Don't forget to make the title words different from your abstract words by changing the title words to title_word1 so that any classifier can put more weights on them.
Cluster the articles into say 100 clusters and then choose then manually label those clusters. Choose 100 based on the coverage of different topics in your corpus. You can also use hierarchical clustering for this.
If it is the case that the number of relevant documents is way less than non-relevant ones, then the best way to go is to find the nearest neighbours to your conceptual space (e.g. using information retrieval implemented in Lucene). Then you can manually go down in your ranked results until you feel the documents are not relevant anymore.
Most of these methods are Bootstrapping or Weakly Supervised approaches for text classification, about which you can more literature.

Resources