How is it possible to map several samples (time series) to one label as input to a neural network? - keras

I have currently a project, where the goal is to create text from time series data. The features of the time series data are the values of sensors in a pencil. The idea is to accomplish that by a Seq2Seq LSTM-Network, like a classical LSTM-Translator, but not between to languages, but between sensor data and text. Unfortunately, I don't know how to label the data correctly and feed this data to the network.
The best way in my opinion is to map one of the labels to one of the recording (let's say 100 samples). So that the network "sees" 100 samples (time series, so one after another) and gets the label (the text written in these 100 samples, tokenized and embedded).
But how do I achieve that? In all examples that I could find, each sample in time series data had one label, so 100 samples, 100 labels. So, sorry that I had to ask here.
I first thought, that I would just repeat the label 100 times, but I think the network will mix it up. I did not try anything else to be honest.
Thanks in advance!
Best,
Jan

Related

Custom entities extraction from texts

What is the right approach for multi-label text information extraction/classification
Having texts that describe a caregiver/patient visit : (made-up example)
Mr *** visits the clinic on 02/2/2018 complaining about pain in the
lower back for several days, No pathological findings in the x-ray or
in the blood tests. I suggest Mr *** 5 resting days.
Now, that text can be even in a paragraph size where the only information I care about will be lower back pain and resting days. I have 300-400 different labels but the number of labeled samples can be around 1000-1500 (total) . When I label the text I also mark the relevant words that create the "label" ,here it will be ['pain','lower','back'].
When I just use look-up for those words (or the other 300-400 labels) in other texts I manage to label a larger amount of texts but if the words are written in different patterns such as Ache in the lower back or "lowerback pain" and I've never added that pattern to the look-up table of "lower back pain" I won't find it.
Due to the fact that I can have large paragraph but the only information I need is just 3-4 words, DL/ML models do not manage to learn with that amount of data and a high number of labels.I am wondering if there is a way to use the lookup table as a feature in the training phase or to try other approaches

Modeling and identiying curves

I have made measurements on sensors (light,humidity etc.) and I result in statistical curves/graphs. When I do the same experiments, I get a curve that looks like the previous in general, not the same of course. What I want is to model the curve and result to an equation so that when I run the experiment again and take a similar curve(graph) to say this is light sensor, or this is humidity sensor..etc. The problem is that I do not know whether this is feasible, and where to start from.. Do I need Machine Learning? Something else? Thanks...
You can use simple neural network which would learn how to determine type of sensor given measurement. To train the neural net you need data which means you would need to gather several dozens or hundreds of measurements and label them (the more data the more accurate predictions from neural network)
Hovewer, if the measurements for given sensor are very similar and from specified range, you don't really need machine learning. You just need to compute to which type of sensor your new measurement is the most similar to.
One possible approach would be to :
Take a few measures for each class of sensor
For each class create a vector of fixed length that would contain averaged values of measurement, for example if your light sensor measurements from 3 experiments look like this:
[1,4,5,3,8]
[1,3,4,3,7]
[1,3,5,3,6]
Then you average it to single vector:
[1, 3.33, 4.66, 3, 7]
When you take a new measure and want to determine it's class, you compute Mean Absolute Error of the new measurement for avereged vector of each class. The class with the lowest error is the sensor that the measurement was taken with

Number of training samples for text classification tas

Suppose you have a set of transcribed customer service calls between customers and human agents, where on average each call's length is 7 minutes. Customers will mostly call because of issues they have with the product. Let's assume that a human can assign one label per axis per call:
Axis 1: What was the problem from the customer's perspective?
Axis 2: What was the problem from the agent's perspective?
Axis 3: Could the agent resolve the customer's issue?
Based on the manually labeled texts you want to train a text classifier that shall predict a label for each call for each of the three axes. But the labeling of recordings takes time and costs money. On the other hand you need a certain amount of training data to get good prediction results.
Given the above assumptions, how many manually labeled training texts would you start with? And how do you know that you need more labeled training texts?
Maybe you've worked on a similar task before and can give some advice.
UPDATE (2018-01-19): There's no right or wrong answer to my question. Ok, ideally, somebody worked on exactly the same task, but that's very unlikely. I'll leave the question open for one more week and then accept the best answer.
This would be tricky to answer but I will try my best based on my experience.
In the past, I have performed text classification on 3 datasets; the number in the bracket indicates how big my dataset was: restaurant reviews (50K sentences), reddit comments (250k sentences) and developer comments from issue tracking systems (10k sentences). Each of them had multiple labels as well.
In each of the three cases, including the one with 10k sentences, I achieved an F1 score of more than 80%. I am stressing on this dataset specifically because I was told by some that the size is less for this dataset.
So, in your case, assuming you have atleast 1000 instances (calls that include conversation between customer and agent) of average 7 minute calls, this should be a decent start. If the results are not satisfying, you have the following options:
1) Use different models (MNB, Random Forest, Decision Tree, and so on in addition to whatever you are using)
2) If point 1 gives more or less similar results, check the ratio of instances of all the classes you have (the 3 axis you are talking about here). If they do not share a good ratio, get more data or try out the different balancing techniques if you cannot get more data.
3) Another way would be to classify them at a sentence level than message or conversation level to generate more data and individual labels for sentences rather than message or the conversation itself.

When using word alignment tools like fast_align, does more sentences mean better accuracy?

I am using fast_align https://github.com/clab/fast_align to get word alignments between 1000 German sentences and 1000 English translations of those sentences. So far the quality is not so good.
Would throwing more sentences into the process help fast_align to be more accurate? Say I take some OPUS data with 100k aligned sentence pairs and then add my 1000 sentences in the end of it and feed it to fast_align. Will that help? I can't seem to find any info on whether this would make sense.
[Disclaimer: I know next to nothing about alignment and have not used fast_align.]
Yes.
You can prove this to yourself and also plot the accuracy/scale curve by removing data from your dataset to try it at at even lower scale.
That said, 1000 is already absurdly low, for these purposes 1000 ≈≈ 0, and I would not expect it to work.
More ideal would be to try 10K, 100K and 1M. More comparable to others' results would be some standard corpus, eg Wikipedia or data from the research workshops.
Adding data very different than the data that is important to you can have mixed results, but in this case more data can hardly hurt. We could be more helpful with suggestions if you mention a specific domain, dataset or goal.

Search different audio files for equal short samples

Consider multiple (at least two) different audio-files, like several different mixes or remixes. Naively I would say, it must be possible to detect samples, especially the vocals, that are almost equal in two or more of the files, of course only then, if the vocal samples aren't modified, stretched, pitched, reverbed too much etc.
So with what kind of algorithm or technique this could be done? Let's say, the user would try to set time markers in all files best possible, which describe the data windows to compare, containing the presumably equal sounds, vocals etc.
I know that no direct approach, trying to directly compare wav data in any way is useful. But even if I have the frequency domain data (e.g. from FFT), I would have to use a comparison algorithm that kind of shifts the comparing-windows through time scale, since I cannot assume the samples, I want to find, are time sync over all files.
Thanks in advance for any suggestions.
Hi this is possible !!
You can use one technique called LSH (locality sensitive hashing), is very robust.
another way to do this is try make spectrogram analysis in your audio files ...
Construct database song
1. Record your Full Song
2. Transform the sound to spectrum
3. slice your Spectrogram in chunk and get three or four high Frequencies
4. Store all the points
Match the song
1. Record one short sample.
2. Transform the sound into another spectrum
3. slice your Spectrogram in chunk and get three or four hight Frequencies
4. Compare the collected frequencies with your database song.
5. your match is the song with have the high hit !
you can see here how make ..
http://translate.google.com/translate?hl=EN&sl=pt&u=http://ederwander.wordpress.com/2011/05/09/audio-fingerprint-em-python/
ederwander

Resources