How to create ARFF dataSet for Multi-Labels text classification

How to create ARFF dataSet for Multi-Labels text classification - text

Good afternoon,
Well, i want to perform a Multi-label text Classification, so, i choose MEKA (extension of Weka) to perform this task. However, i need to transform the document to a Vector of words, i use the GUI Weka but as you know it perform just a biary classification, for that i tend to use MEKA to perform this task,
the problem is how i create an arff file with multi labels
here is an example:
this is the text
The addition of FMNH(2), to Vibrio harveyi luciferase at 2A°C in the presence of tetradecanal results in the formation of a highly fluorescent transient species with a spectral distribution indistinguishable from that of the bioluminescence. The bioluminescence reaches maximum intensity in 1.5 s and decays in a complex manner with exponential components of 10(-1) s(-1) , 7 x 10(-3)S(-1). and 7 x10(4)s(-1).
the labels are:
"FM", "Fl", "Ki", "Luc", "Lum", "Time Factors"
the result i want to get:
#attribute L-class {Luc, Lum, Limb,...}
#attribute F-class {FM, Fl, Foot,...}
#attribute o-class{Ki, TimeFactors, Adult, Aged, ...}
#attribute All_words frequency
#data
FM,Fl,Ki,Luc,Lum,TimeFactors,2,4,6,8,8,7,4,0,1,2,2....
The acronyms are the labels, and the numbers are the Frequency of each term occuring in the text.
Someone could help me, i will be really thankful.

Related

How to label the tokens as positive or negative

I was trying to analyze of cutomer reviews. I have divided it into a list of tokens, How can I label them which token is positive or negative? Is there any library?
I want to build a word cloud of the positive words and negative words.

I think you can try many stuffs here (However people usually classify the review as whole in not the words):
Try Brown clustering to cluster your words, then if you have labels you can better assess the quality of the word clustering.
Create label for words depending on the label of the review where they are (positive or negative), however this may not be accurate cause negativity is sometime a composition of words (e.k not like).
ou can also use your labels to derive negative and positive words by their frequencies in negative and positive documents.
There are plenty of libraries to do sentiment classification : scikit-learn,TensorFlow,....ect.

How are vectors calculated in doc2vec and what does the size parameter depict?

If I pass a Sentence containing 5 words to the Doc2Vec model and if the size is 100, there are 100 vectors. I'm not getting what are those vectors. If I increase the size to 200, there are 200 vectors for just a simple sentence. Please tell me how are those vectors calculated.

When using a size=100, there are not "100 vectors" per text example – there is one vector, which includes 100 scalar dimensions (each a floating-point value, like 0.513 or -1.301).
Note that the values represent points in 100-dimensional space, and the individual dimensions/axes don't have easily-interpretable meanings. Rather, it is only the relative distances and relative directions between individual vectors that have useful meaning for text-based applications, such as assisting in information-retrieval or automatic classification.
The method for computing the vectors is described in the paper 'Distributed Representation of Sentences and Documents' by Le & Mikolov. But, it is closely associated to the 'word2vec' algorithm, so understanding that 1st may help, such as via its first and second papers. If that style of paper isn't your style, queries like [word2vec tutorial] or [how does word2vec work] or [doc2vec intro] should find more casual beginning descriptions.

How to use MFCCs in Weka for audio classification?

I am trying to develop a method to classify audio using MFCCs in Weka. The MFCCs I have are generated with a buffer size of 1024, so there is a series of MFCC coefficients for each audio recording. I want to convert these coefficients into the ARFF data format for Weka, but I'm not sure how to approach this problem.
I also asked a question about merging the data as well because I feel like this may affect the data conversion to ARFF format.
I know that for an ARFF the data needs to be listed through attributes. Should each coefficient of the MFCC be a separate attribute or an array of the coefficients as a single attribute? Should each data represent a single MFCC, a window of time, or the entire file or sound? Below, I wrote out what I think it should look like if it only took one MFCC into account, which I don't think would be able to classify an entire sound.
#relation audio
#attribute mfcc1 real
#attribute mfcc2 real
#attribute mfcc3 real
#attribute mfcc4 real
#attribute mfcc5 real
#attribute mfcc6 real
#attribute mfcc7 real
#attribute mfcc8 real
#attribute mfcc9 real
#attribute mfcc10 real
#attribute mfcc11 real
#attribute mfcc12 real
#attribute mfcc13 real
#attribute class {bark, honk, talking, wind}
#data
126.347275, -9.709645, 4.2038302, -11.606304, -2.4174862, -3.703139, 12.748064, -5.297932, -1.3114156, 2.1852574, -2.1628475, -3.622149, 5.851326, bark
Any help will be greatly appreciated.
Edit:
I have generated some ARFF files using Weka using openSMILE following a method from this website, but I am not sure how this data would be used to classify the audio because each row of data is 10 milliseconds of audio from the same file. The name attribute of each row is "unknown," which I assume is the attribute that the data would try to classify. How would I be able to classify an overall sound (rather than 10 milliseconds) and compare this to several other overall sounds?
Edit #2: Success!
After more thoroughly reading the website that I found, I saw the Accumulate script and Test and Train data files. The accumulate script put all files generated each set of MFCC data from separate audio files together into one ARFF file. Their file was composed of about 200 attributes with stats for 12 MFCCs. Although I wasn't able to retrieve these stats using OpenSmile, I used Python libraries to do so. The stats were max, min, kurtosis, range, standard deviation, and so on. I accurately classified my audio files using BayesNet and Multilayer Perceptron in Weka, which both yielded 100% accuracy for me.

I don't know much about MFCCs, but if you are trying to classify audio files then each line under #data must represent one audio file. If you used windows of time or only one MFCC for each line under #data then the Weka classifiers would be trying to classify windows of time or MFCCs, which is not what you want. Just in case you are unfamiliar with the format (just linking because I saw you put the features of an audio file on the same line as #data), here is an example where each line represents an Iris Plant:
% 1. Title: Iris Plants Database
%
% 2. Sources:
% (a) Creator: R.A. Fisher
% (b) Donor: Michael Marshall (MARSHALL%PLU#io.arc.nasa.gov)
% (c) Date: July, 1988
%
#RELATION iris
#ATTRIBUTE sepallength NUMERIC
#ATTRIBUTE sepalwidth NUMERIC
#ATTRIBUTE petallength NUMERIC
#ATTRIBUTE petalwidth NUMERIC
#ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}
#DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
In terms of addressing your question on what attributes you should use for your audio file, it sounds (no pun intended) like using the MFCC coefficients could work (assuming every audio file has the same number of MFCCs because every piece data/audio file must have the same number of attributes). I would try it out and see how it goes.
EDIT:
If the audio files are not the same size you could:
Cut audio files longer than the shortest audio short. Basically you'd be throwing away the data at the end of the audio files.
Make the number of attributes high enough to fit the longest audio file and put whatever MFCC coefficients represent silence for the unfilled attributes of audio files which are shorted than the longest audio file.
If MFCC values are always within a certain range (e.g. -10 to 10 or something like that) then maybe use a "bag of words" model. Your attributes would represent the number of times an MFCC coefficient falls within a certain range for an audio file. So the first attribute might represent the number of MFCC coefficients which fall between -10 and -9.95, the second attribute, -9.95 to -9.90. So if you had a very short audio file with two MFCCs (not likely, just for example purposes) and one coefficient was 10 and the other was -9.93 then your last attribute would have a value of 1, your second attribute would have a value of 1, but all other attributes would have a value of 0. The downside to this method is that the order of the MFCC coefficients is not taken into account. However, this method works well for text classification even though word order is ignored so who knows, maybe it would work for audio.
Other than that I would see if you get any good answers on your merging question.

Weka, how to use clustering method to group similar string patterns

I am using clustering methods of Weka to group similar string patterns. I have firstly use the fonction "stringtowordVector" of weka and then I directly used some methodes of clustering, but I can't get correct results, could someone give me some correct methods to group this kind of data? This is a small part of my data :
#relation ponds
#ATTRIBUTE LCC string
#data
acegiadfgiacehiacehiacfhjacehjadfhjacfgiadfhjadfhjadfhjacfhjadf
acehiadfhjacehiadfhjadfhjadfhjadfhjacfhfhjacehj
acehiadfhjacehiadfhjadfhjadfhjadfhjacfhjadfhjadfhjadfhjadfhjadfhjacehj
acehiadfhjacehiadfhjadfhjacfhjaacehjadfhjadfhjadfhjacfhj
acehiadfhjacehikkkkkkkkkkk
in fact every line of this data represent an extracted frequent pattern(by a data mining algorithm) and each letter a c or e... represent an attribute, but every pattern(every line) doesn't have the same number of attributes, so how could I use the clustering methods to group similar patterns? Thank you very much!!! Looking forward to your response :)
David

Every string is different, so "string to word vector" will give them different vectors. Please read "bag of words model" for details.
You could try clustering with Levenshtein distance, but I would rather try designing some good features for your problem.

Euclidean vs Cosine for text data

IF I use tf-idf feature representation (or just document length normalization), then is euclidean distance and (1 - cosine similarity) basically the same? All text books I have read and other forums, discussions say cosine similarity works better for text...
I wrote some basic code to test this and found indeed they are comparable, not exactly same floating point value but it looks like a scaled version. Given below are the results of both the similarities on simple demo text data. text no.2 is a big line of about 50 words, rest are small 10 word lines.
Cosine similarity:
0.0, 0.2967, 0.203, 0.2058
Euclidean distance:
0.0, 0.285, 0.2407, 0.2421
Note: If this question is more suitable to Cross Validation or Data Science, please let me know.

If your data is normalized to unit length, then it is very easy to prove that
Euclidean(A,B) = 2 - Cos(A,B)
This does hold if ||A||=||B||=1. It does not hold in the general case, and it depends on the exact order in which you perform your normalization steps. I.e. if you first normalize your document to unit length, next perform IDF weighting, then it will not hold...
Unfortunately, people use all kinds of variants, including quite different versions of IDF normalization.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string