I'm using WEKA tool for clustering data analysis, however in some of my attributes, there are many values within the domain. Specifically, I need to represent some information about proteins and the information that I need to include is the terms associated with their functions.
For example these values are include on the same attribute "Function":
"RNA-Binding protein", "RNA bindingstructural constituent of ribosomerRNA binding", "translation", "intracellularribosomeribonucleoprotein complex".
And these terms diversify hugely.
Can someone help me?
A common approach is to split categorical variables with n different categories into n binary dummy variables.
For example:
gender = {male, female} can be rewritten with 2 dummy variables as:
male = [0, 1]
female = [1, 0]
In your case, it seems a function can contain several distinct values (e.g. 1 protein with several functions). This is easy to mold into dummy variables too.
Related
I want to build an error component logit using R's mlogit library.
I have considered my dataset as a panel dataset (i.e. each row indicates an alternative) and then build an error component logit model.
While I understand that in order to build a mixed-logit model, I need to add the list of covariates in rpar command. However, I do not want to estimate random parameters for the covariates but for the intercept term.
In a multinomial logit model you can estimate J-1 alternative specific constants (intercepts). The easiest way to make them random is to create alternative specific indicators.
For example, let's say that you have three alternatives: 1, 2 and 3, and that these are stored in the variable alt. Now you can create two new variables called alt_1 and alt_2, which are equal to 1 for alternative 1 and 2, respectively, and 0 otherwise.
data$alt_1 <- ifelse(data$alt == 1, 1, 0)
data$alt_2 <- ifelse(data$alt == 2, 1, 0)
Now use the mlogit.data() function.
In your model you would then specify alt_1 and alt_2 to be random parameters following some distribution. Now you have random alternative specific constants and you estimate the mean and standard deviation. If you want them to be simple error components with zero mean and unit standard deviation, you can fix the mean and sd parameters for the intercepts to 0 and 1 respectively.
I know the meaning and methods of word embedding(skip-gram, CBOW) completely. And I know, that Google has a word2vector API that by getting the word can produce the vector.
but my problem is this: we have a clause that includes the subject, object, verb... that each word is previously embedded by the Google API, now "How we can combine these vectors together to create a vector that is equal to the clause?"
Example:
Clause: V= "dog bites man"
after word embedding by the Google, we have V1, V2, V3 that each of them maps to the dog, bites, man. and we know that:
V = V1+ V2 +V3
How can we provide V?
I will appreciate if you explain it by taking an example of real vectors.
A vector is basically just a list of numbers. You add vectors by adding the number in the same position in each list together. Here's an example:
a = [1, 2, 3]
b = [4, 5, 6]
c = a + b # vector addition
c is [(1+4), (2+5), (3+6)], or [5, 7, 9]
As indicated in this question, a simple way to do this in python is like this:
map(sum, zip(a, b))
Vector addition is part of linear algebra. If you don't understand operations on vectors and matrices the math around word vectors will be very hard to understand, so you may want to look into learning more about linear algebra in general.
Normally adding word vectors together is a good way to approximate a sentence vector, since for any given set of words there's an obvious order. However, your example of Dog bites man and Man bites dog shows the weakness of adding vectors - the result doesn't change based on word order, so the results for those two sentences would be the same, even though their meanings are very different.
For methods of getting sentence vectors that are affected by word order, look into doc2vec or the just-released InferSent.
Two solutions:
Use vector addition of the constituent words of a phrase - this typically works well because addition is a good estimation of semantic composition.
Use paragraph vectors, which is able to encode arbitrary length sequence of words as a single vector.
So, In this paper : https://arxiv.org/pdf/2004.07464.pdf
They have combined image embedding and text embedding by concatenating them.
X = TE + IE
Here X is fusion embedding with TE and IE as text and image embedding respectively.
If your TE and IE have dimension of suppose 2048 each, your X will be of length 2*2024. Then maybe you can use this if possible or if you want to reduce the dimension you can use t-SNE/PCA or https://arxiv.org/abs/1708.03629 (Implemented here : https://github.com/vyraun/Half-Size)
For big datasets with 2bil+ samples and approximately 100+ features per sample. Among these, 10% features you have are numerical/continuous variables and the rest of it are categorical variables (position, languages, url etc...).
Let's use some examples:
e.g: dummy categorical feature
feature: Position
real values: SUD | CENTRE | NORTH
encoded values: 1 | 2 | 3
...would have sense use reduction like SVD because distance beetween sud:north > sud:centre and, moreover, it's possible to encode (e.g OneHotEncoder, StringIndexer) this variable because of the small cardinality of it values-set.
e.g: real categorical feature
feature: url
real values: very high cardinality
encoded values: ?????
1) In MLlibthe 90% of the model works just with numerical values (a part of Frequent Itemset and DecisionTree techniques)
2) Features transformers/reductor/extractor as PCA or SVD are not good for these kind of data, and there is no implementation of (e.g) MCA
a) Which could be your approach to engage with this kind of data in spark, or using Mllib?
b) Do you have any suggestions to cope with this much categorical values?
c) After reading a lot in literature, and counting the implemented model in spark, my idea, about make inference on one of that features using the others (categorical), the models at point 1 could be the best coiche. What do you think about it?
(to standardize a classical use case you can imagine the problem of infer the gender of a person using visited url and other categorical features).
Given that I am a newbie in regards to MLlib, may I ask you to provide a concrete example?
Thanks in advance
Well, first I would say stackoverflow works in a different way, you should be the one providing a working example with the problem you are facing and we help you out using this example.
Anyways I got intrigued with the use of the categorical values like the one you show as position. If this is a categorical value as you mention with 3 levels SUD,CENTRE, NORTH, there is no distance between them if they are truly categorical. In this sense I would create dummy variables like:
SUD_Cat CENTRE_Cat NORTH_Cat
SUD 1 0 0
CENTRE 0 1 0
NORTH 0 0 1
This is a truly dummy representation of a categorical variable.
On the other hand if you want to take that distance into account then you have to create another feature which takes this distance into account explicitly, but that is not a dummy representation.
If the problem you are facing is that after you wrote your categorical features as dummy variables (note that now all of them are numerical) you have very many features and you want to reduce your feature's space, then is a different problem.
As a rule of thumbs I try to utilize the entire feature space first, now a plus since in spark computing power allows you to run modelling tasks with big datasets, if it is too big then I would go for dimensionality reduction techniques, PCA etc...
I have 4 sets of manually tagged data for 0 and 1, by 4 different people. I have to get the final labelled data in terms of 0 and 1 using the 4 sets of manually tagged data. I have calculated the degree of agreement between the users as
A-B : 0.3276,
A-C : 0.3263,
A-D : 0.4917,
B-C : 0.2896,
B-D : 0.4052,
C-D : 0.3540.
I do not know how to use this to calculate the final data as a single set.
Please help.
The Kappa coefficient works only for a pair of annotators. For more than two, you need to employ an extension of it. One popular way of doing so is to use this expansion proposed by Richard Light in 1971, or to use the average expected agreement for all annotator pairs, proposed by Davies and Fleiss in 1982. I am not aware of any readily available calculator that will compute these for you, so you may have to implement the code yourself.
There is this Wikipedia page on Fleiss' kappa, however, which you might find helpful.
These techniques can only be used for nominal variables. If your data is not on the nominal scale, use a different measure like the intraclass correlation coefficient.
I have a trading strategy on the foreign exchange market that I am attempting to improve upon.
I have a huge table (100k+ rows) that represent every possible trade in the market, the type of trade (buy or sell), the profit/loss after that trade closed, and 10 or so additional variables that represent various market measurements at the time of trade-opening.
I am trying to find out if any of these 10 variables are significantly related to the profits/losses.
For example, imagine that variable X ranges from 50 to -50.
The average value of X for a buy order is 25, and for a sell order is -25.
If most profitable buy orders have a value of X > 25, and most profitable sell orders have a value of X < -25 then I would consider the relationship of X-to-profit as significant.
I would like a good starting point for this. I have installed RapidMiner 5 in case someone can give me a specific recommendation for that.
A Decision Tree is perhaps the best place to begin.
The tree itself is a visual summary of feature importance ranking (or significant variables as phrased in the OP).
gives you a visual representation of the entire
classification/regression analysis (in the form of a binary tree),
which distinguishes it from any other analytical/statistical
technique that i am aware of;
decision tree algorithms require very little pre-processing on your data, no normalization, no rescaling, no conversion of discrete variables into integers (eg, Male/Female => 0/1); they can accept both categorical (discrete) and continuous variables, and many implementations can handle incomplete data (values missing from some of the rows in your data matrix); and
again, the tree itself is a visual summary of feature importance ranking
(ie, significant variables)--the most significant variable is the
root node, and is more significant than the two child nodes, which in
turn are more significant than their four combined children. "significance" here means the percent of variance explained (with respect to some response variable, aka 'target variable' or the thing
you are trying to predict). One proviso: from a visual inspection of
a decision tree you cannot distinguish variable significance from
among nodes of the same rank.
If you haven't used them before, here's how Decision Trees work: the algorithm will go through every variable (column) in your data and every value for each variable and split your data into two sub-sets based on each of those values. Which of these splits is actually chosen by the algorithm--i.e., what is the splitting criterion? The particular variable/value combination that "purifies" the data the most (i.e., maximizes the information gain) is chosen to split the data (that variable/value combination is usually indicated as the node's label). This simple heuristic is just performed recursively until the remaining data sub-sets are pure or further splitting doesn't increase the information gain.
What does this tell you about the "importance" of the variables in your data set? Well importance is indicated by proximity to the root node--i.e., hierarchical level or rank.
One suggestion: decision trees handle both categorical and discrete data usually without problem; however, in my experience, decision tree algorithms always perform better if the response variable (the variable you are trying to predict using all other variables) is discrete/categorical rather than continuous. It looks like yours is probably continuous, in which case in would consider discretizing it (unless doing so just causes the entire analysis to be meaningless). To do this, just bin your response variable values using parameters (bin size, bin number, and bin edges) meaningful w/r/t your problem domain--e.g., if your r/v is comprised of 'continuous values' from 1 to 100, you might sensibly bin them into 5 bins, 0-20, 21-40, 41-60, and so on.
For instance, from your Question, suppose one variable in your data is X and it has 5 values (10, 20, 25, 50, 100); suppose also that splitting your data on this variable with the third value (25) results in two nearly pure subsets--one low-value and one high-value. As long as this purity were higher than for the sub-sets obtained from splitting on the other values, the data would be split on that variable/value pair.
RapidMiner does indeed have a decision tree implementation, and it seems there are quite a few tutorials available on the Web (e.g., from YouTube, here and here). (Note, I have not used the decision tree module in R/M, nor have i used RapidMiner at all.)
The other set of techniques i would consider is usually grouped under the rubric Dimension Reduction. Feature Extraction and Feature Selection are two perhaps the most common terms after D/R. The most widely used is PCA, or principal-component analysis, which is based on an eigen-vector decomposition of the covariance matrix (derived from to your data matrix).
One direct result from this eigen-vector decomp is the fraction of variability in the data accounted for by each eigenvector. Just from this result, you can determine how many dimensions are required to explain, e.g., 95% of the variability in your data
If RapidMiner has PCA or another functionally similar dimension reduction technique, it's not obvious where to find it. I do know that RapidMiner has an R Extension, which of course let's you access R inside RapidMiner.R has plenty of PCA libraries (Packages). The ones i mention below are all available on CRAN, which means any of the PCA Packages there satisfy the minimum Package requirements for documentation and vignettes (code examples). I can recommend pcaPP (Robust PCA by Projection Pursuit).
In addition, i can recommend two excellent step-by-step tutorials on PCA. The first is from the NIST Engineering Statistics Handbook. The second is a tutorial for Independent Component Analysis (ICA) rather than PCA, but i mentioned it here because it's an excellent tutorial and the two techniques are used for the similar purposes.