Linear model subset selection goodness-of-fit with k-fold cross validation

Linear model subset selection goodness-of-fit with k-fold cross validation - subset

I am studying 'An Introduction to Statistical Learning' from James et al (2015). In the experiment section, a script to calculate the goodness-of-fit of different subsets using the k-fold cross validation method.
When I try to plot the error coefficients, I get the error:
Error in UseMethod("predict") : no applicable method for 'predict' applied to an object of class "regsubsets"
The script makes too little sense for me to know what I'm doing wrong. Can anyone help me interpret?
library(leaps)
library(ISLR)
k=10
set.seed(1)
folds=sample(1:k,nrow(Hitters),replace=TRUE)
cv.errors=matrix(NA,k,19, dimnames=list(NULL, paste(1:19)))
for(j in 1:k){
best.fit=regsubsets(Salary~.,data=Hitters[folds!=j,],nvmax=19)
for(i in 1:19){
pred=predict(best.fit,Hitters[folds==j,],id=i)
cv.errors[j,i]=mean( (Hitters$Salary[folds==j]-pred)^2)
}
}
mean.cv.errors=apply(cv.errors,2,mean)
mean.cv.errors
par(mfrow=c(1,1))
plot(mean.cv.errors,type='b')
reg.best=regsubsets(Salary~.,data=Hitters, nvmax=19)
coef(reg.best,11)

I ran into the problem too. Hope you found the answer. If not, here is the answer.
I am sure that you already created the function below.
predict.regsubsets <- function(object, newdata, id,...) {
form <- as.formula(object$call[[2]])
mat <- model.matrix(form, newdata)
coefi <- coef(object, id = id)
xvars <- names(coefi)
mat[,xvars]%*%coefi
}
Now you have to change pred=predict(best.fit,Hitters[folds==j,],id=i) to pred <- predict.regsubsets(best.fit, hitters[folds == j, ], id = i)
Hope it helped.

Related

OpenModelica - Connector generates additional equations than expected

I'm trying to implement a simple model in OpenModelica, but I got an overnumber of equations caused by the connector.
The tools is composed by three items:
The first is the connector, in which only a flow is defined:
```
connector ConnectorMassFlow
flow Real G "Rate [kg/s]";
end ConnectorMassFlow;
```
The second is a compressor model, in which 1 connector is defined:
```
model ModelCompressor
ConnectorMassFlow MassFlowIn;
equation
end ModelCompressor;
```
And then the circuit model, in which the compressor is defined, and another connector as well:
```
model ModelConditioner
ConnectorMassFlow MassFlowIn;
ModelCompressor Compressor;
equation
MassFlowIn.G = 1.0;
connect(Compressor.MassFlowIn,MassFlowIn);
end ModelConditioner;
```
The problem is that, even if there should be only 2 equations, i.e.:
Compressor.MassFlowIn.G - MassFlowIn.G = 0.0;
MassFlowIn.G = 1.0;
OpenModelica is adding another equation, that is:
MassFlowIn.G = 0.0;
In addition, in the first equation there should be a + between the two terms, and not a minus...
Can someone help me please?

What is the output of predict.coxph() using type = "survival"?

I am trying to learn what the various outputs of predict.coxph() mean. I am currently attempting to fit a cox model on a training set then use the resulting coefficients from the training set to make predictions in a test set (new set of data).
I see from the predict.coxph() help page that I could use type = "survival" to extract and individual's survival probability-- which is equal to exp(-expected).
Here is a code block of what I have attempted so far, using the ISLR2 BrainCancer data.
set.seed(123)
n.training = round(nrow(BrainCancer) * 0.70) # 70:30 split
idx = sample(1:nrow(BrainCancer), size = n.training)
d.training = BrainCancer[idx, ]
d.test = BrainCancer[-idx, ]
# fit a model using the training set
fit = coxph(Surv(time, status) ~ sex + diagnosis + loc + ki + gtv + stereo, data = d.training)
# get predicted survival probabilities for the test set
pred = predict(fit, type = "survival", newdata = d.test)
The predictions generated:
predict(fit, type = "survival", newdata = d.test)
[1] 0.9828659 0.8381164 0.9564982 0.2271862 0.2883800 0.9883625 0.9480138 0.9917512 1.0000000 0.9974775 0.7703657 0.9252100 0.9975044 0.9326234 0.8718161 0.9850815 0.9545622 0.4381646 0.8236644
[20] 0.2455676 0.7289031 0.9063336 0.9126897 0.9988625 0.4399697 0.9360874
Are these survival probabilities associated with a specific time point? From the help page, it sounds like these are survival probabilities at the follow-up times in the newdata argument. Is this correct?
Additional questions:
How is the baseline hazard estimated in predict.coxph? Is it using the Breslow estimator?
If type = "expected" is used, are these values the cumulative hazard? If yes, what are the relevant time points for these?
Thank you!

is it possible to get exactly the same results from tensorflow mfcc and librosa mfcc?

I'm trying to make tensorflow mfcc give me the same results as python lybrosa mfcc
i have tried to match all the default parameters that are used by librosa
in my tensorflow code and got a different result
this is the tensorflow code that i have used :
waveform = contrib_audio.decode_wav(
audio_binary,
desired_channels=1,
desired_samples=sample_rate,
name='decoded_sample_data')
sample_rate = 16000
transwav = tf.transpose(waveform[0])
stfts = tf.contrib.signal.stft(transwav,
frame_length=2048,
frame_step=512,
fft_length=2048,
window_fn=functools.partial(tf.contrib.signal.hann_window,
periodic=False),
pad_end=True)
spectrograms = tf.abs(stfts)
num_spectrogram_bins = stfts.shape[-1].value
lower_edge_hertz, upper_edge_hertz, num_mel_bins = 0.0,8000.0, 128
linear_to_mel_weight_matrix =
tf.contrib.signal.linear_to_mel_weight_matrix(
num_mel_bins, num_spectrogram_bins, sample_rate, lower_edge_hertz,
upper_edge_hertz)
mel_spectrograms = tf.tensordot(
spectrograms,
linear_to_mel_weight_matrix, 1)
mel_spectrograms.set_shape(spectrograms.shape[:-1].concatenate(
linear_to_mel_weight_matrix.shape[-1:]))
log_mel_spectrograms = tf.log(mel_spectrograms + 1e-6)
mfccs = tf.contrib.signal.mfccs_from_log_mel_spectrograms(
log_mel_spectrograms)[..., :20]
the equivalent in librosa:
libr_mfcc = librosa.feature.mfcc(wav, 16000)
the following are the graphs of the results:

I'm the author of tf.signal. Sorry for not seeing this post sooner, but you can get librosa and tf.signal.stft to match if you center-pad the signal before passing it to tf.signal.stft. See this GitHub issue for more details.

I spent a whole 1 day trying to make them match. Even the rryan's solution didn't work for me (center=False in librosa), but I finally found out, that TF and librosa STFT's match only for the case win_length==n_fft in librosa and frame_length==fft_length in TF. That's why rryan's colab example is working, but you can try that if you set frame_length!=fft_length, the amplitudes are very different (although visually, after plotting, the patterns look similar). Typical example - if you choose some win_length/frame_length and then you want to set n_fft/fft_length to the smallest power of 2 greater than win_length/frame_length, then the results will be different. So you need to stick with the inefficient FFT given by your window size... I don't know why it is so, but that's how it is, hopefully it will be helpful for someone.

The output of contrib_audio.decode_wav should be DecodeWav with { audio, sample_rate } and audio shape is (sample_rate, 1), so what is the purpose for getting first item of waveform and do transpose?
transwav = tf.transpose(waveform[0])

No straight forward way, since librosa stft uses center=True which does not comply with tf stft.
Had it been center=False, stft tf/librosa would give near enough results. see colab sniff
But even though, trying to import the librosa code into tf is a big headache. Here is what I started and gave up. Near but not near enough.
def pow2db_tf(X):
amin=1e-10
top_db=80.0
ref_value = 1.0
log10 = 2.302585092994046
log_spec = (10.0/log10) * tf.log(tf.maximum(amin, X))
log_spec -= (10.0/log10) * tf.log(tf.maximum(amin, ref_value))
pow2db = tf.maximum(log_spec, tf.reduce_max(log_spec) - top_db)
return pow2db
def librosa_feature_like_tf(x, sr=16000, n_fft=2048, n_mfcc=20):
mel_basis = librosa.filters.mel(sr, n_fft).astype(np.float32)
mel_basis = mel_basis.reshape(1, int(n_fft/2+1), -1)
tf_stft = tf.contrib.signal.stft(x, frame_length=n_fft, frame_step=hop_length, fft_length=n_fft)
print ("tf_stft", tf_stft.shape)
tf_S = tf.matmul(tf.abs(tf_stft), mel_basis);
print ("tf_S", tf_S.shape)
tfdct = tf.spectral.dct(pow2db_tf(tf_S), norm='ortho'); print ("tfdct", tfdct.shape)
print ("tfdct before cut", tfdct.shape)
tfdct = tfdct[:,:,:n_mfcc];
print ("tfdct afer cut", tfdct.shape)
#tfdct = tf.transpose(tfdct,[0,2,1]);print ("tfdct afer traspose", tfdct.shape)
return tfdct
x = tf.placeholder(tf.float32, shape=[None, 16000], name ='x')
tf_feature = librosa_feature_like_tf(x)
print("tf_feature", tf_feature.shape)
mfcc_rosa = librosa.feature.mfcc(wav, sr).T
print("mfcc_rosa", mfcc_rosa.shape)

For anyone still looking for this: I had a similar problem some time ago: Matching librosa's mel filterbanks/mel spectrogram to a tensorflow implementation. The solution was to use a different windowing approach for the spectrogram and librosa's mel matrix as constant tensor. See here and here.

using as.ppp on data frame to create marked process

I am using a data frame to create a marked point process using as.ppp function. I get an error Error: is.numeric(x) is not TRUE. The data I am using is as follows:
dput(head(pointDataUTM[,1:2]))
structure(list(POINT_X = c(439845.0069, 450018.3603, 451873.2925,
446836.5498, 445040.8974, 442060.0477), POINT_Y = c(4624464.56,
4629024.646, 4624579.758, 4636291.222, 4614853.993, 4651264.579
)), .Names = c("POINT_X", "POINT_Y"), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
I can see that the first two columns are numeric, so I do not know why it is a problem.
> str(pointDataUTM)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 5028 obs. of 31 variables:
$ POINT_X : num 439845 450018 451873 446837 445041 ...
$ POINT_Y : num 4624465 4629025 4624580 4636291 4614854 ...
Then I also checked for NA, which shows no NA
> sum(is.na(pointDataUTM$POINT_X))
[1] 0
> sum(is.na(pointDataUTM$POINT_Y))
[1] 0
When I tried even only the first two columns of the data.frame, the error I get on using as.ppp is this:
Error: is.numeric(x) is not TRUE
5.stop(sprintf(ngettext(length(r), "%s is not TRUE", "%s are not all TRUE"), ch), call. = FALSE, domain = NA)
4.stopifnot(is.numeric(x))
3.ppp(X[, 1], X[, 2], window = win, marks = marx, check = check)
2.as.ppp.data.frame(pointDataUTM[, 1:2], W = studyWindow)
1.as.ppp(pointDataUTM[, 1:2], W = studyWindow)
Could someone tell me what is the mistake here and why I get the not numeric error?
Thank you.

The critical check is whether PointDataUTM[,1] is numeric, rather than PointDataUTM$POINT_X.
Since PointDataUTM is a tbl object, and tbl is a function from the dplyr package, what is probably happening is that the subset operator for the tbl class is returning a data frame, and not a numeric vector, when a single column is extracted. Whereas the $ operator returns a numeric vector.
I suggest you convert your data to data.frame using as.data.frame() before calling as.ppp.
In the next version of spatstat we will make our code more robust against this kind of problem.

I'm on the phone, so can't check but I think it is happens because you have a tibble and not a data.frame. Please try to convert to a data.frame using as.data.frame first.

Error in (function (classes, fdef, mtable)

I have run this topic modeling script two months ago SUCCESSFULLY, but it suddenly gives me an error message (in the last three lines).
post <- posterior(TM1, newdata = dtm[-c(1:20),]) #this script gives me an error message.
perplex <- perplexity(TM1, newdata = dtm[-c(1:20),]) #this script does not give me an error message.
Can anybody help me what is going on here? Please~~
=====================
library("tm")
library("slam")
library("topicmodels")
library("SnowballC")
corpus <- Corpus(DirSource(directory="/Users/loni/Documents/TextMining/test", encoding="UTF-8"))
dtm <- DocumentTermMatrix(corpus, control=list(stemming=TRUE, stopwords=TRUE, removePunctuation=FALSE))
term_tfidf <- tapply(dtm$v/row_sums(dtm)[dtm$i], dtm$j, mean) * log2(nDocs(dtm)/col_sums(dtm>0))
dim(dtm)
[1] 26 919
dtm <- dtm[, term_tfidf >= .06] # petition corpus
dtm <- dtm[row_sums(dtm) > 0,]
dim(dtm)
[1] 26 499
k<-5
SEED <- 2
TM <- list(VEM = LDA(dtm, k = k, control = list(seed = SEED)))
TM1 <- list(VEM = LDA(dtm[c(1:20),], k = k, control = list(seed = SEED))) #validation
Topic <- topics(TM[["VEM"]],1)
Terms <- terms(TM[["VEM"]], 8)
Terms[, 1:5]
post <- posterior(TM1, newdata = dtm[-c(1:20),])
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘posterior’ for signature ‘"list", "DocumentTermMatrix"’

It could be because of wrong indexing of list. Try [[]] or [] on TM1

I had the same error today and found that the issue was because I had other packages loaded that conflicted. The easiest fix was to create a new session with a clear workspace, and rerun the script.
This answer to a similar question clued me in:
Unable to find an inherited method for function ‘select’ for signature ‘"data.frame"’

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Linear model subset selection goodness-of-fit with k-fold cross validation - subset

Related

OpenModelica - Connector generates additional equations than expected

What is the output of predict.coxph() using type = "survival"?

is it possible to get exactly the same results from tensorflow mfcc and librosa mfcc?

using as.ppp on data frame to create marked process

Error in (function (classes, fdef, mtable)

Categories

Resources