Pachinko Modeling in Mallet - topic-modeling

I am experimenting with the Pachinko topic model in Mallet, and am having trouble getting it working. When it prints out the topics at each update, they are all the same. This occurs when I use both the default alpha and beta values, and when I use my own. I haven't been able to find that much written about Mallet's Pachinko model, and their documentation is pretty sparse. Here is my code with my parameters:
PAM4L model = new PAM4L(superTopics: 25, subTopics: 5);
String output = new String("output");
Randoms rand = new Randoms(4);
model.estimate(InstanceList: instances, iterations: 500,
optimizeInterval: 5,showTopicsInterval: 50,
outputModelInterval: 50, outputFile: output, r: rand);
Thank you

Related

What is the inverse of torch.fft.fft() for pytorch1.10?

I am using 1.10.2+cu113.
In pytorch 1.7, the code is
ffted = torch.rfft(input, 1, onesided=False)
iffted = torch.irfft(ffted, 1, onesided=False)
I want to migrate it in new version of pytorch.
Here is my code.
ffted = torch.fft.fft(input)
iffted = torch.fft.ifft(ffted)
But iffted is complex. That is not what I want. I did not found the inverse function of torch.fft.fft.
So what is the correct function I should call?
BTW: input.shape=torch.Size([1, 1, 4000, 10])

What is the output of predict.coxph() using type = "survival"?

I am trying to learn what the various outputs of predict.coxph() mean. I am currently attempting to fit a cox model on a training set then use the resulting coefficients from the training set to make predictions in a test set (new set of data).
I see from the predict.coxph() help page that I could use type = "survival" to extract and individual's survival probability-- which is equal to exp(-expected).
Here is a code block of what I have attempted so far, using the ISLR2 BrainCancer data.
set.seed(123)
n.training = round(nrow(BrainCancer) * 0.70) # 70:30 split
idx = sample(1:nrow(BrainCancer), size = n.training)
d.training = BrainCancer[idx, ]
d.test = BrainCancer[-idx, ]
# fit a model using the training set
fit = coxph(Surv(time, status) ~ sex + diagnosis + loc + ki + gtv + stereo, data = d.training)
# get predicted survival probabilities for the test set
pred = predict(fit, type = "survival", newdata = d.test)
The predictions generated:
predict(fit, type = "survival", newdata = d.test)
[1] 0.9828659 0.8381164 0.9564982 0.2271862 0.2883800 0.9883625 0.9480138 0.9917512 1.0000000 0.9974775 0.7703657 0.9252100 0.9975044 0.9326234 0.8718161 0.9850815 0.9545622 0.4381646 0.8236644
[20] 0.2455676 0.7289031 0.9063336 0.9126897 0.9988625 0.4399697 0.9360874
Are these survival probabilities associated with a specific time point? From the help page, it sounds like these are survival probabilities at the follow-up times in the newdata argument. Is this correct?
Additional questions:
How is the baseline hazard estimated in predict.coxph? Is it using the Breslow estimator?
If type = "expected" is used, are these values the cumulative hazard? If yes, what are the relevant time points for these?
Thank you!

Running multiple inferences in parallel with PyTorch

I'm trying to implement Double DQN (not to be confused with DQN with a slightly delayed Q-target network) in PyTorch to train an agent to play an Atari OpenAI Gym game. Here I discuss the implementation of the following formula:
Update of Q-network, formula taken from Sutton & Barto.
My first implementation is:
Q_pred = self.Q_1.forward(s_now)[T.arange(batch_size), actions.long()]
Q_next_all = self.Q_1.forward(s_next)
maxA_id = T.argmax(Q_next_all, dim=1)
Q_pred2 = self.Q_2.forward(s_next)[T.arange(batch_size), maxA_id]
Q_target = (rewards + (~dones) * self.GAMMA * Q_pred2).detach()
self.Q_1.optimizer.zero_grad()
self.Q_1.loss(Q_target, Q_pred).backward()
self.Q_1.optimizer.step()
(Q_1 and Q_2 are nn.Module classes, and all of the variables involved here are already torch tensors lying in the GPU.)
I noticed that my program ran much slower than a previous implementation which used plain DQN.
I realized that I can combine the batches entering Q_1, so there will be one combined batch being forwarded in the neural network, instead of two batches in sequence. The code becomes:
s_combined = T.cat((s_now, s_next))
Q_combined = self.Q_1.forward(s_combined)
Q_pred = Q_combined[T.arange(batch_size), actions.long()]
Q_next_all = Q_combined[batch_size:]
Q_pred2_all = self.Q_2.forward(s_next)
maxA_id = T.argmax(Q_next_all, dim=1)
Q_pred2 = Q_pred2_all[T.arange(batch_size), maxA_id]
Q_target = (rewards + (~dones) * self.GAMMA * Q_pred2).detach()
self.Q_1.optimizer.zero_grad()
self.Q_1.loss(Q_target, Q_pred).backward()
self.Q_1.optimizer.step()
(This proves that I understand how to do batch training in PyTorch, so don't mark this as a duplicate of this question.)
Furthermore, I realized that Q_1 and Q_2 can process their batches in parallel. So I looked up how to do multiprocessing in PyTorch. Unfortunately, I couldn't find a good example. I tried to adapt a code that looks similar to my scenario, and my code becomes:
def spawned():
s_combined = T.cat((s_now, s_next))
Q_combined = self.Q_1.forward(s_combined)
Q_pred = Q_combined[T.arange(batch_size), actions.long()]
Q_next_all = Q_combined[batch_size:]
mp.set_start_method('spawn', force=True)
p = mp.Process(target=spawned)
p.start()
Q_pred2_all = self.Q_2.forward(s_next)
p.join()
maxA_id = T.argmax(Q_next_all, dim=1)
Q_pred2 = Q_pred2_all[T.arange(batch_size), maxA_id]
Q_target = (rewards + (~dones) * self.GAMMA * Q_pred2).detach()
self.Q_1.optimizer.zero_grad()
self.Q_1.loss(Q_target, Q_pred).backward()
self.Q_1.optimizer.step()
This crashes with the error message:
AttributeError: Can't pickle local object 'Agent.learn.<locals>.spawned'
So how do I make this work?
(Achieving this in CUDA programming is trivial. One simply launches two device kernels using a sequential host code, and the two kernels are automatically computed in parallel in the GPU.)

RRelief Feature Selection, using multisurf from skrebate

I am new to using the MultiSURF algorithm for feature selection.
I am using MultiSurf from skrebate.
I have ~6500 feature dataset.
The code took about 3 days to create a ditance array. It has been stuck on "Feature Scoring unedr way..." for the past 5 days.
What am i doing wrong ?
Following is the code:
fs = MultiSURF(n_features_to_select=100, verbose=True)
fs.fit(X, y)
print("Printing for FS")
print(fs.feature_importances_)
print(fs.top_features_)
print("Done Printing for FS")
dfDashboard = pd.DataFrame()
for feature_name, feature_score in zip(df.drop(responseCol, axis=1).columns,
fs.feature_importances_):
print(feature_name, '\t', feature_score)
new_record = pd.DataFrame([[feature_name, feature_score]],columns=['FeatureName','Score'])
dfDashboard = pd.concat([dfDashboard,new_record])
Can someone please help understand ?
Does MultiSurf not work well for datasets with 1000's features?
Thanks

is it possible to get exactly the same results from tensorflow mfcc and librosa mfcc?

I'm trying to make tensorflow mfcc give me the same results as python lybrosa mfcc
i have tried to match all the default parameters that are used by librosa
in my tensorflow code and got a different result
this is the tensorflow code that i have used :
waveform = contrib_audio.decode_wav(
audio_binary,
desired_channels=1,
desired_samples=sample_rate,
name='decoded_sample_data')
sample_rate = 16000
transwav = tf.transpose(waveform[0])
stfts = tf.contrib.signal.stft(transwav,
frame_length=2048,
frame_step=512,
fft_length=2048,
window_fn=functools.partial(tf.contrib.signal.hann_window,
periodic=False),
pad_end=True)
spectrograms = tf.abs(stfts)
num_spectrogram_bins = stfts.shape[-1].value
lower_edge_hertz, upper_edge_hertz, num_mel_bins = 0.0,8000.0, 128
linear_to_mel_weight_matrix =
tf.contrib.signal.linear_to_mel_weight_matrix(
num_mel_bins, num_spectrogram_bins, sample_rate, lower_edge_hertz,
upper_edge_hertz)
mel_spectrograms = tf.tensordot(
spectrograms,
linear_to_mel_weight_matrix, 1)
mel_spectrograms.set_shape(spectrograms.shape[:-1].concatenate(
linear_to_mel_weight_matrix.shape[-1:]))
log_mel_spectrograms = tf.log(mel_spectrograms + 1e-6)
mfccs = tf.contrib.signal.mfccs_from_log_mel_spectrograms(
log_mel_spectrograms)[..., :20]
the equivalent in librosa:
libr_mfcc = librosa.feature.mfcc(wav, 16000)
the following are the graphs of the results:
I'm the author of tf.signal. Sorry for not seeing this post sooner, but you can get librosa and tf.signal.stft to match if you center-pad the signal before passing it to tf.signal.stft. See this GitHub issue for more details.
I spent a whole 1 day trying to make them match. Even the rryan's solution didn't work for me (center=False in librosa), but I finally found out, that TF and librosa STFT's match only for the case win_length==n_fft in librosa and frame_length==fft_length in TF. That's why rryan's colab example is working, but you can try that if you set frame_length!=fft_length, the amplitudes are very different (although visually, after plotting, the patterns look similar). Typical example - if you choose some win_length/frame_length and then you want to set n_fft/fft_length to the smallest power of 2 greater than win_length/frame_length, then the results will be different. So you need to stick with the inefficient FFT given by your window size... I don't know why it is so, but that's how it is, hopefully it will be helpful for someone.
The output of contrib_audio.decode_wav should be DecodeWav with { audio, sample_rate } and audio shape is (sample_rate, 1), so what is the purpose for getting first item of waveform and do transpose?
transwav = tf.transpose(waveform[0])
No straight forward way, since librosa stft uses center=True which does not comply with tf stft.
Had it been center=False, stft tf/librosa would give near enough results. see colab sniff
But even though, trying to import the librosa code into tf is a big headache. Here is what I started and gave up. Near but not near enough.
def pow2db_tf(X):
amin=1e-10
top_db=80.0
ref_value = 1.0
log10 = 2.302585092994046
log_spec = (10.0/log10) * tf.log(tf.maximum(amin, X))
log_spec -= (10.0/log10) * tf.log(tf.maximum(amin, ref_value))
pow2db = tf.maximum(log_spec, tf.reduce_max(log_spec) - top_db)
return pow2db
def librosa_feature_like_tf(x, sr=16000, n_fft=2048, n_mfcc=20):
mel_basis = librosa.filters.mel(sr, n_fft).astype(np.float32)
mel_basis = mel_basis.reshape(1, int(n_fft/2+1), -1)
tf_stft = tf.contrib.signal.stft(x, frame_length=n_fft, frame_step=hop_length, fft_length=n_fft)
print ("tf_stft", tf_stft.shape)
tf_S = tf.matmul(tf.abs(tf_stft), mel_basis);
print ("tf_S", tf_S.shape)
tfdct = tf.spectral.dct(pow2db_tf(tf_S), norm='ortho'); print ("tfdct", tfdct.shape)
print ("tfdct before cut", tfdct.shape)
tfdct = tfdct[:,:,:n_mfcc];
print ("tfdct afer cut", tfdct.shape)
#tfdct = tf.transpose(tfdct,[0,2,1]);print ("tfdct afer traspose", tfdct.shape)
return tfdct
x = tf.placeholder(tf.float32, shape=[None, 16000], name ='x')
tf_feature = librosa_feature_like_tf(x)
print("tf_feature", tf_feature.shape)
mfcc_rosa = librosa.feature.mfcc(wav, sr).T
print("mfcc_rosa", mfcc_rosa.shape)
For anyone still looking for this: I had a similar problem some time ago: Matching librosa's mel filterbanks/mel spectrogram to a tensorflow implementation. The solution was to use a different windowing approach for the spectrogram and librosa's mel matrix as constant tensor. See here and here.

Resources