How to extract the aggregated gradient from tensorflow_federated? - python-3.x

I have a tensorflow model like this
def input_spec():
return(
tf.TensorSpec([None, 122], tf.float64),
tf.TensorSpec([None, 5],tf.uint8))
def model_fn():
model=tf.keras.models.Sequential([
tf.keras.layers.Dense(64,input_shape=(122,)),
tf.keras.layers.Dense(32,activation='relu'),
tf.keras.layers.Dropout(.15),
tf.keras.layers.Dense(32,activation='relu'),
tf.keras.layers.Dropout(.15),
tf.keras.layers.Dense(32,activation='relu'),
tf.keras.layers.Dropout(.15),
tf.keras.layers.Dense(5,activation='softmax')])
return tff.learning.from_keras_model(
model,
input_spec=input_spec(),
loss=tf.keras.losses.CategoricalCrossentropy(),
metrics=[tf.keras.metrics.CategoricalAccuracy()])
I set the iterative_process in the following
iterative_process=tff.learning.algorithms.build_weighted_fed_avg(
model_fn,
client_optimizer_fn=lambda: tf.keras.optimizers.Adam(),
server_optimizer_fn=lambda: tf.keras.optimizers.Adam())
I have learnt that we can obtain the aggregated weight by model_weights=iterative_process.get_model_weights(state), but I still need to know how to obtain the aggregated gradients.

While running the training procedure, the aggregated (pseudo) gradients can in some cases be computed by subtracting the state at the beginning of the round from that at the end. In the code snippet above, this will not quite literally be true since the server optimizer is Adam (which performs some rescaling of the pseudo-gradients, as well as the addition of a momentum accumulator, if I recall correctly).
If you are simply using gradient-descent with a learning rate of 1 on the server (traditionally the default setting for FedAvg), code like the following should give you this aggregated pseudogradient:
pseudo_grad = tf.nest.map_structure(
lambda x, y: x - y, previous_state.global_model_weights.trainable,
state.global_model_weights.trainable)
Some helpful measurements for debugging can alternatively be accessed by wrapping the aggregator parameter to your build_weighted_fed_avg call in an aggregator that adds these debug measurements, if this is the underlying goal here. You would additionally be able to read these values directly if you implemented a tff.templates.AggregationProcess which output the averaged pseudogradient in the measurements field of its result; these should be passed through directly by the rest of the FedAvg implementation.

Related

Proper way to log things when using Pytorch Lightning DDP

I was wondering what is the proper way of logging metrics when using DDP. I noticed that if I want to print something inside validation_epoch_end it will be printed twice when using 2 GPUs. I was expecting validation_epoch_end to be called only on rank 0 and to receive the outputs from all GPUs, but I am not sure this is correct anymore. Therefore I have several questions:
validation_epoch_end(self, outputs) - When using DDP does every subprocess receive the data processed from the current GPU or data processed from all GPUs, i.e. does the input parameter outputs contains the outputs of the entire validation set, from all GPUs?
If outputs is GPU/process specific what is the proper way to calculate any metric on the entire validation set in validation_epoch_end when using DDP?
I understand that I can solve the printing by checking self.global_rank == 0 and printing/logging only in that case, however I am trying to get a deeper understanding of what I am printing/logging in this case.
Here is a code snippet from my use case. I would like to be able to report f1, precision and recall on the entire validation dataset and I am wondering what is the correct way of doing it when using DDP.
def _process_epoch_outputs(self,
outputs: List[Dict[str, Any]]
) -> Tuple[torch.Tensor, torch.Tensor]:
"""Creates and returns tensors containing all labels and predictions
Goes over the outputs accumulated from every batch, detaches the
necessary tensors and stacks them together.
Args:
outputs (List[Dict])
"""
all_labels = []
all_predictions = []
for output in outputs:
for labels in output['labels'].detach():
all_labels.append(labels)
for predictions in output['predictions'].detach():
all_predictions.append(predictions)
all_labels = torch.stack(all_labels).long().cpu()
all_predictions = torch.stack(all_predictions).cpu()
return all_predictions, all_labels
def validation_epoch_end(self, outputs: List[Dict[str, Any]]) -> None:
"""Logs f1, precision and recall on the validation set."""
if self.global_rank == 0:
print(f'Validation Epoch: {self.current_epoch}')
predictions, labels = self._process_epoch_outputs(outputs)
for i, name in enumerate(self.label_columns):
f1, prec, recall, t = metrics.get_f1_prec_recall(predictions[:, i],
labels[:, i],
threshold=None)
self.logger.experiment.add_scalar(f'{name}_f1/Val',
f1,
self.current_epoch)
self.logger.experiment.add_scalar(f'{name}_Precision/Val',
prec,
self.current_epoch)
self.logger.experiment.add_scalar(f'{name}_Recall/Val',
recall,
self.current_epoch)
if self.global_rank == 0:
print((f'F1: {f1}, Precision: {prec}, '
f'Recall: {recall}, Threshold {t}'))
Questions
validation_epoch_end(self, outputs) - When using DDP does every
subprocess receive the data processed from the current GPU or data
processed from all GPUs, i.e. does the input parameter outputs
contains the outputs of the entire validation set, from all GPUs?
Data processed from the current GPU only, outputs are not synchronized, there is only backward synchronization (gradients are synchronized during training and distributed to replicas of models residing on each GPU).
Imagine that all of the outputs were passed from 1000 GPUs to this poor master, it could give it an OOM very easily
If outputs is GPU/process specific what is the proper way to calculate
any metric on the entire validation set in validation_epoch_end when
using DDP?
According to documentation (emphasis mine):
When validating using a accelerator that splits data from each batch
across GPUs, sometimes you might need to aggregate them on the master
GPU for processing (dp, or ddp2).
And here is accompanying code (validation_epoch_end would receive accumulated data across multiple GPUs from single step in this case, also see the comments):
# Done per-process (GPU)
def validation_step(self, batch, batch_idx):
x, y = batch
y_hat = self.model(x)
loss = F.cross_entropy(y_hat, y)
pred = ...
return {'loss': loss, 'pred': pred}
# Gathered data from all processes (per single step)
# Allows for accumulation so the whole data at the end of epoch
# takes less memory
def validation_step_end(self, batch_parts):
gpu_0_prediction = batch_parts.pred[0]['pred']
gpu_1_prediction = batch_parts.pred[1]['pred']
# do something with both outputs
return (batch_parts[0]['loss'] + batch_parts[1]['loss']) / 2
def validation_epoch_end(self, validation_step_outputs):
for out in validation_step_outputs:
# do something with preds
Tips
Focus on per-device calculations and as small number of between-GPU transfers as possible
Inside validation_step (or training_step if that's what you want, this is general) calculate f1, precision, recall and whatever else on a per-batch basis
Returns those values (say, as a dict). Now you will return 3 numbers from each device instead of (batch, outputs) (which could be significantly larger)
Inside validation_step_end get those 3 values (actually (2, 3) if you have 2 GPUs) and sum/take mean of them and return 3 values
Now validation_epoch_end will get (steps, 3) values that you can use to accumulate
It would be even better if instead of operating on list of values during validation_epoch_end you could accumulate them in another 3 values (say you have a lot of validation steps, the list could grow too large), but this should be enough.
AFAIK PyTorch-Lightning doesn't do this (e.g. instead of adding to list, apply some accumulator directly), but I might be mistaken, so any correction would be great.

GridSearchCV gives different results than LassoCV for optimal alpha

I am aware of the standard process of finding the optimal value of alpha/lambda using Cross Validation technique through GridSearchCV class in sklearn.model_selection library.Here's my code to find that .
alphas=np.arange(0.0001,0.01,0.0005)
cv=RepeatedKFold(n_splits=10,n_repeats=3, random_state=100)
hyper_param = {'alpha':alphas}
model = Lasso()
model_cv = GridSearchCV(estimator = model,
param_grid=hyper_param,
scoring='r2',
cv=cv,
verbose=1,
return_train_score=True
)
model_cv.fit(X_train,y_train)
#checking the bestscore
model_cv.best_params_
This gives me alpha=0.01
Now, looking on LassoCV , as per my understanding , this library creates model by selecting best optimal alpha by the passed alphas list, and please note , I have used the same cross validation scheme for both of them. But when trying sklearn.linear_model.LassoCV with RepeatedKFold cross validation scheme.
alphas=np.arange(0.0001,0.01,0.0005)
cv=RepeatedKFold(n_splits=10,n_repeats=3,random_state=100)
ls_cv_m=LassoCV(alphas,cv=cv,n_jobs=1,verbose=True,random_state=100)
ls_cv_m.fit(X_train_reduced,y_train)
print('Alpha Value %d'%ls_cv_m.alpha_)
print('The coefficients are {}',ls_cv_m.coef_)
I get alpha=0 for the same data and this alpha value in not present in the list of decimal values passed in alphas argument for this.
This has confused me about the actual implementation of LassoCV.
and my doubts are ..
Why do I get optimal alpha as 0 in LassoCV when the list passed to the argument does not has zero in it.
What is the difference between LassoCV and Lasso then, if I have to anyways find most suitable alpha from GridSearchCV only?
First you should pass your alphas as keywords parameters rather then positional parameters since the first positional parameter for LassoCV is eps.
ls_cv_m=LassoCV(alphas=alphas,cv=cv,n_jobs=1,verbose=True,random_state=100)
Then, the model is returning as optimal parameter one of the alphas that you previously defined, however you are simply printing it as an integer number casting the float to int. Replace %d with %f to print it in the float format:
print('Alpha Value %f'%ls_cv_m.alpha_)
Have a look here for more details about Python printing formats and styles.
As for your second question, Lasso is the linear model while LassoCV is an iterative process that allows you to find the optimal parameters for a Lasso model using Cross-validation.

Seq2seq for non-sentence, float data; stuck configuring the decoder

I am trying to apply sequence-to-sequence modelling to EEG data. The encoding works just fine, but getting the decoding to work is proving problematic. The input-data has the shape None-by-3000-by-31, where the second dimension is the sequence-length.
The encoder looks like this:
initial_state = lstm_sequence_encoder.zero_state(batchsize, dtype=self.model_precision)
encoder_output, state = dynamic_rnn(
cell=LSTMCell(32),
inputs=lstm_input, # shape=(None,3000,32)
initial_state=initial_state, # zeroes
dtype=lstm_input.dtype # tf.float32
)
I use the final state of the RNN as the initial state of the decoder. For training, I use the TrainingHelper:
training_helper = TrainingHelper(target_input, [self.sequence_length])
training_decoder = BasicDecoder(
cell=lstm_sequence_decoder,
helper=training_helper,
initial_state=thought_vector
)
output, _, _ = dynamic_decode(
decoder=training_decoder,
maximum_iterations=3000
)
My troubles start when I try to implement inference. Since I am using non-sentence data, I do not need to tokenize or embed, because the data is essentially embedded already. The InferenceHelper class seemed the best way to achieve my goal. So this is what I use. I'll give my code then explain my problem.
def _sample_fn(decoder_outputs):
return decoder_outputs
def _end_fn(_):
return tf.tile([False], [self.lstm_layersize]) # Batch-size is sequence-length because of time major
inference_helper = InferenceHelper(
sample_fn=_sample_fn,
sample_shape=[32],
sample_dtype=target_input.dtype,
start_inputs=tf.zeros(batchsize_placeholder, 32), # the batchsize varies
end_fn=_end_fn
)
inference_decoder = BasicDecoder(
cell=lstm_sequence_decoder,
helper=inference_helper,
initial_state=thought_vector
)
output, _, _ = dynamic_decode(
decoder=inference_decoder,
maximum_iterations=3000
)
The Problem
I don't know what the shape of the inputs should be. I know the start-inputs should be zero because it is the first time-step. But this throws errors; it expects the input to be (1,32).
I also thought I should pass the output of each time-step unchanged to the next. However, this raises problems at run-time: the batch-size varies, so the shape is partial. The library throws an exception at this as it tries to convert the start_input to a tensor:
...
self._start_inputs = ops.convert_to_tensor(
start_inputs, name='start_inputs')
Any ideas?
This is a lesson in poor documentation.
I fixed my problem, but failed to address the variable batch-size problem.
The _end_fn was causing problems I was unaware of. I also managed to work out what the appropriate fields are for the InferenceHelper. I've given the fields names in case anyone needs guidance in future
def _end_fn(_):
return tf.tile([False], [batchsize])
inference_helper = InferenceHelper(
sample_fn=_sample_fn,
sample_shape=[lstm_number_of_units], # In my case, 32
sample_dtype=tf.float32, # Depends on the data
start_inputs=tf.zeros((batchsize, lstm_number_of_units)),
end_fn=_end_fn
)
As for the batch-size problem, there are two things I'm considering:
Changing the internal state of my model object. My TensorFlow computation graph is built inside a class. A class-field records the batch-size. Changing this during training may work. Or:
Pad the batches so that they are 200 sequences long. This will waste time.
Preferably I'd like a way to dynamically manage the batch-sizes.
EDIT: I found a way. It involves simply substituting square-brackets for parentheses:
inference_helper = InferenceHelper(
sample_fn=_sample_fn,
sample_shape=[self.lstm_layersize],
sample_dtype=target_input.dtype,
start_inputs=tf.zeros([batchsize, self.lstm_layersize]),
end_fn=_end_fn
)

Tensorflow: Summaries defined in function not accessible in tensorboard

I have a graph and a set of custom functions that define multilayer RNNs according to an input list which will specify the number of units in each layer. For instance:
def BuildLayers(....):
# takes inputs, list of layer sizes, mask information, etc
#
# invokes BuildLayer(...) several times
#
# returns RNN output and states of last layer
BuildLayer loops through a more detailed function which builds and returns individual layers:
def BuildLayer(....):
# Takes individual layer size, output of previous layer, etc
#
# handles bookkeeping of RNNCells, wrappers, reshaping, etc
# **Important! Defines scope for each layer**
#
# returns RNN output and states of last layer
And ultimately this would be called in a function that defines a graph and runs it in a session:
def Experiment(parameters):
tf.reset_default_graph()
graph = tf.Graph()
with graph.as_default():
#
# Placeholders
# BuildLayers(...)
# Loss function definitions
# optimizer definitions
with tf.Session(graph=graph) as session:
#
# Loop through epochs:
# etc
I.e., if the layer size parameter is [16, 32, 16], we end up with an RNN that has a cell of 16 units in layer1, scoped as layer1, 32 units in layer 2, scoped appropriately, and 16 units in layer 3, scoped, etc.
This seems to work fine, a casual inspection of the graph in tensorboard looks correct, nodes look correct, the thing trains, etc.
Problem: How can I add histogram summaries, e.g., of kernel weights and biases, to that function definition? I've done so naively, as such:
def buildLayer(numUnits, numLayer, input, lengths):
name = 'layer' "{0:0=2d}".format(numLayer)
with tf.variable_scope(name):
cellfw = tf.contrib.rnn.GRUCell(numUnits, activation = tf.nn.tanh)
cellbw = tf.contrib.rnn.GRUCell(numUnits, activation = tf.nn.tanh)
outputs, state = tf.nn.bidirectional_dynamic_rnn(cell_fw = cellfw, cell_bw = cellbw, inputs = input, dtype=tf.float32, sequence_length = lengths)
outputs = tf.concat([outputs[0], outputs[1]], axis=2)
FwKernel = tf.get_default_graph().get_tensor_by_name(name + '/bidirectional_rnn/fw/gru_cell/gates/kernel:0')
FwKernel_sum = tf.summary.histogram("FwKernel", FwKernel, 'rnn')
return outputs, state
And then, at the end of the graph definition, assumed these summaries would be caught up in the
merged = tf.summary.merge_all()
statement. It isn't. I'm confused by this behavior. I can see the histogram summary definitions on a visual inspection of the graph in tensorboard-- they're there. But they don't seem to be getting to the merge and so are never accessible in tensorboard as histograms per se.
How do I get summaries, which are defined in a function, to show up in tensorboard, preferably through a merge and without passing them around through function calls like excess baggage?
The least painful way I have found to avoid this is to pass a single list (i.e., "summaries") through each function, and within the BuildLayers function, to append or extend that list with all desired histogram summaries.
Then, in the main graph definition, rather than a merge_all
merged = tf.summary.merge_all()
instead use a merge and pass the list in as the argument
merged = tf.summary.merge(summaries)
This has the disadvantage of not actually being a merge_all, meaning that if you had defined other summaries (typically scalar summaries for loss functions, at least) you're going to have to manually append them to the summaries list or carry around two merge objects or something similar, which misses the self-advertised point of the merge_all.
I leave this here as an answer to my own question because it might help someone, but will pointedly not accept it because I am hoping to be shown a better way.
Most likely the problem is that summaries are created in the with graph.as_default(): context. The summary operations are then added to this graph's _collections["SUMMARIES"] list. But, when you call merge_all() you are no longer in that context (that set graph to be the default). So, merge_all() looks for summaries in the default graph that was created when you imported tensorflow, which is probably empty.
To fix the issue, simply call merge_all() within the same with graph.as_default(): context.
Here are some relevant code links:
https://github.com/tensorflow/tensorflow/blob/92e6c3e4f5c1cabfda1e61547a6a1b268ef95fa5/tensorflow/python/summary/summary.py#L293
https://github.com/tensorflow/tensorflow/blob/92e6c3e4f5c1cabfda1e61547a6a1b268ef95fa5/tensorflow/python/framework/ops.py#L5649

ARCH effect in GARCH model

After fitting GARCH model in R and obtain the output, how do I know whether there is any evidence of ARCH effect?
I am not toosure whether I have to check in optimal parameters, Information criteria, Q-statistics on standardized residuals, ARCM LM Tests, Nyblom stability test, Sign Bias Test or Adjusted Pearson Goodness-of-fit test?
I assume I have to check under ARCH LM Tests, and if the p-value is rather high, there is an ARCH effect, am I right?
Thank you
You need to start by looking for second order persistence in the return series itself before going on to fit a GARCH model. Lets work through a quick example of how this will work
Start by getting the return series. Here I will use the quantmod library to load in the data for SPDR S&P 500 ETF or SPY
library(quantmod)
library(PerformanceAnalytics)
rtn<-getSymbols(c('SPY'),return.class='ts')
Next, calculate the return series either yourself or using the Return.calculate function as provided by the PerformanceAnalytics library
Rtn <- diff(log(SPY[,"SPY.Close"])) * 100
#OR
Rtn <- Return.calculate(SPY[,"SPY.Close"], method = c("compound","simple")[2]) * 100
Now, lets have a look at the persistence of the first and second order moments of the series. For second order moments, lets use the squared return series as a proxy.
Plotdata<-cbind(Rtn, Rtn^2)
plot.zoo(Plotdata)
There remains strong first persistence in returns and there is clearly periods of strong second order persistence as seen in the squared returns.
We can now formally start testing for ARCH-effects. A formal test for ARCH effects is LBQ stats on squared returns:
Box.test(coredata(Rtn^2), type = "Ljung-Box", lag = 12)
Box-Ljung test
data: coredata(Rtn^2)
X-squared = 2001.2, df = 12, p-value < 2.2e-16
We can clearly reject the null hypothesis of independence in a given time series. (ARCH-effects)
Fin.Ts also provides the ARCH-LM test for conditional heteroskedasticity in the returns:
library(FinTS)
ArchTest(Rtn)
ARCH LM-test; Null hypothesis: no ARCH effects
data: Rtn
Chi-squared = 722.19, df = 12, p-value < 2.2e-16
This supports the conclusion of the LBQ test that ARCH-effects are present.

Resources