Doubts regarding `Understanding Keras LSTMs` - keras

I am new to LSTMs and going through the Understanding Keras LSTMs and had some silly doubts related to a beautiful answer by Daniel Moller.
Here are some of my doubts:
There are 2 ways specified under the Achieving one to many section where it’s written that we can use stateful=True to recurrently take the output of one step and serve it as the input of the next step (needs output_features == input_features).
In the One to many with repeat vector diagram, the repeated vector is fed as input in all the time-step, whereas in the One to many with stateful=True the output is fed as input in the next time step. So, aren't we changing the way the layers work by using the stateful=True?
Which of the above 2 approaches (using the repeat vector OR feeding the previous time-step output as the next input) should be followed when building an RNN?
Under the One to many with stateful=True section, to change the behaviour of one to many, in the code for manual loop for prediction, how will we know the steps_to_predict variable because we don't know the ouput sequence length in advance.
I also did not understand the way the entire model is using the last_step output to generate the next_step ouput. It has confused me about the working of model.predict() function. I mean, doesn't model.predict() simultaneously predict the entire output sequences at once rather than looping through the no. of output sequences (whose value I still don't know) to be generated and doing model.predict() to predict a specific time-step output in a given iteration?
I couldn't understand the entire of Many to many case. Any other link would be helpful.
I understand that we use model.reset_states() to make sure that a new batch is independent of the previous batch. But, Do we manually create batches of sequence such that one batch follows another batch or does Keras in stateful=True mode automatically divides the sequence into such batches.
If it's done manually then, why would anyone divide the dataset into such batches in which a part of a sequence is in one batch and the other in the next batch?
At last, what are the practical implementation or examples/use-cases where stateful=True would be used(because this seems to be something unusual)? I am learning LSTMs and this is the first time I've been introduced to stateful in Keras.
Can anyone help me in explaining my silly questions so that I can be clear on LSTM implementation in Keras?
EDIT: Asking some of these for clarification of the current answer and some for the remaining doubts
A. So, basically stateful lets us keep OR reset the inner state after every batch. Then, how would the model learn if we keep on resetting the inner state again and again after each batch trained? Does resetting truely means resetting the parameters(used in computing the hidden state)?
B. In the line If stateful=False: automatically resets inner state, resets last output step. What did you mean by resetting the last output step? I mean, if every time-step produces its own output then what does resetting of last output step mean and that too only the last one?
C. In response to Question 2 and 2nd point of Question 4, I still didn't get your manipulate the batches between each iteration and the need of stateful((last line of Question 2) which only resets the states). I got to the point that we don't know the input for every output generated in a time-step.
So, you break the sequences into sequences of only one-step and then use new_step = model.predict(last_step) but then how do you know about how long do you need to do this again and again(there must be a stopping point for the loop)? Also, do explain the stateful part( in the last line of Question 2).
D. In the code under One to many with stateful=True, it seems that the for loop(manual loop) is used for predicting the next word is used just in test time. Does the model incorporates that thing itself at train time or do we manually need use this loop also at the train time?
E. Suppose we are doing some machine translation job, I think the breaking of sequences will occur after the entire input(language to translate) has been fed to the input time-steps and then generation of outputs(translated language) at each time-step is going to take place via the manual loop because now we are ended up with the inputs and starting to produce output at each time-step using the iteration. Did I get it right?
F. As the default working of LSTMs requires 3 things mentioned in the answer, so in case of breaking of sequences, are current_input and previous_output fed with same vectors because their value in case of no current input being available is same?
G. Under the many to many with stateful=True under the Predicting: section, the code reads:
predicted = model.predict(totalSequences)
firstNewStep = predicted[:,-1:]
Since, the manual loop of finding the very next word in the current sequence hasn't been used up till now, how do I know the count of the time-steps that has been predicted by the model.predict(totalSequences) so that the last step from predicted(predicted[:,-1:]) will then later be used for generating the rest of the sequences? I mean, how do I know the number of sequences that have been produced in the predicted = model.predict(totalSequences) before the manual for loop (later used).
EDIT 2:
I. In D answer I still didn't get how will I train my model? I understand that using the manual loop(during training) can be quite painful but then if I don't use it how will the model get trained in the circumstances where we want the 10 future steps, we cannot output them at once because we don't have the necessary 10 input steps? Will simply using model.fit() solve my problem?
II. D answer's last para, You could train step by step using train_on_batch only in the case you have the expected outputs of each step. But otherwise I think it's very complicated or impossible to train..
Can you explain this in more detail?
What does step by step mean? If I don't have OR have the output for the later sequences , how will that affect my training? Do I still need the manual loop during training. If not, then will the model.fit() function work as desired?
III. I interpreted the "repeat" option as using the repeat vector. Wouldn't using the repeat vector be just good for the one to many case and not suitable for the many to many case because the latter will have many input vectors to choose from(to be used as a single repeated vector) ? How will you use the repeat vector for the many to many case?

Question 3
Understanding the question 3 is sort of a key to understand the others, so, let's try it first.
All recurrent layers in Keras perform hidden loops. These loops are totally invisible to us, but we can see the results of each iteration at the end.
The number of invisible iterations is equal to the time_steps dimension. So, the recurrent calculations of an LSTM happen regarding the steps.
If we pass an input with X steps, there will be X invisible iterations.
Each iteration in an LSTM will take 3 inputs:
The respective slice of the input data for this step
The inner state of the layer
The output of the last iteration
So, take the following example image, where our input has 5 steps:
What will Keras do in a single prediction?
Step 0:
Take the first step of the inputs, input_data[:,0,:] a slice shaped as (batch, 2)
Take the inner state (which is zero at this point)
Take the last output step (which doesn't exist for the first step)
Pass through the calculations to:
Update the inner state
Create one output step (output 0)
Step 1:
Take the next step of the inputs: input_data[:,1,:]
Take the updated inner state
Take the output generated in the last step (output 0)
Pass through the same calculation to:
Update the inner state again
Create one more output step (output 1)
Step 2:
Take input_data[:,2,:]
Take the updated inner state
Take output 1
Pass through:
Update the inner state
Create output 2
And so on until step 4.
Finally:
If stateful=False: automatically resets inner state, resets last output step
If stateful=True: keep inner state, keep last ouptut step
You will not see any of these steps. It will look like just a single pass.
But you can choose between:
return_sequences = True: every output step is returned, shape (batch, steps, units)
This is exactly many to many. You get the same number of steps in the output as you had in the input
return_sequences = False: only the last output step is returned, shape (batch, units)
This is many to one. You generate a single result for the entire input sequence.
Now, this answers the second part of your question 2: Yes, predict will compute everything without you noticing. But:
The number of output steps will be equal to the number of input steps
Question 4
Now, before going to the question 2, let's look at 4, which is actually the base of the answer.
Yes, the batch division should be done manually. Keras will not change your batches. So, why would I want to divide a sequence?
1, the sequence is too big, one batch doesn't fit the computer's or the GPU's memory
2, you want to do what is happening on question 2: manipulate the batches between each step iteration.
Question 2
In question 2, we are "predicting the future". So, what is the number of output steps? Well, it's the number you want to predict. Suppose you're trying to predict the number of clients you will have based on the past. You can decide to predict for one month in the future, or for 10 months. Your choice.
Now, you're right to think that predict will calculate the entire thing at once, but remember question 3 above where I said:
The number of output steps is equal to the number of input steps
Also remember that the first output step is result of the first input step, the second output step is result of the second input step, and so on.
But we want the future, not something that matches the previous steps one by one. We want that the result step follows the "last" step.
So, we face a limitation: how to define a fixed number of output steps if we don't have their respective inputs? (The inputs for the distant future are also future, so, they don't exist)
That's why we break our sequence into sequences of only one step. So predict will also output only one step.
When we do this, we have the ability to manipulate the batches between each iteration. And we have the ability to take output data (which we didn't have before) as input data.
And stateful is necessary because we want that each of these steps be connected as a single sequence (don't discard the states).
Question 5
The best practical application of stateful=True that I know is the answer of question 2. We want to manipulate the data between steps.
This might be a dummy example, but another application is if you're for instance receiving data from a user on the internet. Each day the user uses your website, you give one more step of data to your model (and you want to continue this user's previous history in the same sequence).
Question 1
Then, finally question 1.
I'd say: always avoid stateful=True, unless you need it.
You don't need it to build a one to many network, so, better not use it.
Notice that the stateful=True example for this is the same as the predict the future example, but you start from a single step. It's hard to implement, it will have worse speed because of manual loops. But you can control the number of output steps and this might be something you want in some cases.
There will be a difference in calculations too. And in this case I really can't answer if one is better than the other. But I don't believe there will be a big difference. But networks are some kind of "art", and testing might bring funny surprises.
Answers for EDIT:
A
We should not mistake "states" with "weights". They're two different variables.
Weights: the learnable parameters, they're never reset. (If you reset the weights, you lose everything the model learned)
States: current memory of a batch of sequences (relates to which step on the sequence I am now and what I have learned "from the specific sequences in this batch" up to this step).
Imagine you are watching a movie (a sequence). Every second makes you build memories like the name of the characters, what they did, what their relationship is.
Now imagine you get a movie you never saw before and start watching the last second of the movie. You will not understand the end of the movie because you need the previous story of this movie. (The states)
Now image you finished watching an entire movie. Now you will start watching a new movie (a new sequence). You don't need to remember what happened in the last movie you saw. If you try to "join the movies", you will get confused.
In this example:
Weights: your ability to understand and intepret movies, ability to memorize important names and actions
States: on a paused movie, states are the memory of what happened from the beginning up to now.
So, states are "not learned". States are "calculated", built step by step regarding each individual sequence in the batch. That's why:
resetting states mean starting new sequences from step 0 (starting a new movie)
keeping states mean continuing the same sequences from the last step (continuing a movie that was paused, or watching part 2 of that story )
States are exactly what make recurrent networks work as if they had "memory from the past steps".
B
In an LSTM, the last output step is part of the "states".
An LSTM state contains:
a memory matrix updated every step by calculations
the output of the last step
So, yes: every step produces its own output, but every step uses the output of the last step as state. This is how an LSTM is built.
If you want to "continue" the same sequence, you want memory of the last step results
If you want to "start" a new sequence, you don't want memory of the last step results (these results will keep stored if you don't reset states)
C
You stop when you want. How many steps in the future do you want to predict? That's your stopping point.
Imagine I have a sequence with 20 steps. And I want to predict 10 steps in the future.
In a standard (non stateful) network, we can use:
input 19 steps at once (from 0 to 18)
output 19 steps at once (from 1 to 19)
This is "predicting the next step" (notice the shift = 1 step). We can do this because we have all the input data available.
But when we want the 10 future steps, we cannot output them at once because we don't have the necessary 10 input steps (these input steps are future, we need the model to predict them first).
So we need to predict one future step from existing data, then use this step as input for the next future step.
But I want that these steps are all connected. If I use stateful=False, the model will see a lot of "sequences of length 1". No, we want one sequence of length 30.
D
This is a very good question and you got me ....
The stateful one to many was an idea I had when writing that answer, but I never used this. I prefer the "repeat" option.
You could train step by step using train_on_batch only in the case you have the expected outputs of each step. But otherwise I think it's very complicated or impossible to train.
E
That's one common approach.
Generate a condensed vector with a network (this vector can be a result, or the states generated, or both things)
Use this condensed vector as initial input/state of another network, generate step by step manually and stop when a "end of sentence" word or character is produced by the model.
There are also fixed size models without the manual loop. You suppose your sentence has a maximum length of X words. The result sentences that are shorter than this are completed with "end of sentence" or "null" words/characters. A Masking layer is very useful in these models.
F
You provide only the input. The other two things (last output and inner states) are already stored in the stateful layer.
I made the input = last output only because our specific model is predicting the next step. That's what we want it to do. For each input, the next step.
We taught this with the shifted sequence in training.
G
It doesn't matter. We want only the last step.
The number of sequences is kept by the first :.
And only the last step is considered by -1:.
But if you want to know, you can print predicted.shape. It is equal to totalSequences.shape in this model.
Edit 2
I
First, we can't use "one to many" models to predict the future, because we don't have data for that. There is no possibility to understand a "sequence" if you don't have the data for the steps of the sequence.
So, this type of model should be used for other types of applications. As I said before, I don't really have a good answer for this question. It's better to have a "goal" first, then we decide which kind of model is better for that goal.
II
With "step by step" I mean the manual loop.
If you don't have the outputs of later steps, I think it's impossible to train. It's probably not a useful model at all. (But I'm not the one that knows everything)
If you have the outputs, yes, you can train the entire sequences with fit without worrying about manual loops.
III
And you're right about III. You won't use repeat vector in many to many because you have varying input data.
"One to many" and "many to many" are two different techniques, each one with their advantages and disadvantages. One will be good for certain applications, the other will be good for other applications.

Related

Understanding the time steps and samples in keras LSTM

I'm still confused about the time steps and samples in LSTMs networks. If I have this csv file
I know that the features are the differents variables that we pass to the network, but for the rows i don't know if does represent the time steps or samples and if represent samples for example, what represent the time series and viseversa?
The number of rows, the Date field, is a sequence.
You can divide that sequence into multiple (or single) input/output and that are your samples.
For example, you can have two time steps as input and one time step as output.
Hence, each sample has a specific number of time steps. The output is a single step (you can have multi step output as well).
So, as we said, the inputs (two time-step)
[1999.02.05 07:26, 1999.02.05 07:28]
[1999.02.05 07:28, 1999.02.05 07:30]
[1999.02.05 07:30, 1999.02.05 07:32]
....
and so on

Scikit learn models gives weight to random variable? Should I remove features with less importance?

I do some feature selection by removing correlated variables and backwards elimination. However, after all that is done as a test I threw in a random variable, and then trained logistic regression, random forest and XGBoost. All 3 models have the feature importance of the random feature as greater than 0. First, how can that be? Second, all models have it ranked toward the bottom, but it's not the lowest feature. Is this a valid step for another round of feature selection -i.e. remove all those who score below the random feature?
The random feature is created with
model_data['rand_feat'] = random.randint(100, size=(model_data.shape[0]))
This can happen, What random is the number you sample, but this random sampling can still generate a pattern by chance. I dont know whether you are doing classification or regression but lets consider the simple example of binary classification. We have class 1 and 0 and 1000 data point from each. When you sample a random number for each data point, it can happen that for example a majority of class 1 gets some value higher than 50, whereas majority of class 0 gets a random number smaller than 50.
So in the end effect, this might result into some pattern. So I would guess everytime you run your code the random feature importance changes. It is always ranked low because it is very unlikely that a good pattern is generated(e.g all 1s get higher than 50 and all 0s get lower than 50).
Finally, yes you should consider to drop the features with low value
I agree with berkay's answer that a random variable can have patterns that are by chance associated to your outcome variable. Secondly, I will neither include random variable in model building nor as my filtering threshold because if random variable has by chance significant or nearly significant association to the outcome it will suppress the expression of important features of original data and you probably end up losing those important features.
In early phase of model development I always include two random variables.
For me it is like a 'sanity check' since these are in effect junk variables or junk features.
If any of my features are worse in importance than the junk features then that is a warning sign that I need to look more carefully at the worth* of those features or to do some better feature engineering.
For example what does theory suggest about the inclusion of those features?

doc2vec: Pull documents from inferred document

i am new in word/paragraph embedding and trying to understand via doc2vec in GENSIM. I would like to seek advice on whether my understanding is incorrect. My understanding is that doc2vec is potentially able to return documents that may have semantically similar content. As a test, i tried the following and have the following questions.
Question 1: I noted that every run of training with the exact same parameters and examples will result in a model that produces very different results from previous trains (E.g. Different vectors and different ranking of similar documents eveytime).. Why is this so indeterministic? As such, can this be reliably used for any practical work?
Question 2: Why am i not getting the tag ids of the top similar documents instead?
Results: [('day',0.477),('2016',0.386)....
Question 2 answer: The problem was due to model.most_similar, should use model.docvecs.most_similar instead
Please advise if i misunderstood anything?
Data prep
I had created multiple documents with a sentence each. I had deliberately made it such that they are distinctly different semantically.
A: It is a fine summer weather, with the birds singing and sun shining bright.
B: It is a lovely day indeed, if only i had a degree in appreciating.
C: 2016-2017 Degree in Earth Science Earthly University
D: 2009-2010 Dip in Life and Nature Life College
Query: Degree in Philosophy from Thinking University from 2009 to 2010
Training
I trained the documents (tokens as words, running index as tag)
tdlist=[]
docstring=['It is a fine summer weather, with the birds singing and sun shining bright.',
'It is a lovely day indeed, if only i had a degree in appreciating.',
'2016-2017 Degree in Earth Science Earthly University',
'2009-2010 Dip in Life and Nature Life College']
counter=1
for para in docstring:
tokens=tokenize(para) #This will also strip punctuation
td=TaggedDocument(gensim.utils.to_unicode(str.encode(' '.join(tokens))).split(), str(counter))
tdlist.append(td)
counter=counter+1
model=gensim.models.Doc2Vec(tdlist,dm=0,alpha=0.025, size=20, min_alpha=0.025, min_count=0)
for epoch in range(200):
model.train(tdlist, total_examples=model.corpus_count, epochs=model.iter)
Inference
I then attempted to infer the query. Although they are many missing words in the vocab for the query, i would expect closest document similarity results for C and D. But the results only gave me a list of 'words' followed by a similarity score. I am unsure if my understanding is wrong. Below is my code extract.
mydocvector=model.infer_vector(['Degree' ,'in' ,'Philosophy' ,'from' ,'Thinking' ,'University', 'from', '2009', 'to', '2010'])
print(model.docvecs.most_similar(positive=[mydocvector])
Doc2Vec doesn't work well on toy-sized datasets - few documents, few total words, few words per document. You'll absolutely want more documents than vector dimensions (size), and ideally tens-of-thousands of documents or more.
The second argument to TaggedDocument should be a list of tags. By supplying a single string-of-an-int, each of its elements (characters) will be seen as tags. (With just documents 1 to 4 this won't yet hurt, but as soon as you have document 10, Doc2Vec will see it as tags 1 and 0, unless you supply it as ['10'] (a single-element list).
Yes, to find most-similar documents you use model.docvecs.most_similar() rather than model.most_similar() (which only operates on learned words, if any).
You are using dm=0 mode, which is a pretty good starting idea – it's fast and often a top-performer. But note that this mode doesn't train word-vectors too. So anything you ask for from the top model, like model['summer'] or model.most_similar('sun'), will be nonsense results based on randomly-initialized but never-trained words. (If you need words trained too, either add dbow_words=1 to the dm=0 mode, or use a dm=1 mode. But for pure doc-vectors, dm=0 is a pretty good choice.)
There's no need to call train() in a loop - or indeed at all, given the line above it. The form you've used to instantiate Doc2Vec, with an actual corpus tdlist as the first argument, already triggers model-setup and training, using the default number of iter passes (5) and the supplied alpha and min_alpha. Now, for Doc2Vec training you often want more passes (10 to 20 are common, though smaller datasets might benefit from even more). And for any training, for properly gradient-descent, you want the effective learning-rate alpha to gradually decline to a negligible value, such as the default 0.0001 (rather than a forced same-as-starting value).
The only situation where you'd usually call train() explicitly is if you instantiate the model without a corpus. In that case, you'd need to both call model.build_vocab(tdlist) (to let the model initialize with a discovered vocabulary), and then some form of train() - but you'd still need only one call to train, supplying the desired number of passes. (Allowing the default model.iter 5 passes, inside an outer loop of 200 iterations, means a total of 1000 passes over the data... and all at the same fixed alpha, which is not proper gradient-descent.)
When you have a beefier dataset, you may find results improve with a higher min_count. Usually words that appear only a few times can't contribute much meaning, and thus only serve as noise slowing training and interfering with other vectors becoming more expressive. (Don't assume "more words must equal better results".) Throwing out the singletons, or more, usually helps.
Regarding inference, almost none of the words in your inference text are in the training set. (I only see 'Degree', 'in', and 'University' repeated.) So in addition to all the issues above, inferring a good vector for the example text would be hard. With a richer training set, you'd likely get better results. It also often helps to increase the steps optional parameter to infer_vector() far above its default of 5.

How to interpret some syntax (n.adapt, update..) in jags?

I feel very confused with the following syntax in jags, for example,
n.iter=100,000
thin=100
n.adapt=100
update(model,1000,progress.bar = "none")
Currently I think
n.adapt=100 means you set the first 100 draws as burn-in,
n.iter=100,000 means the MCMC chain has 100,000 iterations including the burn-in,
I have checked the explanation for this question a lot of time but still not sure whether my interpretation about n.iter and n.adapt is correct and how to understand update() and thinning.
Could anyone explain to me?
This answer is based on the package rjags, which takes an n.adapt argument. First I will discuss the meanings of adaptation, burn-in, and thinning, and then I will discuss the syntax (I sense that you are well aware of the meaning of burn-in and thinning, but not of adaptation; a full explanation may make this answer more useful to future readers).
Burn-in
As you probably understand from introductions to MCMC sampling, some number of iterations from the MCMC chain must be discarded as burn-in. This is because prior to fitting the model, you don't know whether you have initialized the MCMC chain within the characteristic set, the region of reasonable posterior probability. Chains initialized outside this region take a finite (sometimes large) number of iterations to find the region and begin exploring it. MCMC samples from this period of exploration are not random draws from the posterior distribution. Therefore, it is standard to discard the first portion of each MCMC chain as "burn-in". There are several post-hoc techniques to determine how much of the chain must be discarded.
Thinning
A separate problem arises because in all but the simplest models, MCMC sampling algorithms produce chains in which successive draws are substantially autocorrelated. Thus, summarizing the posterior based on all iterations of the MCMC chain (post burn-in) may be inadvisable, as the effective posterior sample size can be much smaller than the analyst realizes (note that STAN's implementation of Hamiltonian Monte-Carlo sampling dramatically reduces this problem in some situations). Therefore, it is standard to make inference on "thinned" chains where only a fraction of the MCMC iterations are used in inference (e.g. only every fifth, tenth, or hundredth iteration, depending on the severity of the autocorrelation).
Adaptation
The MCMC samplers that JAGS uses to sample the posterior are governed by tunable parameters that affect their precise behavior. Proper tuning of these parameters can produce gains in the speed or de-correlation of the sampling. JAGS contains machinery to tune these parameters automatically, and does so as it draws posterior samples. This process is called adaptation, but it is non-Markovian; the resulting samples do not constitute a Markov chain. Therefore, burn-in must be performed separately after adaptation. It is incorrect to substitute the adaptation period for the burn-in. However, sometimes only relatively short burn-in is necessary post-adaptation.
Syntax
Let's look at a highly specific example (the code in the OP doesn't actually show where parameters like n.adapt or thin get used). We'll ask rjags to fit the model in such a way that each step will be clear.
n.chains = 3
n.adapt = 1000
n.burn = 10000
n.iter = 20000
thin = 50
my.model <- jags.model(mymodel.txt, data=X, inits=Y, n.adapt=n.adapt) # X is a list pointing JAGS to where the data are, Y is a vector or function giving initial values
update(my.model, n.burn)
my.samples <- coda.samples(my.model, params, n.iter=n.iter, thin=thin) # params is a list of parameters for which to set trace monitors (i.e. we want posterior inference on these parameters)
jags.model() builds the directed acyclic graph and then performs the adaptation phase for a number of iterations given by n.adapt.
update() performs the burn-in on each chain by running the MCMC for n.burn iterations without saving any of the posterior samples (skip this step if you want to examine the full chains and discard a burn-in period post-hoc).
coda.samples() (from the coda package) runs the each MCMC chain for the number of iterations specified by n.iter, but it does not save every iteration. Instead, it saves only ever nth iteration, where n is given by thin. Again, if you want to determine your thinning interval post-hoc, there is no need to thin at this stage. One advantage of thinning at this stage is that the coda syntax makes it simple to do so; you don't have to understand the structure of the MCMC object returned by coda.samples() and thin it yourself. The bigger advantage to thinning at this stage is realized if n.iter is very large. For example, if autocorrelation is really bad, you might run 2 million iterations and save only every thousandth (thin=1000). If you didn't thin at this stage, you (and your RAM) would need to manipulate an object with three chains of two million numbers each. But by thinning as you go, the final object only has 2 thousand numbers in each chain.

SVM and cross validation

The problem is as follows. When I do support vector machine training, suppose I have already performed cross validation on 10000 training points with a Gaussian kernel and have obtained the best parameter C and \sigma. Now I have another 40000 new training points and since I don't want to waste time on cross validation, I stick to the original C and \sigma that I obtained from the first 10000 points, and train the entire 50000 points on these parameters. Is there any potentially major problem with this? It seems that for C and \sigma in some range, the final test error wouldn't be that bad, and thus the above process seems okay.
There is one major pitfal of such appraoch. Both C and sigma are data dependant. In particular, it can be shown, that optimal C strongly depends on the size of the training set. So once you make your training data 5 times bigger, even if it brings no "new" knowledge - you should still find new C to get the exact same model as before. So, you can do such procedure, but keep in mind, that best parameters for smaller training set do not have to be the best for the bigger one (even though, they sometimes still are).
To better see the picture. If this procedure would be fully "ok" than why not fit C on even smaller data? 5 times? 25 times smaller? Mayone on one single point per class? 10,000 may seem "a lot", but it depends on the problem considered. In many real life domains this is just a "regular" (biology) or even "very small" (finance) dataset, so you won't be sure, if your procedure is fine for this particular problem until you test it.

Resources