TensorFlow LSTM State Switches from Tuple to Tensor - python-3.x

I'm moving my comments from https://github.com/tensorflow/tensorflow/issues/8833 to StackOverflow as SO seems more appropriate.
I'm attempting to implement a sequence to sequence model using tensorflow.contrib.seq2seq and tensorflow.contrib.rnn's BasicLSTMCell. Within rnn_cell_impl.py, the line c, h = state causes the following error:
TypeError: 'Tensor' object is not iterable.
When stepping through the code, I learned that the error is caused the third time c, h = state is evaluated. The first two times, state has type <class 'tensorflow.python.ops.rnn_cell_impl.LSTMStateTuple'>, but on the third time, state has type <class 'tensorflow.python.framework.ops.Tensor'>. Clearly, I want the third time to have type LSTMStateTuple, but I have no idea what might be causing the switch.
The problematic state tensor's name is define_model/define_decoder/decoder/while/Identity_3. I wrote the methods define_model() and define_decoder(), and the remaining information suggests that something is happening inside my decoder.
In case it's relevant, I'm using Python 3.6 and Tensorflow 1.2.

The answer can be found at the above linked Github issue page.
To briefly summarize, the problem was that my encoder used a bidirectional RNN, which produces a 2-tuple of LSTMStateTuples i.e. one c and one h state for each directional RNN. Then, later, the decoder accepts a single cell, which has associated with it a single LSTMStateTuple. To solve this problem, you need to separately concatenate the c states and h states for the two directional RNNS, wrap this as a new LSTMStateTuple and pass that to the decoder's state.

I think the similar answer can be found here.
The code converts cudnn cell state to tensorflow internal state.
See this method
def cudnn_lstm_state_to_state_tuples(cudnn_lstm_state):

Related

Is there a way to supply a numerical function to JiTCODE’s function argument instead of symbolic one?

I am getting a function (a learned dynamical system) through a neural network and want to pass it to JiTCODE to calculate trajectories, Lyapunov exponents, etc. As per the JiTCODE documentation, the function f has to be a symbolic function. Is there any way to change this since ultimately JiTCODE is going to lambdify the symbolic function?
Basically, this is what I'm doing right now:
# learns derviates from the Neural net model
# returns an array of numbers [\dot{x},\dot{y}] for input [x,y]
learned_fn = lambda t, y0: NN_model(t, y0)
ODE = jitcode_lyap(learned_fn, n_lyap=2)
ODE.set_integrator("vode")
First beware that JiTCODE does not take regular functions like your learned_fn as an input. It takes either iterables of symbolic expressions or generator functions returning symbolic expressions. This is why your example code will likely produce an error.
What you are asking for
You can “inject” any derivative with the right signature into JiTCODE by changing the f property and telling it that it failed compiling the actual derivative. Here is a minimal example doing this:
from jitcode import jitcode, y
ODE = jitcode([0])
ODE.f = lambda t,y: y[0]
ODE.compile_attempt = False
ODE.set_integrator("dopri5")
ODE.set_initial_value([1],0.0)
for time in range(30):
print(time,*ODE.integrate(time))
Why you probably do not want to do this
Ignoring Lyapunov exponents for a second, the entire point of JiTCODE is to hard-code your derivative for you and pass it to SciPy’s ode or solve_ivp who perform the actual integration. Thus the above example code is just an overly complicated way of passing a function to one SciPy’s standard integrators (here ode), with no advantage. If your NN_model is very efficiently implemented in the first place, you may not even gain a speed boost from JiTCODE’s auto-compilation.
The main reason to use JiTCODE’s Lyapunov-exponent capabilities is that it automatically obtains the Jacobian and the ODE for the tangent-vector evolution (needed for the Benettin method) from the symbolic representation of the derivative. Without a symbolic input, it cannot possibly do this. You could theoretically inject a tangent-vector ODE as well, but then again you would leave little for JiTCODE to do and you would probably better off using SciPy’s ode or solve_ivp directly.
What you probably need
If you want to use JiTCODE, you need to write a small piece of code that translates the output of your neural-network training to a symbolic representation of your ODE as needed by JiTCODE. This is probably much less scary than it sounds. You just need to obtain the trained coefficients and insert it in the equations of the general form of the neural network.
If you are lucky and your NN_model fully supports duck typing (and ), you may do something like this:
from jitcode import t,y
n = 10 # dimension of your ODE
NN_input = [y(i) for i in range(n)]
learned_fn = NN_model(t,NN_input)[1]
The idea is that you feed NN_model once with abstract symbolic input (t and NN_input). NN_model then once acts on this abstract input providing you an abstract result (here you need the duck-typing support). If I interpreted the output of your NN_model correctly, the second component of this result should be the abstract derivative as required by JiTCODE as an input.
Note that your NN_model appears to expect dimensions to be indices, but JiTCODE’s y expects dimensions to be function arguments. Thus you cannot just choose NN_input = y, but you have to transform it as above.
To quote directly from the linked documentation
JiTCODE takes an iterable (or generator function or dictionary) of symbolic expressions, which it translates to C code, compiles on the fly,
so there is no lambdification going on, the function is parsed, not just evaluated.
But in general that should be no problem, you just use the JITCODE provided symbolic vector y and symbol t instead of the function arguments t,y of the right side of the ODE.

Random Index from a Tensor (Sampling with Replacement from a Tensor)

I'm trying to manipulate individual weights of different neural nets to see how their performance degrades. As part of these experiments, I'm required to sample randomly from their weight tensors, which I've come to understand as sampling with replacement (in the statistical sense). However, since it's high-dimensional, I've been stumped by how to do this in a fair manner. Here are the approaches and research I've put into considering this problem:
This was previously implemented by selecting a random layer and then selecting a random weight in that layer (ignore the implementation of picking a random weight). Since layers are different sizes, we discovered that weights were being sampled unevenly.
I considered what would happen if we sampled according to the numpy.shape of the tensor; however, I realize now that this encounters the same problem as above.
Consider what happens to a rank 2 tensor like this:
[[*, *, *],
[*, *, *, *]]
Selecting a row randomly and then a value from that row results in an unfair selection. This method could work if you're able to assert that this scenario never occurs, but it's far from a general solution.
Note that this possible duplicate actually implements it in this fashion.
I found people suggesting flattening the tensor and use numpy.random.choice to select randomly from a 1D array. That's a simple solution, except I have no idea how to invert the flattened tensor back into its original shape. Further, flattening millions of weights would be a somewhat slow implementation.
I found someone discussing tf.random.multinomial here, but I don't understand enough of it to know whether it's applicable or not.
I ran into this paper about resevoir sampling, but again, it went over my head.
I found another paper which specifically discusses tensors and sampling techniques, but it went even further over my head.
A teammate found this other paper which talks about random sampling from a tensor, but it's only for rank 3 tensors.
Any help understanding how to do this? I'm working in Python with Keras, but I'll take an algorithm in any form that it exists. Thank you in advance.
Before I forget to document the solution we arrived at, I'll talk about the two different paths I see for implementing this:
Use a total ordering on scalar elements of the tensor. This is effectively enumerating your elements, i.e. flattening them. However, you can do this while maintaining the original shape. Consider this pseudocode (in Python-like syntax):
def sample_tensor(tensor, chosen_index: int) -> Tuple[int]:
"""Maps a chosen random number to its index in the given tensor.
Args:
tensor: A ragged-array n-tensor.
chosen_index: An integer in [0, num_scalar_elements_in_tensor).
Returns:
The index that accesses this element in the tensor.
NOTE: Entirely untested, expect it to be fundamentally flawed.
"""
remaining = chosen_index
for (i, sub_list) in enumerate(tensor):
if type(sub_list) is an iterable:
if |sub_list| > remaining:
remaining -= |sub_list|
else:
return i joined with sample_tensor(sub_list, remaining)
else:
if len(sub_list) <= remaining:
return tuple(remaining)
First of all, I'm aware this isn't a sound algorithm. The idea is to count down until you reach your element, with bookkeeping for indices.
We need to make crucial assumptions here. 1) All lists will eventually contain only scalars. 2) By direct consequence, if a list contains lists, assume that it also doesn't contain scalars at the same level. (Stop and convince yourself for (2).)
We also need to make a critical note here too: We are unable to measure the number of scalars in any given list, unless the list is homogeneously consisting of scalars. In order to avoid measuring this magnitude at every point, my algorithm above should be refactored to descend first, and subtract later.
This algorithm has some consequences:
It's the fastest in its entire style of approaching the problem. If you want to write a function f: [0, total_elems) -> Tuple[int], you must know the number of preceding scalar elements along the total ordering of the tensor. This is effectively bound at Theta(l) where l is the number of lists in the tensor (since we can call len on a list of scalars).
It's slow. It's too slow compared to sampling nicer tensors that have a defined shape to them.
It begs the question: can we do better? See the next solution.
Use a probability distribution in conjunction with numpy.random.choice. The idea here is that if we know ahead of time what the distribution of scalars is already like, we can sample fairly at each level of descending the tensor. The hard problem here is building this distribution.
I won't write pseudocode for this, but lay out some objectives:
This can be called only once to build the data structure.
The algorithm needs to combine iterative and recursive techniques to a) build distributions for sibling lists and b) build distributions for descendants, respectively.
The algorithm will need to map indices to a probability distribution respective to sibling lists (note the assumptions discussed above). This does require knowing the number of elements in an arbitrary sub-tensor.
At lower levels where lists contain only scalars, we can simplify by just storing the number of elements in said list (as opposed to storing probabilities of selecting scalars randomly from a 1D array).
You will likely need 2-3 functions: one that utilizes the probability distribution to return an index, a function that builds the distribution object, and possibly a function that just counts elements to help build the distribution.
This is also faster at O(n) where n is the rank of the tensor. I'm convinced this is the fastest possible algorithm, but I lack the time to try to prove it.
You might choose to store the distribution as an ordered dictionary that maps a probability to either another dictionary or the number of elements in a 1D array. I think this might be the most sensible structure.
Note that (2) is truly the same as (1), but we pre-compute knowledge about the densities of the tensor.
I hope this helps.

Doubts regarding `Understanding Keras LSTMs`

I am new to LSTMs and going through the Understanding Keras LSTMs and had some silly doubts related to a beautiful answer by Daniel Moller.
Here are some of my doubts:
There are 2 ways specified under the Achieving one to many section where it’s written that we can use stateful=True to recurrently take the output of one step and serve it as the input of the next step (needs output_features == input_features).
In the One to many with repeat vector diagram, the repeated vector is fed as input in all the time-step, whereas in the One to many with stateful=True the output is fed as input in the next time step. So, aren't we changing the way the layers work by using the stateful=True?
Which of the above 2 approaches (using the repeat vector OR feeding the previous time-step output as the next input) should be followed when building an RNN?
Under the One to many with stateful=True section, to change the behaviour of one to many, in the code for manual loop for prediction, how will we know the steps_to_predict variable because we don't know the ouput sequence length in advance.
I also did not understand the way the entire model is using the last_step output to generate the next_step ouput. It has confused me about the working of model.predict() function. I mean, doesn't model.predict() simultaneously predict the entire output sequences at once rather than looping through the no. of output sequences (whose value I still don't know) to be generated and doing model.predict() to predict a specific time-step output in a given iteration?
I couldn't understand the entire of Many to many case. Any other link would be helpful.
I understand that we use model.reset_states() to make sure that a new batch is independent of the previous batch. But, Do we manually create batches of sequence such that one batch follows another batch or does Keras in stateful=True mode automatically divides the sequence into such batches.
If it's done manually then, why would anyone divide the dataset into such batches in which a part of a sequence is in one batch and the other in the next batch?
At last, what are the practical implementation or examples/use-cases where stateful=True would be used(because this seems to be something unusual)? I am learning LSTMs and this is the first time I've been introduced to stateful in Keras.
Can anyone help me in explaining my silly questions so that I can be clear on LSTM implementation in Keras?
EDIT: Asking some of these for clarification of the current answer and some for the remaining doubts
A. So, basically stateful lets us keep OR reset the inner state after every batch. Then, how would the model learn if we keep on resetting the inner state again and again after each batch trained? Does resetting truely means resetting the parameters(used in computing the hidden state)?
B. In the line If stateful=False: automatically resets inner state, resets last output step. What did you mean by resetting the last output step? I mean, if every time-step produces its own output then what does resetting of last output step mean and that too only the last one?
C. In response to Question 2 and 2nd point of Question 4, I still didn't get your manipulate the batches between each iteration and the need of stateful((last line of Question 2) which only resets the states). I got to the point that we don't know the input for every output generated in a time-step.
So, you break the sequences into sequences of only one-step and then use new_step = model.predict(last_step) but then how do you know about how long do you need to do this again and again(there must be a stopping point for the loop)? Also, do explain the stateful part( in the last line of Question 2).
D. In the code under One to many with stateful=True, it seems that the for loop(manual loop) is used for predicting the next word is used just in test time. Does the model incorporates that thing itself at train time or do we manually need use this loop also at the train time?
E. Suppose we are doing some machine translation job, I think the breaking of sequences will occur after the entire input(language to translate) has been fed to the input time-steps and then generation of outputs(translated language) at each time-step is going to take place via the manual loop because now we are ended up with the inputs and starting to produce output at each time-step using the iteration. Did I get it right?
F. As the default working of LSTMs requires 3 things mentioned in the answer, so in case of breaking of sequences, are current_input and previous_output fed with same vectors because their value in case of no current input being available is same?
G. Under the many to many with stateful=True under the Predicting: section, the code reads:
predicted = model.predict(totalSequences)
firstNewStep = predicted[:,-1:]
Since, the manual loop of finding the very next word in the current sequence hasn't been used up till now, how do I know the count of the time-steps that has been predicted by the model.predict(totalSequences) so that the last step from predicted(predicted[:,-1:]) will then later be used for generating the rest of the sequences? I mean, how do I know the number of sequences that have been produced in the predicted = model.predict(totalSequences) before the manual for loop (later used).
EDIT 2:
I. In D answer I still didn't get how will I train my model? I understand that using the manual loop(during training) can be quite painful but then if I don't use it how will the model get trained in the circumstances where we want the 10 future steps, we cannot output them at once because we don't have the necessary 10 input steps? Will simply using model.fit() solve my problem?
II. D answer's last para, You could train step by step using train_on_batch only in the case you have the expected outputs of each step. But otherwise I think it's very complicated or impossible to train..
Can you explain this in more detail?
What does step by step mean? If I don't have OR have the output for the later sequences , how will that affect my training? Do I still need the manual loop during training. If not, then will the model.fit() function work as desired?
III. I interpreted the "repeat" option as using the repeat vector. Wouldn't using the repeat vector be just good for the one to many case and not suitable for the many to many case because the latter will have many input vectors to choose from(to be used as a single repeated vector) ? How will you use the repeat vector for the many to many case?
Question 3
Understanding the question 3 is sort of a key to understand the others, so, let's try it first.
All recurrent layers in Keras perform hidden loops. These loops are totally invisible to us, but we can see the results of each iteration at the end.
The number of invisible iterations is equal to the time_steps dimension. So, the recurrent calculations of an LSTM happen regarding the steps.
If we pass an input with X steps, there will be X invisible iterations.
Each iteration in an LSTM will take 3 inputs:
The respective slice of the input data for this step
The inner state of the layer
The output of the last iteration
So, take the following example image, where our input has 5 steps:
What will Keras do in a single prediction?
Step 0:
Take the first step of the inputs, input_data[:,0,:] a slice shaped as (batch, 2)
Take the inner state (which is zero at this point)
Take the last output step (which doesn't exist for the first step)
Pass through the calculations to:
Update the inner state
Create one output step (output 0)
Step 1:
Take the next step of the inputs: input_data[:,1,:]
Take the updated inner state
Take the output generated in the last step (output 0)
Pass through the same calculation to:
Update the inner state again
Create one more output step (output 1)
Step 2:
Take input_data[:,2,:]
Take the updated inner state
Take output 1
Pass through:
Update the inner state
Create output 2
And so on until step 4.
Finally:
If stateful=False: automatically resets inner state, resets last output step
If stateful=True: keep inner state, keep last ouptut step
You will not see any of these steps. It will look like just a single pass.
But you can choose between:
return_sequences = True: every output step is returned, shape (batch, steps, units)
This is exactly many to many. You get the same number of steps in the output as you had in the input
return_sequences = False: only the last output step is returned, shape (batch, units)
This is many to one. You generate a single result for the entire input sequence.
Now, this answers the second part of your question 2: Yes, predict will compute everything without you noticing. But:
The number of output steps will be equal to the number of input steps
Question 4
Now, before going to the question 2, let's look at 4, which is actually the base of the answer.
Yes, the batch division should be done manually. Keras will not change your batches. So, why would I want to divide a sequence?
1, the sequence is too big, one batch doesn't fit the computer's or the GPU's memory
2, you want to do what is happening on question 2: manipulate the batches between each step iteration.
Question 2
In question 2, we are "predicting the future". So, what is the number of output steps? Well, it's the number you want to predict. Suppose you're trying to predict the number of clients you will have based on the past. You can decide to predict for one month in the future, or for 10 months. Your choice.
Now, you're right to think that predict will calculate the entire thing at once, but remember question 3 above where I said:
The number of output steps is equal to the number of input steps
Also remember that the first output step is result of the first input step, the second output step is result of the second input step, and so on.
But we want the future, not something that matches the previous steps one by one. We want that the result step follows the "last" step.
So, we face a limitation: how to define a fixed number of output steps if we don't have their respective inputs? (The inputs for the distant future are also future, so, they don't exist)
That's why we break our sequence into sequences of only one step. So predict will also output only one step.
When we do this, we have the ability to manipulate the batches between each iteration. And we have the ability to take output data (which we didn't have before) as input data.
And stateful is necessary because we want that each of these steps be connected as a single sequence (don't discard the states).
Question 5
The best practical application of stateful=True that I know is the answer of question 2. We want to manipulate the data between steps.
This might be a dummy example, but another application is if you're for instance receiving data from a user on the internet. Each day the user uses your website, you give one more step of data to your model (and you want to continue this user's previous history in the same sequence).
Question 1
Then, finally question 1.
I'd say: always avoid stateful=True, unless you need it.
You don't need it to build a one to many network, so, better not use it.
Notice that the stateful=True example for this is the same as the predict the future example, but you start from a single step. It's hard to implement, it will have worse speed because of manual loops. But you can control the number of output steps and this might be something you want in some cases.
There will be a difference in calculations too. And in this case I really can't answer if one is better than the other. But I don't believe there will be a big difference. But networks are some kind of "art", and testing might bring funny surprises.
Answers for EDIT:
A
We should not mistake "states" with "weights". They're two different variables.
Weights: the learnable parameters, they're never reset. (If you reset the weights, you lose everything the model learned)
States: current memory of a batch of sequences (relates to which step on the sequence I am now and what I have learned "from the specific sequences in this batch" up to this step).
Imagine you are watching a movie (a sequence). Every second makes you build memories like the name of the characters, what they did, what their relationship is.
Now imagine you get a movie you never saw before and start watching the last second of the movie. You will not understand the end of the movie because you need the previous story of this movie. (The states)
Now image you finished watching an entire movie. Now you will start watching a new movie (a new sequence). You don't need to remember what happened in the last movie you saw. If you try to "join the movies", you will get confused.
In this example:
Weights: your ability to understand and intepret movies, ability to memorize important names and actions
States: on a paused movie, states are the memory of what happened from the beginning up to now.
So, states are "not learned". States are "calculated", built step by step regarding each individual sequence in the batch. That's why:
resetting states mean starting new sequences from step 0 (starting a new movie)
keeping states mean continuing the same sequences from the last step (continuing a movie that was paused, or watching part 2 of that story )
States are exactly what make recurrent networks work as if they had "memory from the past steps".
B
In an LSTM, the last output step is part of the "states".
An LSTM state contains:
a memory matrix updated every step by calculations
the output of the last step
So, yes: every step produces its own output, but every step uses the output of the last step as state. This is how an LSTM is built.
If you want to "continue" the same sequence, you want memory of the last step results
If you want to "start" a new sequence, you don't want memory of the last step results (these results will keep stored if you don't reset states)
C
You stop when you want. How many steps in the future do you want to predict? That's your stopping point.
Imagine I have a sequence with 20 steps. And I want to predict 10 steps in the future.
In a standard (non stateful) network, we can use:
input 19 steps at once (from 0 to 18)
output 19 steps at once (from 1 to 19)
This is "predicting the next step" (notice the shift = 1 step). We can do this because we have all the input data available.
But when we want the 10 future steps, we cannot output them at once because we don't have the necessary 10 input steps (these input steps are future, we need the model to predict them first).
So we need to predict one future step from existing data, then use this step as input for the next future step.
But I want that these steps are all connected. If I use stateful=False, the model will see a lot of "sequences of length 1". No, we want one sequence of length 30.
D
This is a very good question and you got me ....
The stateful one to many was an idea I had when writing that answer, but I never used this. I prefer the "repeat" option.
You could train step by step using train_on_batch only in the case you have the expected outputs of each step. But otherwise I think it's very complicated or impossible to train.
E
That's one common approach.
Generate a condensed vector with a network (this vector can be a result, or the states generated, or both things)
Use this condensed vector as initial input/state of another network, generate step by step manually and stop when a "end of sentence" word or character is produced by the model.
There are also fixed size models without the manual loop. You suppose your sentence has a maximum length of X words. The result sentences that are shorter than this are completed with "end of sentence" or "null" words/characters. A Masking layer is very useful in these models.
F
You provide only the input. The other two things (last output and inner states) are already stored in the stateful layer.
I made the input = last output only because our specific model is predicting the next step. That's what we want it to do. For each input, the next step.
We taught this with the shifted sequence in training.
G
It doesn't matter. We want only the last step.
The number of sequences is kept by the first :.
And only the last step is considered by -1:.
But if you want to know, you can print predicted.shape. It is equal to totalSequences.shape in this model.
Edit 2
I
First, we can't use "one to many" models to predict the future, because we don't have data for that. There is no possibility to understand a "sequence" if you don't have the data for the steps of the sequence.
So, this type of model should be used for other types of applications. As I said before, I don't really have a good answer for this question. It's better to have a "goal" first, then we decide which kind of model is better for that goal.
II
With "step by step" I mean the manual loop.
If you don't have the outputs of later steps, I think it's impossible to train. It's probably not a useful model at all. (But I'm not the one that knows everything)
If you have the outputs, yes, you can train the entire sequences with fit without worrying about manual loops.
III
And you're right about III. You won't use repeat vector in many to many because you have varying input data.
"One to many" and "many to many" are two different techniques, each one with their advantages and disadvantages. One will be good for certain applications, the other will be good for other applications.

Difference between apply node and op node in theano

Theano beginner here. I was just going through the graph structures section on deeplearning.net and I have a doubt.
It is stated that in the tutorial that, "Apply node represents the application of an op to some variables. It is important to draw the difference between the definition of a computation represented by an op and its application to some actual data which is represented by the apply node."
In theano, the application of a computation to data is performed by first creating a function and plugging the appropriate value in f(). Where does the op node come into the picture ?
OP in theano means the computation, like add, dot, convolution. And we define all our works by OPs and then build it to a specific graph by theano.function(). Look at here.
Theano represents symbolic mathematical computations as graphs. These
graphs are composed of interconnected Apply, Variable and Op nodes.
Apply node represents the application of an op to some variables.
I am not sure this answers your questions. Let's me know if it's still unclear.

Dummy Coding of Nominal Attributes - Effect of Using K Dummies, Effect of Attribute Selection

Summing up my understanding of the topic 'Dummy Coding' is usually understood as coding a nominal attribute with K possible values as K-1 binary dummies. The usage of K values would cause redundancy and would have a negative impact e.g. on logistic regression, as far as I learned it. That far, everything's clear to me.
Yet, two issues are unclear to me:
1) Bearing in mind the issue stated above, I am confused that the 'Logistic' classifier in WEKA actually uses K dummies (see picture). Why would that be the case?
2) An issue arises as soon as I consider attribute selection. Where the left-out attribute value is implicitly included as the case where all dummies are zero if all dummies are actually used for the model, it isn't included clearly anymore, if one dummy is missing (as not selected in attribute selection). The issue is much easy to understand with the sketch I uploaded. How can that issue be treated?
Secondly
Images
WEKA Output: The Logistic algorithm was run on the UCI dataset German Credit, where the possible values of the first attribute are A11,A12,A13,A14. All of them are included in the logistic regression model. http://abload.de/img/bildschirmfoto2013-089out9.png
Decision Tree Example: Sketch showing the issue when it comes to running decision trees on datasets with dummy-coded instances after attribute selection. http://abload.de/img/sketchziu5s.jpg
The output is generally more easy to read, interpret and use when you use k dummies instead of k-1 dummies. I figure that is why everybody seems to actually use k dummies.
But yes, as the k values sum up to 1, there exists a correlation that may cause problems. But correlations in data sets are common, you will never completely get rid of them!
I believe feature selection and dummy coding just doesn't fit. It equals dropping some values from the attribute. Why do you insist on doing feature selection?
You really should be using weighting, or consider more advanced algorithms that can handle such data. In fact the dummy variables can cause just as much trouble, because they are binary, and oh so many algorithms (e.g. k-means) don't make much sense on binary variables.
As for the decision tree: don't perform, feature selection on your output attribute...
Plus, as a decision tree already selects features, it does not make sense to do all this anyway... leave it to the decision tree to decide upon which attribute to use for splitting. This way, it can learn dependencies, too.

Resources