I have a simple-seeming problem, but in practice it seems to be more involved. In python, for example, it seems like it would be much more straightforward. But I would really like to learn how to do this in Stata.
Say that I have a big dataset. I have several string variables, S1, S2, and S3. I get a subset of S1 based on some criteria. Let's say that this gets me (after sorting and only the data of interest are displayed):
S1
1 A
2 B
3 C
4 D
5 E
Based on different criteria, I get, for S2:
S2
1 B
2 B
3 C
4 F
For S3:
S3
1 B
2 Long string
What I am interested in doing is to get a list of all of the distinct values across S1, S2, and S3. One way I have thought about doing this is:
Save all desired values of S1 into a macro, M1. I didn't figure out how one is able to do this.
Save all desired values of S2 into a macro, M2.
Check if the values of M2 are in M1. Do not add the values of M2 to M1 that are already in M1, but do add the values of M2 to M1 that are not already there. It seems like this post is similar to how to do this step. (Why is there a : in front of list?)
Repeat step 3, except for S3/M3 instead of S2/M2.
This would produce the macro M1 with values:
A B C D E F Long String
Note that I do not need this to be in a macro. If it could be in a matrix or some other way, that would work as well. The important part is to get the information.
Several ways to do this.
Many assumptions made in this example (many things are not clear in your post):
clear
set more off
input ///
str15(s1 s2 s3)
a "b" "b"
b "b" "long string"
c "c" ""
d "f" ""
e "" ""
end
list
stack s*, into(news) clear
bysort news : keep if _n == 1
drop _stack
list
If you want to work your way through, using macros, then help macrolists and help levelsof can aid:
clear
set more off
input ///
str15(s1 s2 s3)
a "b" "b"
b "b" "long string"
c "c" ""
d "f" ""
e "" ""
end
list
local uvalues
foreach var of varlist _all {
levelsof `var', local(loc`var')
local uvalues : list uvalues | loc`var'
}
display `"`uvalues'"'
Saying more about how your variables are organized (e.g. one or several files), whether you care or not to destroy the original data set, the treatment of missings, etc. can probably get you an ad hoc answer.
Related
I am creating a Product Decoder for a project.
Lets say, Our product can have a code such as "ABCDE" OR a code like "BCDEF".
ABCDE has a table of data that I use to decode using a lookup. For example AB can decode into "Potato" and CDE can decode into "Chip". So any combination with AB can be Potato "Anything else".
BCDER, BC can decode into "Veggie" so DER can code into "Chip".
I also use the 1/search method to take placements for the decode. Example =IFERROR(LOOKUP(2,1/SEARCH($E$19:$E$23,N18),$E$19:$E$23), "")
I concatenate all the decodes using =S2&" "&T2&" "&U2&" "&V2
Question is...if we are getting a huge amount of product code coming that I want to decode into one single column... How do I tell excel to use this table of data for ABCDE if product starts with "A", if not, use table of data that correlates to BCDER when product starts with "B".
Edit 1:
Here is my table, right side is where i look up the Part Number column N"
As you can see on column "W" I concatenate the date is Look up from columns O~V.
Column O Function: =IFERROR(LOOKUP(2,1/SEARCH($C$1:$C$7,N2),$C$1:$C$7), "")
On column N, I have two parts. One that starts with M and one that starts with K which is pretty standard.
Image two is me trying to use the IF Left but, it doesn't really work
=IF(LEFT(AA4,10) = "M ", W2, W18)
So How can I tell my excel page to use Table A1:A12 if part starts with "M*" and vice versa?
Let me know if this is confusing, I will do my best to clear things up.
First, a possible correction
I think this function does not give you what you say it does:
= IFERROR(LOOKUP(2,1/SEARCH($E$19:$E$23,N18),$E$19:$E$23), "")
You might mean:
= IFERROR( LOOKUP( 2, 1/SEARCH( $E$19:$E$23, N18 ), $F$19:$F$23 ), "" )
Because you want to look up the value in column E and return the value in column F. If that's not true, then skip the rest of this answer.
Now the solution
What you're trying to do is change the lookup array if the part number starts with a different letter. So, the IF( LEFT( combo mentioned by #BigBen should be used to modify the lookup array. I think it would look like this:
= IFERROR( LOOKUP( 2
,1/SEARCH( IF( LEFT( AA4, 1 ) = "M"
,$C$2:$C$12
,$C$19:$C$23 )
,N2 )
,IF( LEFT( AA4, 1 ) = "M"
,$D$2:$D$12
,$D$19:$D$23 )
)
,"")
I have a very detailed excel model to calculate the profitability of a project, that we can call P.
The model has been simplified to compute from 3 unrelated variables. I would like to automatically create a table that shows how inputs A, B and C might vary in order to produce a pre-defined level of profitability, P. For instance, if A = 4 & B = 30, then C must = 2 in order for P to equal 20%. Likewise, if A = 5 & B = 25, then C must = 3 in order for P to equal 20%. A and B should be tested at sensible increments, perhaps 8 intervals each.
A laborious (not scalable) equivalent would be to manually define A and B, then goal-seek C to our pre-defined level of P - we'd then repeat for each combination of A and B at the given intervals and record in a two-way table.
I believe a conventional two-way data table would be pratical if the model sitting behind the inputs were greatly simplified, unfortunately this isn't possible.
Thanks to anyone that can lend a hand. Kind regards.
I think the best way to approach this will be with a VBA macro and the prebuilt GoalSeek Function something like this (p is in cell D1) :
Range(”D1”).GoalSeek Goal:=20 _
ChangingCell:=Range(“C1”)
I want to find the common elements in multiple (>=2) cell arrays of strings.
A related question is here, and the answer proposes to use the function intersect(), however it works for only 2 inputs.
In my case, I have more than two cells, and I want to obtain a single common subset. Here is an example of what I want to achieve:
c1 = {'a','b','c','d'}
c2 = {'b','c','d'}
c3 = {'c','d'}
c_common = my_fun({c1,c2,c3});
in the end, I want c_common={'c','d'}, since only these two strings occur in all the inputs.
How can I do this with MATLAB?
Thanks in advance,
P.S. I also need the indices from each input, but I can probably do that myself using the output c_common, so not necessary in the answer. But if anyone wants to tackle that too, my actual output will be like this:
[c_common, indices] = my_fun({c1,c2,c3});
where indices = {[3,4], [2,3], [1,2]} for this case.
Thanks,
Listed in this post is a vectorized approach to give us the common strings and indices using unique and accumarray. This would work even when the strings are not sorted within each cell array to give us indices corresponding to their positions within it, but they have to be unique. Please have a look at the sample input, output section* to see such a case run. Here's the implementation -
C = {c1,c2,c3}; % Add more cell arrays here
% Get unique strings and ID each of the strings based on their uniqueness
[unqC,~,unqID] = unique([C{:}]);
% Get count of each ID and the IDs that have counts equal to the number of
% cells arrays in C indicate that they are present in all cell arrays and
% thus are the ones to be finally selected
match_ID = find(accumarray(unqID(:),1)==numel(C));
common_str = unqC(match_ID)
% ------------ Additional work to get indices ----------------
N_str = numel(common_str);
% Store matches as a logical array to be used at later stages
matches = ismember(unqID,match_ID);
% Use ismember to find all those indices in unqID and subtract group
% lengths from them to give us the indices within each cell array
clens = [0 cumsum(cellfun('length',C(1:end-1)))];
match_index = reshape(find(matches),N_str,[]);
% Sort match_index along each column based on the respective unqID elements
[m,n] = size(match_index);
[~,sidx] = sort(reshape(unqID(matches),N_str,[]),1);
sorted_match_index = match_index(bsxfun(#plus,sidx,(0:n-1)*m));
% Subtract cumulative group lens to give us indices corres. to each cell array
common_idx = bsxfun(#minus,sorted_match_index,clens).'
Please note that at the step that calculates match_ID : accumarray(unqID(:),1) could be replaced by histc(unqID,1:max(unqID)). Also, histcounts be another alternative there.
*Sample input, output -
c1 =
'a' 'b' 'c' 'd'
c2 =
'b' 'c' 'a' 'd'
c3 =
'c' 'd' 'a'
common_str =
'a' 'c' 'd'
common_idx =
1 3 4
3 2 4
3 1 2
As noted in the comments to this question, there is a file in File Exchange called "MINTERSECT -- Multiple set intersection." at http://www.mathworks.com/matlabcentral/fileexchange/6144-mintersect-multiple-set-intersection that contains simple code to generalize intersect to multiple sets. In a nutshell, the code gets the output from performing intersect on the first pair of cells and then perform intersect on this output with the next cell. This process continues until all cells have been compared. Note that the author points out that the code is not particularly efficient but it may be sufficient for your use case.
I am attempting to translate my existing Matlab code into Numbers (basically Excel). In Matlab, I have the following code:
clear all; clc;
n = 30
x = 1:(n-1)
T = 295;
D = T./(n-x)
E = T/n
for i=1:(n-2)
C(i) = D(i+1) - D(i)
end
hold on
plot(x(1:end-1), C, 'rx')
plot(x, D, 'bx')
I believe everything has been solved by your formulas, there are parts of them that I don't understand otherwise I would try to figure the rest out myself. Attached is the result (Also you might like to know that the formulas you gave work and are recognized in Numbers). Im trying to figure out why (x) begins at 2 as I feel as though it should start at 1?
Also it is clear that the realistic results for the formulas only exist under certain conditions i.e. column E > 0. That being the case, what would be the easiest way to chart data with a filter so that only certain data is charted?
(Using Excel...)
Suppose you put your input values T & n in A1 & B1 respectively.
You could generate x, D & C In columns C,D & E with:
C1: =IF(ROW()<$A$1,ROW(),"")
D1: =IF(LEN(C1)>0,$A$2/($A$1-C1),"")
E1: =IF(LEN(D2)>0,D2-D1,"")
You could then pull all 3 columns down as far as you need to generate the full length of your vectors. If you then want to chart these, simply use these columns as inputs.
I am looking to match multiple string criteria and then subset the row in R, using grepl to find the match. I have found a nice solution from another post where some specific code is used (but you get the idea): subset(GEMA_EO5, grepl(paste(l, collapse="|"),GEMA_EO5$RefSeq_ID))
I am wondering if it is possible to grepl in two columns, instead of just RefSeq_ID in the example above. That is, in grepl via any other method. In other words, I would like to look for the options in l not just in one column, but in two (or however many). Is this possible?
eg.: 3 columns, a b and c. I would like to criteria such that T (rows 3 and 4) is selected, despite the format "T I" in (3,b). it should identify both (4,a) and (3,b), hence the link to the previous question. I want it to look in column a AND column b, not one or the other.
a b c
A A C P L
V V B W E E
W T I P J G
T W P J
Here's some demo data to show how this works:
set.seed(1234)
dat <- data.frame(A = sample(letters[1:3],10,TRUE),
B = sample(letters[1:3],10,TRUE))
Using [ to subset makes this a lot more clear in my opinion - we can use grepl to give a logical vector based on a match, and use | to combine two tests (on multiple columns). If you wanted a subset of all the rows that contained an 'a' in either column:
dat.a <- dat[with(dat, grepl("a", A)|grepl("a", B)),]
A B
1 b a
2 b a
3 a c
5 a a
9 a a