Keep only observations with data for all variables - subset

I have a dataset with several variables. I want to create a subsample which only includes the observations which have data for all variables, so no missing data in any of the variable.
I know about the dropmiss command in Stata but that does not apply here because I do not want to drop variables, but I want to drop the observations.
I found a question similar to mine in Stack Overflow, but the statistical program used there is SAS and I am using Stata.
(SAS - Keeping only observations with all variables).
An example (the "." is a missing data):
ID year pension age gender
1 2006 300 54 F
2 2007 250 40 M
3 2006 . 45 M
4 2005 . . F
So in this case I only want to keep ID 1 and 2, and drop 3 and 4 from the sample since it contains missing data for some of the variables.

The statement about dropmiss (download from Stata Journal website after search dropmiss) is incorrect.
dropmiss has an obs option geared to this need.
. sysuse auto, clear
(1978 Automobile Data)
. dropmiss, obs
(0 observations deleted)
. dropmiss, obs any
(5 observations deleted)
However, dropmiss is considered by its author (that's me) to be superseded by missings (download similarly from Stata Journal website). missings doesn't support this directly, as considering whether missing values can be handled by multiple imputation is widely considered better statistical practice.
But if you insist, missings can help with this too:
. sysuse auto, clear
(1978 Automobile Data)
. missings tag, gen(anymiss)
Checking missings in all variables:
5 observations with missing values
. drop if anymiss
(5 observations deleted)
There's an egen function rowmiss() that behaves similarly.
What's key here is that you don't need to spell out the variable names concerned. However, watch out: these commands can be highly destructive.

The answer is pretty simple assuming you have a limited number of variables.
Just type:
keep if !missing(var1) & !missing(var2) & !missing(var3)
That command will only keep rows that have non-missing values of all of the three variables mentioned above. Feel free to add more.

Related

Transportation problem to minimize the cost using genetic algorithm

I am new to Genetic Algorithm and Here is a simple part of what i am working on
There are factories (1,2,3) and they can server any of the following customers(ABC) and the transportation costs are given in the table below. There are some fixed cost for A,B,C (2,4,1)
A B C
1 5 2 3
2 2 4 6
3 8 5 5
How to solve the transportation problem to minimize the cost using a genetic algorithm
First of all, you should understand what is a genetic algorithm and why we call it like that. Because we act like a single cell organism and making cross overs and mutations to reach a better state.
So, you need to implement your chromosome first. In your situation, let's take a side, customers or factories. Let's take customers. Your solution will look like
1 -> A
2 -> B
3 -> C
So, your example chromosome is "ABC". Then create another chromosome ("BCA" for example)
Now you need a fitting function which you wish to minimize/maximize.
This function will calculate your chromosomes' breeding chance. In your situation, that'll be the total cost.
Write a function that calculates the cost for given factory and given customer.
Now, what you're going to do is,
Pick 2 chromosomes weighted randomly. (Weights are calculated by fitting function)
Pick an index from 2 chromosomes and create new chromosomes via using their switched parts.
If new chromosomes have invalid parts (Such as "ABA" in your situation), make a fixing move (Make one of "A"s, "C" for example). We call it a "mutation".
Add your new chromosome to the chromosome set if it wasn't there before.
Go to first process again.
You'll do this for some iterations. You may have thousands of chromosomes. When you think "it's enough", stop the process and sort the chromosome set ascending/descending. First chromosome will be your result.
I'm aware that makes the process time/chromosome dependent. I'm aware you may or may not find an optimum (fittest according to biology) chromosome if you do not run it enough. But that's called genetic algorithm. Even your first run and second run may or may not produce the same results and that's fine.
Just for your situation, possible chromosome set is very small, so I guarantee that you will find an optimum in a second or two. Because the entire chromosome set is ["ABC", "BCA", "CAB", "BAC", "CBA", "ACB"] for you.
In summary, you need 3 informations for applying a genetic algorithm:
How should my chromosome be? (And initial chromosome set)
What is my fitting function?
How to make cross-overs in my chromosomes?
There are some other things to care about this problem:
Without mutation, genetical algorithm can stuck to a local optimum. It still can be used for optimization problems with constraints.
Even if a chromosome exists with a very low chance to be picked for cross-over, you shouldn't sort and truncate the chromosome set till the end of iterations. Otherwise, you may stuck at a local extremum or worse, you may get an ordinary solution candidate instead of global optimum.
To fasten your process, pick non-similar initial chromosomes. Without enough mutation rate, finding global optimum could be a real pain.
As mentioned in nejdetckenobi's answer, in this case the solution search space is too small, i.e. only 8 feasible solutions ["ABC", "BCA", "CAB", "BAC", "CBA", "ACB"]. I assume this is only a simplified version of your problem, and your problem actually contains more factories and customers (but the numbers of factories and customers are equal). In this case, you can just make use of special mutation and crossover to avoid infeasible solution with repeating customers, e.g. ["ABA", 'CCB', etc.].
For mutation, I suggest to use a swap mutation, i.e. randomly pick two customers, swap their corresponding factory (position):
ABC mutate to ACB
ABC mutate to CBA

SUM not working 'Invalid or missing field format'

I have an input file in this format: (length 20, 10 chars and 10 numerics)
jname1 0000500006
bname1 0000100002
wname1 0000400007
yname1 0000000006
jname1 0000100001
mname1 0000500012
mname2 0000700013
In my jcl I have defined my sysin data as such:
SYSIN DATA *
SORT FIELDS=(1,1,CH,A)
SUM FIELDS=(11,10,FD)
DATAEND
*
It works fine as long as I don't add the sum fields so I'm wondering if I'm using the wrong format for my numerics seeing as I know they start at field 11 and have a length of 10 the format is the only thing that could be wrong.
As you might have already realised the point of this JCL is to just list the values but grouped by the first letter of the name (so for the example data and JCL I have given it would group the numeric for mname1 and mname2 together but leave the other records untouched).
I'm kind of new at this so I was wonder what I need for the format if my numerics are like that in the input file.
If new to DFSORT, get hold of the DFSORT Getting Started guide for your version of DFSORT (http://www-01.ibm.com/support/docview.wss?uid=isg3T7000080).
This takes your through all the basic operations with many examples.
The DFSORT Application Programming Guide describes everything you need to know, in detail. Again with examples. Appendix C of that document contains all the data-types available (note, when you tried to use FD, FD is not valid data-type, so probably a typo). There are Tables throughout the document listing what data-types are available where, if there is a particular limit.
For advanced techniques, consult the DFSORT Smart Tricks publication here: http://www-01.ibm.com/support/docview.wss?uid=isg3T7000094
You need to understand a bit more the way data is stored on a Mainframe as well.
Decimals (which can be "packed-decimal" or "zoned-decimal") do not contain a decimal-point. The decimal-point is implied. In high-level languages you tell the compiler where the decimal-point is (in a fixed position) and the compiler does the alignments for you. In Assembler, you do everything yourself.
Decimals are 100% accurate, as there are machine-instructions which act directly on packed-decimal data giving packed-decimal results.
A field which actually contains a decimal-point, cannot be directly used in arithmetic.
An unsigned field is treated as positive when used in any arithmetic.
The SUM statement supports a limited number of numeric definitions, and you have chosen the correct one. It does not matter that your data is unsigned.
If the format of the output from SUM is not what you want, look at OPTION ZDPRINT (or NOZDPRINT).
If you want further formatting, you can use OUTREC or OUTFIL.
As an option to using SUM, you can use OUTFIL reporting functions (especially, although not limited to, if you want a report). You can use SECTIONS and TRAILER3 with TOT/TOTAL.
Something to watch for with SUM (which is not a problem with the reporting features) is if any given one (or more) of your SUMmed fields exceed the field size. To continue to use SUM if that happens, you need to extend the field in INREC and then get SUM to use the new, sufficient, size.
After some trial and error I finally found it, appearantly the format I needed to use was the ZD format (zoned decimal, signed), so my sysin becomes this:
SYSIN DATA *
SORT FIELDS=(1,1,CH,A)
SUM FIELDS=(11,10,ZD)
DATAEND
*
even though my records don't contain any decimals and they are unsigned, I don't really get it so if someone knows why it's like that please go ahead and explain it to me.
For now the way I'm going to remember it is this: Z = symbol for real (meaning integers so no decimals)

Create (mathematical) function from set of predefined values

I want to create an excel table that will help me when estimating implementation times for tasks that I am given. To do so, I derived 4 categories in which I individually rate the task from 1 to 10.
Those are: Complexity of system (simple scripts or entire business systems), State of requirements (well defined or very soft), Knowledge about system (how much I know about the system and the code base) and Plan for implementation (do I know what to do or don't I have any plan what to do or where to start).
After rating each task in these categories, I want to have a resulting factor of how expensive and how long the task will likely take, as a very rough estimate that I can tell my bosses.
What I thought about doing
I thought to create a function where I define the inputs and then get the result in form of a number, see:
| a | b | c | d | Result |
| 1 | 1 | 1 | 1 | 160 |
| 5 | 5 | 5 | 5 | 80 |
| 10 | 10 | 10 | 10 | 2 |
And I want to create a function that, when given a, b, c, d will produce the results above for the extreme cases (max, min, avg) and of course any values (float) in between.
How can I go about doing this? I imagine this is some form of polynomial problem, but how can I actually create the function that creates these results?
I have tasks like this often, so it would be cool to have a sort of pattern to follow whenever I need to create such functions for any amount of parameters and results needed.
I tried using wolfram alphas interpolate polynomial command for this, but the result is just a mess of extremely large fractions...
How can I create this function properly with reasonable results?
While writing this edit, I realize this may be better suited over at programmers.SE - If no one answers here, I will move the question there.
You don't have enough data as it is. The simplest formula which takes into account all your four explanatory variables would be linear:
x0 + x1*a + x2*b + x3*c + x4*d
If you formulate a set of equations for this, you have three equations but five unknowns, which means that you don't have a unique solution. On the other hand, the data points which you did provide are proof of the fact that the relation between scores and time is not exactly linear. So you might have to look at some family of functions which is even more complex, and therefore has even more parameters to tune. While it would be easy to tune parameters to match the input, that choice would be pretty arbitrary, and therefore without predictive power.
So while your system of four distinct scores might be useful in the long run, I'd not use that at the moment. I'd suggest you collect some more data points, see how long a given task actually did take you, and only use that fine-grained a model once you have enough data points to fit all of its parameters.
In the meantime, aggregate all four numbers into a single number. E.g. by taking their average. Then decide on a formula to choose. E.g. a quadratic one:
182 - 22.9*a + 0.49*a*a
That's a fair fit for your requirements, and not too complex or messy. But the choice of function, i.e. a polynomial one, is still pretty arbitrary. So revisit that choice once you have more data. Note that this polynomial is almost the one Wolfram Alpha found for your data:
1642/9 - 344/15*a + 22/45*a*a
I only converted these rational numbers to decimal notation, which I truncated pretty early on since all of this is very rough in any case.
On the whole, this question appears more suited to CrossValidated than to Programmers SE, in my opinion. But don't bother them unless you have sufficient data to actually fit a model.

Detection of similar sequences in ordered event lists

I have logs from a bunch (millions) of small experiments.
Each log contains a list (tens to hundreds) of entries. Each entry is a timestamp and an event ID (there are several thousands of event IDs, each of may occur many times in logs):
1403973044 alpha
1403973045 beta
1403973070 gamma
1403973070 alpha
1403973098 delta
I know that one event may trigger other events later.
I am researching this dataset. I am looking for "stable" sequences of events that occur often enough in the experiments.
Is there a way to do this without writing too much code and without using proprietary software? The solution should be scalable enough, and work on large datasets.
I think that this task is similar to what bioinformatics does — finding sequences in a DNA and such. Only my task includes many more than four letters in an alphabet... (Update, thanks to #JayInNyc: proteomics deals with larger alphabets than mine.)
(Note, BTW, that I do not know beforehand how stable and similar I want my sequences, what is the minimal sequence length etc. I'm researching the dataset, and will have to figure this out on the go.)
Anyway, any suggestions on the approaches/tools/libraries I could use?
Update: Some answers to the questions in comments:
Stable sequences: found often enough across the experiments. (How often is enough? Don't know yet. Looks like I need to calculate a top of the chains, and discard rarest.)
Similar sequences: sequences that look similar. "Are the sequences 'A B C D E' and 'A B C E D' (minor difference in sequence) similar according to you? Are the sequences 'A B C D E' and 'A B C 1 D E' (sequence of occurrence of selected events is same) also similar according to you?" — Yes to both questions. More drastic mutations are probably also OK. Again, I'd like to be able to calculate a top and discard the most dissimilar...
Timing: I can discard timing information for now (but not order). But it would be cool to have it in a similarity index formula.
Update 2: Expected output.
In the end I would like to have a rating of most popular longest stablest chains. A combination of all three factors should have effect in the calculation of the rating score.
A chain in such rating is, obviously, rather a cluster of similar enough chains.
A synthetic example of a chain-cluster:
alpha
beta
gamma
[garbage]
[garbage]
delta
another:
alpha
beta
gamma|zeta|epsilon
delta
(or whatever variant did not came to my mind right now.)
So, the end output would be something like that (numbers are completely random in this example):
Chain cluster ID | Times found | Time stab. factor | Chain stab. factor | Length | Score
A | 12345 | 123 | 3 | 5 | 100000
B | 54321 | 12 | 30 | 3 | 700000
I have thought about this setup for the past day or so -- how to do it in a sane scalable way in bash, etc.. The answer is really driven by the relational information you are wanting to draw from the data and the apparent size of the dataset you currently have. The xleanest solution will be to load you datasets into a relational database (MariaDB would by my recommendation)
Since your data already exists in an fairly clean format, your options for getting the data into a database are 2. (1) if the files have the data in a usable rowxcol setup, then you can simply use LOAD DATA INFILE to bring your data into the database; or (2) parse the files with bash in a while read line; do scenario, parse the data to get the data in the table format you desire, and use mysql batch mode to directly load the information into mysql in a single pass. The general form of the bash command would be mysql -uUser -hHost database -Bse "your insert command".
Once in a relational database, you then have the proper tool for the job of being able to run flexible queries against your data in a sane manner instead of continually writing/re-writing bash snippets to handle your data in a different way each time. That is probably the best scalable solution you are looking for. A little more work up-front, but a lot better setup going forward.
Wikipedia defines algorithm as 'a precise list of precise steps': 'I am looking for "stable" sequences of events that occur often enough in the experiments.' "Stable" and "often enough" without definition makes the task of giving you an algorithm impossible.
So I give you the trivial one to calculate the frequency of sequences of length 2. I will ignore the time stamp. Here is the awk code (pW stands for previous Word, pcs stands for pair counters):
#!/usr/bin/awk -f
BEGIN { getline; pW=$2; }
{ pcs[pW, $2]++; pW=$2; }
END {
for (i in pcs)
print i, pcs[i];
}
I duplicated your sample to show something meaningful looking
1403973044 alpha
1403973045 beta
1403973070 gamma
1403973070 alpha
1403973098 delta
1403973044 alpha
1403973045 beta
1403973070 gamma
1403973070 beta
1403973098 delta
1403973044 alpha
1403973045 beta
1403973070 gamma
1403973070 beta
1403973098 delta
1403973044 alpha
1403973045 beta
1403973070 gamma
1403973070 beta
1403973098 delta
Running the code above on it gives:
gammaalpha 1
alphabeta 4
gammabeta 3
deltaalpha 3
betagamma 4
alphadelta 1
betadelta 3
which can be interpreted as alpha followed by beta and beta followed by gamma are the most frequent length two sequences each occurring 4 times in the sample. I guess that would be your definition of stable sequence occurring often enough.
What's next?
(1) You can easily adopt the code above to sequences of length N and to find sequences occurring often enough you can sort (-k2nr) the output on the second column.
(2) To put a limit on N you can stipulate that no event triggers itself, that provides you with a cut-off point. Or you can place a limit on the timestamp ie the difference between consecutive events.
(3) So far those sequences were really strings and I used exact matching between them (CLRS terminology). Nothing prevents you from using your favourite similarity measure instead:
{ pcs[CLIFY(pW, $2)]++; pW=$2; }
CLIFY would be a function which takes k consecutive events and puts them into a bin ie maybe you want ABCDE and ABDCE to go to the same bin. CLIFY could of course take as an additional argument the set of bins so far.
The choice of awk is for convenience. It wouldn't fly, but you can easily run them in parallel.
It is unclear what you want to use this for but a google search for Markov chains, Mark V Shaney would probably help.

How can I loop through variables in SPSS? I want to avoid code duplication

Is there a "native" SPSS way to loop through some variable names? All I want to do is take a list of variables (that I define) and run the same procedure for them:
pseudo-code - not really a good example, but gets the point across...
for i in varlist['a','b','c']
do
FREQUENCIES VARIABLES=varlist[i] / ORDER=ANALYSIS.
end
I've noticed that people seem to just use R or Python SPSS plugins to achieve this basic array functionality, but I don't know how soon I can get those configured (if ever) on my installation of SPSS.
SPSS has to have some native way to do this...right?
There are two easy solutions for looping through variables (easier compared to using Python in SPSS).
1) DO REPEAT-END REPEAT
The draw back is that you can use DO REPEAT-END REPEAT mainly only for data transformations - for example COMPUTE, RECODE etc. Frequencies are not allowed. For example:
DO REPEAT R=REGION1 TO REGION5.
COMPUTE R=0.
END REPEAT.
2) DEFINE-!ENDDEFINE (macro facility)
You can do Frequencies in a loop of variables using macro command. For example:
DEFINE macdef (!POS !CHAREND('/'))
!DO !i !IN (!1)
frequencies variables = !i.
!DOEND
!ENDDEFINE.
macdef VAR1 VAR2 VAR3 /.
If I understand the question correctly, there may be no need to use a looping construct. SPSS commands with a VARIABLES subcommand like FREQUENCIES allow you to specify multiple variables.
The basic syntax for the FREQUENCIES is:
FREQUENCIES
VARIABLES= varlist [varlist...]
where [varlist] is a single variable name, multiple space-delimited variable names, a range of consecutive variables specified with the TO keyword, the keyword ALL, or a combination of the previous options.
For example:
FREQUENCIES VARIABLES=VARA
FREQUENCIES VARIABLES=VARA VARB VARC
FREQUENCIES VARIABLES=VARA TO VARC
FREQ VAR=ALL
FREQ VAR=VARA TO VARC VARM VARX TO VARZ
See SPSS Statistics 17.0 Command Syntax Reference available at http://support.spss.com/ProductsExt/SPSS/Documentation/SPSSforWindows/index.htm
Note that it's been years since I've actually used SPSS.
It's more efficient to do all these frequencies on one data pass, e.g.,
FREQUENCIES a to c.
but Python lets you do looping and lots of other control flow tricks.
begin program.
import spss
for v in ['a','b','c']:
spss.Submit("FREQUENCIES " + v)
end program.
Using Python requires installing the (free) Python plugin available from SPSS Developer Central, www.spss.com/devcentral.
You can, of course, use macros for this sort of things, but Python is a lot more powerful and easier once you get the hang of it.
Yes, SPSS can do this. Sounds like the guys at UCLA use python 'cause they know how to do it in python and not in SPSS. :)
Let's call your variables VARA, VARB, VARC. They must be numerical (since you are doing frequencies) and they must be consecutive in your spss data file. Then you create a vector saying in effect "here is the series of variables I want to loop through".
VECTOR VectorVar = VarA TO VarC.
LOOP #cnt = 1 to 3 by 1.
FREQUENCIES VARIABLES=VectorVar(#cnt) / ORDER=ANALYSIS
ENDLOOP.
EXECUTE.
(The above has not been tested. Might be missing a period somewhere, etc.)
Here's a page from UCLA's Academic Technology Services that describes looping over lists of variables. Quote,
"Because we are looping through more
than one variable, we will need to use
Python."
In my experience, UCLA ATS is probably the site with the best coverage of all of the major statistical computing systems. If they say you need Python... you probably need Python.
Er... sorry for being that guy, but maybe it's time to switch to a different stats system.
I haven't used SPSS macros very much, but maybe they can get you where you need to be? Check out this site for some examples:
http://spsstools.net/Macros.htm
Also, the SPSS Data Management book may be helpful as well.
Lastly, if memory serves, I think the problem may even be the main example of how to leverage Python inside of SPSS syntax. I have only used Python and SPSS a few times, but it is very handy to have that language accessible if need be.
HTH
How can do this stata sintxis for spss.
foreach var of varlist pob_multi pob_multimod pob_multiex vul_car vul_ing nopob_nov espacio carencias carencias_3 ic_rezedu ic_asalud ic_ss ic_cv ic_sbv ic_ali pobex pob {
tabstat `var' [w=factor] if pob_multi!=., stats(mean) save
matrix define `var'_pp =(r(StatTotal))
matrix rownames `var'_pp = `var'_pp
}
matrix tabla1 = (pob_multi_pp \ pob_multimod_pp \ pob_multiex_pp \ vul_car_pp \ vul_ing_pp \ nopob_nov_pp \ espacio_pp \ carencias_pp \ carencias_3_pp \ espacio_pp \ ic_rezedu_pp\ ic_asalud_pp \ ic_ss_pp \ ic_cv_pp \ ic_sbv_pp\ ic_ali_pp \ espacio_pp \ pobex_pp \ pob_pp )
matrix list tabla1
thanks.

Resources