I have some data with 15 variables, including some missing values. When I tried to look at the Frequency charts with the count and percentages of missing values, it was showing all variables without any missing values. In the variable view, I changed the "Missing" column from "None" to "Discrete" missing values, which was "?" to include the missing values. This showed the correct numbers of missing values for each variable.
I then went to Transform > Replace Missing Values to replace these missing values, but the only variables that appear as an option are the variables that are not missing any values. I tried going back to the variable view and changing all of the "Missing" column values back to "None" from "?", but that didn't help.
All of the variables that do appear in the Replace Missing Variables box are also numeric. Is that the problem - that the variables I want to replace are strings? If so, how can I handle these missing string values in my data?
The dialogue you are discussing is for imputation techniques for missing data. Such as the mean or the median of the series or nearby points (the specific command is RMV) and so it is only applicable for numeric data.
One way to replace missing values for string variables is to use the RECODE command, example shown below.
DATA LIST FREE / X (A5).
BEGIN DATA
A
B
?
C
?
END DATA.
MISSING VALUES X ('?').
RECODE X (MISSING = '!').
Related
I have a table that shows me a chemical concentration value based on temperature, pH and
ammonia. The way the I measure these variables, the ammonia level are always one of these six values (on top of the table), so it works as a categorical variable.
I need a way to interpolate on this table, based on these 3 variables. I tried using a combination of INDEX and MATCH, but I was not able to achieve what I wanted. Then I thought of "dividing" the table in intervals to "reduce" one variable and use an IF function to select which interval to interpolate based on the third variable (I was thinking pH or Ammonia), but I can't figure out a way to change intervals dynamically like this.
Can anyone think of an alternative to accomplish what I'm trying to do? If possible I would like to avoid using VBA, but if there is no other way I have no problem using it.
Thank you for the help!
I'm attaching an example of the table below.
Assuming that PH is in Column A:
=INDEX(A:H;MATCH(6,8;A:A;0)+MATCH(25;B:B;0)-2;MATCH(2;2:2,0))
Where the -2 needs to be changed to the number of rows BEFORE the first 22 in Temp.
This also assumes that the pattern of 22;25;28 in Temp is the same for every pH
I checked the missing data in SPSS. There were more missing data than the actual missing cases in a variable.
Screenshot:
For the first variable, it said there are 171784 missing when there are only 127014 missing (I checked using MS Excel). Moreover, there are actually 341272 cases in total but the sum of valid and missing cases in a variable is only 340296. Why are there lots of missing data? Maybe because of this, the mean values I calculated in SPSS are different from those in MS Excel.
Sounds to me like some values were defined as missing values. As opposed to SPSS, when checking in excel these values are counted as valid and influence the calculation of the average, so the results are different for both comparisons.
To check weather there are values defined as missing, take a look at the "missing" column in the variable view of your dataset.
HTH
Have you checked if there are a lot of empty rows in your SPSS Data File that are interpreted as missing data even though they are empty cases. Sometimes importing data to SPSS can cause these problems.
I have two string fields of unspecified length, lets call them One and Two. Now I would like to concat them, so that if One = "aaa" and Two = "bbb" the result becomes "aaabbb". Using the Concat fields step seems like a reasonable first guess for how to do this.
However, if I leave the "Length of target field" setting with the standard value of 0 I get no output. If I set it to something large, like 100, I always get extra spaces at the end. I want the resulting field to be as long as necessary to contain One + Two, not longer and not shorter. Is there anyway to do this using this step or some other one?
I have tried using the trim setting, but it trims the input and not the output. Clicking the "Minimal width" button does absolutely nothing.
This seems like it should be a pretty simple standard task. Am I missing something here?
EDIT: My input here is just a few rows from a Data grid step, without anything between the grid and the concat. I tried replacing the grid with a Generate rows step, but I get the same result (both when using fixed length for the generated fields, and when leaving the length fields blank).
My version of Kettle is 5.4.0.1-130. I am running it on a Windows 7 x64 platform.
Do the configuration as shown in the figure. It correctly gives the result as the second figure.
Result:
Used Data Grid step to get data.
The configuration suggested by Marlon Abeykoon works. It also works with type "String" instead of "None".
My problem was not in the Concat fields step, but the Text file output step I used to write the result to a file. It takes its metadata about the fields from the Concat fields step, and inherits the zero length for the field therefore printing zero characters to the text file.
The solution is to go to the "Fields" tab of the output step, and there press "Get fields". That explicitly adds all the fields and their metadata to the list, so you can change the lenght field of the output field from the concat step to be empty instead of 0.
I have several variables of the form:
1 gdppercap
2 19786,97
3 20713,737
4 20793,163
5 23070,398
6 5639,175
I have copy-pasted the data into Stata, and it thinks they are strings. So far I have tried:
destring gdppercap, generate(gdppercap_n)
but get
gdppercap contains nonnumeric characters; no generate
And:
encode gdppercap, gen(gdppercap_n)
but get a variable numbered from 1 to 1055 regardless of the previous value.
Also I've tried:
gen gdppercap_n = real(gdppercap)
But get:
(1052 missing values generated)
Can you help me? As far as I can tell, Stata do not like the fact that the variable contains fraction numbers.
If I understand you correctly, the interpretation as string arises from one and possibly two facts:
The variable name may be echoed in the first observation. If so, that's text and it's inconsistent with a numeric variable. The root problem there is likely to be a copy-and-paste operation that copied too much. Stata typically gives you a choice when importing by copy-and-paste of whether the first row of what you copied is to be treated as variable names or as data, and you need the first choice, so that column headers become variable names, not data. It may be best to go back and do the copy-and-paste correctly. However, Stata can struggle with multiple header lines in a spreadsheet. Alternatively, use import excel, not a copy-and-paste. Alternatively, drop in 1 to remove the first observation, provided that it consistently is superfluous.
Commas indicate decimal places. destring can easily cope with this: see the help for its dpcomma option. Stata has no objection to fractions; that would be absurd. The problem is that you need to flag your use of commas.
Note that
destring is a wrapper for real(), so real() is not a way round this.
encode is for mapping genuine categorical variables to integers, as you discovered, and as its help does explain. It is not for fixing data input errors.
You can write a for loop to convert a comma to a period. I don't quite know your variables but imagine you have a variable gdppercap with information like 1234,343 and you want that to be 1234.343 before you do the destring.
For example:
forvalues x = 1(1)10 {
replace gdppercap = substr(gdppercap, 1, `x'-1) + "." + substr(gdppercap, `x'+1, .)
if substr(gdppercap, `x', 1) == ","
}
I'm importing a very large dataset into SPSS. Many fields in the dataset contain a "999" value, indicating a missing value. I want to instruct SPSS to view them as such. However, default each variable in SPSS is set to having "no missing values". In variable view, you have to define "999" as being the "discrete missing value" for each variable. With hundreds of variables though, this is a lot of work:
Therefore: is there a way to define "discrete missing value 999" as the default missing value for each variable on import? This would save me a lot of work, but I cannot find the answer online (I only get tutorials as to how define 999 as the missing value for each variable seperately, as I am doing now).
Your help is be greatly appreciated!
It is not possible to make a value 999 by default as user missing value.
I advise you to use the syntax. There is a command MISSING VALUES. It allows to define values as user missing for several variables in one go. Try the following commands for example:
MISSING VALUES all (999).
MISSING VALUES V1 to V99 (999).