How can I loop through variables in SPSS? I want to avoid code duplication - statistics

Is there a "native" SPSS way to loop through some variable names? All I want to do is take a list of variables (that I define) and run the same procedure for them:
pseudo-code - not really a good example, but gets the point across...
for i in varlist['a','b','c']
do
FREQUENCIES VARIABLES=varlist[i] / ORDER=ANALYSIS.
end
I've noticed that people seem to just use R or Python SPSS plugins to achieve this basic array functionality, but I don't know how soon I can get those configured (if ever) on my installation of SPSS.
SPSS has to have some native way to do this...right?

There are two easy solutions for looping through variables (easier compared to using Python in SPSS).
1) DO REPEAT-END REPEAT
The draw back is that you can use DO REPEAT-END REPEAT mainly only for data transformations - for example COMPUTE, RECODE etc. Frequencies are not allowed. For example:
DO REPEAT R=REGION1 TO REGION5.
COMPUTE R=0.
END REPEAT.
2) DEFINE-!ENDDEFINE (macro facility)
You can do Frequencies in a loop of variables using macro command. For example:
DEFINE macdef (!POS !CHAREND('/'))
!DO !i !IN (!1)
frequencies variables = !i.
!DOEND
!ENDDEFINE.
macdef VAR1 VAR2 VAR3 /.

If I understand the question correctly, there may be no need to use a looping construct. SPSS commands with a VARIABLES subcommand like FREQUENCIES allow you to specify multiple variables.
The basic syntax for the FREQUENCIES is:
FREQUENCIES
VARIABLES= varlist [varlist...]
where [varlist] is a single variable name, multiple space-delimited variable names, a range of consecutive variables specified with the TO keyword, the keyword ALL, or a combination of the previous options.
For example:
FREQUENCIES VARIABLES=VARA
FREQUENCIES VARIABLES=VARA VARB VARC
FREQUENCIES VARIABLES=VARA TO VARC
FREQ VAR=ALL
FREQ VAR=VARA TO VARC VARM VARX TO VARZ
See SPSS Statistics 17.0 Command Syntax Reference available at http://support.spss.com/ProductsExt/SPSS/Documentation/SPSSforWindows/index.htm
Note that it's been years since I've actually used SPSS.

It's more efficient to do all these frequencies on one data pass, e.g.,
FREQUENCIES a to c.
but Python lets you do looping and lots of other control flow tricks.
begin program.
import spss
for v in ['a','b','c']:
spss.Submit("FREQUENCIES " + v)
end program.
Using Python requires installing the (free) Python plugin available from SPSS Developer Central, www.spss.com/devcentral.
You can, of course, use macros for this sort of things, but Python is a lot more powerful and easier once you get the hang of it.

Yes, SPSS can do this. Sounds like the guys at UCLA use python 'cause they know how to do it in python and not in SPSS. :)
Let's call your variables VARA, VARB, VARC. They must be numerical (since you are doing frequencies) and they must be consecutive in your spss data file. Then you create a vector saying in effect "here is the series of variables I want to loop through".
VECTOR VectorVar = VarA TO VarC.
LOOP #cnt = 1 to 3 by 1.
FREQUENCIES VARIABLES=VectorVar(#cnt) / ORDER=ANALYSIS
ENDLOOP.
EXECUTE.
(The above has not been tested. Might be missing a period somewhere, etc.)

Here's a page from UCLA's Academic Technology Services that describes looping over lists of variables. Quote,
"Because we are looping through more
than one variable, we will need to use
Python."
In my experience, UCLA ATS is probably the site with the best coverage of all of the major statistical computing systems. If they say you need Python... you probably need Python.
Er... sorry for being that guy, but maybe it's time to switch to a different stats system.

I haven't used SPSS macros very much, but maybe they can get you where you need to be? Check out this site for some examples:
http://spsstools.net/Macros.htm
Also, the SPSS Data Management book may be helpful as well.
Lastly, if memory serves, I think the problem may even be the main example of how to leverage Python inside of SPSS syntax. I have only used Python and SPSS a few times, but it is very handy to have that language accessible if need be.
HTH

How can do this stata sintxis for spss.
foreach var of varlist pob_multi pob_multimod pob_multiex vul_car vul_ing nopob_nov espacio carencias carencias_3 ic_rezedu ic_asalud ic_ss ic_cv ic_sbv ic_ali pobex pob {
tabstat `var' [w=factor] if pob_multi!=., stats(mean) save
matrix define `var'_pp =(r(StatTotal))
matrix rownames `var'_pp = `var'_pp
}
matrix tabla1 = (pob_multi_pp \ pob_multimod_pp \ pob_multiex_pp \ vul_car_pp \ vul_ing_pp \ nopob_nov_pp \ espacio_pp \ carencias_pp \ carencias_3_pp \ espacio_pp \ ic_rezedu_pp\ ic_asalud_pp \ ic_ss_pp \ ic_cv_pp \ ic_sbv_pp\ ic_ali_pp \ espacio_pp \ pobex_pp \ pob_pp )
matrix list tabla1
thanks.

Related

Selecting arbitrary rows from a Neo matrix in Nim?

I am using the Neo library for linear algebra in Nim, and I would like to extract arbitrary rows from a matrix.
I can explicitly select a continuous sequence of rows as per the examples in the README, but can't select a disjoint subset of rows.
import neo
let x = randomMatrix(10, 4)
let some_rows = #[1,3,5]
echo x[2..4, All] # works fine
echo x[some_rows, All] ## error
The first echo works because you are creating a Slice object, which neo has defined a proc for. The second echo uses a sequence of integers, and that kind of access is not defined in the neo library. Unfortunately Slices define contiguous closed ranges, you can't even specify steps to iterate in bigger increments than one, so there is no way to accomplish what you want.
Looking at the structure of a Matrix, it seems that it is highly optimised to avoid copying data. Matrix transformation operations seem to reuse the data of the previous matrix and change the access/dimensions. As such, a matrix transformation with arbitrary random would not be possible, the indexes in your example specifically access non contiguos data and this would need to be encoded somehow in the new structure. Plus if you wrote #[1,5,3] that would defeat any kind of normal iterative looping.
An alternative of course is to write a proc which accepts a sequence instead of a slice and then builds a new matrix copying data from the old one. This implies a performance penalty, but if you think this is a good addition to the library please request it in the issue tracker of the project. If it is not accepted, then you will need to write yourself such a proc for personal use in your programs.

Learning/Detecting Mutatable Parts of a URL in Logs

Say you have a webserver log (apache, nginx, whatever). From it you extract a large list of URLs:
/article/1/view
/article/2/view
/article/1/view
/article/1323/view
/article/1/edit
/help
/article/1/view
/contact
/contact/thank-you
/article/8/edit
...
or
/blog/2012/06/01/how-i-will-spend-my-summer-vacation
/blog/2012/08/30/how-i-wasted-my-summer-vacation
...
You explode these urls into their pieces such that you have ['article', '1323', 'view'] or ['blog', '2012', '08', '30', 'how-i-wasted-my-summer-vacation'].
How would one go about analyzing and comparing these urls to detect and call out "variables" in the url path. That is to say, you would want to recognize things like /article/XXX/view, /article/XXX/edit, and /blog/XXX/XXX/XXX/XXX such that you can summarize information about those lines in the logs.
I assume that there will need to be some statistical threshold for the number of differences that constitute a mutable piece vs a similar looking but different template. I am also unsure as to what data structure would make this analysis quick and easy.
I would like the output of the script to output what it thinks are all the url templates that are present on the server, possibly with some confidence value if appropriate.
A simple solution would be to count path occurrences and learn which values correspond to templates. Assume that the file input contains the URLs from your first snippet. Then compute the per-path visits:
awk -F '/' '{ for (i=2; i<=NF; ++i) { for (j=2; j<=i; ++j) printf "/%s", $j; printf "\n" }}' input \
| sort \
| uniq -c \
| sort -rn
This yields:
7 /article
4 /article/1
3 /article/1/view
2 /contact
1 /help
1 /contact/thank-you
1 /article/8/edit
1 /article/8
1 /article/2/view
1 /article/2
1 /article/1323/view
1 /article/1323
1 /article/1/edit
Now you have a weight for each path which you can feed into a score function f(x, y), where x represents the count and y the depth of the path. For example, the first line would result in the invocation f(7,2) and may return a value in [0,1], say 0.8, to tell you that the given parametrization corresponds to a template with 80%. Of course, all the magic happens in f and you would have to come up with reasonable values based on the paths that you see being accessed. To develop a good f, you could use logistic regression on some a small data set and see if it predicts well the binary feature of being a template or not.
You can also take a mundane route: just drop the tail, e.g., all values <= 1.
How about using a DAWG? Except the nodes would store not letters, but the URI pieces. Like this:
This is a very nice data structure: it has pretty minimal memory requirements, it's easy to traverse, and, being a DAG, there are plenty of easy and well-researched algorithms for it. It also happens to describe the state machine that accepts all URLs in the sample and rejects all others (so we might actually build a regular expression out of it, which is very neat, but I'm not clever enough to know how to go about it from there).
Anyhow, with a structure like this, your problem translates into that of finding the "bottlenecks". I'd guess there are proper algorithms for that, but with a large enough sample where variables vary wildly enough, it's basically this: the more nodes there are at a certain depth, the more likely it's a mutable part.
A probably naive approach to do it would be like this: keeping separate DAWGs for every starting part, I'd find the mean width of the DAWG (possibly weighted based on the depth). And if a level's width is above that mean, I'd consider it a variable with the probability depending on how far away it is from the mean. You may very well unleash the power of statistics at this point. modeling the distribution of the width.
This approach wouldn't fare well with independent patterns starting with the same part, like "shop/?/?" and "shop/admin/?/edit". This could be perhaps mitigated by examining the DAWG-s in a more dynamic fashion, using a sliding window of sorts, always examining only a part of the DAWG at once, but I don't know how. Oh and, the whole thing fails horribly if the very first part is a variable, but that's thankfully rare.
You may also look out for certain little things like all nodes of the same level having numerical values (more likely to be a variable), and I'd certainly check for common date patterns in the sample before building the DAWGs, factoring them out would make handling the blog-like patterns easier.
(Oh and, adding the "algorithm" tag would probably attract more attention to the question.)

eval in template strings

I'm considering porting a rather unwieldy bash script to python but I'm stuck on how to handle the following aspect: The point of the script is to generate a png image depending on dynamically fetched data. The bash script grabs the data, and builds a very long invocation of the convert utility, with lots of options. It seemed like python's template strings would be a good solution (I would vastly prefer to stay within the standard library, since I'll be deploying to shared hosting), but I discovered that you can't evaluate expressions as you can in bash:
>>> from string import Template
>>> s = Template('The width times one is ${width}')
>>> s.substitute(width=45)
'The width times one is 45'
>>> t = Template('The width times two is ${width*2}')
>>> t.substitute(width=45)
# Raises ValueError
Since my bash script depends quite heavily on such arithmetic (otherwise the number of variables to keep track of would increase exponentially) I'd like to know if there's a way to emulate this behavior in python. I saw that this question, asking roughly the same, has a comment, reading:
This would be very unPythonic, because it's counterintuitive -- strings are just
strings, they shouldn't run code!
If this is the case, what would be a more idiomatic way to approach this problem?
The proposed answer to the question linked above is to use string formatting with either the % syntax or the format() function, but I don't think that would work well with the number of variables in my string (around 50).
Why not use built-in string formatting?
width = 45
"Width times one is {width}".format(width=width)
"Width times two is {width}".format(width=2*width)
results in
Width times one is 45
Width times two is 90
The Pythonic solution to this problem is to forget about string formatting and pass a list of arguments to one of the subprocess functions, e.g.
# I have no idea about convert's command line usage,
# so here's an example using echo.
subprocess.call(["echo", str(1 + 1), "bla"])
That way, there's no need to build a single string and no need to worry about quoting.
You probably need a better templating engine. Jinja2 supports this kind of stuff and a lot more. I don't think the standard library has anything equally powerful, but from what I figured, the library is pure Python, so you can integrate it into your application by just copying it along.
If Jinja doesn't fit you for some reason, have a look at the Python wiki, which has a section specifically for those kinds of libraries. Amongst them is the very lightweight Templite, which is only one class and seems to do exactly what you need.
The task is not that hard, why don't you just make some coding for fun? And here is the function almost does what you want.
import re
def TempEval(template,**kwargs):
mark = re.compile('\${(.*?)}')
for key in kwargs:
exec('%s=%s'%(key,kwargs[key]))
for item in mark.findall(template):
template=template.replace('${%s}'%item,str(eval(item)))
return template
print TempEval('The width times one is ${width}',width=5)
#The width times one is 5
print TempEval('The width times two is ${width*2}',width=5)
#The width times two is 10

How to solve a linear system in Linux shell?

Does anyone know of a Linux command that reads a linear system of equations from its standard input and writes the solution (if exists) in its standard output?
I want to do something like this:
generate_system | solve_system
You can probably write your own such command using this package.
This is an old question, but showed up in my searches for this problem, so I'm adding an answer here.
I used maxima's solve function. Wrangling the input/output to/from maxima is a bit of a challenge, but can be done.
prepare the system of equations as a comma-separated list -- for a example, EQs="C[1]+C[2]=1,C[1]-C[2]=2". I wanted a solution for an unknown number of variables, so I used C[n], but you can use variable names.
prepare a list of variables you wish to solve for -- EQ_VARS="C[1],C[2]"
Maxima will echo all inputs, use line wrap, and return a solution in the form [C[1]=...,C[2]=..]. We need to resolve all of these.
Taken together, this becomes
OUT_VALS=( \
$(maxima --very-quiet \
--batch-string="display2d:false\$linel:9999\$print(map(rhs,float(solve([$EQs],[$EQ_VARS]))[1]))\$" \
| tail -n 1 \
| tr -c '0-9-.e' ' ') )
which will place the solution values into the array $OUT_VALS.
Note that this only properly handles that Maxima output if your problem is correctly constrained -- if you have zero, or more than one solution, the output will not be parsed correctly.

Identifying frequent formulas in a codebase

My company maintains a domain-specific language that syntactically resembles the Excel formula language. We're considering adding new builtins to the language. One way to do this is to identify verbose commands that are repeatedly used in our codebase. For example, if we see people always write the same 100-character command to trim whitespace from the beginning and end of a string, that suggests we should add a trim function.
Seeing a list of frequent substrings in the codebase would be a good start (though sometimes the frequently used commands differ by a few characters because of different variable names used).
I know there are well-established algorithms for doing this, but first I want to see if I can avoid reinventing the wheel. For example, I know this concept is the basis of many compression algorithms, so is there a compression module that lets me retrieve the dictionary of frequent substrings? Any other ideas would be appreciated.
The string matching is just the low hanging fruit, the obvious cases. The harder cases are where you're doing similar things but in different order. For example suppose you have:
X+Y
Y+X
Your string matching approach won't realize that those are effectively the same. If you want to go a bit deeper I think you need to parse the formulas into an AST and actually compare the AST's. If you did that you could see that the tree's are actually the same since the binary operator '+' is commutative.
You could also apply reduction rules so you could evaluate complex functions into simpler ones, for example:
(X * A) + ( X * B)
X * ( A + B )
Those are also the same! String matching won't help you there.
Parse into AST
Reduce and Optimize the functions
Compare the resulting AST to other ASTs
If you find a match then replace them with a call to a shared function.
I would think you could use an existing full-text indexer like Lucene, and implement your own Analyzer and Tokenizer that is specific to your formula language.
You then would be able to run queries, and be able to see the most used formulas, which ones appear next to each other, etc.
Here's a quick article to get you started:
Lucene Analyzer, Tokenizer and TokenFilter
You might want to look into tag-cloud generators. I couldn't find any source in the minute that I spent looking, but here's an online one:
http://tagcloud.oclc.org/tagcloud/TagCloudDemo which probably won't work since it uses spaces as delimiters.

Resources