SPSS compare string function - string

I have a file composed of many strings. For each string, I want to create substrings of length 4 and then compare each substring with a dictionary of words from another SPSS file. For example, if I have the string "transport" I want to create a list of 4-letter strings (e.g., 'tran', 'rans', 'ansp', etc.). For each of these 4-letter strings, I want to know if it exists in another file with a long list of words. Here is my syntax in SPSS:
*rawNonword is the name of the string in my first file.
compute chars = char.length(rawNonword).
string holder (A50).
loop #i = 1 to chars-4.
compute holder = char.substr(rawNonword, #i, 4).
*here I would like to compare holder with the strings in another file.
end loop.
execute.
I realize that the merge and match functions are normally used in SPSS, but it seems as if I can't use them inside a loop. I believe this problem is fairly easy in python, but I need to do this task in SPSS. Is there an easy function in SPSS that will return a value of 1 or true if the 4-letter string exists in another file?

Certainly easier to do using the Python plugin with the extendedTransforms.vlookup function, but in traditional syntax, you could create a variable holding all the four-letter fragments, sort both files, and use a TABLE match with MATCH FILES using that variable as the key.

Related

How to Automatically add thousand separators for every number in a string?

How can i create a thousand separator for every number which is in my string?
So for example this string:
string = "123456,78+1234"
should be displayed as:
TextView = "123.456,78+1.234"
And the string should be editable, so the thousand separator should adapt when i remove or add a digit.
I have already read all the posts I could find about it, but I could never find an up-to-date working answer. So I would be really grateful for your help!
Your question contains two sub-questions:
A. You want to add thousand separators to a string which contains a group of numbers.
B. You want it to change.
And the answers are:
A: In your example there's , as a delimiter, so you need to split the string using this delimiter to an array of strings.
Then iterate over them and have your dots added to every 3nth index of their characters; you can also use String.format("%,d", substr.toLong()).
Lastly, append all of the strings back together with , as the separator.
B: This one can be done in different ways. You may store the original string somewhere and observe it, so when it changes it goes to the function which does A, and use the function result the way you like (which I suppose is to be set in a TextView).

SPSS converting a string into a numeric variable issue

I have a string variable with lots of parentheses and other punctuation e.g. _LSC Debt licensed work. How can I easily convert it to a numeric variable when I already have a specified code list for it? i.e. I don't want it to automatically recode everything because it uses the wrong values against the labels.
Create a dataset with two variables: a string holding the current messy name and a numeric variable holding the new code. Then, with both the original dataset and the lookup one sorted by the string, do MATCH FILES specifying a table match (or use Data > Merge Files > Add Variables).
You can prepare a separate file which includes two variables:
- one contains each of the possible values in the original string variable to be recoded (make sure the name and width are the same as your original variable)
- the second contains the new values you want to recode to.
when you set this up, match the files like this:
get file="filepath\Your_Value_Table.sav".
sort cases by YourOriginalVarName.
dataset name ValTab.
get file="filepath\Your_Original_File.sav".
sort cases by YourOriginalVarName.
match files /file=* /table=ValTab /by YourOriginalVarName.
exe.
At this point your original file will contain a new variable that has the codes you wanted.
In general I agree with the solution provided by others. However, I would like to suggest an extra step, which could make your look-up file (see the answer of eli-k and JKP) a bit better.
The point is that your string variable with lots of parentheses and other punctuation probably also has different ways to write the same thing.
For example:
_LSC Debt licensed work
LSC Debt licensed work
_LSC Debt Licensed Work
etc.
You could create a lookup-table with three variables: the unique values of the original string variable, a cleaned-up version of that variable, and finally the numeric value you want to attach.
The advantage of the cleaned-up version is that you can identify more easily the same value although it is written differently.
You could clean up using several functions:
string CleanedUpVersion (A40).
compute CleanedUpVersion = REPLACE(RTIM(LTRIM(UPCASE(YourOriginalVarName))),'_','').
execute.
In this basic example we convert to capital letters, delete leading and trailing blanks and remove the underscore by replacing it by nothing.
Overall this could help to avoid giving different numbers to unique values in your original variable that mean the same thing, while you would like them to have the same number.

String Comparison ignoring special characters C#

I am creating a generic List if unique strings.My string formats are like GBP/101-P506 some time it could be GBP-101-P-506. Both of these strings have to be considered as SAME. how could I compare such strings?
Most straight forward way would be, to replace the special characters with empty strings and compare the results...
Use temporary variables if you don't want to modify your originals.
Regards
RegEx the input and normalize the data before entering it into your data structure. If you don't want to change the original strings, you will have to consider all possible valid values anytime you need to perform operations on the strings.

Loading dataset containing both strings and number

I'm trying to load the following dataset:
Afghanistan,5,1,648,16,10,2,0,3,5,1,1,0,1,1,1,0,green,0,0,0,0,1,0,0,1,0,0,black,green
Albania,3,1,29,3,6,6,0,0,3,1,0,0,1,0,1,0,red,0,0,0,0,1,0,0,0,1,0,red,red
Algeria,4,1,2388,20,8,2,2,0,3,1,1,0,0,1,0,0,green,0,0,0,0,1,1,0,0,0,0,green,white
...
Problem is it contains both integers and strings.
I found some information on how to get out the integers only.
But haven't been able to see if there's any way to get all the data.
My question is that possible ??
If that is not possible, is there then any way to find the numbers on each line and throw everything else away without having to choose the columns?
I need specifically since it seems I cannot use str2num on a whole line at a time.
Almost anything is possible, you just have to define your goal accurately.
Assuming that your database is stored as a text file, you can parse it line by line using textread, and then apply regexp to filter only the numerical fields (this does not require having prior knowledge about the columns):
C = textread('database.txt', '%s', 'delimiter', '\n');
C = cellfun(#(x)regexp(x, '\d+', 'match'), C, 'Uniform', false);
The result here is a cell array of cell array of strings, where each string corresponds to a numerical field in a specific line.
Since the numbers are still stored as strings, you'd probably need to convert them to actual numerical values. There's a multitude of ways to do that, but you can use str2num in a tricky way: it can convert delimited strings into an array of numbers. This means that if you concatenate all strings in a specific line back into one string, and put spaces in between, you can apply str2num on all of them at once, like so:
C = cellfun(#(x)str2num(sprintf('%s ', x{:})), C, 'Uniform', false);
The resulting C is a cell array of vectors, each vector containing the values of all numerical fields in the corresponding line. To access a specific vector, you can use curly braces ({}). For instance, to access the numbers of the second line, you would use C{2}.
All the non-numerical fields are discarded in the process of parsing, of course. If you want to keep them as well, you should use a different regular expression with regexp.
Good luck!

compare a string to a cell array of srings in matlab and find the most similar

I have a list of images stored in a directory. They are all named. My GUI reads all the images and saves their names in a cell array. Now I have added a editable box that the user can type in a name and the program will show that image. The problem is I want the program to take into account typos and misspellings by the user and find the most similar file name to the user typed word. Can you please help me?
Many Thanks,
Hamid
You should read this WP article: Approximate string matching and look at "Calculation of distance between strings" on FEx.
I think you should use the longest common subsequence algorithm to approximately compare strings.
Here is a matlab implementation:
http://www.mathworks.com/matlabcentral/fileexchange/24559-longest-common-subsequence
After, just do something like that:
[~,ind]=min(cellarray( #(x) LCS(lower(userInput),lower(x)), allFileNames));
chosenFile=allFileName{ind};
(the function LCS is the longest common subsequence algorithm, and the functionlower converts to lower case)
Not exactly what you are looking for, but you can compare the first few characters of the strings ignoring case to find a close match. See the command strncmpi:
strncmpi Compare first N characters of strings ignoring case.
TF = strncmpi(S,C,N) performs a case-insensitive comparison between the
first N characters of string S and the first N characters in each element
of cell array C. Input S is a character vector (or 1-by-1 cell array), and
input C is a cell array of strings. The function returns TF, a logical
array that is the same size as C and contains logical 1 (true) for those
elements of C that are a match, except for letter case, and logical 0
(false) for those elements that are not. The order of the two input
arguments is not important.

Resources