I am creating a generic List if unique strings.My string formats are like GBP/101-P506 some time it could be GBP-101-P-506. Both of these strings have to be considered as SAME. how could I compare such strings?
Most straight forward way would be, to replace the special characters with empty strings and compare the results...
Use temporary variables if you don't want to modify your originals.
Regards
RegEx the input and normalize the data before entering it into your data structure. If you don't want to change the original strings, you will have to consider all possible valid values anytime you need to perform operations on the strings.
Related
I have this file where I want to make a conditional check for any cell that contains the letter combination "_SOL", or where the string is followed by any numeric character like "_SOL1524", and stop looking after that. So I don't want matches for "_SOLUTION" or "_SOLothercharactersthannumeric".
So when I use the following formula, I also get results for words like "_SOLUTION":
=IF(ISNUMBER(FIND("_SOL",A1))=TRUE,"Yay","")
How can I avoid this, and only get matches if the match is "_SOL" or "_SOLnumericvalue" (one numeric character)
Clarification: The whole strings may be "Blabla_SOL_BLABLA", "Blabla_SOLUTION_BLABLA" or "Blabla_SOL1524_BLABLA"
Maybe this, which will check if the character after "_SOL" is numeric.
=IF(ISNUMBER(VALUE(MID(A1,FIND("_SOL",A1)+4,1))),"Yay","")
Or, as per OP's request and suggestion, to include the possibility of an underscore after "SOL"
=IF(OR(ISNUMBER(VALUE(MID(A1,FIND("_SOL",A1)+4,1))),ISNUMBER(FIND("_SOL_",A1))),"Yay","")
Here is an alternative way to check if your string contains SOL followed by either nothing or any numeric value up to any characters after SOL:
=IF(COUNT(FILTERXML("<t><s>"&SUBSTITUTE(A1,"_","1</s><s>")&"</s></t>","//s[substring-after(.,'SOL')*0=0]")>0),"Yey","Nay")
Just to use in an unfortunate event where you would encounter SOL1TEXT for example. Or, maybe saver (in case you have text like AEROSOL):
=IF(COUNT(FILTERXML("<t><s>"&SUBSTITUTE(A1,"_","</s><s>")&"</s></t>","//s[translate(.,'1234567890','')='SOL']")>0),"Yey","Nay")
And to prevent that you have text like 123SOL123 you could even do:
=IF(COUNT(FILTERXML("<t><s>"&SUBSTITUTE(A1,"_","1</s><s>")&"</s></t>","//s[starts-with(., 'SOL') and substring(., 4)*0=0]")>0),"Yey","Nay")
Say I have a simple bytecode-like file format for saving data.
If I want to store a string, should I do it like in source files where all characters between a certain byte is the string,
or should I first store the length of the string then the string bytes?
Or are both solutions horrible and if so which one can I use?
It depends on whether you want to store:
a single string
a number of strings
different length strings
all the same length
For all of the above, it may also matter if your strings contain:
any characters
only certain characters
formatting
In general, you should use Unicode.
For a single string, you simply can use an entire file to contain the string, the end-of-file will be the same as the end of string. No need to store the length of the string.
If the strings aren't all (around) the same length you can use an inline separator to separate the strings. Often the newline character is useful for this (especially since a lot of programming languages support this way of reading in a file line-by-line), but other markers such as tab are common.
CSV text files often use double quotes to enclose strings that contain commas (or other column separator) (which would otherwise indicate the next column value was starting), or line-breaks (which would otherwise indicate the next row).
Of course, now you have the problem of how to store a double quote in your string.
If you want to store formatting, you can use a markup language (html) or it may be enough to allow for line breaks and/or some markdown.
I have a string variable with lots of parentheses and other punctuation e.g. _LSC Debt licensed work. How can I easily convert it to a numeric variable when I already have a specified code list for it? i.e. I don't want it to automatically recode everything because it uses the wrong values against the labels.
Create a dataset with two variables: a string holding the current messy name and a numeric variable holding the new code. Then, with both the original dataset and the lookup one sorted by the string, do MATCH FILES specifying a table match (or use Data > Merge Files > Add Variables).
You can prepare a separate file which includes two variables:
- one contains each of the possible values in the original string variable to be recoded (make sure the name and width are the same as your original variable)
- the second contains the new values you want to recode to.
when you set this up, match the files like this:
get file="filepath\Your_Value_Table.sav".
sort cases by YourOriginalVarName.
dataset name ValTab.
get file="filepath\Your_Original_File.sav".
sort cases by YourOriginalVarName.
match files /file=* /table=ValTab /by YourOriginalVarName.
exe.
At this point your original file will contain a new variable that has the codes you wanted.
In general I agree with the solution provided by others. However, I would like to suggest an extra step, which could make your look-up file (see the answer of eli-k and JKP) a bit better.
The point is that your string variable with lots of parentheses and other punctuation probably also has different ways to write the same thing.
For example:
_LSC Debt licensed work
LSC Debt licensed work
_LSC Debt Licensed Work
etc.
You could create a lookup-table with three variables: the unique values of the original string variable, a cleaned-up version of that variable, and finally the numeric value you want to attach.
The advantage of the cleaned-up version is that you can identify more easily the same value although it is written differently.
You could clean up using several functions:
string CleanedUpVersion (A40).
compute CleanedUpVersion = REPLACE(RTIM(LTRIM(UPCASE(YourOriginalVarName))),'_','').
execute.
In this basic example we convert to capital letters, delete leading and trailing blanks and remove the underscore by replacing it by nothing.
Overall this could help to avoid giving different numbers to unique values in your original variable that mean the same thing, while you would like them to have the same number.
I'm trying to load the following dataset:
Afghanistan,5,1,648,16,10,2,0,3,5,1,1,0,1,1,1,0,green,0,0,0,0,1,0,0,1,0,0,black,green
Albania,3,1,29,3,6,6,0,0,3,1,0,0,1,0,1,0,red,0,0,0,0,1,0,0,0,1,0,red,red
Algeria,4,1,2388,20,8,2,2,0,3,1,1,0,0,1,0,0,green,0,0,0,0,1,1,0,0,0,0,green,white
...
Problem is it contains both integers and strings.
I found some information on how to get out the integers only.
But haven't been able to see if there's any way to get all the data.
My question is that possible ??
If that is not possible, is there then any way to find the numbers on each line and throw everything else away without having to choose the columns?
I need specifically since it seems I cannot use str2num on a whole line at a time.
Almost anything is possible, you just have to define your goal accurately.
Assuming that your database is stored as a text file, you can parse it line by line using textread, and then apply regexp to filter only the numerical fields (this does not require having prior knowledge about the columns):
C = textread('database.txt', '%s', 'delimiter', '\n');
C = cellfun(#(x)regexp(x, '\d+', 'match'), C, 'Uniform', false);
The result here is a cell array of cell array of strings, where each string corresponds to a numerical field in a specific line.
Since the numbers are still stored as strings, you'd probably need to convert them to actual numerical values. There's a multitude of ways to do that, but you can use str2num in a tricky way: it can convert delimited strings into an array of numbers. This means that if you concatenate all strings in a specific line back into one string, and put spaces in between, you can apply str2num on all of them at once, like so:
C = cellfun(#(x)str2num(sprintf('%s ', x{:})), C, 'Uniform', false);
The resulting C is a cell array of vectors, each vector containing the values of all numerical fields in the corresponding line. To access a specific vector, you can use curly braces ({}). For instance, to access the numbers of the second line, you would use C{2}.
All the non-numerical fields are discarded in the process of parsing, of course. If you want to keep them as well, you should use a different regular expression with regexp.
Good luck!
I have come across the following issue with Fortran: that in reading a character array, for example, or any list in actuality, from a data file with fmt=*, both non-interquote blanks AND commas are natively considered as delimiters for the elements in the array/list. The fact that commas act as delimiters is a big problem for me.
So the question is: do you know of any semantic option or compilation directive in Fortran that permits to consider the commas in input files as characters and not as delimiters,
with the only delimiters being blanks? As an specific example, I would like that when reading a record like:
x,y,z
with:
read (7,*) adummy
would result in adummy (a scalar character variable) getting the value x,y,z not x.
Any help would be most welcome.
The solution is to specify formatting to match your data record, i.e. use character data descriptor when specifying the format:
read(7,fmt='(A)')adummy
will result in adummy having value x,y,z, assuming it is a variable of sufficient length.
However this method will not treat blanks as delimiters either, so if you want to read commas as character strings but have blanks as delimiter, the common way to achieve this is to read the whole record into the character variable and do the splitting into separate variables afterwards.