Replacing multiple string values in SAS - string

I have a dataset that I'm trying to clean up. One variable is gender where I have 'F','Female,'M','Male' and 'Unknown' as values. I want to change all the iterations of 'F' to show as 'Female' and all the 'M' values to show as 'Male'. I also have another variable called 'Ethnicity' which has values such as '1 - White' but I want it to show as 'White'.
I have tried to use tranwrd
gender=tranwrd(gender, "F", "Female");
But this replaces the 'Female' values as well to 'Femaleemale'
I have also attempted index:
IF index(lowcase(gender),"f") THEN gender="Female";
IF index(lowcase(gender),"m") THEN gender="male";
But the multiple If statements don't work.

As you discovered TRANWRD is the wrong function for the value transformation task at hand. Neither is INDEX because the true value in SAS is the state of non-zero and non-missing -- INDEX(source, excerpt) result will be a logical true for the case of finding the excerpt anywhere in source.
For specific value transformations use a direct literal value for comparison. For testing a specific single character you can do the lowercase as you show, or use an IN list.
if gender in ('M', 'm') then gender = 'Male'; else
if gender in ('F', 'f') then gender = 'Female';
For the case of extracting ethnicity from a value construct # - ethnicity you can , per #draycut, use the COMPRESS function with the keep alphabetic characters only option (ka).
Another way to transform patterned values is to use regular expression search and replace.
* replace leading # - before embedded ethnicity with no string (//);
ethnicity = prxchange ('/^\d+\s*-\s*//',1,ethnicity);

See if you can use this as a template
data have;
input gender $ 1-7 Ethnicity $ 9-18;
datalines;
F 1 - White
Female White
Male 2 - Black
Unknown Black
m 1 - White
f 1 - White
;
data want;
set have;
if upcase(char(gender, 1)) = "M" then gender = "Male";
else if upcase(char(gender, 1)) = "F" then gender = "Female";
else gender = "Unknown";
Ethnicity = compress(Ethnicity, , 'ka');
run;

Related

M language - extract 5 digit numbers only from string

I am trying to write a custom column in M to detect whether the field contains a 5 digit number, and then extract that 5 digit number into a new column. When employees incur an expense, they need to specify the job number if its for a job. If not, they usually type in text.
IF text.contains number like "#####" then have the new column have that 5 digit number, else null.
I am having incredible difficult on how to write in M. I tried doing this in
Assuming your data is in column [Column1] then try
Add column.. custom column ... with formula:
= try if Number.From([Column1])>0 and Text.Length(Text.From([Column1]))=5 then [Column1] else null otherwise null
that looks for Column1 being something that can evaluate to a number, and that has a length of 5 characters when viewed as a string, otherwise puts a null. This allows for leading zeroes like '01234 but does not attempt to account for mixed text/numbers such as A12345
if you need to remove alphas, try
= if Text.Length(Text.Select(Text.From([Column1]),{"0".."9"}))=5 then Text.Select(Text.From([Column1]),{"0".."9"}) else null
Given what you have written, that valid entries will be in the range of 10,000 to 99,999 and may or may not be preceded by a J, you can add a column to detect that.
//Ensure Column1 (or whatever it's real name is) is of type text and NOT any
#"Added Custom" = Table.AddColumn(#"Previous Step", "Job Number", each
let
x = Text.TrimStart(Text.Upper([Column1]),"J"),
n = try Number.From(x) otherwise 0
in
if n>=10000 and n<100000 then n else null)

How to find common elements in string cells?

I want to find the common elements in multiple (>=2) cell arrays of strings.
A related question is here, and the answer proposes to use the function intersect(), however it works for only 2 inputs.
In my case, I have more than two cells, and I want to obtain a single common subset. Here is an example of what I want to achieve:
c1 = {'a','b','c','d'}
c2 = {'b','c','d'}
c3 = {'c','d'}
c_common = my_fun({c1,c2,c3});
in the end, I want c_common={'c','d'}, since only these two strings occur in all the inputs.
How can I do this with MATLAB?
Thanks in advance,
P.S. I also need the indices from each input, but I can probably do that myself using the output c_common, so not necessary in the answer. But if anyone wants to tackle that too, my actual output will be like this:
[c_common, indices] = my_fun({c1,c2,c3});
where indices = {[3,4], [2,3], [1,2]} for this case.
Thanks,
Listed in this post is a vectorized approach to give us the common strings and indices using unique and accumarray. This would work even when the strings are not sorted within each cell array to give us indices corresponding to their positions within it, but they have to be unique. Please have a look at the sample input, output section* to see such a case run. Here's the implementation -
C = {c1,c2,c3}; % Add more cell arrays here
% Get unique strings and ID each of the strings based on their uniqueness
[unqC,~,unqID] = unique([C{:}]);
% Get count of each ID and the IDs that have counts equal to the number of
% cells arrays in C indicate that they are present in all cell arrays and
% thus are the ones to be finally selected
match_ID = find(accumarray(unqID(:),1)==numel(C));
common_str = unqC(match_ID)
% ------------ Additional work to get indices ----------------
N_str = numel(common_str);
% Store matches as a logical array to be used at later stages
matches = ismember(unqID,match_ID);
% Use ismember to find all those indices in unqID and subtract group
% lengths from them to give us the indices within each cell array
clens = [0 cumsum(cellfun('length',C(1:end-1)))];
match_index = reshape(find(matches),N_str,[]);
% Sort match_index along each column based on the respective unqID elements
[m,n] = size(match_index);
[~,sidx] = sort(reshape(unqID(matches),N_str,[]),1);
sorted_match_index = match_index(bsxfun(#plus,sidx,(0:n-1)*m));
% Subtract cumulative group lens to give us indices corres. to each cell array
common_idx = bsxfun(#minus,sorted_match_index,clens).'
Please note that at the step that calculates match_ID : accumarray(unqID(:),1) could be replaced by histc(unqID,1:max(unqID)). Also, histcounts be another alternative there.
*Sample input, output -
c1 =
'a' 'b' 'c' 'd'
c2 =
'b' 'c' 'a' 'd'
c3 =
'c' 'd' 'a'
common_str =
'a' 'c' 'd'
common_idx =
1 3 4
3 2 4
3 1 2
As noted in the comments to this question, there is a file in File Exchange called "MINTERSECT -- Multiple set intersection." at http://www.mathworks.com/matlabcentral/fileexchange/6144-mintersect-multiple-set-intersection that contains simple code to generalize intersect to multiple sets. In a nutshell, the code gets the output from performing intersect on the first pair of cells and then perform intersect on this output with the next cell. This process continues until all cells have been compared. Note that the author points out that the code is not particularly efficient but it may be sufficient for your use case.

String Handling in Talend

I have this kind of data,
12345 Lipa AVE, AKA 1234 LIpa AVE, Lipa City, LP, 12345
I want this to transform into this:
All the data that I'm going to process have 1 comma to separate the address and another case is the 2 comma above.
An example of the 1 comma is below,
12345 Lipa AVE, Lipa City, LP, 12345
The simplest solution is to unify the structure, and then make the mapping. In this case it means first convert the 4 column structure (1 comma case) into 5 columns (2 commas case) where the second field is empty.
The diagram is the following:
tFileInputFullRow -> tJavaRow -> tExtractDelimitedField -> tMap -> tFileOutputDelimited
So first read the full row, then detect the case and insert the extra column if necessary. The tJavaRow code is the following:
output_row.line = "";
String[] elements = input_row.line.split(",");
if(elements.length == 4)
elements[0] += ",";
for(String element:elements)
output_row.line += element + ",";
In tExtractDelimitedField set the separator to comma and finally in the tMap merge the two addresses field into one:
row3.address2 != null && !row3.address2.equals("") ? row3.address1 + "," + row3.address2 : row3.address1
The tExtractDelimitedField can be skipped in the tJavaRow by changing the output schema and then passing the array elements one by one.

Count number of occurences of a string and relabel

I have a n x 1 cell that contains something like this:
chair
chair
chair
chair
table
table
table
table
bike
bike
bike
bike
pen
pen
pen
pen
chair
chair
chair
chair
table
table
etc.
I would like to rename these elements so they will reflect the number of occurrences up to that point. The output should look like this:
chair_1
chair_2
chair_3
chair_4
table_1
table_2
table_3
table_4
bike_1
bike_2
bike_3
bike_4
pen_1
pen_2
pen_3
pen_4
chair_5
chair_6
chair_7
chair_8
table_5
table_6
etc.
Please note that the dash (_) is necessary Could anyone help? Thank you.
Interesting problem! This is the procedure that I would try:
Use unique - the third output parameter in particular to assign each string in your cell array to a unique ID.
Initialize an empty array, then create a for loop that goes through each unique string - given by the first output of unique - and creates a numerical sequence from 1 up to as many times as we have encountered this string. Place this numerical sequence in the corresponding positions where we have found each string.
Use strcat to attach each element in the array created in Step #2 to each cell array element in your problem.
Step #1
Assuming that your cell array is defined as a bunch of strings stored in A, we would call unique this way:
[names, ~, ids] = unique(A, 'stable');
The 'stable' is important as the IDs that get assigned to each unique string are done without re-ordering the elements in alphabetical order, which is important to get the job done. names will store the unique names found in your array A while ids would contain unique IDs for each string that is encountered. For your example, this is what names and ids would be:
names =
'chair'
'table'
'bike'
'pen'
ids =
1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4
1
1
1
1
2
2
names is actually not needed in this algorithm. However, I have shown it here so you can see how unique works. Also, ids is very useful because it assigns a unique ID for each string that is encountered. As such, chair gets assigned the ID 1, followed by table getting assigned the ID of 2, etc. These IDs will be important because we will use these IDs to find the exact locations of where each unique string is located so that we can assign those linear numerical ranges that you desire. These locations will get stored in an array computed in the next step.
Step #2
Let's pre-allocate this array for efficiency. Let's call it loc. Then, your code would look something like this:
loc = zeros(numel(A), 1);
for idx = 1 : numel(names)
id = find(ids == idx);
loc(id) = 1 : numel(id);
end
As such, for each unique name we find, we look for every location in the ids array that matches this particular name found. find will help us find those locations in ids that match a particular name. Once we find these locations, we simply assign an increasing linear sequence from 1 up to as many names as we have found to these locations in loc. The output of loc in your example would be:
loc =
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
5
6
7
8
5
6
Notice that this corresponds with the numerical sequence (the right most part of each string) of your desired output.
Step #3
Now all we have to do is piece loc together with each string in our cell array. We would thus do it like so:
out = strcat(A, '_', num2str(loc));
What this does is that it takes each element in A, concatenates a _ character and then attaches the corresponding numbers to the end of each element in A. Because we want to output strings, you need to convert the numbers stored in loc into strings. To do this, you must use num2str to convert each number in loc into their corresponding string equivalents. Once you find these, you would concatenate each number in loc with each element in A (with the _ character of course). The output is stored in out, and we thus get:
out =
'chair_1'
'chair_2'
'chair_3'
'chair_4'
'table_1'
'table_2'
'table_3'
'table_4'
'bike_1'
'bike_2'
'bike_3'
'bike_4'
'pen_1'
'pen_2'
'pen_3'
'pen_4'
'chair_5'
'chair_6'
'chair_7'
'chair_8'
'table_5'
'table_6'
For your copying and pasting pleasure, this is the full code. Be advised that I've nulled out the first output of unique as we don't need it for your desired output:
[~, ~, ids] = unique(A, 'stable');
loc = zeros(numel(A), 1);
for idx = 1 : numel(names)
id = find(ids == idx);
loc(id) = 1 : numel(id);
end
out = strcat(A, '_', num2str(loc));
If you want an alternative to unique, you can work with a hash table, which in Matlab would entail to using the containers.Map object. You can then store the occurrences of each individual label and create the new labels on the go, like in the code below.
data={'table','table','chair','bike','bike','bike'};
map=containers.Map(data,zeros(numel(data),1)); % labels=keys, counts=values (zeroed)
new_data=data; % initialize matrix that will have outputs
for ii=1:numel(data)
map(data{ii}) = map(data{ii})+1; % increment counts of current labels
new_data{ii} = sprintf('%s_%d',data{ii},map(data{ii})); % format outputs
end
This is similar to rayryeng's answer but replaces the for loop by bsxfun. After the strings have been reduced to unique labels (line 1 of code below), bsxfun is applied to create a matrix of pairwise comparisons between all (possibly repeated) labels. Keeping only the lower "half" of that matrix and summing along rows gives how many times each label has previously appeared (line 2). Finally, this is appended to each original string (line 3).
Let your cell array of strings be denoted as c.
[~, ~, labels] = unique(c); %// transform each string into a unique label
s = sum(tril(bsxfun(#eq, labels, labels.')), 2); %'// accumulated occurrence number
result = strcat(c, '_', num2str(x)); %// build result
Alternatively, the second line could be replaced by the more memory-efficient
n = numel(labels);
M = cumsum(full(sparse(1:n, labels, 1)));
s = M((1:n).' + (labels-1)*n);
I'll give you a psuedocode, try it yourself, post the code if it doesn't work
Initiate a counter to 1
Iterate over the cell
If counter > 1 check with previous value if the string is same
then increment counter
else
No- reset counter to 1
end
sprintf the string value + counter into a new array
Hope this helps!

How to change stringified numbers in data frame into pure numeric values in R

I have the following data.frame:
employee <- c('John Doe','Peter Gynn','Jolie Hope')
# Note that the salary below is in stringified format.
# In reality there are more such stringified numerical columns.
salary <- as.character(c(21000, 23400, 26800))
df <- data.frame(employee,salary)
The output is:
> str(df)
'data.frame': 3 obs. of 2 variables:
$ employee: Factor w/ 3 levels "John Doe","Jolie Hope",..: 1 3 2
$ salary : Factor w/ 3 levels "21000","23400",..: 1 2 3
What I want to do is to convert the change the value from string into pure number
straight fro the df variable. At the same time preserve the string name for employee.
I tried this but won't work:
as.numeric(df)
At the end of the day I'd like to perform arithmetic on these numeric
values from df. Such as df2 <- log2(df), etc.
Ok, there's a couple of things going on here:
R has two different datatypes that look like strings: factor and character
You can't modify most R objects in place, you have to change them by assignment
The actual fix for your example is:
df$salary = as.numeric(as.character(df$salary))
If you try to call as.numeric on df$salary without converting it to character first, you'd get a somewhat strange result:
> as.numeric(df$salary)
[1] 1 2 3
When R creates a factor, it turns the unique elements of the vector into levels, and then represents those levels using integers, which is what you see when you try to convert to numeric.

Resources