String Handling in Talend - string

I have this kind of data,
12345 Lipa AVE, AKA 1234 LIpa AVE, Lipa City, LP, 12345
I want this to transform into this:
All the data that I'm going to process have 1 comma to separate the address and another case is the 2 comma above.
An example of the 1 comma is below,
12345 Lipa AVE, Lipa City, LP, 12345

The simplest solution is to unify the structure, and then make the mapping. In this case it means first convert the 4 column structure (1 comma case) into 5 columns (2 commas case) where the second field is empty.
The diagram is the following:
tFileInputFullRow -> tJavaRow -> tExtractDelimitedField -> tMap -> tFileOutputDelimited
So first read the full row, then detect the case and insert the extra column if necessary. The tJavaRow code is the following:
output_row.line = "";
String[] elements = input_row.line.split(",");
if(elements.length == 4)
elements[0] += ",";
for(String element:elements)
output_row.line += element + ",";
In tExtractDelimitedField set the separator to comma and finally in the tMap merge the two addresses field into one:
row3.address2 != null && !row3.address2.equals("") ? row3.address1 + "," + row3.address2 : row3.address1
The tExtractDelimitedField can be skipped in the tJavaRow by changing the output schema and then passing the array elements one by one.

Related

Transpose with default headers in excel

I have a simple data which is
Name Age
Venky 20
Anil 22
Output should be like :
Name : Venky
Age : 20
Name : Anil
Age : 22
Note : For each record should have header values
I have tried multiple ways apart from Macros
Can you please give me you inputs?
Without more specific context from the question, we can combine BYROW, MAP, TOCOL, TEXTJOIN, TEXTSPLIT and SUBSTITUTE to get the expected result (adding the link since some of them are relatively new Excel functions):
=LET(y,SUBSTITUTE(TEXTJOIN(";",,
BYROW(A2:C4, LAMBDA(x, TEXTJOIN(";",,MAP(TOCOL(A1:C1), TOCOL(x),
LAMBDA(a,b, a & ":"& b & ";")))))), ";;",";"),
TEXTSPLIT(LEFT(y, LEN(y) - 1),,";"))
The main idea is create on the fly for each row of input data(A1:B3 including the headers) a new array via MAP with the following structure (let’s call it tempArray:
| Name:name1; |
|-------------|
| Age:age1; |
Note: It works for more than two columns in the header, if that is the case a row per header item will be created.
For example:
=MAP(TOCOL(A1:B1), TOCOL(A2:B2), LAMBDA(a,b, a & ":"& b & ";"))
For the first row of the input data will return:
| Name:Venky; |
| ------------|
| Age:10; |
Then convert tempArray into a text via: TEXTJOIN so every element of BYROW, i.e. x is converted to a string, where each record is delimited by ;. For the first row it will be:
Name:Venky;;Age:10;
SUBSTITUTE is used to remove the extra ; generated by TEXTJOIN. For example for the first row it would be:
Name:Venky;Age:10;
After executing SUBSTITUTE the intermediate result would be:
Name:Venky;Age:20;Name:Anil;Age:22;
We use LEFT to remove the extra ; at the end. Now we have a string that has each row delimited by ;, so the string is ready to be converted back to an array using TEXTSPLIT.
Sample adding an additional record and Last Name as a new header item:

Remove 1-3 length character from string in sql

From a space delimited string, i would like to remove all words that are long from 1 to 3 characters.
For example: this string
LCCPIT A2 LCCMAD B JBPM_JIT CCC
should become
LCCPIT LCCMAD JBPM_JIT
So, A2, B and CCC words are removed (since they are long 2, 1 and 3 characters). Is there a way to do it? I think i could use REGEXP_REPLACE, but i didn't find the correct regular expression to have this result.
Split string to words and aggregate back only these substrings whose length is greater than 3.
Sample data:
SQL> with test (col) as
2 (select 'LCCPIT A2 LCCMAD B JBPM_JIT CCC' from dual)
Query begins here:
3 select listagg(val, ' ') within group (order by lvl) result
4 from (select regexp_substr(col, '[^ ]+', 1, level) val,
5 level lvl
6 from test
7 connect by level <= regexp_count(col, ' ') + 1
8 )
9 where length(val) > 3;
RESULT
--------------------------------------------------------------------------------
LCCPIT LCCMAD JBPM_JIT
SQL>
I prefer a regex replacement trick:
SELECT TRIM(REGEXP_REPLACE(val, '(^|\s+)\w{1,3}(\s+|$)', ' '))
FROM dual;
-- output is 'LCCPIT LCCMAD JBPM_JIT'
Demo
The strategy above is match any 1, 2, or 3 letter word, along with any surrounding whitespace, and to replace with just a single space. The outer call to TRIM() is necessary to remove dangling spaces which might arise from the first or last word being removed.

Replacing multiple string values in SAS

I have a dataset that I'm trying to clean up. One variable is gender where I have 'F','Female,'M','Male' and 'Unknown' as values. I want to change all the iterations of 'F' to show as 'Female' and all the 'M' values to show as 'Male'. I also have another variable called 'Ethnicity' which has values such as '1 - White' but I want it to show as 'White'.
I have tried to use tranwrd
gender=tranwrd(gender, "F", "Female");
But this replaces the 'Female' values as well to 'Femaleemale'
I have also attempted index:
IF index(lowcase(gender),"f") THEN gender="Female";
IF index(lowcase(gender),"m") THEN gender="male";
But the multiple If statements don't work.
As you discovered TRANWRD is the wrong function for the value transformation task at hand. Neither is INDEX because the true value in SAS is the state of non-zero and non-missing -- INDEX(source, excerpt) result will be a logical true for the case of finding the excerpt anywhere in source.
For specific value transformations use a direct literal value for comparison. For testing a specific single character you can do the lowercase as you show, or use an IN list.
if gender in ('M', 'm') then gender = 'Male'; else
if gender in ('F', 'f') then gender = 'Female';
For the case of extracting ethnicity from a value construct # - ethnicity you can , per #draycut, use the COMPRESS function with the keep alphabetic characters only option (ka).
Another way to transform patterned values is to use regular expression search and replace.
* replace leading # - before embedded ethnicity with no string (//);
ethnicity = prxchange ('/^\d+\s*-\s*//',1,ethnicity);
See if you can use this as a template
data have;
input gender $ 1-7 Ethnicity $ 9-18;
datalines;
F 1 - White
Female White
Male 2 - Black
Unknown Black
m 1 - White
f 1 - White
;
data want;
set have;
if upcase(char(gender, 1)) = "M" then gender = "Male";
else if upcase(char(gender, 1)) = "F" then gender = "Female";
else gender = "Unknown";
Ethnicity = compress(Ethnicity, , 'ka');
run;

Count number of occurences of a string and relabel

I have a n x 1 cell that contains something like this:
chair
chair
chair
chair
table
table
table
table
bike
bike
bike
bike
pen
pen
pen
pen
chair
chair
chair
chair
table
table
etc.
I would like to rename these elements so they will reflect the number of occurrences up to that point. The output should look like this:
chair_1
chair_2
chair_3
chair_4
table_1
table_2
table_3
table_4
bike_1
bike_2
bike_3
bike_4
pen_1
pen_2
pen_3
pen_4
chair_5
chair_6
chair_7
chair_8
table_5
table_6
etc.
Please note that the dash (_) is necessary Could anyone help? Thank you.
Interesting problem! This is the procedure that I would try:
Use unique - the third output parameter in particular to assign each string in your cell array to a unique ID.
Initialize an empty array, then create a for loop that goes through each unique string - given by the first output of unique - and creates a numerical sequence from 1 up to as many times as we have encountered this string. Place this numerical sequence in the corresponding positions where we have found each string.
Use strcat to attach each element in the array created in Step #2 to each cell array element in your problem.
Step #1
Assuming that your cell array is defined as a bunch of strings stored in A, we would call unique this way:
[names, ~, ids] = unique(A, 'stable');
The 'stable' is important as the IDs that get assigned to each unique string are done without re-ordering the elements in alphabetical order, which is important to get the job done. names will store the unique names found in your array A while ids would contain unique IDs for each string that is encountered. For your example, this is what names and ids would be:
names =
'chair'
'table'
'bike'
'pen'
ids =
1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4
1
1
1
1
2
2
names is actually not needed in this algorithm. However, I have shown it here so you can see how unique works. Also, ids is very useful because it assigns a unique ID for each string that is encountered. As such, chair gets assigned the ID 1, followed by table getting assigned the ID of 2, etc. These IDs will be important because we will use these IDs to find the exact locations of where each unique string is located so that we can assign those linear numerical ranges that you desire. These locations will get stored in an array computed in the next step.
Step #2
Let's pre-allocate this array for efficiency. Let's call it loc. Then, your code would look something like this:
loc = zeros(numel(A), 1);
for idx = 1 : numel(names)
id = find(ids == idx);
loc(id) = 1 : numel(id);
end
As such, for each unique name we find, we look for every location in the ids array that matches this particular name found. find will help us find those locations in ids that match a particular name. Once we find these locations, we simply assign an increasing linear sequence from 1 up to as many names as we have found to these locations in loc. The output of loc in your example would be:
loc =
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
5
6
7
8
5
6
Notice that this corresponds with the numerical sequence (the right most part of each string) of your desired output.
Step #3
Now all we have to do is piece loc together with each string in our cell array. We would thus do it like so:
out = strcat(A, '_', num2str(loc));
What this does is that it takes each element in A, concatenates a _ character and then attaches the corresponding numbers to the end of each element in A. Because we want to output strings, you need to convert the numbers stored in loc into strings. To do this, you must use num2str to convert each number in loc into their corresponding string equivalents. Once you find these, you would concatenate each number in loc with each element in A (with the _ character of course). The output is stored in out, and we thus get:
out =
'chair_1'
'chair_2'
'chair_3'
'chair_4'
'table_1'
'table_2'
'table_3'
'table_4'
'bike_1'
'bike_2'
'bike_3'
'bike_4'
'pen_1'
'pen_2'
'pen_3'
'pen_4'
'chair_5'
'chair_6'
'chair_7'
'chair_8'
'table_5'
'table_6'
For your copying and pasting pleasure, this is the full code. Be advised that I've nulled out the first output of unique as we don't need it for your desired output:
[~, ~, ids] = unique(A, 'stable');
loc = zeros(numel(A), 1);
for idx = 1 : numel(names)
id = find(ids == idx);
loc(id) = 1 : numel(id);
end
out = strcat(A, '_', num2str(loc));
If you want an alternative to unique, you can work with a hash table, which in Matlab would entail to using the containers.Map object. You can then store the occurrences of each individual label and create the new labels on the go, like in the code below.
data={'table','table','chair','bike','bike','bike'};
map=containers.Map(data,zeros(numel(data),1)); % labels=keys, counts=values (zeroed)
new_data=data; % initialize matrix that will have outputs
for ii=1:numel(data)
map(data{ii}) = map(data{ii})+1; % increment counts of current labels
new_data{ii} = sprintf('%s_%d',data{ii},map(data{ii})); % format outputs
end
This is similar to rayryeng's answer but replaces the for loop by bsxfun. After the strings have been reduced to unique labels (line 1 of code below), bsxfun is applied to create a matrix of pairwise comparisons between all (possibly repeated) labels. Keeping only the lower "half" of that matrix and summing along rows gives how many times each label has previously appeared (line 2). Finally, this is appended to each original string (line 3).
Let your cell array of strings be denoted as c.
[~, ~, labels] = unique(c); %// transform each string into a unique label
s = sum(tril(bsxfun(#eq, labels, labels.')), 2); %'// accumulated occurrence number
result = strcat(c, '_', num2str(x)); %// build result
Alternatively, the second line could be replaced by the more memory-efficient
n = numel(labels);
M = cumsum(full(sparse(1:n, labels, 1)));
s = M((1:n).' + (labels-1)*n);
I'll give you a psuedocode, try it yourself, post the code if it doesn't work
Initiate a counter to 1
Iterate over the cell
If counter > 1 check with previous value if the string is same
then increment counter
else
No- reset counter to 1
end
sprintf the string value + counter into a new array
Hope this helps!

Grouping common elements using awk

The following table illustrates a brief snapshot of the data that I wish to manipulate. I am looking for an awk script that will group similar elements into one group. For eg. if you look at the table below:
Numbers (1,2,3,4,6) should all belong to one group. So row1 row2 row4 row8 will be group "1"
Number 9 is unique and does not have any common elements. So it will reside alone in a separate group say group 2
Similarly numbers 5,7 will reside in one group say group 3 and so on...
The file:
heading1 heading2 numberlist group
name1 text 1,2,3 1
name2 text 2 1
name3 text 9 2
name4 text 1,4 1
name5 text 5,7 3
name6 text 7 3
name7 text 8 4
name8 text 6,2 1
I was searching for queries similar to mine and found this link. Grouping lists by common elements. But the solution is in C++ and not awk, which is my primary requirement.
Incidentally I also found this awk solution that is somewhat related to my query but it was devoid of handling of comma separated values.
awk script grouping with array
Numberlist i.e. $3 is my only consideration for grouping.
This problem seemed almost same as one of my problems and i had used one column in your example to solve my problem :) So...
[[bash_prompt$]]$ cat log ; echo "########"; \
> cat test.sh ;echo "########"; awk -f test.sh log
heading1 heading2 numberlist group
name1 text 1,2,3
name2 text 2
name3 text 9
name4 text 1,4
name5 text 5,7
name6 text 7
name7 text 8
name8 text 6,2
########
/^name/{
i=0; j=0;
split($3,a,",");
for(var in a) {
for(var1 in q) {
split(q[var1],r,",");
for(var2 in r) {
if(r[var2] == a[var]) {
i=1;
j=((var1+1));
}
}
}
}
if(i == 0) {
q[length(q)] = $3;
j=length(q);
}
print $1 "\t\t" $2 " \t\t" $3 "\t\t" j;
}
########
name1 text 1,2,3 1
name2 text 2 1
name3 text 9 2
name4 text 1,4 1
name5 text 5,7 3
name6 text 7 3
name7 text 8 4
name8 text 6,2 1
[[bash_prompt$]]$
Update:
split splits the first argument by the delimiter passed in third argument and puts it into an array pointed by the second argument. Here main array is q, which holds the group members of a group, it's basically an array of arrays where the index of an element is the group id, and the element is collection all the members of the group. so q[0]="1,2,3" indicates 0th group is containing members 1,2 and 3. Now in awk, first one line is read which starts with name (/^name/). Then the 3rd field (1,2,3) is broken down into an array a. Now for each element in an array a, we go for each group stored into q (for(var1 in q)) , then inside each group, we split them into another temporary array r (split(q[var1],r,",")), i.e. "1,2,3" is split into an array r. Now each element in r is compared to the element in a. if a match found, the group's index is the index of that row (array index starts from 0, group's from 1, so ((var1+1)) used. Now if not found, just add this as a new group in q and the last index + 1, i.e. length of the array is the index for the row
Update:
/^name/{
j=0;
split($3,a,",");
for(var in a) {
if(q[a[var]] != 0) {
j=q[a[var]]; i=1;
break;
}
}
j = (j == 0) ? ++k : j;
for(var1 in a) {
if(q[a[var1]] == 0) {
q[a[var1]] = j;
}
}
print $1 "\t\t" $2 " \t\t" $3 "\t\t" j;
}
Update:
base is awk has associative array and each element is accessed by a string key. Earlier approach was to store each group in an array where key is the index of the group. So when we were reading a column, we will read each group, split the group in individual element, compare each of the element with each element of the column. But instead of storing a group, if we store the elements in an array where key is the element themselves and value at key is the index of the group to which the element belongs. So when we read a column, we split the column in individual element (split($3,a,",");) then check element in array if there is a group index with the element as key in if(q[a[var]] != 0)( in awk, if the element is not there, by default an element with value 0 is initialized there, so the check q[a[var]] != 0 ). If any element is found, we take the element's group index as the index of the column and break. else j will remain 0. if j remains 0, ++k gives the latest group index. Now we found the group index for the column elements. Need to carry that index to those elements which are not a part of any other group( there will be cases where multiple elements in same column belongs to different group, here we are taking the first come, first serve approach, but do not over write the group index of others already belonging to another group). So for each element in column (for(var1 in a)) , if it does not belong to a group (if(q[a[var1]] == 0)) , give it a group index q[a[var1]] = j;. So here all accesses are linear because we are accessing using elements directly a key. Thus no breaking up a group again and again for every element and hence a shorter time. My first approach was based on one of my own problem ( i mentioned in first line ) which was more complex processing but shorter data set. But this one required a simpler straight forward logic.

Resources