Grouping common elements using awk - linux

The following table illustrates a brief snapshot of the data that I wish to manipulate. I am looking for an awk script that will group similar elements into one group. For eg. if you look at the table below:
Numbers (1,2,3,4,6) should all belong to one group. So row1 row2 row4 row8 will be group "1"
Number 9 is unique and does not have any common elements. So it will reside alone in a separate group say group 2
Similarly numbers 5,7 will reside in one group say group 3 and so on...
The file:
heading1 heading2 numberlist group
name1 text 1,2,3 1
name2 text 2 1
name3 text 9 2
name4 text 1,4 1
name5 text 5,7 3
name6 text 7 3
name7 text 8 4
name8 text 6,2 1
I was searching for queries similar to mine and found this link. Grouping lists by common elements. But the solution is in C++ and not awk, which is my primary requirement.
Incidentally I also found this awk solution that is somewhat related to my query but it was devoid of handling of comma separated values.
awk script grouping with array
Numberlist i.e. $3 is my only consideration for grouping.

This problem seemed almost same as one of my problems and i had used one column in your example to solve my problem :) So...
[[bash_prompt$]]$ cat log ; echo "########"; \
> cat test.sh ;echo "########"; awk -f test.sh log
heading1 heading2 numberlist group
name1 text 1,2,3
name2 text 2
name3 text 9
name4 text 1,4
name5 text 5,7
name6 text 7
name7 text 8
name8 text 6,2
########
/^name/{
i=0; j=0;
split($3,a,",");
for(var in a) {
for(var1 in q) {
split(q[var1],r,",");
for(var2 in r) {
if(r[var2] == a[var]) {
i=1;
j=((var1+1));
}
}
}
}
if(i == 0) {
q[length(q)] = $3;
j=length(q);
}
print $1 "\t\t" $2 " \t\t" $3 "\t\t" j;
}
########
name1 text 1,2,3 1
name2 text 2 1
name3 text 9 2
name4 text 1,4 1
name5 text 5,7 3
name6 text 7 3
name7 text 8 4
name8 text 6,2 1
[[bash_prompt$]]$
Update:
split splits the first argument by the delimiter passed in third argument and puts it into an array pointed by the second argument. Here main array is q, which holds the group members of a group, it's basically an array of arrays where the index of an element is the group id, and the element is collection all the members of the group. so q[0]="1,2,3" indicates 0th group is containing members 1,2 and 3. Now in awk, first one line is read which starts with name (/^name/). Then the 3rd field (1,2,3) is broken down into an array a. Now for each element in an array a, we go for each group stored into q (for(var1 in q)) , then inside each group, we split them into another temporary array r (split(q[var1],r,",")), i.e. "1,2,3" is split into an array r. Now each element in r is compared to the element in a. if a match found, the group's index is the index of that row (array index starts from 0, group's from 1, so ((var1+1)) used. Now if not found, just add this as a new group in q and the last index + 1, i.e. length of the array is the index for the row
Update:
/^name/{
j=0;
split($3,a,",");
for(var in a) {
if(q[a[var]] != 0) {
j=q[a[var]]; i=1;
break;
}
}
j = (j == 0) ? ++k : j;
for(var1 in a) {
if(q[a[var1]] == 0) {
q[a[var1]] = j;
}
}
print $1 "\t\t" $2 " \t\t" $3 "\t\t" j;
}
Update:
base is awk has associative array and each element is accessed by a string key. Earlier approach was to store each group in an array where key is the index of the group. So when we were reading a column, we will read each group, split the group in individual element, compare each of the element with each element of the column. But instead of storing a group, if we store the elements in an array where key is the element themselves and value at key is the index of the group to which the element belongs. So when we read a column, we split the column in individual element (split($3,a,",");) then check element in array if there is a group index with the element as key in if(q[a[var]] != 0)( in awk, if the element is not there, by default an element with value 0 is initialized there, so the check q[a[var]] != 0 ). If any element is found, we take the element's group index as the index of the column and break. else j will remain 0. if j remains 0, ++k gives the latest group index. Now we found the group index for the column elements. Need to carry that index to those elements which are not a part of any other group( there will be cases where multiple elements in same column belongs to different group, here we are taking the first come, first serve approach, but do not over write the group index of others already belonging to another group). So for each element in column (for(var1 in a)) , if it does not belong to a group (if(q[a[var1]] == 0)) , give it a group index q[a[var1]] = j;. So here all accesses are linear because we are accessing using elements directly a key. Thus no breaking up a group again and again for every element and hence a shorter time. My first approach was based on one of my own problem ( i mentioned in first line ) which was more complex processing but shorter data set. But this one required a simpler straight forward logic.

Related

How do I group rows based on a fixed sum of values in Excel?

I am trying to find another solution to below Excel formula that was already provided here:
How do I create groups based on the sum of values?
It is the same requirement, but the grouping criteria needs to be an exact value.
Here's the sample data:
Column A | Column B
Item A | 1
Item B | 2
Item C | 3
Item D | 4
Item E | 5
Item F | 1
Item G | 2
Item H | 3
Item I | 4
Item J | 5
I need to group the rows if their Column B sum = 5.
Expected result:
Group 1 = Item A, Item D (1 + 4) = 5
Group 2 = Item B, Item C (2 + 3) = 5
Group 3 = Item E = 5
Group 4 = Item F, Item I (1 + 4) = 5
Group 5 = Item G, Item H (2 + 3) = 5
Group 6 = Item J = 5
If a row's Column B exceeds 5 or does not have another matching row to equal 5 when added then it will have no Group value.
Groupings can be interchangeable, ie. Group 1 = Item A, Item I can be made since 1 + 4 = 5.
I assume this can be achieved using Excel formulas but I am struggling to find which formula(s) can be used. Any help is appreciated!
I believe I was able to understand your question after some comments exchanged. Anyway I would recommend to update your question, it is an interesting problem, but the question was difficult to follow.
Before looking for an Excel solution, I took the approach of understanding the problem as a state machine with the transition from one state to another. I considered the following states that represent the position the item in the group. A group is defined as consecutive items that the sum of all items is equal to 5.
EMPTY: Just the initial situation
START: Start of the group
MIDDLE: A middle element of the group
END: The end of the group
START-END: A group with a single element
NA: Not applicable group
I follow the same idea of: How do I create groups based on the sum of values?, but slightly different helper columns:
Total (Column D), but for this case it is used the following formula: IF(SUM(C3,D2)>5,C3,SUM(C3,D2))
Status or item position within Group (Column G). Here is where it is calculated the corresponding status for each element
Checks for Valid Groups (Column H): Evaluates if a group is valid. When there is no match to 5, the group is not valid. It is indicated at the row that represents the beginning of the group (START or START-END states). If TRUE it means a valid group, if FALSE it is not a valid group, and NA for an NA value from Status column. If empty represents any element of the group that is not the first one.
Group # (Column I): To identify the group the row (Item) belongs to. Notice that we start counting the group from 1 and I also consider the case a group can not be formed (NA).
Here is a screenshot with the solution and the formula on G3:
=LET(total, D3, prevS, G2, QTY, C3,
IF(C3="", "",
IF(OR(AND(total=5, QTY<5, prevS="START"), AND(total=5, prevS="MIDDLE")), "END",
IF(OR(AND(total>5, total=QTY, OR(prevS="START", prevS="MIDDLE")),AND(total>5, OR(prevS="", prevS="END", prevS="NA", prevS="START-END"))), "NA",
IF(OR(AND(total<5, total=QTY, OR(prevS="START", prevS="MIDDLE")),AND(total<5, OR(prevS="", prevS="END", prevS="NA", prevS="START-END"))), "START",
IF(AND(total<5, OR(prevS="START", prevS="MIDDLE")), "MIDDLE",
IF(OR(AND(total=5, total= QTY, OR(prevS="START", prevS="MIDDLE")),AND(total=5, OR(prevS="", prevS="END", prevS="NA", prevS="START-END"))), "START-END", "UNDEFINED")
)
)
)
)
)
)
Notes::
LET Excel function is used to have something more readable
The IF blocks should to be ordered from the most specific case of total and QTY values to the most generic ones. For the case with same total condition, make sure the second condition for prevS are not repeated.
Added as a last resort UNDEFINED case, to check if any transition was not covered, if that is the case it has to be reviewed, so far in the sample data all cases are covered
Column K-Q is just for documenting purpose to identify all possible transitions. Column K-M provides all possible transitions organized them by previous status. The columns O-Q represent all possible transitions ordered by current status, so it is easier to formulate each portion of the IF blocks.
Maybe the formula can be simplified, compared to the solution provided by the similar question is more complex, but this question has more specific conditions. Some transitions maybe not relevant for the final result, but it is preferred to consider all positions in the group to make sure all transitions are covered.
The following state machine diagram shows all possible transitions:
Notes:
As you can see the solution also considers when a group cannot be created or non valid groups (NA values). The solution considers that Item column has only positive values, it is not stated in the question any restriction, but looking at the example they are all positives. To consider zero values, this solution needs to be adjusted.
Checks for Valid Groups column is calculated as follow:
= IF(G3="", "",
IF(G3="START-END", TRUE,
IF(G3="NA", "NA",
IF(G3="START",
LET(endRow, IFNA(MATCH("START", LEFT(G4:$G$1000,5),0), MATCH("", LEFT(G4:$G$1000,5),0))+ ROW()-1,
value, VLOOKUP("END", G4:INDIRECT( "G" & endRow),1,0),
IF(ISNA(value), FALSE, TRUE)
), ""
)
)
)
)
It identifies the start and end of the group, and then finds any NA values, if there are, then it is not a valid group. If the end of the candidate group is not found (the first MATCH returns N/A), then is searches until a blank row
Group # column is calculated has follow:
=IF(C3="","", LET(value, MAX($I$2:I2), IF(G3="NA", "NA",
IF(H3=TRUE, value + 1, IF(H3=FALSE, "NA",
IF(I2="NA", "NA", value))))))
This way only valid transaction are considered, i.e. the following status transitions starting from START but not ending in END : START->NA, START->MIDDLE[one or more]...->NA and NA are not considered valid groups (NA).
I added more examples from the original sample file provided, more can be added to further test all possible scenarios, but I guess you get the idea about this approach. As you sated "I assume this can be achieved using Excel formulas" yes it is possible, but I would say for more complex conditions I would suggest to implement a state machine algorithm in VBA. Even it is possible to do it with Excel functions, you have to deal with several nested IF blocks and helper columns, something that can be achieved with a simple for-loop in VBA.
Here is a link to online Excel file I used.

How delete singel word from dataframe except few define words like *when what*

How delete singel word from dataframe except few define words like when what
1. Hello
2. My name is khan
3. When
4. What
5. Opted bat
6. Learn
I want output like
2. My name is khan
3. When
4. What
5. Opted bat
Use boolean indexing with filtering multiple words chained by bitwise OR with mask for filtering words defined in list:
words = ['When','What']
df = df[(df['col'].str.split().str.len() != 1) | df['col'].isin(words)]
print (df)
col
1 My name is khan
2 When
3 What
4 Opted bat
If words are defined lowercase in list:
words = ['when','what']
df = df[(df['col'].str.split().str.len() != 1) | df['col'].str.lower().isin(words)]

How to intrpret Tcl "list as string" as list?

By some unknown reason the variable result in the following line of code
set result [[$sqlCmd execute] allrows -as lists]
gets string which looks like list: {2 3 4 5}
If I write puts "result $result => [llength $result]" it prints {2 3 4 5} => 1
if I write puts [list $result], it prints {{2 3 4 5}}, what is correct because list creates list from one string.
Is there any way to convert this string to what it expected to be - list - without any string processing steps like deletion of braces and splitting string to list by split function? I suggest it must be some interpretation but I'm unable to find nice solutition.
The allrows method always returns a list, one per row (even when there's only a single row returned). When the -as lists option is passed in, each element of that list is itself a list representing the columns in that row.
Thus, to iterate over the columns of that row, you'd do:
set result [[$sqlCmd execute] allrows -as lists]
set rowresult [lindex $result 0]
foreach col $rowresult {
puts "I've got a '$col'"
}
You're usually recommended to use the default that represents rows as dictionaries indexed by column name, as that has a better representation of SQL NULLs (i.e., the column is absent then instead of being the driver-designated null value, which is often and ambiguously the empty string).
allrows is giving you a list of lists, each sublist representing a row. There is 1 row in this list, 2 3 4 5, so the length is 1. You can index or iterate over the list the usual ways to access its one element.
# If you're assuming there will only be one row
set only_row [lindex $result 0]
# Or if you want to iterate over all rows
foreach row $result {
do whatever with $row
}

String Handling in Talend

I have this kind of data,
12345 Lipa AVE, AKA 1234 LIpa AVE, Lipa City, LP, 12345
I want this to transform into this:
All the data that I'm going to process have 1 comma to separate the address and another case is the 2 comma above.
An example of the 1 comma is below,
12345 Lipa AVE, Lipa City, LP, 12345
The simplest solution is to unify the structure, and then make the mapping. In this case it means first convert the 4 column structure (1 comma case) into 5 columns (2 commas case) where the second field is empty.
The diagram is the following:
tFileInputFullRow -> tJavaRow -> tExtractDelimitedField -> tMap -> tFileOutputDelimited
So first read the full row, then detect the case and insert the extra column if necessary. The tJavaRow code is the following:
output_row.line = "";
String[] elements = input_row.line.split(",");
if(elements.length == 4)
elements[0] += ",";
for(String element:elements)
output_row.line += element + ",";
In tExtractDelimitedField set the separator to comma and finally in the tMap merge the two addresses field into one:
row3.address2 != null && !row3.address2.equals("") ? row3.address1 + "," + row3.address2 : row3.address1
The tExtractDelimitedField can be skipped in the tJavaRow by changing the output schema and then passing the array elements one by one.

Count number of occurences of a string and relabel

I have a n x 1 cell that contains something like this:
chair
chair
chair
chair
table
table
table
table
bike
bike
bike
bike
pen
pen
pen
pen
chair
chair
chair
chair
table
table
etc.
I would like to rename these elements so they will reflect the number of occurrences up to that point. The output should look like this:
chair_1
chair_2
chair_3
chair_4
table_1
table_2
table_3
table_4
bike_1
bike_2
bike_3
bike_4
pen_1
pen_2
pen_3
pen_4
chair_5
chair_6
chair_7
chair_8
table_5
table_6
etc.
Please note that the dash (_) is necessary Could anyone help? Thank you.
Interesting problem! This is the procedure that I would try:
Use unique - the third output parameter in particular to assign each string in your cell array to a unique ID.
Initialize an empty array, then create a for loop that goes through each unique string - given by the first output of unique - and creates a numerical sequence from 1 up to as many times as we have encountered this string. Place this numerical sequence in the corresponding positions where we have found each string.
Use strcat to attach each element in the array created in Step #2 to each cell array element in your problem.
Step #1
Assuming that your cell array is defined as a bunch of strings stored in A, we would call unique this way:
[names, ~, ids] = unique(A, 'stable');
The 'stable' is important as the IDs that get assigned to each unique string are done without re-ordering the elements in alphabetical order, which is important to get the job done. names will store the unique names found in your array A while ids would contain unique IDs for each string that is encountered. For your example, this is what names and ids would be:
names =
'chair'
'table'
'bike'
'pen'
ids =
1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4
1
1
1
1
2
2
names is actually not needed in this algorithm. However, I have shown it here so you can see how unique works. Also, ids is very useful because it assigns a unique ID for each string that is encountered. As such, chair gets assigned the ID 1, followed by table getting assigned the ID of 2, etc. These IDs will be important because we will use these IDs to find the exact locations of where each unique string is located so that we can assign those linear numerical ranges that you desire. These locations will get stored in an array computed in the next step.
Step #2
Let's pre-allocate this array for efficiency. Let's call it loc. Then, your code would look something like this:
loc = zeros(numel(A), 1);
for idx = 1 : numel(names)
id = find(ids == idx);
loc(id) = 1 : numel(id);
end
As such, for each unique name we find, we look for every location in the ids array that matches this particular name found. find will help us find those locations in ids that match a particular name. Once we find these locations, we simply assign an increasing linear sequence from 1 up to as many names as we have found to these locations in loc. The output of loc in your example would be:
loc =
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
5
6
7
8
5
6
Notice that this corresponds with the numerical sequence (the right most part of each string) of your desired output.
Step #3
Now all we have to do is piece loc together with each string in our cell array. We would thus do it like so:
out = strcat(A, '_', num2str(loc));
What this does is that it takes each element in A, concatenates a _ character and then attaches the corresponding numbers to the end of each element in A. Because we want to output strings, you need to convert the numbers stored in loc into strings. To do this, you must use num2str to convert each number in loc into their corresponding string equivalents. Once you find these, you would concatenate each number in loc with each element in A (with the _ character of course). The output is stored in out, and we thus get:
out =
'chair_1'
'chair_2'
'chair_3'
'chair_4'
'table_1'
'table_2'
'table_3'
'table_4'
'bike_1'
'bike_2'
'bike_3'
'bike_4'
'pen_1'
'pen_2'
'pen_3'
'pen_4'
'chair_5'
'chair_6'
'chair_7'
'chair_8'
'table_5'
'table_6'
For your copying and pasting pleasure, this is the full code. Be advised that I've nulled out the first output of unique as we don't need it for your desired output:
[~, ~, ids] = unique(A, 'stable');
loc = zeros(numel(A), 1);
for idx = 1 : numel(names)
id = find(ids == idx);
loc(id) = 1 : numel(id);
end
out = strcat(A, '_', num2str(loc));
If you want an alternative to unique, you can work with a hash table, which in Matlab would entail to using the containers.Map object. You can then store the occurrences of each individual label and create the new labels on the go, like in the code below.
data={'table','table','chair','bike','bike','bike'};
map=containers.Map(data,zeros(numel(data),1)); % labels=keys, counts=values (zeroed)
new_data=data; % initialize matrix that will have outputs
for ii=1:numel(data)
map(data{ii}) = map(data{ii})+1; % increment counts of current labels
new_data{ii} = sprintf('%s_%d',data{ii},map(data{ii})); % format outputs
end
This is similar to rayryeng's answer but replaces the for loop by bsxfun. After the strings have been reduced to unique labels (line 1 of code below), bsxfun is applied to create a matrix of pairwise comparisons between all (possibly repeated) labels. Keeping only the lower "half" of that matrix and summing along rows gives how many times each label has previously appeared (line 2). Finally, this is appended to each original string (line 3).
Let your cell array of strings be denoted as c.
[~, ~, labels] = unique(c); %// transform each string into a unique label
s = sum(tril(bsxfun(#eq, labels, labels.')), 2); %'// accumulated occurrence number
result = strcat(c, '_', num2str(x)); %// build result
Alternatively, the second line could be replaced by the more memory-efficient
n = numel(labels);
M = cumsum(full(sparse(1:n, labels, 1)));
s = M((1:n).' + (labels-1)*n);
I'll give you a psuedocode, try it yourself, post the code if it doesn't work
Initiate a counter to 1
Iterate over the cell
If counter > 1 check with previous value if the string is same
then increment counter
else
No- reset counter to 1
end
sprintf the string value + counter into a new array
Hope this helps!

Resources