How to randomly sort one key while the other is kept in its original sort order with GNU "sort" - linux

Given an input list like the following:
405:alice#level1
405:bob#level2
405:chuck#level1
405:don#level3
405:eric#level1
405:francis#level1
004:ac#jjj
004:la#jjj
004:za#zzz
101:amy#floor1
101:brian#floor3
101:christian#floor1
101:devon#floor1
101:eunuch#floor2
101:frank#floor3
005:artie#le2
005:bono#nuk1
005:bozo#nor2
(As you can see, the first field was randomly sorted (the original input had all of the first field in numerical order, with 004 coming first, then 005, 101, 405, et al) but the second field is in alphabetical order on the first character.)
What is desired is a randomized sort where the first field - as separated by a colon ':', is randomly sorted so that all of the entries of the second field do not matter during the random sort, so long as all lines where the first field are the same are grouped together but randomly distributed throughout the file - is to have the second field randomly sorted as well. That is, in the final output, lines with the same value in the first field are grouped together (but randomly distributed throughout the file) but also to have the second field randomly sorted. I am unable to get this desired result as I am not too familiar with sort keys and whatnot.
The desired output would look similar to this:
405:francis#level1
405:don#level3
405:eric#level1
405:bob#level2
405:alice#level1
405:chuck#level1
004:za#zzz
004:ac#jjj
004:la#jjj
101:christian#floor1
101:amy#floor1
101:frank#floor3
101:eunuch#floor2
101:brian#floor3
101:devon#floor1
005:bono#nuk1
005:artie#le2
005:bozo#nor2
Does anyone know how to achieve this type of sort?
Thank you!

You can do this with awk pretty easily.
As a one-liner:
awk -F: 'BEGIN{cmd="sort -R"} $1 != key {close(cmd)} {key=$1; print | cmd}' input.txt
Or, broken apart for easier explanation:
-F: - Set awk's field separator to colon.
BEGIN{cmd="sort -R"} - before we start, set a variable that is a command to do the "randomized sort". This one works for me on FreeBSD. Should work with GNU sort as well.
$1 != key {close(cmd)} - If the current line has a different first field than the last one processed, close the output pipe...
{key=$1; print | cmd} - And finally, set the "key" var, and print the current line, piping output through the command stored in the cmd variable.
This usage takes advantage of a bit of awk awesomeness. When you pipe through a string (be it stored in a variable or not), that pipe is automatically created upon use. You can close it any time, and a subsequent use will reopen a new command.
The impact of this is that each time you close(cmd), you print the current set of randomly sorted lines. And awk closes cmd automatically once you come to the end of the file.
Of course, for this solution to work, it's vital that all lines with a shared first field are grouped together.

not as elegant but a different method
$ awk -F: '!($1 in a){a[$1]=c++} {print a[$1] "\t" $0}' file |
sort -R -k2 |
sort -nk1,1 -s |
cut -f2-
or, this alternative which doesn't assume initial grouping
$ sort -R file |
awk -F: '!($1 in a){a[$1]=c++} {print a[$1] "\t" $0}' |
sort -nk1,1 -s |
cut -f2-

Related

Retrieve different information from several files to bring them together in one. BASH

I have a problem with my bash script, I would like to retrieve information contained in several files and gather them in one.
I have a file in this form which contains about 15000 lines: (file1)
1;1;A0200101C
2;2;A0200101C
3;3;A1160101A
4;4;A1160101A
5;5;A1130304G
6;6;A1110110U
7;7;A1110110U
8;8;A1030002V
9;9;A1030002V
10;10;A2120100C
11;11;A2120100C
12;12;A3410071A
13;13;A3400001A
14;14;A3385000G1
15;15;A3365070G1
I would need to retrieve the first record of each row matching the id.
My second file is this, I just need to retrieve the 3rd row: (file2)
count
-------
131
(1 row)
I would therefore like to be able to assemble the id of (file1) and the 3rd line of (file2) in order to achieve this result:
1;131
2;131
3;131
4;131
5;131
6;131
7;131
8;131
9;131
11;131
12;131
13;131
14;131
15;131
Thank you.
One possible way:
#!/usr/bin/env bash
count=$(awk 'NR == 3 { print $1 }' file2)
while IFS=';' read -r id _; do
printf "%s;%s\n" "$id" "$count"
done < file1
First, read just the third line of file2 and save that in a variable.
Then read each line of file1 in a loop, extracting the first semicolon-separated field, and print it along with that saved value.
Using the same basic approach in a purely awk script instead of shell will be much faster and more efficient. Such a rewrite is left as an exercise for the reader (Hint: In awk, FNR == NR is true when reading the first file given, and false on any later ones. Alternatively, look up how to pass a shell variable to an awk script; there are Q&As here on SO about it.)

Please explain this awk script for taking Fixed Width to CSV

I'm learning some awk. I found an example online of taking a fixed width file and converting it to a csv file. There is just one part I do not understand, even after going through many man pages and online tutorials:
1: awk -v FIELDWIDTHS='1 10 4 2 2' -v OFS=',' '
2: { $1=$1 ""; print }
3: ' data.txt`
That is verbatim from the sample online (found here).
What I don't understand is line 2. I get there is no condition, so the 'program' (contained in brackets) will always execute per record (line). I don't understand why it is doing the $1=$1 as well as the empty string statement "";. However, removing these causes incorrect behavior.
$1=$1 assigns a value to $1 (just happens to be the same value it already had). Assigning any value to a field cause awk to recompile the current record using the OFS value between fields (effectively replacing all FSs or FIELDSEPS spacings with OFSs).
$ echo 'a,b,c' | awk -F, -v OFS="-" '{print; $1=$1; print}'
a,b,c
a-b-c
The "" is because whoever wrote the script doesn't fully understand awk and thinks that's necessary to ensure numbers retain their precision by converting them to a string before the assignment.

pipe an awk object created with awk code

I would like to know if there is a method for creating awk objects inside an awk call. I need to build a key/value map and use it in an awk call. More in details, I have a map linking some labels with a unique id (e.g. "ID1002", "External compartment"). I would like to use this map to identify a set of unique ids from another table. Here is what I was thinking about:
awk 'BEGIN{map=system(awk '{m[$1]=$2}' first.csv)}{print map[$1]}' second.csv
Obviously this doesn't work and I was wondering how can I do something like that without building an awk script.
The common way this done in awk is:
$ awk 'NR==FNR{m[$1]=$2;next}{print m[$1]}' first.csv second.csv
Explanation:
NR is a special variable that gets incremented on each record read
FNR is similar to NR however it is reset for each new file read
next instructs awk to stop executing for the current record and get the next record.
With the definitions set you can read the script as:
NR==FNR # Conditional that is only true when reading the first file
{m[$1]=$2;next} # Create a map and move on to the next line
{print m[$1]} # Using next in the first block means this only runs on the second file

Linux join utility complains about input file not being sorted

I have two files:
file1 has the format:
field1;field2;field3;field4
(file1 is initially unsorted)
file2 has the format:
field1
(file2 is sorted)
I run the 2 following commands:
sort -t\; -k1 file1 -o file1 # to sort file 1
join -t\; -1 1 -2 1 -o 1.1 1.2 1.3 1.4 file1 file2
I get the following message:
join: file1:27497: is not sorted: line_which_was_identified_as_out_of_order
Why is this happening ?
(I also tried to sort file1 taking into consideration the entire line not only the first filed of the line but with no success)
sort -t\; -c file1 doesn't output anything. Around line 27497, the situation is indeed strange which means that sort doesn't do its job correctly:
XYZ113017;...
line 27497--> XYZ11301;...
XYZ11301;...
To complement Wumpus Q. Wumbley's helpful answer with a broader perspective (since I found this post researching a slightly different problem).
When using join, the input files must be sorted by the join field ONLY, otherwise you may see the warning reported by the OP.
There are two common scenarios in which more than the field of interest is mistakenly included when sorting the input files:
If you do specify a field, it's easy to forget that you must also specify a stop field - even if you target only 1 field - because sort uses the remainder of the line if only a start field is specified; e.g.:
sort -t, -k1 ... # !! FROM field 1 THROUGH THE REST OF THE LINE
sort -t, -k1,1 ... # Field 1 only
If your sort field is the FIRST field in the input, it's tempting to not specify any field selector at all.
However, if field values can be prefix substrings of each other, sorting whole lines will NOT (necessarily) result in the same sort order as just sorting by the 1st field:
sort ... # NOT always the same as 'sort -k1,1'! see below for example
Pitfall example:
#!/usr/bin/env bash
# Input data: fields separated by '^'.
# Note that, when properly sorting by field 1, the order should
# be "nameA" before "nameAA" (followed by "nameZ").
# Note how "nameA" is a substring of "nameAA".
read -r -d '' input <<EOF
nameA^other1
nameAA^other2
nameZ^other3
EOF
# NOTE: "WRONG" below refers to deviation from the expected outcome
# of sorting by field 1 only, based on mistaken assumptions.
# The commands do work correctly in a technical sense.
echo '--- just sort'
sort <<<"$input" | head -1 # WRONG: 'nameAA' comes first
echo '--- sort FROM field 1'
sort -t^ -k1 <<<"$input" | head -1 # WRONG: 'nameAA' comes first
echo '--- sort with field 1 ONLY'
sort -t^ -k1,1 <<<"$input" | head -1 # ok, 'nameA' comes first
Explanation:
When NOT limiting sorting to the first field, it is the relative sort order of chars. ^ and A (column index 6) that matters in this example. In other words: the field separator is compared to data, which is the source of the problem: ^ has a HIGHER ASCII value than A, and therefore sorts after 'A', resulting in the line starting with nameAA^ sorting BEFORE the one with nameA^.
Note: It is possible for problems to surface on one platform, but be masked on another, based on locale and character-set settings and/or the sort implementation used; e.g., with a locale of en_US.UTF-8 in effect, with , as the separator and - permissible inside fields:
sort as used on OSX 10.10.2 (which is an old GNU sort version, 5.93) sorts , before - (in line with ASCII values)
sort as used on Ubuntu 14.04 (GNU sort 8.21) does the opposite: sorts - before ,[1]
[1] I don't know why - if somebody knows, please tell me. Test with sort <<<$'-\n,'
sort -k1 uses all fields starting from field 1 as the key. You need to specify a stop field.
sort -t\; -k1,1
... or the gnu sort is just as buggy as every other GNU command
try and sort Gi1/0/11 vs Gi1/0/1 and you'll never be able to get an actual regular textual sort suitable for join input because someone added some extra intelligence in sort which will happily use numeric or human numeric sorting automagically in such cases without even bothering to add a flag to force the regular behavior
what is suitable for humans is seldom suitable for scripting

Split ordered file in Linux

I have a large delimited file (with pipe '|' as the delimiter) which I have managed to sort (using linux sort) according to first (numeric), second (numeric) and fourth column (string ordering since it is a timestamp value). The file is like this:
77|141|243848|2014-01-10 20:06:15.722|2.5|1389391203399
77|141|243849|2014-01-10 20:06:18.222|2.695|1389391203399
77|141|243850|2014-01-10 20:06:20.917|3.083|1389391203399
77|171|28563|2014-01-10 07:08:56|2.941|1389344702735
77|171|28564|2014-01-10 07:08:58.941|4.556|1389344702735
77|171|28565|2014-01-10 07:09:03.497|5.671|1389344702735
78|115|28565|2014-01-10 07:09:03.497|5.671|1389344702735
I was wondering if there is an easy way to split this file to multiple text files with an awk, sed, grep or perl one liner whenever the first column or the second column value changes. The final result for the example file should be 3 text files like that:
77|141|243848|2014-01-10 20:06:15.722|2.5|1389391203399
77|141|243849|2014-01-10 20:06:18.222|2.695|1389391203399
77|141|243850|2014-01-10 20:06:20.917|3.083|1389391203399
77|171|28563|2014-01-10 07:08:56|2.941|1389344702735
77|171|28564|2014-01-10 07:08:58.941|4.556|1389344702735
77|171|28565|2014-01-10 07:09:03.497|5.671|1389344702735
78|115|28565|2014-01-10 07:09:03.497|5.671|1389344702735
I could do that in Java of course, but I think it would be kind of overkill, if it can be done with a script. Also, is this possible that the filenames created use those two columns values, something like 77_141.txt for the first file, 77_171.txt for the second file and 78_115.txt for the third one?
awk is very handy for this kind of problems. This can be an approach:
awk -F"|" '{print >> $1"_"$2".txt"}' file
Explanation
-F"|" sets field separator as |.
{print > something} prints the lines into the file something.
$1"_"$2".txt" instead of something, set the output file as $1"_"$2, being $1 the first field based on the | separator. That is, 77, 78... And same for $2, being 141, 171...

Resources