count character length from a large file - linux

I need to find character length from a file containing 140000 lines, each string length varies.
aaaaa
bbb
ccccc
ddddd
fff
Expecting output as below
strings char-length
2 3
3 5
(means 2 strings character length is 3, 3 strings character length is 5). I have already tried for-loop, which reads each and every line, but it takes time since my file had 140000 string lines.

If you have awk available you can try the following command:
awk '{ print length($0) }' <your_file> | sort | uniq -c
(Took 27ms on my VM with sample test file of 7000 lines, each line around 10 chars long).

Related

How to remove some words in specific field using awk?

I have several lines of text. I want to extract the number after specific word using awk.
I tried the following code but it does not work.
At first, create the test file by: vi test.text. There are 3 columns (the 3 fields are generated by some other pipeline commands using awk).
Index AllocTres CPUTotal
1 cpu=1,mem=256G 18
2 cpu=2,mem=1024M 16
3 4
4 cpu=12,gres/gpu=3 12
5 8
6 9
7 cpu=13,gres/gpu=4,gres/gpu:ret6000=2 20
8 mem=12G,gres/gpu=3,gres/gpu:1080ti=1 21
Please note there are several empty fields in this file.
what I want to achieve only keep the number folloing the first gres/gpu part and remove all cpu= and mem= parts using a pipeline like: cat test.text | awk '{some_commands}' to output 3 columns:
Index AllocTres CPUTotal
1 18
2 16
3 4
4 3 12
5 8
6 9
7 4 20
8 3 21
1st solution: With your shown samples, please try following GNU awk code. This takes care of spaces in between fields.
awk '
FNR==1{ print; next }
match($0,/[[:space:]]+/){
space=substr($0,RSTART,RLENGTH-1)
}
{
match($2,/gres\/gpu=([0-9]+)/,arr)
match($0,/^[^[:space:]]+[[:space:]]+[^[:space:]]+([[:space:]]+)/,arr1)
space1=sprintf("%"length($2)-length(arr[1])"s",OFS)
if(NF>2){ sub(OFS,"",arr1[1]);$2=space arr[1] space1 arr1[1] }
}
1
' Input_file
Output will be as follows for above code with shown samples:
Index AllocTres CPUTotal
1 18
2 16
3 4
4 3 12
5 8
6 9
7 4 20
8 3 21
2nd solution: If you don't care of spaces then try following awk code.
awk 'FNR==1{print;next} match($2,/gres\/gpu=([0-9]+)/,arr){$2=arr[1]} 1' Input_file
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
FNR==1{ ##Checking condition if this is first line then do following.
print ##Printing current line.
next ##next will skip all further statements from here.
}
match($2,/gres\/gpu=([0-9]+)/,arr){ ##using match function to match regex gres/gpu= digits and keeping digits in capturing group.
$2=arr[1] ##Assigning 1st value of array arr to 2nd field itself.
}
1 ##printing current edited/non-edited line here.
' Input_file ##Mentioning Input_file name here.
Using sed
$ sed 's~\( \+\)[^,]*,\(gres/gpu=\([0-9]\)\|[^ ]*\)[^ ]* \+~\1\3 \t\t\t\t ~' input_file
Index AllocTres CPUTotal
1 18
2 16
3 4
4 3 12
5 8
6 9
7 4 20
8 3 21
awk '
FNR>1 && NF==3 {
n = split($2, a, ",")
for (i=1; a[i] !~ /gres\/gpu=[0-9]+,?/ && i<=n; ++i);
sub(/.*=/, "", a[i])
$2 = a[i]
}
NF==2 {$3=$2; $2=""}
{printf "%-7s%-11s%s\n",$1,$2,$3}' test.txt
Output:
Index AllocTres CPUTotal
1 18
2 16
3 4
4 3 12
5 8
6 9
7 4 20
8 3 21
You can adjust column widths as desired.
This assumes the first and last columns always have a value, so that NF (number of fields) can be used to identify field 2. Then if field 2 is not empty, split that field on commas, scan the resulting array for the first match of gres/gpu, remove this suffix, and print the three fields. If field 2 is empty, the second last line inserts an empty awk field so printf always works.
If assumption above is wrong, it's also possible to identify field 2 by its character index.
A awk-based solution without needing
- array splitting,
- regex back-referencing,
- prior state tracking, or
- input multi-passing
—- since m.p. for /dev/stdin would require state tracking
|
{mng}awk '!_~NF || sub("[^ ]+$", sprintf("%*s&", length-length($!(NF=NF)),_))' \
FS='[ ][^ \\/]*gres[/]gpu[=]|[,: ][^= ]+[=][^,: ]+' OFS=
Index AllocTres CPUTotal
1 18
2 16
3 4
4 3 12
5 8
6 9
7 4 20
8 3 21
If you don't care for nawk, then it's even simpler single-pass approach with only 1 all-encompassing call to sub() per line :
awk ' sub("[^ ]*$", sprintf("%*s&", length($_) - length($(\
gsub(" [^ /]*gres[/]gpu=|[,: ][^= ]+=[^,: ]+", _)*_)),_))'
or even more condensed but worse syntax styling :
awk 'sub("[^ ]*$",sprintf("%*s&",length^gsub(" [^ /]*gres\/gpu=|"\
"[,: ][^= ]+=[^,: ]+",_)^_ - length,_) )'
This might work for you (GNU sed):
sed -E '/=/!b
s/\S+/\n&\n/2;h
s/.*\n(.*)\n.*/\1/
/gpu=/!{s/./ /g;G;s/(^.*)\n(.*)\n.*\n/\2\1/p;d}
s/gpu=([^,]*)/\n\1 \n/;s/(.*)\n(.*\n)/\2\1/;H
s/.*\n//;s/./ /g;H;g
s/\n.*\n(.*)\n(.*)\n.*\n(.*)/\2\3\1/' file
In essence the solution above involves using the hold space (see here and eventually here) as a scratchpad to hold intermediate results. Those results are gathered by isolating the second field and then again the gpu info. The step by step story follows:
If the line does not contain a second field, leave alone.
Surround the second field by newlines and make a copy.
Isolate the second field
If the second field contains no gpu info, replace the entire field by spaces and using the copy, format the line accordingly.
Otherwise, isolate the gpu info, move it to the front of the line and append that to the copy of the line in the hold space.
Meanwhile, remove the gpu info from the pattern space and replace each character in the pattern space by a space.
Apend these spaces to the copy and then overwrite the pattern space by the copy.
Lastly, knowing each part of the line has been split by newlines, reassemble the parts into the desired format.
N.B. The solution depends on the spacing of columns being real spaces. If there are tabs in the file, then prepend the sed command s/\t/ /g (where in the example tabs are replaced by 8 spaces).
Alternative:
sed -E '/=/!b
s/\S+/\n&\n/2;h
s/.*(\n.*)\n.*/\1/;s/(.)(.*gpu=)([^,]+)/\3\1\2/;H
s/.*\n//;s/./ /g;G
s/(.*)\n(.*)\n.*\n(.*)\n(.*)\n.*$/\2\4\1\3/' file
In this solution, rather than treat lines with a second field but no gpu info, as a separate case, I introduce a place holder for this missing info and follow the same solution as if gpu info was present.

linux/unix convert delimited file to fixed width

I have a requirement to convert a delimited file to fixed-width file, details as follows.
Input file sample:
AAA|BBB|C|1234|56
AA1|BB2|DD|12345|890
Output file sample:
AAA BBB C 1234 56
AA1 BB2 DD 12345 890
Details of field positions
Field 1 Start at position 1 and length should be 5
Field 2 start at position 6 and length should be 6
Field 3 Start at position 12 and length should be 4
Field 4 Start at position 16 and length should be 6
Field 5 Start at position 22 and length should be 3
Another awk solution:
echo -e "AAA|BBB|C|1234|56\nAA1|BB2|DD|12345|890" |
awk -F '|' '{printf "%-5s%-6s%-4s%-6s%-3s\n",$1,$2,$3,$4,$5}'
Note the - before the %-3s in the printf statement, which will left-align the fields, as required in the question. Output:
AAA BBB C 1234 56
AA1 BB2 DD 12345 890
With the following awk command you can achive your goal:
awk 'BEGIN { RS=" "; FS="|" } { printf "%5s%6s%4s%6s%3s\n",$1,$2,$3,$4,$5 }' your_input_file
Your record separator (RS) is a space and your field separator (FS) is a pipe (|) character. In order to parse your data correctly we set them in the BEGIN statement (before any data is read). Then using printf and the desired format characters we output the data in the desired format.
Output:
AAA BBB C 1234 56
AA1 BB2 DD 12345890
Update:
I just saw your edits on the input file format (previously they seemed different). If your input data records are separated with a new line then simply remove the RS=" "; part from the above one-liner and apply the - modifiers for the format characters to left align your fields:
awk 'BEGIN { FS="|" } { printf "%-5s%-6s%-4s%-6s%-3s\n",$1,$2,$3,$4,$5 }' your_input_file

Move Last Four Lines To Second Row In Text File

I need to move the last 4 lines of a text file and move them to the second row in the text file.
I'm assuming that tail and sed are used but, I haven't much luck so far.
Here is a head and tail solution. Let us start with the same sample file as Glenn Jackman:
$ seq 10 >file
Apply these commands:
$ head -n1 file ; tail -n4 file; tail -n+2 file | head -n-4
1
7
8
9
10
2
3
4
5
6
Explanation:
head -n1 file
Print first line
tail -n4 file
Print last four lines
tail -n+2 file | head -n-4
Print the lines starting with line 2 and ending before the fourth-to-last line.
If I'm assuming correctly, ed can handle your task:
seq 10 > file
ed file <<'COMMANDS'
$-3,$m1
w
q
COMMANDS
cat file
1
7
8
9
10
2
3
4
5
6
lines 7,8,9,10 have been moved to the 2nd line
$-3,$m1 means, for the range of lines from "$-3" (3 lines before the last line) to "$" (the last line, move them ("m") below the first line ("1")
Note that the heredoc has been quoted so the shell does not try to interpret the strings $- and $m1 as variables
If you don't want to actually modify the file, but instead print to stdout:
ed -s file <<'COMMANDS'
$-3,$m1
%p
Q
COMMANDS
Here is an awk solution:
seq 10 > file
awk '{a[NR]=$0} END {for (i=1;i<=NR-4;i++) if (i==2) {for (j=NR-3;j<=NR;j++) print a[j];print a[i]} else print a[i]}' file
1
7
8
9
10
2
3
4
5
6

Change format of text file

I have a file with many lines of tab separated data in the following format:
1 1 2 2
3 3 4 4
5 5 6 6
...
and I would like to change the format to:
1 1
2 2
3 3
4 4
5 5
6 6
Is there a not too complicated way to do this? I don't have any experience with using awk, sed, etc.
Thanks
If you just want to group your file in blocks of X columns, you can make use of xargs -nX:
$ xargs -n2 < file
1 1
2 2
3 3
4 4
5 5
6 6
To have more control and print an empty line after 4th field, you can also use this awk:
$ awk 'BEGIN{FS=OFS="\t"} {for (i=1;i<=NF;i++) printf "%s%s", $i, (i%2?OFS:RS); print ""}' file
1 1
2 2
3 3
4 4
5 5
6 6
# <-- note there is an empty line here
Explanation
On odd fields, it print FS after it.
On even fields, print RS.
Note FS stands for field separator, which defaults to space, and RS stands for record separator, which defaults to new line. As you have tab as field separator, we redefine it in the BEGIN block.
This is probably the simplest way which allows for customisation
awk '{print $1,$2"\n"$3,$4}' file
For a line between
awk '{print $1,$2"\n"$3,$4"\n"}' file
although fedorquis answer with xargs is probably the simplest if this isn't needed
As Ed pointed out this wouldn't work if there were blanks in the fields, this could be resolved using
awk 'BEGIN{FS=OFS="\t"} {print $1,$2 ORS $3,$4 ORS}' file
Through perl,
perl -pe 's/\t(\d\t\d)$/\n$1\n/g' file
Fed the above command's output to the sed command to delete the last blank line.
perl -pe 's/\t(\d\t\d)$/\n$1\n/g' file | sed '$d'

how to sort a file according to another file?

Is there a unix oneliner or some other quick way on linux to sort a file according to a permutation set by the sorting of another file?
i.e.:
file1: (separated by CRLFs, not spaces)
2
3
7
4
file2:
a
b
c
d
sorted file1:
2
3
4
7
so the result of this one liner should be
sorted file2:
a
b
d
c
paste file1 file2 | sort | cut -f2
Below is a perl one-liner that will print the contents of file2 based on the sorted input of file1.
perl -n -e 'BEGIN{our($x,$t,#a)=(0,1,)}if($t){$a[$.-1]=$_}else{$a[$.-1].=$_ unless($.>$x)};if(eof){$t=0;$x=$.;close ARGV};END{foreach(sort #a){($j,$l)=split(/\n/,$_,2);print qq($l)}}' file1 file2
Note: If the files are different lengths, the output will only print up to the shortest file length.
For example, if file-A has 5 lines and file-B has 8 lines then the output will only be 5 lines.

Resources