linux/unix convert delimited file to fixed width - linux

I have a requirement to convert a delimited file to fixed-width file, details as follows.
Input file sample:
AAA|BBB|C|1234|56
AA1|BB2|DD|12345|890
Output file sample:
AAA BBB C 1234 56
AA1 BB2 DD 12345 890
Details of field positions
Field 1 Start at position 1 and length should be 5
Field 2 start at position 6 and length should be 6
Field 3 Start at position 12 and length should be 4
Field 4 Start at position 16 and length should be 6
Field 5 Start at position 22 and length should be 3

Another awk solution:
echo -e "AAA|BBB|C|1234|56\nAA1|BB2|DD|12345|890" |
awk -F '|' '{printf "%-5s%-6s%-4s%-6s%-3s\n",$1,$2,$3,$4,$5}'
Note the - before the %-3s in the printf statement, which will left-align the fields, as required in the question. Output:
AAA BBB C 1234 56
AA1 BB2 DD 12345 890

With the following awk command you can achive your goal:
awk 'BEGIN { RS=" "; FS="|" } { printf "%5s%6s%4s%6s%3s\n",$1,$2,$3,$4,$5 }' your_input_file
Your record separator (RS) is a space and your field separator (FS) is a pipe (|) character. In order to parse your data correctly we set them in the BEGIN statement (before any data is read). Then using printf and the desired format characters we output the data in the desired format.
Output:
AAA BBB C 1234 56
AA1 BB2 DD 12345890
Update:
I just saw your edits on the input file format (previously they seemed different). If your input data records are separated with a new line then simply remove the RS=" "; part from the above one-liner and apply the - modifiers for the format characters to left align your fields:
awk 'BEGIN { FS="|" } { printf "%-5s%-6s%-4s%-6s%-3s\n",$1,$2,$3,$4,$5 }' your_input_file

Related

count character length from a large file

I need to find character length from a file containing 140000 lines, each string length varies.
aaaaa
bbb
ccccc
ddddd
fff
Expecting output as below
strings char-length
2 3
3 5
(means 2 strings character length is 3, 3 strings character length is 5). I have already tried for-loop, which reads each and every line, but it takes time since my file had 140000 string lines.
If you have awk available you can try the following command:
awk '{ print length($0) }' <your_file> | sort | uniq -c
(Took 27ms on my VM with sample test file of 7000 lines, each line around 10 chars long).

linux command to delete the last column of csv

How can I write a linux command to delete the last column of tab-delimited csv?
Example input
aaa bbb ccc ddd
111 222 333 444
Expected output
aaa bbb ccc
111 222 333
It is easy to remove the fist field instead of the last. So we reverse the content, remove the first field, and then revers it again.
Here is an example for a "CSV"
rev file1 | cut -d "," -f 2- | rev
Replace the "file1" and the "," with your file name and the delimiter accordingly.
You can use cut for this. You specify a delimiter with option -d and then give the field numbers (option -f) you want to have in the output. Each line of the input gets treated individually:
cut -d$'\t' -f 1-6 < my.csv > new.csv
This is according to your words. Your example looks more like you want to strip a column in the middle:
cut -d$'\t' -f 1-3,5-7 < my.csv > new.csv
The $'\t' is a bash notation for the string containing the single tab character.
You can use below command which will delete the last column of tab-delimited csv irrespective of field numbers,
sed -r 's/(.*)\s+[^\s]+$/\1/'
for example:
echo "aaa bbb ccc ddd 111 222 333 444" | sed -r 's/(.*)\s+[^\s]+$/\1/'

How to do divide a column based on the corresponding value in another file?

I have multiple files (66) and want to divid column 3 of each file to its corresponding value in the info.file and insert the new value in column 4 of each file.
My manual code is:
awk '{print $4=$3/NUmber from info.file}1' file
But this takes me hours to do for each individual file. So I want to automate it for all files. Thanks
file1:
chrm name value
4 a 8
3 b 4
file2:
chrm name value
3 g 6
5 s 12
info.file:
file_name average
file1 8
file2 6
file3 10
output:
file1:
chrm name value new_value
4 a 8 1
3 b 4 0.5
file2:
chrm name value new_value
3 g 6 1
5 s 12 2
without error handling
$ awk 'NR==FNR {a[$1]=$2; next}
FNR==1 {out=FILENAME".new"; print $0, "new_value" > out; next}
{v=$NF/a[FILENAME]; $++NF=v; print > out}' info file1 file2
will generate updated files
$ head file{1,2}.new | column -t
==> file1.new <==
chrm name value new_value
4 a 8 1
3 b 4 0.5
==> file2.new <==
chrm name value new_value
3 g 6 1
5 s 12 2
Explanation
NR==FNR {a[$1]=$2; next} scan the first file and save the file/value pairs in the associative array
FNR==1 in the header line of each data file
out=FILENAME".new" set a output filename
print $0, "new_value" > out print existing header appended with the new column name
v=$NF/a[FILENAME] for every data line, scale the last field and assign to v
$++NF=v increment number of fields and assign the new computed value to the last field
print > out print the new line to the same file set before
info file1 file2 the list of files should be preceded by the info file
I have prepared the following double nested awk command for you:
awk 'NR>1{system("awk -v div="$2" -f div_column3.awk "$1" | column -t > new_"$1);}' info.file
with div_column3.awk being a awk commands script file with the content:
$ cat div_column3.awk
NR==1{print $0" new_value"}NR>1{print $0" "$3/div}

add a column with different label

I want to add a column that contains two different labels. Let's say I have this text
aa bb cc
dd ee ff
gg hh ii
ll mm nn
oo pp qq
and I want to add 1 at hte first column of the first two lines and 2 at the first column of the remaining lines, so that eventually I will get that text:
1 aa bb cc
1 dd ee ff
2 gg hh ii
3 ll mm nn
4 oo pp qq
Do you know how to do it?
thanks
Assuming you are processing a text file in Linux shell, you could use awk for this. Your problem description says you want two labels 1 and 2, this would be
cat input.txt | awk '{print (NR<=2 ? "1 ":"2 ") $0}'
Your expected output says you want label 1 for the first two lines, and start counting from 2 beginning with the third line, this would be
cat input.txt | awk '{print (NR<=2 ? "1 ":NR-1" ") $0}'
I'm assuming that you want to do this using the shell, if your data is in a file called input.txt, you can either use cat -n or nl.
% tail -n+2 input.txt | cat -n
1 dd ee ff
2 gg hh ii
3 ll mm nn
4 oo pp qq
% tail -n+2 input.txt | nl
1 dd ee ff
2 gg hh ii
3 ll mm nn
4 oo pp qq
The first line can be added back manually.
The two commands will behave differently if you have empty lines in your input file.
Could you please try following and let me know if this helps you.
Solution 1st: By using a variable named count whose initial value is 1 and then checking if line number is either 1 or 2 then simply append 1 in $1 else increase variable count's value by 1 and append it in $1's value.
awk -v count=1 '{$1=NR==1||NR==2?1 FS $1:++count FS $1} 1' Input_file
Solution 2nd: Checking if line number is 1 or 2 then simply adding 1 to $1 else checking if a line is NOT NULL then add NR-1(which means subtract 1 from it's line number) and add to $1's value.
awk '{$1=NR==1||NR==2?1 FS $1:(NF?FNR-1 FS $1:"")} 1' Input_file

How can I adjust the length of a column field in bash using awk or sed?

I've an input.csv file in which columns 2 and 3 have variable lengtt.
100,Short Column, 199
200,Meeedium Column,1254
300,Loooooooooooong Column,35
I'm trying to use the following command to achieve a clean tabulation, but I need to fill the 2nd column with a certain number of blank spaces in order to get a fixed lenght column (let's say that a total lenght of 30 is enough).
awk -F, '{print $1 "\t" $2 "\t" $3;}' input.csv
My current output looks like this:
100 Short Column 199
200 Meeedium Column 1254
300 Loooooooooooong Column 35
And I would like to achieve the following output, by filling 2nd and 3rd column properly:
100 Short Column 199
200 Meeedium Column 1254
300 Loooooooooooong Column 35
Any good idea out there about awk or sed command should be used?
Thanks everybody.
Use printf in awk
$ awk -F, '{gsub(/ /, "", $3); printf "%-5s %-25s%5s\n", $1, $2, $3}' file input.csv
100 Short Column 199
200 Meeedium Column 1254
300 Loooooooooooong Column 35
What I have done above, is set the IFS,field separator to ,; since the file has some white-spaces in the 3rd column alone, it mangles, how printf processes the strings, removing it with gsub and formatting in C-style printf.
Rather than picking some arbitrary number as the width of each field, do a 2-pass approach where the first pass calculates the max length of each field and the 2nd prints the fields in a width that size plus a couple of spaces between fields:
$ cat tst.awk
BEGIN { FS=" *, *"; OFS=" " }
NR==FNR {
for (i=1;i<=NF;i++) {
w[i] = (length($i) > w[i] ? length($i) : w[i])
if ($i ~ /[^0-9]/) {
a[i] = "-"
}
}
next
}
{
for (i=1;i<=NF;i++) {
printf "%"a[i]w[i]"s%s", $i, (i<NF ? OFS : ORS)
}
}
$ awk -f tst.awk file file
100 Short Column 199
200 Meeedium Column 1254
300 Loooooooooooong Column 35
The above also uses left-alignment for non-digit fields, right alignment for all-digits fields. It'll work no matter how long the input fields are and no matter how many fields you have:
$ cat file1
100000,Short Column, 199,a
100,Now is the Winter of our discontent with fixed width fields,20000,b
100,Short Column, 199,c
200,Meeedium Column,1254,d
300,Loooooooooooong Column,35,e
$ awk -f tst.awk file1 file1
100000 Short Column 199 a
100 Now is the Winter of our discontent with fixed width fields 20000 b
100 Short Column 199 c
200 Meeedium Column 1254 d
300 Loooooooooooong Column 35 e
Solution using perl
$ perl -pe 's/([^,]+),([^,]+),([^,]+)/sprintf "%-6s%-30s%5s", $1,$2,$3/e' input.csv
100 Short Column 199
200 Meeedium Column 1254
300 Loooooooooooong Column 35

Resources