Linux awk text file processing

Linux awk text file processing - linux

I have a file with a few thousand lines of data, each line is like: a:b:c:d
So for example:
0.0:2000.00:2000.04:2000.02
I want to get all a's in one file, b's in second file etc. How?

One way. Output files will be named fileX, with X for each column number.
Assuming infile with content:
0.0:2000.00:2000.04:2001.02
0.1:2002.00:2000.05:2003.02
0.2:2003.00:2002.04:2004.02
0.3:2001.00:2000.05:2000.03
0.3:2001.00:2000.04:2001.02
0.2:2001.00:2002.04:2000.02
Execute this awk command:
awk '
BEGIN {
FS = ":";
}
{
for ( i = 1; i <= NF; i++ ) {
print $i > "file" i;
}
}
' infile
Check output files:
head file[1234]
With following result:
==> file1 <==
0.0
0.1
0.2
0.3
0.3
0.2
==> file2 <==
2000.00
2002.00
2003.00
2001.00
2001.00
2001.00
==> file3 <==
2000.04
2000.05
2002.04
2000.05
2000.04
2002.04
==> file4 <==
2001.02
2003.02
2004.02
2000.03
2001.02
2000.02

Look at the awk (or gawk) manual.
You should use the -F: flag to set the field separator to :.
You should use print with > file to get the outputs to the file you want.
awk -F: '{ for (i = 1; i <= NF; i++) { file = "file." i; print $i > file; } }' input
(awk on Mac OS X 10.7.4 does not permit an expression as the file name; gawk does. The solution shown will work on both.)

What about:
cat filename|cut -d ':' -f1 > a.txt
Then you can write -f2 for the second field and put it in b.txt.

Related

Filling empty spaces in a CSV file

I have a CSV file where some columns are empty such as
oski14,safe,0,13,53,4
oski15,Unknow,,,,0
oski16,Unknow,,,,0
oski17,Unknow,,,,0
oski18,unsafe,0.55,,1,2
oski19,unsafe,0.12,4,,56
How do I replace all the empty columns with the word "empty".
I have tried using awk(which is a command I am learning to use).
I want to have
oski14,safe,0,13,53,4
oski15,Unknow,empty,empty,empty,0
oski16,Unknow,empty,empty,empty,0
oski17,Unknow,empty,empty,empty,0
oski18,unsafe,0.55,empty,1,2
oski19,unsafe,0.12,4,empty,56
I tried to replace just the 3rd column to see if I was on the right track
awk -F '[[:space:]]' '$2 && !$3{$3="empty"}1' file
this left me with
oski14,safe,0,13,53,4
oski15,Unknow,,,,0
oski16,Unknow,,,,0
oski17,Unknow,,,,0
oski18,unsafe,0.55,,1,2
oski19,unsafe,0.12,4,,56
I have also tried
nawk -F, '{$3="\ "?"empty":$3;print}' OFS="," file
this resulted in
oski14,safe,empty,13,53,4
oski15,Unknow,empty,,,0
oski16,Unknow,empty,,,0
oski17,Unknow,empty,,,0
oski18,unsafe,empty,,1,2
oski19,unsafe,empty,4,,56
Lastly I tried
awk '{if (!$3) {print $1,$2,"empty"} else {print $1,$2,$3}}' file
this left me with
oski14,safe,empty,13,53,4 empty
oski15,Unknow,empty,,,0 empty
oski16,Unknow,empty,,,0 empty
oski17,Unknow,empty,,,0 empty
oski18,unsafe,empty,,1,2 empty
oski19,unsafe,empty,4,,56 empty

With a sed that supports EREs with a -E argument (e.g. GNU sed or OSX/BSD sed):
$ sed -E 's/(^|,)(,|$)/\1empty\2/g; s/(^|,)(,|$)/\1empty\2/g' file
oski14,safe,0,13,53,4
oski15,Unknow,empty,empty,empty,0
oski16,Unknow,empty,empty,empty,0
oski17,Unknow,empty,empty,empty,0
oski18,unsafe,0.55,empty,1,2
oski19,unsafe,0.12,4,empty,56
You need to do the substitution twice because given contiguous commas like ,,, one regexp match would use up the first 2 ,s and so you'd be left with ,empty,,.
The above would change a completely empty line into empty, let us know if that's an issue.

This is the awk command
awk 'BEGIN { FS=","; OFS="," }; { for (i=1;i<=NF;i++) { if ($i == "") { $i = "empty" }}; print $0 }' yourfile
As suggested in the comments, you can shorten the BEGIN procedure to FS=OFS="," as awk allows chained assignment (which I did not know, thank you #EdMorton).
I've set FS="," in the BEGIN procedure instead of using the -F, option just for uniformity with setting OFS=",".
Clearly you can put the script in a more nice looking form:
#!/usr/bin/awk -f
BEGIN {
FS = ","
OFS = ","
}
{
for (i = 1; i <= NF; ++i)
if ($i == "")
$i = "empty"
print $0
}
and use it as a standalone program (you have to chmod +x it), even if this is known to have some drawbacks (consult the comments to this question as well as this answer):
./the_script_above your_file
or
down_the_pipe | ./the_script_above | further_processing
Clearly you are still able to feed the above script to awk this way:
awk -f the_script_above file1 file2

How to merge multiple columns in a file to a single column using bash commands? [duplicate]

This question already has answers here:
How to print columns one after the other in bash?
(7 answers)
Closed 4 years ago.
I have a text file with three different columns . I want to create another file by merging all these columns into a single column.
my file looks like this
mep_kylo_campaigns mep_primecastaccount mep_flightstatus
nqs tod_do gandhi_sub_data
kylo_register policy_record mep_kylo_jobs
mep_note msg_store mep_feature
nqs_aside tbl_employee mep_profile
i want my output like this
mep_kylo_campaigns
nqs
kylo_register
mep_note
nqs_aside
mep_primecastaccount
mep_flightstatus
tod_do
policy_record
msg_store
tbl_employee
gandhi_sub_data
mep_kylo_jobs
mep_feature
mep_profile

This is one way but the order is not the same:
$ cat file | tr -s ' ' '\n'
mep_kylo_campaigns
mep_primecastaccount
mep_flightstatus
...
Update: As useless use of cat was suggested here is another form:
$ < file tr -s ' ' '\n'

If you are interested in doing it awk this is the way :
awk 'BEGIN{ ORS="" } { for ( i=1; i<= NF ; i++){ print $i"\n" } }' input.txt
Additionally, if you are seeking to preserve the order of the columns you can use this :
awk 'BEGIN{ ORS="" } { for ( i=1; i<= NF ; i++){ dict[i]=dict[i]$i"\n" } } END { for (key in dict) { print dict[key] } }' input.txt
Hope it helps!

Here is a Perl Solution that maintains order
$ cat globe.txt
mep_kylo_campaigns mep_primecastaccount mep_flightstatus
nqs tod_do gandhi_sub_data
kylo_register policy_record mep_kylo_jobs
mep_note msg_store mep_feature
nqs_aside tbl_employee mep_profile
$ perl -F"/\s+/" -lane ' push(#F1,$F[0]);push(#F2,$F[1]);push(#F3,$F[2]); END { print join("\n",#F1,#F2,#F3) } ' globe.txt
mep_kylo_campaigns
nqs
kylo_register
mep_note
nqs_aside
mep_primecastaccount
tod_do
policy_record
msg_store
tbl_employee
mep_flightstatus
gandhi_sub_data
mep_kylo_jobs
mep_feature
mep_profile
$

Unix command to create new output file by combining 2 files based on condition

I have 2 files. Basically i want to match the column names from File 1 with the column name listed in the File 2. The resulting output File should have data for the column that matches with File 2 and Null value for the remaining column name in File 2.
Example:
file1
Name|Phone_Number|Location|Email
Jim|032131|xyz|xyz#qqq.com
Tim|037903|zzz|zzz#qqq.com
Pim|039141|xxz|xxz#qqq.com
File2
Location
Name
Age
Based on these 2 files, I want to create new file which has data in the below format:
Output:
Location|Name|Age
xyz|Jim|Null
zzz|Tim|Null
xxz|Pim|Null
Is there a way to get this result using join, awk or sed. I tried with join but couldnt get it working.

$ cat tst.awk
BEGIN { FS=OFS="|" }
NR==FNR { names[++numNames] = $0; next }
FNR==1 {
for (nameNr=1;nameNr<=numNames;nameNr++) {
name = names[nameNr]
printf "%s%s", name, (nameNr<numNames?OFS:ORS)
}
for (i=1;i<=NF;i++) {
name2fldNr[$i] = i
}
next
}
{
for (nameNr=1;nameNr<=numNames;nameNr++) {
name = names[nameNr]
fldNr = name2fldNr[name]
printf "%s%s", (fldNr?$fldNr:"Null"), (nameNr<numNames?OFS:ORS)
}
}
$ awk -f tst.awk file2 file1
Location|Name|Age
xyz|Jim|Null
zzz|Tim|Null
xxz|Pim|Null
Get the book Effective Awk Programming, 4th Edition, by Arnold Robbins.

I'd suggest using csvcut, which is part of CSVKit (https://csvkit.readthedocs.org), along the lines of the following:
#!/bin/bash
HEADERS=File2
PSV=File1
headers=$(tr '\n' , < "$HEADERS" | sed 's/,$//' )
awk '-F|' '
BEGIN {OFS=FS}
NR==1 {print $0,"Age"; next}
{print $0, "Null"}' "$PSV" ) |\
csvcut "-d|" -c "$headers"
I realize this may not be entirely satisfactory, but csvcut doesn't currently have options to handle missing columns or translate missing data to a specified value.

Using `awk` to print number of lines in file in the BEGIN section

I am trying to write an awk script and before anything is done tell the user how many lines are in the file. I know how to do this in the END section but unable to do so in the BEGIN section. I have searched SE and Google but have only found a half dozen ways to do this in the END section or as part of a bash script, not how to do it before any processing has taken place at all. I was hoping for something like the following:
#!/usr/bin/awk -f
BEGIN{
print "There are a total of " **TOTAL LINES** " lines in this file.\n"
}
{
if($0==4587){print "Found record on line number "NR; exit 0;}
}
But have been unable to determine how to do this, if it is even possible. Thanks.

You can read the file twice:
awk 'NR!=1 && FNR==1 {print NR-1} <some more code here>' file{,}
In your example:
awk 'NR!=1 && FNR==1 {print "There are a total of "NR-1" lines in this file.\n"} $0==4587 {print "Found record on line number "NR; exit 0;}' file{,}
You can use file file instead of file{,} (it just makes it show up twice)
NR!=1 && FNR==1 this will be true only at first line of second file.
To use an awk script containing:
#!/usr/bin/awk -f
NR!=1 && FNR==1 {
print "There are a total of "NR-1" lines in this file.\n"
}
$0==4587 {
print "Found record on line number "NR; exit 0
}
call:
awk -f myscript file{,}

To do this robustly and for multiple files you need something like:
$ cat tst.awk
BEGINFILE {
numLines = 0
while ( (getline line < FILENAME) > 0 ) {
numLines++
}
print "----\nThere are a total of", numLines, "lines in", FILENAME
}
$0==4587 { print "Found record on line number", FNR, "of", FILENAME; nextfile }
$
$ cat file1
a
4587
c
$
$ cat file2
$
$ cat file3
d
e
f
4587
$
$ awk -f tst.awk file1 file2 file3
----
There are a total of 3 lines in file1
Found record on line number 2 of file1
----
There are a total of 0 lines in file2
----
There are a total of 4 lines in file3
Found record on line number 4 of file3
The above uses GNU awk for BEGINFILE. Any other solution is difficult to implement such that it will handle empty files (you need an array to track files being parsed and print info the the FNR==1 and END sections after the empty file has been skipped).
Using getline has caveats and should not be used lightly, see http://awk.info/?tip/getline, but this is one of the appropriate and robust uses of it. You can also test for non-readable files in BEGINFILE by testing ERRNO and skipping the file (see the gawk manual) - that situation will cause other scripts to abort.

BEGIN {
s="cat your_file.txt|wc -l";
s | getline file_size;
close(s);
print file_size
}
This will put the size of the file named your_file.txt into the awk variable file_size and print it out.
If your file name is dynamic you can pass the filename on the commandline and change the script to use the variable.
E.g. my.awk
BEGIN {
s="cat "VAR"|wc -l";
s | getline file_size;
close(s);
print file_size
}
Then you can call it like this:
awk -v VAR="your_file.txt" -f my.awk

If you use GNU awk and need a robust, generic solution that accommodates multiple, possibly empty input files, use Ed Morton's solution.
This answer uses portable (POSIX-compliant) code. Within the constraints noted, it is robust, but Ed's GNU awk solution is both simpler and more robust.
Tip of the hat to Ed Morton for his help.
With a single input file, it is simpler to handle line counting with a shell command in the BEGIN block, which has the following advantages:
on invocation, the filename doesn't have to be specified twice, unlike in the accepted answer
Also note that the accepted answer doesn't work as intended (as of this writing); the correct form is (see the comments on the answer for an explanation):
awk 'NR==FNR {next} FNR==1 {print NR-1} $0==4587 {print "Found record on line number "NR; exit 0}' file{,}
the solution also works with an empty input file.
In terms of performance, this approach is either only slightly slower than reading the file twice in awk, or even a little faster, depending on the awk implementation used:
awk '
BEGIN {
# Execute a shell command to count the lines and read
# result into an awk variable via <cmd> | getline <varname>.
# If the file cannot be read, abort. (The shell has already printed an error msg.)
cmd="wc -l < \"" ARGV[1] "\""; if ((cmd | getline count) < 1) exit 1; close(cmd)
printf "There are a total of %s lines in this file.\n\n", count
}
$0==4587 { print "Found record on line number " NR; exit 0 }
' file
Assumptions:
The filename is passed as the 1st operand (non-option argument) on the command line, accessed as ARGV[1].
The filename doesn't contain embedded " chars.
The following solutions deal with multiple files and make analogous assumptions:
All operands passed are filenames. That is, all arguments after the program must be filenames, and not variable assignments such as var=value.
No filename contains embedded " chars.
No processing is to take place if any of the input files do not exist or cannot be read.
It's not hard to generalize this to handling multiple files, but the following solution doesn't print the line count for empty files:
awk '
BEGIN {
# Loop over all input files and store their line counts in an array.
for (i=1; i<ARGC; ++i) {
cmd="wc -l < \"" ARGV[i] "\""; if ((cmd | getline count) < 1) exit 1; close(cmd)
counts[ARGV[i]] = count
}
}
# At the beginning of every (non-empty) file, print the line count.
FNR==1 { printf "There are a total of %s lines in file %s.\n\n", counts[FILENAME], FILENAME }
# $0==4587 { print "%s: Found record on line number %d\n", FILENAME, NR; exit 0 }
' file1 file2 # ...
Things get a little trickier if you want the line count to be printed for empty files also:
awk '
BEGIN {
# Loop over all input files and store their line counts in an array.
for (i=1; i<ARGC; ++i) {
cmd="wc -l < \"" ARGV[i] "\""; if ((cmd | getline count) < 1) exit 1; close(cmd)
counts[ARGV[i]] = count
}
fileCount = ARGC - 1
fmtStringCount = "There are a total of %s lines in file %s.\n\n"
}
# At the beginning of every (non-empty) file, print the line count.
FNR==1 {
++fileIndex
# If there were intervening empty files, print their counts too.
while (ARGV[fileIndex] != FILENAME) {
printf fmtStringCount, 0, ARGV[fileIndex++]
}
printf fmtStringCount, counts[FILENAME], FILENAME
}
# Process input lines
$0==4587 { print "%s: Found record on line number %d\n", FILENAME, NR; exit 0 }
# If there are any remaining empty files a the end, print their counts too.
END {
while (fileIndex < fileCount) { printf fmtStringCount, 0, ARGV[++fileIndex] }
}
' file1 file2 # ...

You can get the number of lines by wc and cut, and set to awk variable with -v option, then you can use the variable in awk script.
cat awk.txt \
| awk -v FNC=`wc -l awk.txt | cut -wf 2` \
'BEGIN { print "FNC: " FNC } { print $0 }'

How to use AWK to print line with highest number?

I have a question. Assuming I dump a file and do a grep for foo and comes out the result like this:
Foo-bar-120:'foo name 1'
Foo-bar-130:'foo name 2'
Foo-bar-1222:'foo name 3'
Etc.
All I want is trying to extract the foo name with largest number. For instance in this case, largest number is 1222 and the result I expect is foo name 3
Is there a easy way using awk and sed to achieve this? Rather than pull the number out line by line and loop through to find the largest number?

Code for awk:
awk -F[-:] '$3>a {a=$3; b=$4} END {print b}' file
$ cat file
Foo-bar-120:'foo name 1'
Foo-bar-130:'foo name 2'
Foo-bar-1222:'foo name 3'
$ awk -F[-:] '$3>a {a=$3; b=$4} END {print b}' file
'foo name 3'

Here's how I would do it. I just tested this in Cygwin. Hopefully it works under Linux as well. Put this into a file, such as mycommand:
#!/usr/bin/awk -f
BEGIN {
FS="-";
max = 0;
maxString = "";
}
{
num = $3 + 0; # convert string to int
if (num > max) {
max = num;
split($3, arr, "'");
maxString = arr[2];
}
}
END {
print maxString;
}
Then make the file executable (chmod 755 mycommand). Now you can pipe whatever you want through it by typing, for example, cat somefile | ./mycommand.

Assuming the line format is as shown with 2 hyphens before "the number":
cut -d- -f3- | sort -rn | sed '1{s/^[0-9]\+://; q}'

is this ok for you?
awk -F'[:-]' '{n=$(NF-1);if(n>m){v=$NF;m=n}}END{print v}'
with your data:
kent$ echo "Foo-bar-120:’foo name 1’
Foo-bar-130:’foo name 2’
Foo-bar-1222:’foo name 3’"|awk -F'[:-]' '{n=$(NF-1);if(n>m){v=$NF;m=n}}END{print v}'
’foo name 3’
P.S. I like the Field separator [:-]

$ awk '{gsub(/.*:.|.$/,"")} (NR==1)||($NF>max){max=$NF; val=$0} END{print val}' file
foo name 3

You don't need to use grep. you can use awk directly on your file as:
awk -F"[-:]" '/Foo/ && $3>prev{val=$NF;prev=$3}END{print val}' file

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Linux awk text file processing - linux

I have a file with a few thousand lines of data, each line is like: a:b:c:d So for example: 0.0:2000.00:2000.04:2000.02 I want to get all a's in one file, b's in second file etc. How?

What about: cat filename|cut -d ':' -f1 > a.txt Then you can write -f2 for the second field and put it in b.txt.

Related

Filling empty spaces in a CSV file

How to merge multiple columns in a file to a single column using bash commands? [duplicate]

Unix command to create new output file by combining 2 files based on condition

Using `awk` to print number of lines in file in the BEGIN section

How to use AWK to print line with highest number?

Categories

Resources