Using awk to include file name with format in column - linux

I'm working on wrangling some data to ingest into Hive. The problem is, I have overwrites in my historical data so I need to include the file name in the text files so that I can dispose of the duplicated rows which have been updated in subsequent files.
The way I've chosen to go about this is to use awk to add the file name to each file, then after I ingest into Hive I can use HQL to filter out my deprecated rows.
Here is my sample data (tab-delimited):
animal legs eyes
hippo 4 2
spider 8 8
crab 8 2
mite 6 0
bird 2 2
I've named it long_name_20180901.txt
I've figured out how to add my new column from this post:
awk '{print FILENAME (NF?"\t":"") $0}' long_name_20180901.txt
which results in:
long_name_20180901.txt animal legs eyes
long_name_20180901.txt hippo 4 2
long_name_20180901.txt spider 8 8
long_name_20180901.txt crab 8 2
long_name_20180901.txt mite 6 0
long_name_20180901.txt bird 2 2
But, being a beginner, I don't know how to augment this command to:
make the column name (first line) something like "file_name"
implement regex in awk to just extract the part of the file name that I need, and dispose of the rest. I really just want "long_name_(.{8,}).txt" (the stuff in the capturing group.
Target output is:
file animal legs eyes
20180901 spider 8 8
20180901 crab 8 2
20180901 mite 6 0
20180901 bird 2 2
Thanks for your time!! I'm a total newbie to awk.

This would handle one or multiple input files:
awk -v OFS='\t' '
NR==1 { print "file", $0 }
FNR==1 { n=split(FILENAME,t,/[_.]/); fname=t[n-1]; next }
{ print fname, $0 }
' *.txt

You can use BEGIN that sets the "file" and then reset it to use the filename for the rest.
awk 'BEGIN{f="file\t"} NF{print f $0; if (f=="file\t") {l=split(FILENAME, a, /[_.]/); f=a[l-1]"\t"};}' long_name_20180901.txt

Related

How to remove some words in specific field using awk?

I have several lines of text. I want to extract the number after specific word using awk.
I tried the following code but it does not work.
At first, create the test file by: vi test.text. There are 3 columns (the 3 fields are generated by some other pipeline commands using awk).
Index AllocTres CPUTotal
1 cpu=1,mem=256G 18
2 cpu=2,mem=1024M 16
3 4
4 cpu=12,gres/gpu=3 12
5 8
6 9
7 cpu=13,gres/gpu=4,gres/gpu:ret6000=2 20
8 mem=12G,gres/gpu=3,gres/gpu:1080ti=1 21
Please note there are several empty fields in this file.
what I want to achieve only keep the number folloing the first gres/gpu part and remove all cpu= and mem= parts using a pipeline like: cat test.text | awk '{some_commands}' to output 3 columns:
Index AllocTres CPUTotal
1 18
2 16
3 4
4 3 12
5 8
6 9
7 4 20
8 3 21
1st solution: With your shown samples, please try following GNU awk code. This takes care of spaces in between fields.
awk '
FNR==1{ print; next }
match($0,/[[:space:]]+/){
space=substr($0,RSTART,RLENGTH-1)
}
{
match($2,/gres\/gpu=([0-9]+)/,arr)
match($0,/^[^[:space:]]+[[:space:]]+[^[:space:]]+([[:space:]]+)/,arr1)
space1=sprintf("%"length($2)-length(arr[1])"s",OFS)
if(NF>2){ sub(OFS,"",arr1[1]);$2=space arr[1] space1 arr1[1] }
}
1
' Input_file
Output will be as follows for above code with shown samples:
Index AllocTres CPUTotal
1 18
2 16
3 4
4 3 12
5 8
6 9
7 4 20
8 3 21
2nd solution: If you don't care of spaces then try following awk code.
awk 'FNR==1{print;next} match($2,/gres\/gpu=([0-9]+)/,arr){$2=arr[1]} 1' Input_file
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
FNR==1{ ##Checking condition if this is first line then do following.
print ##Printing current line.
next ##next will skip all further statements from here.
}
match($2,/gres\/gpu=([0-9]+)/,arr){ ##using match function to match regex gres/gpu= digits and keeping digits in capturing group.
$2=arr[1] ##Assigning 1st value of array arr to 2nd field itself.
}
1 ##printing current edited/non-edited line here.
' Input_file ##Mentioning Input_file name here.
Using sed
$ sed 's~\( \+\)[^,]*,\(gres/gpu=\([0-9]\)\|[^ ]*\)[^ ]* \+~\1\3 \t\t\t\t ~' input_file
Index AllocTres CPUTotal
1 18
2 16
3 4
4 3 12
5 8
6 9
7 4 20
8 3 21
awk '
FNR>1 && NF==3 {
n = split($2, a, ",")
for (i=1; a[i] !~ /gres\/gpu=[0-9]+,?/ && i<=n; ++i);
sub(/.*=/, "", a[i])
$2 = a[i]
}
NF==2 {$3=$2; $2=""}
{printf "%-7s%-11s%s\n",$1,$2,$3}' test.txt
Output:
Index AllocTres CPUTotal
1 18
2 16
3 4
4 3 12
5 8
6 9
7 4 20
8 3 21
You can adjust column widths as desired.
This assumes the first and last columns always have a value, so that NF (number of fields) can be used to identify field 2. Then if field 2 is not empty, split that field on commas, scan the resulting array for the first match of gres/gpu, remove this suffix, and print the three fields. If field 2 is empty, the second last line inserts an empty awk field so printf always works.
If assumption above is wrong, it's also possible to identify field 2 by its character index.
A awk-based solution without needing
- array splitting,
- regex back-referencing,
- prior state tracking, or
- input multi-passing
—- since m.p. for /dev/stdin would require state tracking
|
{mng}awk '!_~NF || sub("[^ ]+$", sprintf("%*s&", length-length($!(NF=NF)),_))' \
FS='[ ][^ \\/]*gres[/]gpu[=]|[,: ][^= ]+[=][^,: ]+' OFS=
Index AllocTres CPUTotal
1 18
2 16
3 4
4 3 12
5 8
6 9
7 4 20
8 3 21
If you don't care for nawk, then it's even simpler single-pass approach with only 1 all-encompassing call to sub() per line :
awk ' sub("[^ ]*$", sprintf("%*s&", length($_) - length($(\
gsub(" [^ /]*gres[/]gpu=|[,: ][^= ]+=[^,: ]+", _)*_)),_))'
or even more condensed but worse syntax styling :
awk 'sub("[^ ]*$",sprintf("%*s&",length^gsub(" [^ /]*gres\/gpu=|"\
"[,: ][^= ]+=[^,: ]+",_)^_ - length,_) )'
This might work for you (GNU sed):
sed -E '/=/!b
s/\S+/\n&\n/2;h
s/.*\n(.*)\n.*/\1/
/gpu=/!{s/./ /g;G;s/(^.*)\n(.*)\n.*\n/\2\1/p;d}
s/gpu=([^,]*)/\n\1 \n/;s/(.*)\n(.*\n)/\2\1/;H
s/.*\n//;s/./ /g;H;g
s/\n.*\n(.*)\n(.*)\n.*\n(.*)/\2\3\1/' file
In essence the solution above involves using the hold space (see here and eventually here) as a scratchpad to hold intermediate results. Those results are gathered by isolating the second field and then again the gpu info. The step by step story follows:
If the line does not contain a second field, leave alone.
Surround the second field by newlines and make a copy.
Isolate the second field
If the second field contains no gpu info, replace the entire field by spaces and using the copy, format the line accordingly.
Otherwise, isolate the gpu info, move it to the front of the line and append that to the copy of the line in the hold space.
Meanwhile, remove the gpu info from the pattern space and replace each character in the pattern space by a space.
Apend these spaces to the copy and then overwrite the pattern space by the copy.
Lastly, knowing each part of the line has been split by newlines, reassemble the parts into the desired format.
N.B. The solution depends on the spacing of columns being real spaces. If there are tabs in the file, then prepend the sed command s/\t/ /g (where in the example tabs are replaced by 8 spaces).
Alternative:
sed -E '/=/!b
s/\S+/\n&\n/2;h
s/.*(\n.*)\n.*/\1/;s/(.)(.*gpu=)([^,]+)/\3\1\2/;H
s/.*\n//;s/./ /g;G
s/(.*)\n(.*)\n.*\n(.*)\n(.*)\n.*$/\2\4\1\3/' file
In this solution, rather than treat lines with a second field but no gpu info, as a separate case, I introduce a place holder for this missing info and follow the same solution as if gpu info was present.

How to remove lines based on another file? [duplicate]

This question already has answers here:
How to delete rows from a csv file based on a list values from another file?
(3 answers)
Closed 2 years ago.
Now I have two files as follows:
$ cat file1.txt
john 12 65 0
Nico 3 5 1
king 9 5 2
lee 9 15 0
$ cat file2.txt
Nico
king
Now I would like to remove each line which contains a name fron the second file in its first column.
Ideal result:
john 12 65 0
lee 9 15 0
Could anyone tell me how to do that? I have tried the code like this:
for i in 'less file2.txt'; do sed "/$i/d" file1.txt; done
But it does not work properly.
You don't need to iterate it, you just need to use grep with-v option to invert match and -w to force pattern to match only WHOLE words
grep -wvf file2.txt file1.txt
This job suites awk:
awk 'NR == FNR {a[$1]; next} !($1 in a)' file2.txt file1.txt
john 12 65 0
lee 9 15 0
Details:
NR == FNR { # While processing the first file
a[$1] # store the first field in an array a
next # move to next line
}
!($1 in a) # while processing the second file
# if first field doesn't exist in array a then print

Bash code to struture proteomics data

I need help concerning retructuring my dataset so that I can perform the downstream analysis. I am presently dealing with proteomics data and want to perform comparative analysis. The problem is the protein ids. In general one protein can have more then 1 id and they are separated by ";". I need to print the entire line of the same protein with different protein ids. for example:-
Input file :
tom dick harry jan
a;b;c 1 2 3 4
d;e 4 5 7 3
desirable output:
tom dick harry jan
a 1 2 3 4
b 1 2 3 4
c 1 2 3 4
d 4 5 7 3
e 4 5 7 3
many many thanks in advance
$ awk 'NR==1{$0="key "$0} {split($1,a,/;/); for (i=1; i in a; i++) { $1=a[i]; print } }' file | column -t
key tom dick harry jan
a 1 2 3 4
b 1 2 3 4
c 1 2 3 4
d 4 5 7 3
e 4 5 7 3
You can trivially remove the word "key" from the output if you don't like it but IMHO having some columns with and some without headers is a very bad idea - just makes any further processing more difficult.
#!/bin/bash
read header
printf "%4s %s\n" "" "$header"
while true
do
read ids values
for id in $(tr ';' ' ' <<< "$ids")
do
printf "%-4s %s\n" "$id" "$values"
done
done
This reads the header and prints is (just slightly differently formatted), then it reads each line and prints for each of these a bunch of lines, one line for each id given in the beginning of the line. For finding the ids, the ids string is split over semicolon (;).

Linux command or/and script for duplicate lines retrieval

I would like to know if there's an easy way way to locate duplicate lines in a text file that contains many entries (about 200.000 or more) and output a file with the duplicates' line numbers, keeping the source file intact. For instance, I got a file with tweets like this:
1. i got red apple
2. i got red apple in my stomach
3. i got green apple
4. i got red apple
5. i like blue bananas
6. i got red apple
7. i like blues music
8. i like blue bananas
9. i like blue bananas
I want the output to be a separate file like this:
4
6
8
9
where numbers will indicate the lines with duplicate entries (excluding the first occurrence of the duplicates). Also note that the matching pattern must be exactly the same sentence (like line 1 is different than line 2, 5 is different than 7 and so on).
Everything I could find with sort | uniq doesn't seem to match the whole sentence but only the first word of the sentence so I'm considering if an awk script would be better for this task or if there is another type of command that can do that.
I also need the first file to be intact (not sorted or reordered in any way) and get only the line numbers as shown above because I want to manually delete these lines from two files. The first file contains the tweets and the second the hashtags of these tweets, so I want to delete the lines that contain duplicate tweets in both files, keeping the first occurrence.
You can try this awk:
awk '$0 in a && a[$0]==1{print NR} {a[$0]++}' file
As per comment,
awk '$0 in a{print NR} {a[$0]++}' file
Output:
$ awk '$0 in a && a[$0]==1{print NR} {a[$0]++}' file
4
8
$ awk '$0 in a{print NR} {a[$0]++}' file
4
6
8
9
you could use python script for doing the same.
f = open("file")
lines = f.readlines()
count = len (lines)
i=0
ignore = []
for i in range(count):
if i in ignore:
continue
for j in range(count):
if (j<= i):
continue
if lines[i] == lines[j]:
ignore.append(j)
print j+1
output :
4
6
8
9
Here is a method combining a few command line tools:
nl -n ln file | sort -k 2 | uniq -f 1 --all-repeated=prepend | sed '/^$/{N;d}' |
cut -f 1
This
numbers the lines with nl, left adjusted with no leading zeroes (-n ln)
sorts them (ignoring the the first field, i.e., the line number) with sort
finds duplicate lines, ignoring the first field with uniq; the --all-repeated=prepend adds an empty line before each group of duplicate lines
removes all the empty lines and the first one of each group of duplicates with sed
removes everything but the line number with cut
This is what the output looks like at the different stages:
$ nl -n ln file
1 i got red apple
2 i got red apple in my stomach
3 i got green apple
4 i got red apple
5 i like blue bananas
6 i got red apple
7 i like blues music
8 i like blue bananas
9 i like blue bananas
$ nl -n ln file | sort -k 2
3 i got green apple
1 i got red apple
4 i got red apple
6 i got red apple
2 i got red apple in my stomach
5 i like blue bananas
8 i like blue bananas
9 i like blue bananas
7 i like blues music
$ nl -n ln file | sort -k 2 | uniq -f 1 --all-repeated=prepend
1 i got red apple
4 i got red apple
6 i got red apple
5 i like blue bananas
8 i like blue bananas
9 i like blue bananas
$ nl -n ln file | sort -k 2 | uniq -f 1 --all-repeated=prepend | sed '/^$/{N;d}'
4 i got red apple
6 i got red apple
8 i like blue bananas
9 i like blue bananas
$ nl -n ln file | sort -k 2 | uniq -f 1 --all-repeated=prepend | sed '/^$/{N;d}' | cut -f 1
4
6
8
9

Change format of text file

I have a file with many lines of tab separated data in the following format:
1 1 2 2
3 3 4 4
5 5 6 6
...
and I would like to change the format to:
1 1
2 2
3 3
4 4
5 5
6 6
Is there a not too complicated way to do this? I don't have any experience with using awk, sed, etc.
Thanks
If you just want to group your file in blocks of X columns, you can make use of xargs -nX:
$ xargs -n2 < file
1 1
2 2
3 3
4 4
5 5
6 6
To have more control and print an empty line after 4th field, you can also use this awk:
$ awk 'BEGIN{FS=OFS="\t"} {for (i=1;i<=NF;i++) printf "%s%s", $i, (i%2?OFS:RS); print ""}' file
1 1
2 2
3 3
4 4
5 5
6 6
# <-- note there is an empty line here
Explanation
On odd fields, it print FS after it.
On even fields, print RS.
Note FS stands for field separator, which defaults to space, and RS stands for record separator, which defaults to new line. As you have tab as field separator, we redefine it in the BEGIN block.
This is probably the simplest way which allows for customisation
awk '{print $1,$2"\n"$3,$4}' file
For a line between
awk '{print $1,$2"\n"$3,$4"\n"}' file
although fedorquis answer with xargs is probably the simplest if this isn't needed
As Ed pointed out this wouldn't work if there were blanks in the fields, this could be resolved using
awk 'BEGIN{FS=OFS="\t"} {print $1,$2 ORS $3,$4 ORS}' file
Through perl,
perl -pe 's/\t(\d\t\d)$/\n$1\n/g' file
Fed the above command's output to the sed command to delete the last blank line.
perl -pe 's/\t(\d\t\d)$/\n$1\n/g' file | sed '$d'

Resources