add header to columns from list text file awk - linux

I have a very large text file with hundreds of columns. I want to add a header to every column from an independent text file containing a list.
My large file looks like this:
largefile.txt
chrom start end 0 1 0 1 0 0 0 etc
chrom start end 0 0 0 0 1 1 1 etc
chrom start end 0 0 0 1 1 1 1 etc
my list of headers:
headers.txt
h1
h2
h3
wanted output:
output.txt
h1 h2 h3 h4 h5 h6 h7 etc..
chrom start end 0 1 0 1 0 0 0 etc
chrom start end 0 0 0 0 1 1 1 etc
chrom start end 0 0 0 1 1 1 1 etc

$ awk 'NR==FNR{h=h OFS $0; next} FNR==1{print OFS OFS h} 1' head large | column -s ' ' -t
h1 h2 h3
chrom start end 0 1 0 1 0 0 0 etc
chrom start end 0 0 0 0 1 1 1 etc
chrom start end 0 0 0 1 1 1 1 etc
or if you prefer:
$ awk -v OFS='\t' 'NR==FNR{h=h OFS $0; next} FNR==1{print OFS OFS h} {$1=$1}1' head large
h1 h2 h3
chrom start end 0 1 0 1 0 0 0 etc
chrom start end 0 0 0 0 1 1 1 etc
chrom start end 0 0 0 1 1 1 1 etc

Well, here's one. OFS is tab for eye candy. From the OP I concluded that the headers should start from the fourth field, hence +3s in the code.
$ awk -v OFS="\t" ' # tab OFS
NR==FNR { a[NR]=$1; n=NR; next } # has headers
FNR==1 { # print headers in the beginning of 2nd file
$1=$1 # rebuild record for tabs
b=$0 # buffer record
$0="" # clear record
for(i=1;i<=n;i++) # spread head to fields
$(i+3)=a[i]
print $0 ORS b # output head and buffered first record
}
{ $1=$1 }1' head data # implicit print with record rebuild
h1 h2 h3
chrom start end 0 1 0 1 0 0 0 etc
chrom start end 0 0 0 0 1 1 1 etc
chrom start end 0 0 0 1 1 1 1 etc
Then again, this would also do the trick:
$ awk 'NR==FNR{h=h (NR==1?"":OFS) $0;next}FNR==1{print OFS OFS OFS h}1' head date
h1 h2 h3
chrom start end 0 1 0 1 0 0 0 etc
chrom start end 0 0 0 0 1 1 1 etc
chrom start end 0 0 0 1 1 1 1 etc

Use paste to pivot the headers into a single line and then cat them together with the main file (- instead of a file name means stdin to cat):
$ paste -s -d' ' headers.txt | cat - largefile.txt
If you really need the headers to line up as in your example output you can preprocess (either manually or with a command) the headers file, or you can finish with sed (for just one option) as below:
$ paste -s -d' ' headers.txt | cat - largefile.txt | sed '1 s/^/ /'
h1 h2 h3
chrom start end 0 1 0 1 0 0 0 etc
chrom start end 0 0 0 0 1 1 1 etc
chrom start end 0 0 0 1 1 1 1 etc

Related

deleting part after a specific character and that character

I'm working within my command line/bash on a large file with millions of rows. I'm analyzing the data with a software that requires the rsIDs to be less than 40 characters.
awk 'length($2)>40' 1000G_All_chr_merged.bim > IDtoolong.bim
head IDtoolong.bim
1 rs540674385;rs540674385;rs540674385;rs576523156 0 4439107 AAG AAGGAGG
1 rs561687032;rs546685337;rs528205989;rs370782231 0 4804685 GCACACA GCA
1 rs561021122;rs542858700;rs527502051;rs560257256;rs545143128 0 6210427 AGG GGAAT
1 rs529037702;rs561824298;rs539915961;rs528175459 0 12122415 CCCATCCAT AT
1 rs571308260;rs549871057;rs537509991;rs587738155 0 12611561 CAAA CAAAA
1 rs553093917;rs553093917;rs534535365;rs570185860 0 16657917 AAAT AAATAAT
How can I run through the second column and delete the first semicolon, ;, and anything after that?
I tried this:
awk '{sub(/;.*/,"", $2)}' 1000G_All_chr_merged.bim > adjusted_IDlength.bim
And I also using sed but found myself ruining the file at one point. Any help is appreciated!
I'm guessing that by "ruining the file" you mean changing the white space between fields. If that's the problem, the following won't do that:
$ sed 's/;[^[:space:]]*//' file
1 rs540674385 0 4439107 AAG AAGGAGG
1 rs561687032 0 4804685 GCACACA GCA
1 rs561021122 0 6210427 AGG GGAAT
1 rs529037702 0 12122415 CCCATCCAT AT
1 rs571308260 0 12611561 CAAA CAAAA
1 rs553093917 0 16657917 AAAT AAATAAT

Nested Loop Over Two Files

I have two test files, the first one contains a 3rd party names, the second file contains a message status like sent, failed, technical errors, etc.
I want to search in a log file for each 3rd party name (from first file) and get count of each message status (listed in file 2)
example of 1st file.txt (3rd party names)
BNF_IPL
one97
pajwok
RadioAzadi
SPICDIGITAL
U2OPIA
UNIFUN
UNIFUNRS
vectracom
VNTAF
YRMP
INFOTT
second file.txt (message status):
success
partial
failed
Error absentSubscriber
UnknownSubscriber
smDeliveryFailure
userSpecificReason
CallBarred
systemFailure
my goal is to produce a report contains total status for each 3rd party. something like
sent | failed | TechErrpr | Absent | subscriber
IBM someValue someValue someValue someValue someValue
Microsoft someValue someValue someValue someValue someValue
Oracle someValue someValue someValue someValue someValue
google someValue someValue someValue someValue someValue
To get the values i will grep those names and status in a log file and get the totals. for that i am trying to use nested loop but with no luck.something like:
for ((i = 0; i < wc -l 3rdPList.txt ; i++)); do
for ((j = i; j < wc -l status.txt ; j++)); do
grep 3rdPList.txt logFile | grep status.txt | wc -l > outputFile.txt
echo $st[j]
done
done
example of the log file:
2018-10-30 00:07:19,640 DEBUG [org.mobicents.smsc.library.CdrGenerator] 2018-10-29 14:42:45,789 +0430,588,5,0,93706315646,1,1,temp_failed,BNF_IPL,26674477,0702700006,412012004908984,null,ایید.,Error absentSubscriber after MtForwardSM Request: MAPErrorMessageAbsentSubscriber []
2018-10-30 00:07:41,034 DEBUG [org.mobicents.smsc.library.CdrGenerator] 2018-10-29 16:21:27,260 +0430,588,5,0,0700375593,1,1,temp_failed,BNF_IPL,27008401,null,null,null,عدد1 را به588 ارسال ,AbsentSubscriber response from HLR: MAPErrorMessageAbsentSubscriber []
This does pretty much what you ask, but I didn't work too much on pretty formatting!
{ sed 's/^/1,/' 1.txt; sed 's/^/2,/' 2.txt; cat log.txt; } | awk -F, '$1==1{c=substr($0,3);cc[c]++;next} $1==2{s=substr($0,3); ss[s]++;next} {s=$10;c=$11;res[c SEP s]++} END{for(s in ss){printf("%s ",s)};printf("\n");for(c in cc){printf("%s ",c);for(s in ss){printf("%d ",res[c SEP s]+0)}printf("\n")}}'
Sample Output
systemFailure temp_failed CallBarred userSpecificReason smDeliveryFailure UnknownSubscriber Error absentSubscriber partial success
pajwok 0 0 0 0 0 0 0 0 0
SPICDIGITAL 0 0 0 0 0 0 0 0 0
YRMP 0 0 0 0 0 0 0 0 0
UNIFUN 0 0 3 0 0 0 0 0 0
U2OPIA 0 0 0 0 0 0 0 0 0
UNIFUNRS 0 0 0 0 0 0 0 0 0
RadioAzadi 0 0 0 0 0 0 0 0 0
one97 0 0 0 0 0 0 0 0 0
BNF_IPL 0 2 0 0 0 0 0 0 0
VNTAF 0 0 0 0 0 0 0 0 0
INFOTT 0 0 0 0 0 0 0 0 0
vectracom 0 0 0 0 0 0 0 0 0
If you want to understand it, try running the parts separately. So, for the first part, I prefix all the company names by a 1 so that awk can differentiate them from status codes and log lines:
sed 's/^/1,/' 1.txt
Output
1,BNF_IPL
1,one97
1,pajwok
1,RadioAzadi
1,SPICDIGITAL
1,U2OPIA
1,UNIFUN
1,UNIFUNRS
1,vectracom
1,VNTAF
1,YRMP
1,INFOTT
Then, I prefix all the status messages with a 2 so that awk can differentiate those from company names and log lines:
sed 's/^/2,/' 2.txt
Output
2,success
2,partial
2,temp_failed
2,Error absentSubscriber
2,UnknownSubscriber
2,smDeliveryFailure
2,userSpecificReason
2,CallBarred
2,systemFailure
Then I cat the log file into awk:
cat log.txt
The awk can be written across multiple lines and commented:
{ sed ...; sed ...; cat ...; } | awk -F, '
$1==1 {c=substr($0,3); cc[c]++; next} # Process company name in "1.txt", "c" holds name, "cc[]" is an array of names
$1==2 {s=substr($0,3); ss[s]++; next} # Process status code in "2.txt, "s" holds status, "ss[]" is an array of statuses
{s=$10; c=$11; res[c SEP s]++} # Process line from log, status is field 10, company is field 11. Increment results array "res[]"
END {
# Print line of status codes
for(s in ss){printf("%s ",s)};
printf("\n");
for(c in cc){printf("%s ",c);
for(s in ss){printf("%d ",res[c SEP s]+0)}printf("\n")}
}'
SEP is just a separator to fake 2-D arrays.

Counting number of rows depending on more than 1 column condition

I have a data file like this
H1 H2 H3 E1 E2 E3 C1 C2 C3
0 0 0 0 0 0 0 0 1
1 0 0 0 1 0 0 0 1
0 1 0 0 1 0 1 0 1
now i want to count the rows where H1,H2,H3 has the same pattern as E1,E2 and E3. for example, i want to count the number of time H1,H2,H3 and E1,E2,E3 both are 010 or 000.
I tried to use this code but it doesnt really work
awk -F "" '!($1==0 && $2==1 && $3==0 && $4==0 && $5==1 && $6==0)' file | wc -l
Something like
>>> awk '$1$2$3 == $4$5$6' input | wc -l
2
What it does?
$1$2$3 == $4$5$6 Checks if the string formed by columns 1 2 and 3 is equal to the columns formed by 4 5 and 6. When it is true, awk takes the default action of printing the entire line and the wc takes care of counting those lines.
Or, if you want complete awk solution, you can write
>>> awk '$1$2$3 == $4$5$6{count++} END{print count}' input
2

How to find common rows in multiple files using awk

I have tab delimited text files in which common rows between them are to be found based on columns 1 and 2 as key columns.
Sample files:
file1.txt
aba 0 0
aba 0 0 1
abc 0 1
abd 1 1
xxx 0 0
file2.txt
xyz 0 0
aba 0 0 0 0
aba 0 0 0 1
xxx 0 0
abc 1 1
file3.txt
xyx 0 0
aba 0 0
aba 0 1 0
xxx 0 0 0 1
abc 1 1
The below code does the same and returns the rows only if the key column is found in all the N files (3 files in this case).
awk '
FNR == NR {
arr[$1,$2] = 1
line[$1,$2] = line[$1,$2] ( line[$1,$2] ? SUBSEP : "" ) $0
next
}
FNR == 1 { delete found }
{ if ( arr[$1,$2] && ! found[$1,$2] ) { arr[$1,$2]++; found[$1,$2] = 1 } }
END {
num_files = ARGC -1
for ( key in arr ) {
if ( arr[key] < num_files ) { continue }
split( line[ key ], line_arr, SUBSEP )
for ( i = 1; i <= length( line_arr ); i++ ) {
printf "%s\n", line_arr[ i ]
}
}
}
' *.txt > commoninall.txt
Output:
xxx 0 0
aba 0 0
aba 0 0 1
However, now I would like to get the output if 'x' files have the key columns.
For example x=2 i.e. rows which are common in two files based on key columns 1 and 2. The output in this case would be:
xyz 0 0
abc 1 1
In real scenario I do have to specify different values for x. Can anybody suggest an edit to this or a new solution.
First attempt
I think you just need to modify the END block a little, and the command invocation:
awk -v num_files=${x:-0} '
…
…script as before…
…
END {
if (num_files == 0) num_files = ARGC - 1
for (key in arr) {
if (arr[key] == num_files) {
split(line[key], line_arr, SUBSEP)
for (i = 1; i <= length(line_arr); i++) {
printf "%s\n", line_arr[i]
}
}
}
}
'
Basically, this takes a command line parameter based on $x, defaulting to 0, and assigning it to the awk variable num_files. In the END block, the code checks for num_files being zero, and resets it to the number of files passed on the command line. (Interestingly, the value in ARGC discounts any -v var=value options and either a command line script or -f script.awk, so the ARGC-1 term remains correct. The array ARGV contains awk (or whatever name you invoked it with) in ARGV[0] and the files to be processed in ARGV[1] through ARGV[ARGC-1].) The loop then checks for the required number of matches and prints as before. You can change == to >= if you want the 'or more' option.
Does it work?
I observed in a comment:
I'm not clear what you are asking. I took it that your code was working for the example with three files and producing the right answer. I simply suggested how to modify the working code to handle N files and at least M of them sharing an entry. I have just realized, while typing this, that there is a bit more work to do. An entry could be missing from the first file but present in the others and will need to be processed, therefore. It is easy to report all occurrences in every file, or the first occurrence in any file. It is harder to report all occurrences only in the first file with a key.
The response was:
It is perfectly fine to report first occurrence in any file and need not be only from the first file. However, the issue with the suggested modification is, it is producing the same output for different values of x.
That's curious: I was able to get sane output from the amended code with different values for the number of files where the key must appear. I used this shell script. The code in the awk program up to the END block is the same as in the question; the only change is in the END processing block.
#!/bin/bash
while getopts n: opt
do
case "$opt" in
(n) num_files=$OPTARG;;
(*) echo "Usage: $(basename "$0" .sh) [-n number] file [...]" >&2
exit 1;;
esac
done
shift $(($OPTIND - 1))
awk -v num_files=${num_files:-$#} '
FNR == NR {
arr[$1,$2] = 1
line[$1,$2] = line[$1,$2] (line[$1,$2] ? SUBSEP : "") $0
next
}
FNR == 1 { delete found }
{ if (arr[$1,$2] && ! found[$1,$2]) { arr[$1,$2]++; found[$1,$2] = 1 } }
END {
if (num_files == 0) num_files = ARGC - 1
for (key in arr) {
if (arr[key] == num_files) {
split(line[key], line_arr, SUBSEP)
for (i = 1; i <= length(line_arr); i++) {
printf "%s\n", line_arr[i]
}
}
}
}
' "$#"
Sample runs (data files from question):
$ bash common.sh file?.txt
xxx 0 0
aba 0 0
aba 0 0 1
$ bash common.sh -n 3 file?.txt
xxx 0 0
aba 0 0
aba 0 0 1
$ bash common.sh -n 2 file?.txt
$ bash common.sh -n 1 file?.txt
abc 0 1
abd 1 1
$
That shows different answers depending on the value specified via -n. Note that this only shows lines that appear in the first file and appear in exactly N files in total. The only key that appears in two files (abc/1) does not appear in the first file, so it is not listed by this code which stops paying attention to new keys after the first file is processed.
Rewrite
However, here's a rewrite, using some of the same ideas, but working more thoroughly.
#!/bin/bash
# SO 30428099
# Given that the key for a line is the first two columns, this script
# lists all appearances in all files of a given key if that key appears
# in N different files (where N defaults to the number of files). For
# the benefit of debugging, it includes the file name and line number
# with each line.
usage()
{
echo "Usage: $(basename "$0" .sh) [-n number] file [...]" >&2
exit 1
}
while getopts n: opt
do
case "$opt" in
(n) num_files=$OPTARG;;
(*) usage;;
esac
done
shift $(($OPTIND - 1))
if [ "$#" = 0 ]
then usage
fi
# Record count of each key, regardless of file: keys
# Record count of each key in each file: key_file
# Count of different files containing each key: files
# Accumulate line number, filename, line for each key: lines
awk -v num_files=${num_files:-$#} '
{
keys[$1,$2]++;
if (++key_file[$1,$2,FILENAME] == 1)
files[$1,$2]++
#printf "%s:%d: Key (%s,%s); keys = %d; key_file = %d; files = %d\n",
# FILENAME, FNR, $1, $2, keys[$1,$2], key_file[$1,$2,FILENAME], files[$1,$2];
sep = lines[$1,$2] ? RS : ""
#printf "B: [[\n%s\n]]\n", lines[$1,$2]
lines[$1,$2] = lines[$1,$2] sep FILENAME OFS FNR OFS $0
#printf "A: [[\n%s\n]]\n", lines[$1,$2]
}
END {
#print "END"
for (key in files)
{
#print "Key =", key, "; files =", files[key]
if (files[key] == num_files)
{
#printf "TAG\n%s\nEND\n", lines[key]
print lines[key]
}
}
}
' "$#"
Sample output (given the data files from the question):
$ bash common.sh file?.txt
file1.txt 5 xxx 0 0
file2.txt 4 xxx 0 0
file3.txt 4 xxx 0 0 0 1
file1.txt 1 aba 0 0
file1.txt 2 aba 0 0 1
file2.txt 2 aba 0 0 0 0
file2.txt 3 aba 0 0 0 1
file3.txt 2 aba 0 0
file3.txt 3 aba 0 1 0
$ bash common.sh -n 2 file?.txt
file2.txt 5 abc 1 1
file3.txt 5 abc 1 1
$ bash common.sh -n 1 file?.txt
file1.txt 3 abc 0 1
file3.txt 1 xyx 0 0
file1.txt 4 abd 1 1
file2.txt 1 xyz 0 0
$ bash common.sh -n 3 file?.txt
file1.txt 5 xxx 0 0
file2.txt 4 xxx 0 0
file3.txt 4 xxx 0 0 0 1
file1.txt 1 aba 0 0
file1.txt 2 aba 0 0 1
file2.txt 2 aba 0 0 0 0
file2.txt 3 aba 0 0 0 1
file3.txt 2 aba 0 0
file3.txt 3 aba 0 1 0
$ bash common.sh -n 4 file?.txt
$
You can fettle this to give the output you want (probably missing file name and line number). If you only want the lines from the first file containing a given key, you only add the information to lines when files[$1,$2] == 1. You can separate the recorded information with SUBSEP instead of RS and OFS if you prefer.
Can't you simply use uniq to search for repeated lines in you files?
Something like:
cat file1.txt file2.txt file3.txt | uniq -d
For your complete scenario, you could use uniq -c to get the number of repetition for each line, and filter this with grep.

read line by line awk and if

I have a File called contenido.txt
the file have inside the next table
Nombre column1 column2 valor3
Marcos 1 0 0
Jose 1 0 0
Andres 0 0 0
Oscar 1 0 0
Pablo 0 0 0
I need a final file or a print of the lines that only has 0 in the column2
could you help me please?
cat contenido.txt | while read LINE; do
var=$(cat $LINE | awk '{print $2}')
if ["$var" == 0]
then
echo $LINE | awk '{print $1}'
fi
done
After reading your codes, the column 2 you meant is actually the 2nd column( the column with header "column1"), it is not the column with header "column2". So this line will help you:
awk 'NR==1{print;next}$2==0' file
test with your data
kent$ echo "Nombre column1 column2 valor3
Marcos 1 0 0
Jose 1 0 0
Andres 0 0 0
Oscar 1 0 0
Pablo 0 0 0"|awk 'NR==1{print;next}$2==0'
Nombre column1 column2 valor3
Andres 0 0 0
Pablo 0 0 0
and the 2nd part of your codes seem that extracting the first column (names?) out. You can do this in one shot with awk (ignore the header):
kent$ echo "Nombre column1 column2 valor3
Marcos 1 0 0
Jose 1 0 0
Andres 0 0 0
Oscar 1 0 0
Pablo 0 0 0"|awk '$2==0{print $1}'
Andres
Pablo
column2 is $3 in awk. So:
$ awk '$3 == 0' < in.txt
Marcos 1 0 0
Jose 1 0 0
Andres 0 0 0
Oscar 1 0 0
Pablo 0 0 0
{print $0} is the implicit action.

Resources