awk command to split filename based on substring - linux

I have a directory in that file names are like
Abc_def_ijk.txt-1
Abc_def_ijk.txt-2
Abc_def_ijk.txt-3
Abc_def_ijk.txt-4
Abc_def_ijk.txt-5
Abc_def_ijk.txt-6
Abc_def_ijk.txt-7
Abc_def_ijk.txt-8
Abc_def_ijk.txt-9
I like to divide into 4 variables as below
v1=Abc_def_ijk.txt-1,Abc_def_ijk.txt-5,Abc_def_ijk.txt-9
V2=Abc_def_ijk.txt-2,Abc_def_ijk.txt-6
V3=Abc_def_ijk.txt-3,Abc_def_ijk.txt-7
V4=Abc_def_ijk.txt-4,Abc_def_ijk.txt-8
If no of files increase it will goto any of above variables. I'm looking for awk one liners to achieve above.

I would do it using GNU AWK following way, let file.txt content be
Abc_def_ijk.txt-1
Abc_def_ijk.txt-2
Abc_def_ijk.txt-3
Abc_def_ijk.txt-4
Abc_def_ijk.txt-5
Abc_def_ijk.txt-6
Abc_def_ijk.txt-7
Abc_def_ijk.txt-8
Abc_def_ijk.txt-9
then
awk '{arr[NR%4]=arr[NR%4] "," $0}END{print substr(arr[1],2);print substr(arr[2],2);print substr(arr[3],2);print substr(arr[0],2)}' file.txt
output
Abc_def_ijk.txt-1,Abc_def_ijk.txt-5,Abc_def_ijk.txt-9
Abc_def_ijk.txt-2,Abc_def_ijk.txt-6
Abc_def_ijk.txt-3,Abc_def_ijk.txt-7
Abc_def_ijk.txt-4,Abc_def_ijk.txt-8
Explanation: I store lines in array arr and decide where to put given line based on numer of line (NR) modulo (%) four (4). I do concatenate to what is currently stored (empty string if nothing so far) with , and content of current line ($0), this result in leading , which I remove using substr function, i.e. starting at 2nd character.
(tested in GNU Awk 5.0.1)

Related

Replacing a string in the beginning of some rows in two columns with another string in linux

I have a tab separated text file. In column 1 and 2 there are family and individual ids that start with a character followed by number as follow:
HG1005 HG1005
HG1006 HG1006
HG1007 HG1007
NA1008 NA1008
NA1009 NA1009
I would like to replace NA with HG in both the columns. I am very new to linux and tried the following code and some others:
awk '{sub("NA","HG",$2)';print}' input file > output file
Any help is highly appreciated.
Converting my comment to answer now, use gsub in spite of sub here. Because it will globally substitute NA to HG here.
awk 'BEGIN{FS=OFS="\t"} {gsub("NA","HG");print}' inputfile > outputfile
OR use following in case you have several fields and you want to perform substitution only in 1st and 2nd fields.
awk 'BEGIN{FS=OFS="\t"} {sub("NA","HG",$1);sub("NA","HG",$2);print}' inputfile > outputfile
Change sub to gsub in 2nd code in case multiple occurrences of NA needs to be changed within field itself.
The $2 in your call to sub only replaces the first occurrence of NA in the second field.
Note that while sed is more typical for such scenarios:
sed 's/NA/HG/g' inputfile > outputfile
you can still use awk:
awk '{gsub("NA","HG")}1' inputfile > outputfile
See the online demo.
Since there is no input variable in gsub (that performs multiple search and replaces) the default $0 is used, i.e. the whole record, the current line, and the code above is equal to awk '{gsub("NA","HG",$0)}1' inputfile > outputfile.
The 1 at the end triggers printing the current record, it is a shorter variant of print.
Notice /^NA/ position at the beginning of field:
awk '{for(i=1;i<=NF;i++)if($i ~ /^NA/) sub(/^NA/,"HG",$(i))} 1' file
HG1005 HG1005
HG1006 HG1006
HG1007 HG1007
HG1008 HG1008
HG1009 HG1009
and save it:
awk '{for(i=1;i<=NF;i++)if($i ~ /^NA/) sub(/^NA/,"HG",$(i))} 1' file > outputfile
If you have a tab as separator:
awk 'BEGIN{FS=OFS="\t"} {for(i=1;i<=NF;i++)if($i ~ /^NA/) sub(/^NA/,"HG",$(i))} 1' file > outputfile

How do I concatenate each line of 2 variables in bash?

I have 2 variables, NUMS and TITLES.
NUMS contains the string
1
2
3
TITLES contains the string
A
B
C
How do I get output that looks like:
1 A
2 B
3 C
paste -d' ' <(echo "$NUMS") <(echo "$TITLES")
Having multi-line strings in variables suggests that you are probably doing something wrong. But you can try
paste -d ' ' <(echo "$nums") - <<<"$titles"
The basic syntax of paste is to read two or more file names; you can use a command substitution to replace a file anywhere, and you can use a here string or other redirection to receive one of the "files" on standard input (where the file name is then conventionally replaced with the pseudo-file -).
The default column separator from paste is a tab; you can replace it with a space or some other character with the -d option.
You should avoid upper case for your private variables; see also Correct Bash and shell script variable capitalization
Bash variables can contain even very long strings, but this is often clumsy and inefficient compared to reading straight from a file or pipeline.
Convert them to arrays, like this:
NUMS=($NUMS)
TITLES=($TITLES)
Then loop over indexes of whatever array, lets say NUMS like this:
for i in ${!NUMS[*]}; {
# and echo desired output
echo "${NUMS[$i]} ${TITLES[$i]}"
}
Awk alternative:
awk 'FNR==NR { map[FNR]=$0;next } { print map[FNR]" "$0} ' <(echo "$NUMS") <(echo "$TITLE")
For the first file/variable (NR==FNR), set up an array called map with the file number record as the index and the line as the value. Then for the second file, print the entry in the array as well as the line separated by a space.

How to extract two part-numerical values from a line in shell script

I have multiple text files in this format. I would like to extract lines matching this pattern "pass filters and QC".
File1:
Before main variant filters, 309 founders and 0 nonfounders present.
0 variants removed due to missing genotype data (--geno).
9302015 variants removed due to minor allele threshold(s)
(--maf/--max-maf/--mac/--max-mac).
7758518 variants and 309 people pass filters and QC.
Calculating allele frequencies... done.
I was able to grep the line, but when I tried to assign to line variable it just doesn't work.
grep 'people pass filters and QC' File1
line="$(echo grep 'people pass filters and QC' File1)"
I am new to shell script and would appreciate if you could help me do this.
I want to create a tab separated file with just
"File1" "7758518 variants" "309 people"
GNU awk
gawk '
BEGIN { patt = "([[:digit:]]+ variants) .* ([[:digit:]]+ people) pass filters and QC" }
match($0, patt, m) {printf "\"%s\" \"%s\" \"%s\"\n", FILENAME, m[1], m[2]}
' File1
You are almost there, just remove double quotes and echo from your command:
line=$(grep 'people pass filters and QC' File1)
Now view the value stored in variable:
echo $line
And if your file structure is same, i.e., it will always be in this form: 7758518 variants and 309 people pass filters and QC, you can use awk to get selected columns from output. So complete command would be like below:
OIFS=$IFS;IFS=$'\n';for i in $line;do echo $i;echo '';done | awk -F "[: ]" '{print $1"\t"$2" "$3"\t"$5" "$6}';IFS=$OIFS
Explanation:
IFS means internal field separator, and we are setting it to newline character, because we need to use it in for loop.
But before that, we are taking it's backup in another variable OIFS, so we can restore it later.
We are using a for loop to iterate through all the matched strings, and using awk to select, 1st, 2nd, 3rd , 4th and 5th column as per your requirement.
But please note, if your file structure varies, we may need to use a different technique to extract "7758518 variants" and "309 people" part.

sed - Delete lines only if they contain multiple instances of a string

I have a text file that contains numerous lines that have partially duplicated strings. I would like to remove lines where a string match occurs twice, such that I am left only with lines with a single match (or no match at all).
An example output:
g1: sample1_out|g2039.t1.faa sample1_out|g334.t1.faa sample1_out|g5678.t1.faa sample2_out|g361.t1.faa sample3_out|g1380.t1.faa sample4_out|g597.t1.faa
g2: sample1_out|g2134.t1.faa sample2_out|g1940.t1.faa sample2_out|g45.t1.faa sample4_out|g1246.t1.faa sample3_out|g2594.t1.faa
g3: sample1_out|g2198.t1.faa sample5_out|g1035.t1.faa sample3_out|g1504.t1.faa sample5_out|g441.t1.faa
g4: sample1_out|g2357.t1.faa sample2_out|g686.t1.faa sample3_out|g1251.t1.faa sample4_out|g2021.t1.faa
In this case I would like to remove lines 1, 2, and 3 because sample1 is repeated multiple times on line 1, sample 2 is twice on line 2, and sample 5 is repeated twice on line 3. Line 4 would pass because it contains only one instance of each sample.
I am okay repeating this operation multiple times using different 'match' strings (e.g. sample1_out , sample2_out etc in the example above).
Here is one in GNU awk:
$ awk -F"[| ]" '{ # pipe or space is the field reparator
delete a # delete previous hash
for(i=2;i<=NF;i+=2) # iterate every other field, ie right side of space
if($i in a) # if it has been seen already
next # skit this record
else # well, else
a[$i] # hash this entry
print # output if you make it this far
}' file
Output:
g4: sample1_out|g2357.t1.faa sample2_out|g686.t1.faa sample3_out|g1251.t1.faa sample4_out|g2021.t1.faa
The following sed command will accomplish what you want.
sed -ne '/.* \(.*\)|.*\1.*/!p' file.txt
grep: grep -vE '(sample[0-9]).*\1' file
Inspiring from Glenn's answer: use -i with sed to directly do changes in the file.
sed -r '/(sample[0-9]).*\1/d' txt_file

Convert data into desired form using linux

I have data in a tab separated file in the following form (filename.tsv):
#a 0 Espert A trius
#b 9 def J
I want to convert the data into the following form (I am introducing here in every second line):
##<a>
<0 Espert> <abc> <A trius>.
##<b>
<9 def> <abc> <J>.
I am introducing in every line. I know to do the same using python using csv module. But I am trying to learn linux commands, is there a way to do the same in linux terminal using linux commands like grep?
awk seems like the right tool for the job:
awk '{
printf "##<%s>\n<%s %s> <abc> <%s%s%s>.\n",
substr($1,2),
$2,
$3,
$4,
(length($5) ? " " : ""),
$5
}' filename.tsv
awk loops over all lines in the input file and breaks each line into fields by runs of tabs and/or spaces; $1 refers to the first field, $2, to the second, ...
printf functions the same as in C: a format (template) string containing placeholders is followed by corresponding arguments to substitute for the placeholders.
substr($1,2) returns the substring of the 1st field starting at the 2nd character (i.e., a for the 1st line, b for the 2nd) - note that indices in awk are 1-based.
(length($5) ? " " : "") is a C-style ternary expression that returns a single space if the 5th field is nonempty, and an empty string otherwise.

Resources