Append multiple lines from one file to another - excel

Hello Guys,
I have two files that have same number of lines (503 exactly) and i want to append the contents of one file to the other right infront of them without any space or tab. Let us consider the input file contents are:
One.txt
C:\Users\Desktop.VG\New-folder\?filename=
C:\Users\Desktop.VG\New-folder\?filename=
C:\Users\Desktop.VG\New-folder\?filename=
C:\Users\Desktop.VG\New-folder\?filename=
C:\Users\Desktop.VG\New-folder\?filename=
Two.txt
one
val_ilu_girl
pacmanhall
four_stars
squares3
Now I want like this:
C:\Users\Desktop.VG\New-folder\?filename=one
C:\Users\Desktop.VG\New-folder\?filename=val_ilu_girl
C:\Users\Desktop.VG\New-folder\?filename=pacmanhall
C:\Users\Desktop.VG\New-folder\?filename=four_stars
C:\Users\Desktop.VG\New-folder\?filename=squares3
Is there a way to do this using anything from SED GREP AWK to EXCEL Notepadd++ etc.?
Thanks in advance...!

Could you please try following.
awk 'FNR==NR{a[FNR]=$0;next} {print $0 a[FNR]}' two.txt one.txt
Explanation: Adding explanation for above code.
awk ' ##Starting awk program here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first Input_file is being read.
a[FNR]=$0 ##Creating an array named a whose index is FNR and value is current line.
next ##next will skip all further statements from here.
} ##Closing BLOCK for FNR==NR condition here.
{ ##These statements will be executed when 2nd Input_file is being read.
print $0 a[FNR] ##Printing current line along with array a value with index of FNR.
}
' two.txt one.txt ##Mentioning Input_file names here.

That;s the job paste was invented to do:
$ paste -d '' one.txt two.txt
C:\Users\Desktop.VG\New-folder\?filename=one
C:\Users\Desktop.VG\New-folder\?filename=val_ilu_girl
C:\Users\Desktop.VG\New-folder\?filename=pacmanhall
C:\Users\Desktop.VG\New-folder\?filename=four_stars
C:\Users\Desktop.VG\New-folder\?filename=squares3

Related

Match lines based on patterns and reformat file Bash/ Linux

I am looking preferably for a bash/Linux method for the problem below.
I have a text file (input.txt) that looks like so (and many many more lines):
TCCTCCGC+TAGTTAGG_Vel_24_CC_LlanR_34 CC_LlanR
GGAGTATG+TCTATTCG_Vel_24_CC_LlanR_22 CC_LlanR
TTGACTAG+TGGAGTAC_Vel_02_EN_DavaW_11 EN_DavaW
TCGAATAA+TGGTAATT_Vel_24_CC_LlanR_23 CC_LlanR
CTGCTGAA+CGTTGCGG_Vel_02_EN_DavaW_06 EN_DavaW
index_07_barcode_04_PA-17-ACW-04 17-ACW
index_09_barcode_05_PA-17-ACW-05 17-ACW
index_08_barcode_37_PA-21-YC-15 21-YC
index_09_barcode_04_PA-22-GB-10 22-GB
index_10_barcode_37_PA-28-CC-17 28-CC
index_11_barcode_29_PA-32-MW-07 32-MW
index_11_barcode_20_PA-32-MW-08 32-MW
I want to produce a file that looks like
CC_LlanR(TCCTCCGC+TAGTTAGG_Vel_24_CC_LlanR_34,GGAGTATG+TCTATTCG_Vel_24_CC_LlanR_22,TCGAATAA+TGGTAATT_Vel_24_CC_LlanR_23)
EN_DavaW(TTGACTAG+TGGAGTAC_Vel_02_EN_DavaW_11,CTGCTGAA+CGTTGCGG_Vel_02_EN_DavaW_06)
17-ACW(index_07_barcode_04_PA-17-ACW-04,index_09_barcode_05_PA-17-ACW-05)
21-YC(index_08_barcode_37_PA-21-YC-15)
22-GB(index_09_barcode_04_PA-22-GB-10)
28-CC(index_10_barcode_37_PA-28-CC-17)
32-MW(index_11_barcode_29_PA-32-MW-07,index_11_barcode_20_PA-32-MW-08)
I thought that I could do something along the lines of this.
cat input.txt | awk '{print $1}' | grep -e "CC_LlanR" | paste -sd',' > intermediate_file
cat input.txt | awk '{print $2"("}' something something??
But I only know how to grep one pattern at a time? Is there a way to find all the matching lines at once and output them in this format?
Thank you!
(Happy Easter/ long weekend to all!)
With your shown samples please try following.
awk '
FNR==NR{
arr[$2]=(arr[$2]?arr[$2]",":"")$1
next
}
($2 in arr){
print $2"("arr[$2]")"
delete arr[$2]
}
' Input_file Input_file
2nd solution: Within a single read of Input_file try following.
awk '{arr[$2]=(arr[$2]?arr[$2]",":"")$1} END{for(i in arr){print i"("arr[i]")"}}' Input_file
Explanation(1st solution): Adding detailed explanation for 1st solution here.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first time Input_file is being read.
arr[$2]=(arr[$2]?arr[$2]",":"")$1 ##Creating array with index of 2nd field and keep adding its value with comma here.
next ##next will skip all further statements from here.
}
($2 in arr){ ##Checking condition if 2nd field is present in arr then do following.
print $2"("arr[$2]")" ##Printing 2nd field ( arr[$2] ) here.
delete arr[$2] ##Deleteing arr value with 2nd field index here.
}
' Input_file Input_file ##Mentioning Input_file names here.
Assuming your input is grouped by the $2 value as shown in your example (if it isn't then just run sort -k2,2 on your input first) using 1 pass and only storing one token at a time in memory and producing the output in the same order of $2s as the input:
$ cat tst.awk
BEGIN { ORS="" }
$2 != prev {
printf "%s%s(", ORS, $2
ORS = ")\n"
sep = ""
prev = $2
}
{
printf "%s%s", sep, $1
sep = ","
}
END { print "" }
$ awk -f tst.awk input.txt
CC_LlanR(TCCTCCGC+TAGTTAGG_Vel_24_CC_LlanR_34,GGAGTATG+TCTATTCG_Vel_24_CC_LlanR_22)
EN_DavaW(TTGACTAG+TGGAGTAC_Vel_02_EN_DavaW_11)
CC_LlanR(TCGAATAA+TGGTAATT_Vel_24_CC_LlanR_23)
EN_DavaW(CTGCTGAA+CGTTGCGG_Vel_02_EN_DavaW_06)
17-ACW(index_07_barcode_04_PA-17-ACW-04,index_09_barcode_05_PA-17-ACW-05)
21-YC(index_08_barcode_37_PA-21-YC-15)
22-GB(index_09_barcode_04_PA-22-GB-10)
28-CC(index_10_barcode_37_PA-28-CC-17)
32-MW(index_11_barcode_29_PA-32-MW-07,index_11_barcode_20_PA-32-MW-08)
This might work for you (GNU sed):
sed -E 's/^(\S+)\s+(\S+)/\2(\1)/;H
x;s/(\n\S+)\((\S+)\)(.*)\1\((\S+)\)/\1(\2,\4)\3/;x;$!d;x;s/.//' file
Append each manipulated line to the hold space.
Before moving on to the next line, accumlate like keys into a single line.
Delete every line except the last.
Replace the last line by the contents of the hold space.
Remove the first character (newline artefact introduced by H comand) and print the result.
N.B. The final solution is unsorted and in the original order.

In file B find patterns from file A and replace with patterns from file C, line by line

I have a file of patterns (fileA.txt) which need to be searched in a large file (fileB.txt) and they need to be replaced with patterns in another file (fileC.txt)
Example:
fileB.txt
4472534
8BC4232
3533221
333553D
8645141
2412AAA
I want to search this patterns in fileB:
fileA.txt
BC423
33221
12AAA
Then I want to replace them with patterns in fileC, line by line:
fileC.txt
66FF7
11GYT
2HHJK
Expected output:
4472534
866FF72
3511GYT
333553D
8645141
242HHJK
I wrote something like this:
grep -f fileA.txt fileB.txt | xargs sed -i fileC.txt
however, it searches correctly the patterns but the substitution is probably not correct.
Any advice?
fileA (pattern to search)
CAAGATTTTCTTTGCCGAGACTCAGTGGGG
fileB
>AMP_4 RS0255 CENPF__ENST00000366955.7__6322__30__0.43333__69.25__1 RS0247
CAGTTGTGCAATTTGGTTTTCCAGCTCACA
>AMP_4 RS0451 CENPF__ENST00000366955.7__10108__30__0.5__71.1396__1 RS0247
GAAGCCTGCAGCCCTCACTGGAAATAAACA
>AMP_4 RS0451 CENPF__ENST00000366955.7__9236__30__0.5__69.816__1 RS0332
CAAGATTTTCTTTGCCGAGACTCAGTGGGG
>AMP_4 RS0451 CENPF__ENST00000366955.7__8140__30__0.43333__68.033__1RS0255
GAGCTCCTTCAATTGATCTTTGCTGCTCTT
fileC (pattern to replace)
GGAGGATGGTGCCTGAATCTACTGGGCTCC
This should be a task for awk, could you please try following written and tested with shown samples in GNU awk.
awk '
FNR==NR{
arr[$0]=FNR
next
}
FILENAME=="fileC.txt"{
arrVal[++count]=$0
next
}
FILENAME=="fileB.txt"{
for(key in arr){
if(sub(key,arrVal[arr[key]])){
break
}
}
print
}
' fileA.txt fileC.txt fileB.txt
Output will be as follows.
4472534
866FF72
3511GYT
333553D
8645141
242HHJK
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE when fileA.txt is being read.
arr[$0]=FNR ##Creating arr with index of current line and value of current line number.
next ##next will skip all further statements from here.
}
FILENAME=="fileC.txt"{ ##Checking condition if file name is fileC.txt then do following.
arrVal[++count]=$0 ##Creating arrVal with index of count increasing value of 1 and having current line as its value.
next ##next will skip all further statements from here.
}
FILENAME=="fileB.txt"{ ##Checking condition if file name is fileB.txt then do
for(key in arr){ ##Traversing through array arr here.
if(sub(key,arrVal[arr[key]])){ ##Checking condition if substitution of arrVal[arr[key]] is successfully done with key in current line, which basically changes the values in fileB values.
break ##Come out of loop to save some cycles.
}
}
print ##Printing current line here.
}
' fileA.txt fileC.txt fileB.txt ##Mentioning Input_file names here.
NOTE: We could also use ARGC conditions check in place of file name checks too in above.
paste fileA fileC \
|awk 'NR==FNR{ mapping[$1] =$2; next }
{ for(pat in mapping){
gsub(pat, mapping[pat])
};
print
}' - fileB
You could use sed to generate a sed script that would replace them:
sed "$(paste fileA.txt fileC.txt | sed 's/\(.*\)\t\(.*\)/s#\1#\2#g/')" fileB.txt
Here is a one liner with paste + awk + sed:
sed -f <(awk '{printf "s/%s/%s/g\n",$1,$2}' <(paste file{A,C}.txt)) fileB.txt
4472534
866FF72
3511GYT
333553D
8645141
242HHJK
This might work for you (GNU sed & parallel):
parallel echo 's/{1}/{2}/' ::::+ file[AC] | sed -f - fileB
Build a sed script and then run the script with fileB as input.
N.B. ::::+ emulates the paste command and {1} and {2} the values of each line from fileA and fileC.

How to edit output rows from awk with defined position?

Is there a way how to solve this?
I have a bash script, which creates .dat and .log file from source files.
I'm using awk with print and position what I need to print. The problem is with the last position - ID2 (lower). It should be just \*[0-9]{3}\*#, but in some cases there is a string before [0-9]{12}\[00]\>.
Then row looks for example like this:
2020-01-11 01:01:01;test;test123;123456789123[00]>*123*#
What I need is remove the string before in a file:
2020-01-11 01:01:01;test;test123;*123*#
File structure:
YYYY-DD-MM HH:MM:SS;string;ID1;ID2
I will be happy for any advice, thanks.
awk 'BEGIN{FS=OFS=";"} {$NF=substr($NF,length($NF)-5)}1' file
Here we keep only last 6 characters of the last field, while semicolon is the field separator. If there is nothing else in front of that *ID*#, then we keep all of it.
Delete everything before the first *:
$ awk 'BEGIN{FS=OFS=";"}{sub(/^[^*]*/,"",$NF)}1' file
Output:
2020-01-11 01:01:01;test;test123;*123*#
Could you please try following tested and written with shown samples in GNU awk.
awk '
match($0,/[0-9]{12}\[[0-9]+\]>/) && /\*[0-9]{3}\*#/{
print substr($0,1,RSTART-1) substr($0,RSTART+RLENGTH)
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
match($0,/[0-9]{12}\[[0-9]+\]>/) && /\*[0-9]{3}\*#/{ ##Using match function to match regex in it, what regex does is: It matches digits(12 in number) then [ then digits(continuously coming) and ] Also checking condition if line ends with *3 digits *
print substr($0,1,RSTART-1) substr($0,RSTART+RLENGTH) ##If above condition is TRUE then printing sub-string from 1st character to RSTART-1 and then sub-string from RSTART+RLENGTH value to till last of line.
}
' Input_file ##Mentioning Input_file name here.

gsub in awk with variable

I want to replace the ">" with variable names staring with ">" and ends with ".". But the following code is not printing the variable names.
for f in *.fasta;
do
nam=$(basename $f .fasta);
awk '{print $f}' $f | awk '{gsub(">", ">$nam."); print $0}'; done
Input of first file sample01.fasta:
cat sample01.fasta:
>textofDNA
ATCCCCGGG
>textofDNA2
ATCCCCGGGTTTT
Output expected:
>sample01.textofDNA
ATCCCCGGG
>sample01.textofDNA2
ATCCCCGGGTTTT
$ awk 'FNR==1{fname=FILENAME; sub(/[^.]+$/,"",fname)} sub(/^>/,""){$0=">" fname $0} 1' *.fasta
>sample01.textofDNA
ATCCCCGGG
>sample01.textofDNA2
ATCCCCGGGTTTT
Compared to the other answers you've got so far, the above will work in any awk, only does the file name calculation once per input file rather than once per line or once per >-line, won't fail if the file name contains other .s, won't fail if the file name contains &, and won't fail if the file name doesn't contain the string fasta..
Or like this? You don't really need the looping and basename or two awk invocations.
awk '{stub=gensub( /^([^.]+\.)fasta.*/ , "\\1", "1",FILENAME ) ; gsub( />/, ">"stub); print}' *.fasta
>sample01.textofDNA
ATCCCCGGG
>sample01.textofDNA2
ATCCCCGGGTTTT
Explanation: awk has knowledge of the filename it currently operates on through the built-in variable FILENAME; I strip the .fasta extension using gensub, and store it in the variable stub. The I invoke gsub to replace ">" with ">" and the content of my variable stub. After that I print it.
As Ed points out in the comments: gensub is a GNU extension and won't work on other awk implementations.
Could you please try following too.
awk '/^>/{split(FILENAME,array,".");print substr($0,1,1) array[1]"." substr($0,2);next} 1' Input_file
Explanation: Adding explanation for above code here.
awk '
/^>/{ ##Checking condition if a line starts from > then do following.
split(FILENAME,array,".") ##Using split function of awk to split Input_file name here which is stored in awk variable FILENAME.
print substr($0,1,1) array[1]"." substr($0,2) ##Printing substring to print 1st char then array 1st element and then substring from 2nd char to till last of line.
next ##next will skip all further statements from here.
}
1 ##1 will print all lines(except line that are starting from >).
' sample01.fasta ##Mentioning Input_file name here.

How to search a specific expression in multiple files with Awk

I have like 500 text documents. In every of them the expression "Numero de expediente" appears at least once. I want to locate every file where there is at least twice. Every file has its own name, I'm not sure if that's a problem (I don't know if *.txt works as in cmd with Windows). So yeah, I would like to know which document contain that expression at least twice and I don't know which command is more useful for that, if grep or cat.
Thanks.
I would add another way with grep and awk. grep is responsible for matching. awk filters out the files with matched counter>=2:
grep -o -m2 'YOUR_PATTERN' *.txt
|awk -F: '{a[$1]++}END{for(x in a)if(a[x]>1)print x}'
Note:
-o works with multiple occurrences in same line case
-m2 will improve the performance: after hits two matches, stop processing the file.
awk line just builds up a hashtable, and output the filenames with match count > 1
EDIT: As per #kent and #tripleee sir's comments I am taking care of multiple instances in a single line sum of string's occurences + if someone awk is NOT supporting nextfile I am creating a flag kind of no_processing which will simply skip lines if it is TRUE(after seeing 2 instances of string in any file).
awk 'FNR==1{count=0;no_processing=""} no_processing{next} {count+=gsub("Numero de expediente","")} count==2{print FILENAME;no_processing=1}' *.txt
OR(non-one liner form of solution)
awk '
FNR==1{
count=0
no_processing=""
}
no_processing{
next
}
{
count+=gsub("Numero de expediente","")
}
count==2{
print FILENAME
no_processing=1
}
' *.txt
Could you please try following, should work with GNU awk.
awk 'FNR==1{count=0} /Numero de expediente/{count++} count==2{print FILENAME " has at least 2 instances of searched string in it.";nextfile}' *.txt
Above will print eg--> test.txt has at least 2 instances of string in it. In case you want to simply print file names then try following.
awk 'FNR==1{count=0} /Numero de expediente/{count++} count==2{print FILENAME;nextfile}' *.txt
Explanation: Adding expplanation for above code now.
awk ' ##Starting awk program here.
FNR==1{ ##Checking condition FNR==1 which will check if this is a 1st line for any new Input_file(since we are reading multiple Input_files from awk in this code).
count=0 ##Setting value of variable count as ZERO here.
} ##Closing BLOCK for FNR condition here.
/Numero de expediente/{ ##Checking condition here if a line contains string Numero de expediente in it then do following.
count++ ##Incrementing variable named count value with 1 here.
} ##Closing BLOCK for string checking condition here.
count==2{ ##Checking condition if variable count value is 2 then do following.
print FILENAME ##Printing Input_file name here, where FILENAME is out of the box awk variable contains current Input_file name in it.
nextfile ##nextfile will skip current Input_file, since we got 2 instances so need NOT to read this Input_file as per OP requirement and SAVE some time here.
} ##Closing BLOCK for count condition here.
' *.txt ##Mentioning *.txt which will pass all .txt extension files to it.
You can try with Perl as well
perl -lne ' $x++ for(/Numero de expediente/g); if($x>=2) { print $ARGV;close(ARGV);$x=0 } ' *.txt
The $x will be 0 and for every pattern match (Numero de expediente) it will be incremented, even if the pattern is appearing twice in the same line. When you have atleast 2 matches, the file handle is closed using close(ARGV) and the nextfile is read.

Resources