If I have 3 csv files, and I want to merge the data all into one, but beside each other, how would I do it? For example:
Initial Merged file:
,,,,,,,,,,,,
File 1:
20,09/05,5694
20,09/06,3234
20,09/08,2342
File 2:
20,09/05,2341
20,09/06,2334
20,09/09,342
File 3:
20,09/05,1231
20,09/08,3452
20,09/10,2345
20,09/11,372
Final merged File:
09/05,5694,,,09/05,2341,,,09/05,1231
09/06,3234,,,09/06,2334,,,09/08,3452
09/08,2342,,,09/09,342,,,09/10,2345
,,,,,,,,09/11,372
Basically data from each file goes into a specific column of the merged file.
I know the awk function can be used for this, but I have no clue how to start
EDIT: Only the 2nd and 3rd Columns of each file are being printed. I was using this to print out the 2nd and 3rd columns:
awk -v f="${i}" -F, 'match ($0,f) { print $2","$3 }' file3.csv > d$i.csv
however, say for example, file1 and file2 were null in that row, the data for that row would be shifted to the left. so I came up with this to account for the shift:
awk -v x="${i}" -F, 'match ($0,x) { if ($2='/NULL') { print "," }; else { print $2","$3}; }' alld.csv > d$i.csv
Using GNU awk for ARGIND:
$ gawk '{ a[FNR,ARGIND]=$0; maxFnr=(FNR>maxFnr?FNR:maxFnr) }
END {
for (i=1;i<=maxFnr;i++) {
for (j=1;j<ARGC;j++)
printf "%s%s", (j==1?"":",,,"), (a[i,j]?a[i,j]:",")
print ""
}
}
' file1 file2 file3
09/05,5694,,,09/05,2341,,,09/05,1231
09/06,3234,,,09/06,2334,,,09/08,3452
09/08,2342,,,09/09,342,,,09/10,2345
,,,,,,,,09/11,372
If you don't have GNU awk, just add an initial line that says FNR==1{ARGIND++}.
Commented version per request:
$ gawk '
{ a[FNR,ARGIND]=$0; # Store the current line in a 2-D array `a` indexed by
# the current line number `FNR` and file number `ARGIND`.
maxFnr=(FNR>maxFnr?FNR:maxFnr) # save the max FNR value
}
END{
for (i=1;i<=maxFnr;i++) { # Loop from 1 to max number of fields
# seen across all files and for each:
for (j=1;j<ARGC;j++) # Loop from 1 to total number of files parsed and:
printf "%s%s", # Print 2 strings, specifically:
(j==1?"":",,,"), # A field separator - empty if were printing
# the first field, three commas otherwise.
(a[i,j]?a[i,j]:",") # The value stored in the array if it was
# present in the files, a comma otherwise.
print "" # Print a newline
}
}
' file1 file2 file3
I originally was using an array fnr[FNR] to track the max value of FNR but IMHO that's kinda obscure and it has a flaw where if no lines have, say, a 2nd field then a loop on for (i=1;i in fnr;i++) in the END section would bail out before getting to the 3rd field.
paste is done for this:
$ paste -d";" f1 f2 f3 | sed 's/;/,,,/g'
09/05,5694,,,09/05,2341,,,09/05,1231
09/06,3234,,,09/06,2334,,,09/08,3452
09/08,2342,,,09/09,342,,,09/10,2345
,,,,,,09/11,372
Note that the paste alone will output just one comma:
$ paste -d, f1 f2 f3
09/05,5694,09/05,2341,09/05,1231
09/06,3234,09/06,2334,09/08,3452
09/08,2342,09/09,342,09/10,2345
,,09/11,372
So to have multiple ones we can use another delimiter like ; and then replace by ,,, with sed:
$ paste -d";" f1 f2 f3 | sed 's/;/,,,/g'
09/05,5694,,,09/05,2341,,,09/05,1231
09/06,3234,,,09/06,2334,,,09/08,3452
09/08,2342,,,09/09,342,,,09/10,2345
,,,,,,09/11,372
Using pr:
$ pr -mts',,,' file[1-3]
09/05,5694,,,09/05,2341,,,09/05,1231
09/06,3234,,,09/06,2334,,,09/08,3452
09/08,2342,,,09/09,342,,,09/10,2345
,,,,,,09/11,372
Related
I have a source.txt file like below containing two columns of data. The format of the columns of source.txt include [ ] (square bracket) as shown in my source.txt:
[hot] [water]
[16] [boots and, juice]
and I have another target.txt file and contain empty lines plus full stops at the end of each line:
the weather is today (foo) but we still have (bar).
= (
the next bus leaves at (foo) pm, we can't forget to take the (bar).
I want to do replace foo of each nth line of target.txt with the "respective contents" of the first column of source.txt, and also replace bar of each nth line of target.txt with the "respective contents" of the second column of source. txt.
i tried to search other sources and understand how i would do it, at first i already have a command that i use to replace "replace each nth occurrence of 'foo' by numerically respective nth line of a supplied file" but i couldn't adapt it:
awk 'NR==FNR {a[NR]=$0; next} /foo/{gsub("foo", a[++i])} 1' source.txt target.txt > output.txt;
I remember seeing a way to use gsub with containing two columns of data but I don't remember what exactly the difference was.
EDIT POST: sometimes read with some symbols between them = and ( and ) within the target.txt text. I added this symbol as some answers will not work if these symbols are in the target.txt file
Note: the number of target.txt lines and therefore the number of occurrences of bar and foo in this file can vary, I just showed a sample. But the number of occurrences of both foo and bar in each row is 1 respectively.
With your shown samples, please try following answer. Written and tested in GNU awk.
awk -F'\\[|\\] \\[|\\]' '
FNR==NR{
foo[FNR]=$2
bar[FNR]=$3
next
}
NF{
gsub(/\<foo\>/,foo[++count])
gsub(/\<bar\>/,bar[count])
}
1
' source.txt FS=" " target.txt
Explanation: Adding detailed explanation for above.
awk -F'\\[|\\] \\[|\\]' ' ##Setting field separator as [ OR ] [ OR ] here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when source.txt will be read.
foo[FNR]=$2 ##Creating foo array with index of FNR and value of 2nd field here.
bar[FNR]=$3 ##Creating bar array with index of FNR and value of 3rd field here.
next ##next will skip all further statements from here.
}
NF{ ##If line is NOT empty then do following.
gsub(/\<foo\>/,foo[++count]) ##Globally substituting foo with array foo value, whose index is count.
gsub(/\<bar\>/,bar[count]) ##Globally substituting bar with array of bar with index of count.
}
1 ##printing line here.
' source.txt FS=" " target.txt ##Mentioning Input_files names here.
EDIT: Adding following solution also which will handle n number of occurrences of [...] in source and matching them at target file also. Since this is a working solution for OP(confirmed in comments) adding this in here. Also fair warning this will fail when source.txt contains a &.
awk '
FNR==NR{
while(match($0,/\[[^]]*\]/)){
arr[++count]=substr($0,RSTART+1,RLENGTH-2)
$0=substr($0,RSTART+RLENGTH)
}
next
}
{
line=$0
while(match(line,/\(?[[:space:]]*(\<foo\>|\<bar\>)[[:space:]]*\)?/)){
val=substr(line,RSTART,RLENGTH)
sub(val,arr[++count1])
line=substr(line,RSTART+RLENGTH)
}
}
1
' source.txt target.txt
Using any awk in any shell on every Unix box:
$ cat tst.awk
BEGIN {
FS="[][]"
tags["foo"]
tags["bar"]
}
NR==FNR {
map["foo",NR] = $2
map["bar",NR] = $4
next
}
{
found = 0
head = ""
while ( match($0,/\([^)]+)/) ) {
tag = substr($0,RSTART+1,RLENGTH-2)
if ( tag in tags ) {
if ( !found++ ) {
lineNr++
}
val = map[tag,lineNr]
}
else {
val = substr($0,RSTART,RLENGTH)
}
head = head substr($0,1,RSTART-1) val
$0 = substr($0,RSTART+RLENGTH)
}
print head $0
}
$ awk -f tst.awk source.txt target.txt
the weather is today hot but we still have water.
= (
the next bus leaves at 16 pm, we can't forget to take the boots and, juice.
awk '
NR==FNR { # build lookup
# delete gumph
gsub(/(^[[:space:]]*\[)|(\][[:space:]]*$)/, "")
# split
split($0, a, /\][[:space:]]+\[/)
# store
foo[FNR] = a[1]
bar[FNR] = a[2]
next
}
!/[^[:space:]]/ { next } # ignore blank lines
{ # do replacements
VFNR++ # FNR - (ignored lines)
# can use sub if foo/bar only appear once
gsub(/\<foo\>/, foo[VFNR])
gsub(/\<bar\>/, bar[VFNR])
print
}
' source.txt target.txt
Note: \< and \> are not in POSIX but are accepted by some versions of awk (eg. gawk). I'm not sure if POSIX awk regex has "word boundary".
How to get information from specimen1 to specimen3 and paste it into another file 'DNA_combined.txt'?
I tried cut command and awk commend but I found that it is tricky to cutting by paragraph(?) or sequence.
My trial was something like cut -d '>' -f 1-3 dna1.fasta > DNA_combined.txt
You can get the line number for each row using Esc + : and type set nu
Once you get the line number corresponding to each row:
Note down the line number corresponding to Line containing >Specimen 1 (say X) and Specimen 3 (say Y)
Then, use sed command to get the text between two lines
sed -n 'X,Yp' dna1.fasta > DNA_combined.txt
Please let me know if you have any questions.
If you want the first three sequences irrespective of the content after >, you can use this:
$ cat ip.txt
>one
ACGTA
TCGAAA
>two
TGACA
>three
ACTG
AAAAC
>four
ATGC
>five
GTA
$ awk '/^>/ && ++count==4{exit} 1' ip.txt
>one
ACGTA
TCGAAA
>two
TGACA
>three
ACTG
AAAAC
/^>/ matches the start of a sequence
for such sequences, increment the count variable
if count reaches 4, the exit command will terminate the script
1 idiomatic way to print contents of input record
Would you please try the following:
awk '
BEGIN {print ">Specimen1-3"} # print header
/^>Specimen/ {f = match($0, "^>Specimen[1-3]") ? 1 : 0; next}
# set the flag depending on the number
f # print if f == 1
' dna1.fasta > DNA_combined.txt
I have 2 files with settings:
file1.txt and file2.txt
A=1 A=2
B=3 B=3
C=5 C=4
D=6 .
. E=7
I am looking for the best approach to replace the values of the file1.txt with the diff values of file2.txt, so the file1.txt would look like:
file1.txt:
A=2
B=3
C=4
D=6
E=7
Currently i didn't write any code, but the only approach i think about is to write a bash script that diffs both files (provided as positional arguments), and use sed to replace non-matching strings. Something in this vein:
./diffreplace.bash file1.txt file2.txt > NEWfile1.txt
I wonder whether there is something more elegant that alerady exists?
All of the following solutions may change the order of assignments. I assumed that would be ok.
Lazy Solution
If you use these assignments in some way that allows overwriting, then you can simple append file2 to the end of file1. All old values will be overwritten be the new ones when you execute result.
cat old new > result
Slightly Better Solution
Extending the previous approach, you can iterate over the lines of result and for every variable, keep only the last assignment:
cat new old |
awk -F= '{if (a[$1]!="x") {print $0; a[$1]=x}}'
Alternative Solution
Use join to combine both files, then filter out the values from the first file by using cut. When your files are sorted, use
join -t= -a1 -a2 new old | cut -d= -f1,2
if not, use
join -t= -a1 -a2 <(sort new) <(sort old) |
cut -d= -f1,2
I'm a little puzzed over your comment the structure of the file must remain untouched. Sort mixes the order so I'm assuming that the As are always on line 1 or line 1 is . etc:
$ awk '
BEGIN { RS="\r?\n" } # in case of Windows line-endings
$0!="." { # we dont store . (change it to null if you need to)
a[FNR]=$0 # hash using line number as key
}
END { # after all that hashing
for(i=1;i<=FNR;i++) # iterate in line number order
print a[i] # output the last met version
}' file1 file2 # mind the file order
Output:
A=2
B=3
C=4
D=6
E=7
Edit: A version with a whitelist:
$ cat whitelist
A
B
E
Script:
$ awk -F= '
NR==FNR { # process the whitelist
a[FNR]=$1 # for a key is linenumber, record as value
b[$1]=FNR # bor b record is key, linenumber is value
n=FNR # remember the count for END
next
} # process file1 and file2 ... filen
($1 in b) { # if record is found in b
a[b[$1]]=$0 # we set the record to a[linenumber]=record
}
END {
for(i=1;i<=n;i++) # here we loop on linenumbers, 1 to n
print a[i]
}' whitelist file1 file2
Output:
A=2
B=3
E=7
Problem Statement:
I have a delimited text file offloaded from Teradata which happens to have "\n" (newline characters or EOL markers) inside data fields.
The same EOL marker is at the end of each new line for one entire line or record.
I need to split this file in two or more files (based on no of records given by me) while retaining the newline chars in data fields but against the line breaks at the end of each lines.
Example:
1|Alan
Wake|15
2|Nathan
Drake|10
3|Gordon
Freeman|11
Expectation :
file1.txt
1|Alan
Wake|15
2|Nathan
Drake|10
file2.txt
3|Gordon
Freeman|11
What i have tried :
awk 'BEGIN{RS="\n"}NR%2==1{x="SplitF"++i;}{print > x}' inputfile.txt
The code can't discern between data field newlines and actual newlines. Is there a way it can be achieved?
EDIT:: i have changed the problem statement with example. Please share your thoughts on the new example.
Use the following awk approach:
awk '{ r=(r!="")?r RS $0 : $0; if(NR%4==0){ print r > "file"++i".txt"; r="" } }
END{ if(r) print r > "file"++i".txt" }' inputfile.txt
NR%4==0 - your logical single line occupies two physical records, so we expect to separate on each 4 records
Results:
> cat file1.txt
1|Alan
Wake
2|Nathan
Drake
> cat file2.txt
3|Gordon
Freeman
If you are using GNU awk you can do this by setting RS appropriately, e.g.:
parse.awk
BEGIN { RS="[0-9]\\|" }
# Skip the empty first record by checking NF (Note: this will also skip
# any empty records later in the input)
NF {
# Send record with the appropriate key to a numbered file
printf("%s", d $0) > "file" i ".txt"
}
# When we found enough records, close current file and
# prepare i for opening the next one
#
# Note: NR-1 because of the empty first record
(NR-1)%n == 0 {
close("file" i ".txt")
i++
}
# Remember the record key in d, again,
# becuase of the empty first record
{ d=RT }
Run it like this:
gawk -f parse.awk n=2 infile
Where n is the number of records to put into each file.
Output:
file1.txt
1|Alan
Wake|15
2|Nathan
Drake|10
file2.txt
3|Gordon
Freeman|11
Given an input file of variable values (example):
A
B
D
What is a script to remove all lines from another file which start with one of the above values? For example, the file contents:
A
B
C
D
Would end up being:
C
The input file is of the order of 100,000 variable values. The file to be mangled is of the order of several million lines.
awk '
NR==FNR { # IF this is the first file in the arg list THEN
list[$0] # store the contents of the current record as an index or array "list"
next # skip the rest of the script and so move on to the next input record
} # ENDIF
{ # This MUST be the second file in the arg list
for (i in list) # FOR each index "i" in array "list" DO
if (index($0,i) == 1) # IF "i" starts at the 1st char on the current record THEN
next # move on to the next input record
}
1 # Specify a true condition and so invoke the default action of printing the current record.
' file1 file2
An alternative approach to building up an array and then doing a string comparison on each element would be to build up a Regular Expression, e.g.:
...
list = list "|" $0
...
and then doing an RE comparison:
...
if ($0 ~ list)
next
...
but I'm not sure that'd be any faster than the loop and you'd then have to worry about RE metacharacters appearing in file1.
If all of your values in file1 are truly single characters, though, then this approach of creating a character list to use in an RE comparison might work well for you:
awk 'NR==FNR{list = list $0; next} $0 !~ "^[" list "]"' file1 file2
You can also achieve this using egrep:
egrep -vf <(sed 's/^/^/' file1) file2
Lets see it in action:
$ cat file1
A
B
$ cat file2
Asomething
B1324
C23sd
D2356A
Atext
CtestA
EtestB
Bsomething
$ egrep -vf <(sed 's/^/^/' file1) file2
C23sd
D2356A
CtestA
EtestB
This would remove lines that start with one of the values in file1.
You can use comm to display the lines that are not common to both files, like this:
comm -3 file1 file2
Will print:
C
Notice that for this for this to work, both files have to be sorted, if they aren't sorted you can bypass that using
comm -3 <(sort file1) <(sort file2)