Related
I am looking for a way to filter a (~12 Gb) largefile.txt with long strings in each line for each of the words (one per line) in a queryfile.txt. But afterwards, instead of outputting/saving the whole line that each query word is found in, I'd like to save only that query word and a second word which I only know the start of (e.g. "ABC") and that I know for certain is in the same line the first word was found in.
For example, if queryfile.txt has the words:
this
next
And largefile.txt has the lines:
this is the first line with an ABCword # contents of first line will be saved
and there is an ABCword2 in this one as well # contents of 2nd line will be saved
and the next line has an ABCword2 too # contents of this line will be saved as well
third line has an ABCword3 # contents of this line won't
(Notice that the largefile.txt always has a word starting with ABC included in every line. It's also impossible for one of the query words to start with "ABC")
The save file should look similar to:
this ABCword1
this ABCword2
next ABCword2
So far I've looked into other similar posts' suggestions, namely combining grep and awk, with commands similar to:
LC_ALL=C grep -f queryfile.txt largefile.txt | awk -F"," '$2~/ABC/' > results.txt
The problem is that not only is the query word not being saved but the -F"," '$2~/ABC/' command doesn't seem to be the correct one for fetching words beginning with 'ABC' either.
I also found ways of only using awk, but still haven't managed to adapt the code to save the word #2 as well instead of the whole line:
awk 'FNR==NR{A[$1]=$1;next} ($1 in A){print}' queryfile.txt largefile.txt > results.txt
2nd attempt based on updated sample input/output in question:
$ cat tst.awk
FNR==NR { words[$1]; next }
{
queryWord = otherWord = ""
for (i=1; i<=NF; i++) {
if ( $i in words ) {
queryWord = $i
}
else if ( $i ~ /^ABC/ ) {
otherWord = $i
}
}
if ( (queryWord != "") && (otherWord != "") ) {
print queryWord, otherWord
}
}
$ awk -f tst.awk queryfile.txt largefile.txt
this ABCword
next ABCword2
Original answer:
This MAY be what you're trying to do (untested):
awk '
FNR==NR { word2lgth[$1] = length($1); next }
($1 in word2lgth) && (match(substr($0,word2lgth[$1]+1),/ ABC[[:alnum:]_]+/) ) {
print substr($0,1,word2lgth[$1]+1+RSTART+RLENGTH)
}
' queryfile.txt largefile.txt > results.txt
Given:
cat large_file
this is the first line with an ABCword
and the next line has an ABCword2 too CRABCAKE
third line has an ABCword3
ABCword4 and this is behind
cat query_file
this
next
(The comments you have on each line of large_file are eliminated otherwise ABCword3 prints since there is 'this' in the comment.)
You can actually do this entirely with GNU sed and tr manipulation of the query file:
pat=$(gsed -E 's/^(.+)$/\\b\1\\b/' query_file | tr '\n' '|' | gsed 's/|$//')
gsed -nE "s/.*(${pat}).*(\<ABC[a-zA-Z0-9]*).*/\1 \2/p; s/.*(\<ABC[a-zA-Z0-9]*).*(${pat}).*/\1 \2/p" large_file
Prints:
this ABCword
next ABCword2
ABCword4 this
This one assumes your queryfile has more entries than there are words one a line in the largefile. Also, it does not consider your comments as comments but processes them as reqular data and therefore if cut'n'pasted, the third record is a match too.
$ awk '
NR==FNR { # process queryfile
a[$0] # hash those query words
next
}
{ # process largefile
for(i=1;i<=NF && !(f1 && f2);i++) # iterate until both words found
if(!f1 && ($i in a)) # f1 holds the matching query word
f1=$i
else if(!f2 && ($i~/^ABC/)) # f2 holds the ABC starting word
f2=$i
if(f1 && f2) # if both were found
print f1,f2 # output them
f1=f2=""
}' queryfile largefile
Using sed in a while loop
$ cat queryfile.txt
this
next
$ cat largefile.txt
this is the first line with an ABCword # contents of this line will be saved
and the next line has an ABCword2 too # contents of this line will be saved as well
third line has an ABCword3 # contents of this line won't
$ while read -r line; do sed -n "s/.*\($line\).*\(ABC[^ ]*\).*/\1 \2/p" largefile.txt; done < queryfile.txt
this ABCword
next ABCword2
I have written an awk command
awk 'NR==5 {sub(substr($1,14,1),(substr($1,14,1) + 1)); print "test.py"}' > test.py
This is trying to change the 14th character on the 5th line of a python file. For some reason this doesn't stop executing and I have to break it. It also deletes the contents of the file.
Sample input:
import tools
tools.setup(
name='test',
tagvisc='0.0.8',
packages=tools.ges(),
line xyz
)
`
Output:
import tools
tools.setup(
name='test',
tagvisc='0.0.9',
packages=tools.ges(),
line xyz
)
If I understand the nuances of what you need to do now, you will need to split the first field of the 5th record into an array using "." as the fieldsep and then remove the "\"," from the end of the 3rd element of the array (optional) before incrementing the number and putting the field back together. You can do so with:
awk '{split($1,a,"."); sub(/["],/,"",a[3]); $1=a[1]"."a[2]"."(a[3]+1)"\","}1'
(NR==5 omitted for example)
Example Use/Output
$ echo 'tagvisc="3.4.30"', |
awk '{split($1,a,"."); sub(/["],/,"",a[3]); $1=a[1]"."a[2]"."(a[3]+1)"\","}1'
tagvisc="3.4.31",
I'll leave redirecting to a temp file and then back to the original to you. Let me know if this isn't what you need.
Adding NR == 5 you would have
awk 'NR==5 {split($1,a,"."); sub(/["],/,"",a[3]); $1=a[1]"."a[2]"."(a[3]+1)"\","}1' test.py > tmp; mv -f tmp test.py
Get away from the fixed line number (NR==5) and fixed character position (14) and instead look at dynamically finding what you want to change/increment, eg:
$ cat test.py
import tools
tools.setup(
name='test',
tagvisc='0.0.10',
packages=tools.ges(),
line xyz
)
One awk idea to increment the 10 (3rd line, 3rd numeric string in line):
awk '
/tagvisc=/ { split($0,arr,".") # split line on periods
sub("." arr[3]+0 "\047","." arr[3]+1 "\047") # replace .<oldvalue>\047 with .<newvalue>\047; \047 == single quote
}
1
' test.py
NOTES:
arr[3] = 10',; with arr[3]+0 awk will take the leftmost all-numeric content, strip off everything else, then add 0, leaving us with arr[3] = 10; same logic applies for arr[3]+1 (arr[3]+1 = 11); basically a trick for discarding any suffix that is not numeric
if there are multiple lines in the file with the string tagvisc='x.y.z' then this will change z in all of the lines; we can get around this by adding some more logic to only change the first occurrence, but I'll leave that out for now assuming it's not an issue
This generates:
import tools
tools.setup(
name='test',
tagvisc='0.0.11',
packages=tools.ges(),
line xyz
)
If the objective is to overwrite the original file with the new values you have a couple options:
# use temporary file:
awk '...' test.py > tmp ; mv tmp test.py
# if using GNU awk, and once accuracy of script has been verified:
awk -i inplace '...' test.py
Using awk to make changes to nth character in [mth] line in a file:
$ awk 'BEGIN{FS=OFS=""}NR==5{$18=9}1' file # > tmp && mv tmp file
Outputs:
import tools
tools.setup(
name='test',
tagvisc='0.0.9', <----- this is not output but points to what changed
packages=tools.ges(),
line xyz
)
Explained:
$ awk '
BEGIN {
FS=OFS="" # set the field separators to empty and you can reference
} # each char in record by a number
NR==5 { # 5th record
$18=9 # and 18th char is replaced with a 9
}1' file # > tmp && mv tmp file # output to a tmp file and replace
Notice: Some awks (probably all but GNU awk) will fail if you try to replace a multibyte char by a single byte one (for example utf8 ä (0xc3 0xa4) with an a (0x61) will result in 0x61 0xa4). Naturally an ä before the position you'd like to replace will set your calculations off by 1.
Oh yeah, you can replace one char with multiple chars but not vice versa.
something like this...
$ awk 'function join(a,k,s,sep) {for(k in a) {s=s sep a[k]; sep="."} return s}
BEGIN {FS=OFS="\""}
/^tagvisc=/{v[split($2,v,".")]++; $2=join(v)}1' file > newfile
Using GNU awk for the 3rd arg to match() and "inplace" editing:
$ awk -i inplace '
match($0,/^([[:space:]]*tagvisc=\047)([^\047]+)(.*)/,a) {
split(a[2],ver,".")
$0 = a[1] ver[1] "." ver[2] "." ver[3]+1 a[3]
}
{ print }
' test.py
$ cat test.py
import tools
tools.setup(
name='test',
tagvisc='0.0.9',
packages=tools.ges(),
line xyz
)
I have 2 files test1.py and test2.py with this content
test1.py :
#first
#endoffirst
#second
#endofsecond
#3rd
#endof3rd
and test2.py :
#first
this is first command
#endoffirst
#second
this is second command
#endofsecond
#3rd
this is 3rd command
#endof3rd
I want to first check test2.py file and copy content between #first and #endoffirst and put it into the same tags in test1.py file with bash scripting or other operations in Linux. I mean that all content between two unique tags or commands in one file should copy and put in between the same tags or comments in other file.
I already test so many things line sed command but I can't get the right answer.
I appreciate that anyone can help me whit this
If you just want to use sed, I'd do it in two passes.
sed -n '/#first/,/#endoffirst/w tmp' test2.py
sed '/#first/,/#endoffirst/{
/#endoffirst/!d;
/#endoffirst/{ z; r tmp
} }' test1.py
#first
this is first command
#endoffirst
#second
#endofsecond
#3rd
#endof3rd
The weird formatting is because if you use r (or w) then the filename has to be the only thing on the rest of the line. Semicolons, spaces, closing curlies or pretty much anything else but a newline will be included in the filename by sed.
I'd probably use awk. Here's a clumsy pass at that.
$: awk '/#first/,/#endoffirst/{
if (NR == FNR) { x=x$0; if ($0 ~ "#endoffirst") { nextfile } else { x=x"\n" } }
else { if ($0 ~ "#endoffirst") { print x; } }
next } {print}' test2.py test1.py
#first
this is first command
#endoffirst
#second
#endofsecond
#3rd
#endof3rd
This may be what you're trying to do:
$ cat tst.awk
/^#/ {
inBlock = !inBlock
if ( inBlock ) {
tag = $0
}
}
NR == FNR {
if ( inBlock ) {
val[tag] = (tag in val ? val[tag] ORS : "") $0
}
next
}
$0 in val {
print val[$0]
}
!inBlock
$ awk -f tst.awk test2.py test1.py
#first
this is first command
#endoffirst
#second
this is second command
#endofsecond
#3rd
this is 3rd command
#endof3rd
from os import fdopen, remove
from tempfile import mkstemp
from shutil import copymode, move
try:
start_flag_count = end_flag_count = 0
content_dict = {}
with open('/tmp/test2.py') as old_file:
for line in old_file:
ori_line = line
line = line.strip()
if line and line.startswith('#'): # change '#' for your tags startswith
if start_flag_count == end_flag_count:
cur_tag = line
start_flag_count += 1
mul_content_lines = ''
elif start_flag_count == end_flag_count + 1:
content_dict[cur_tag] = mul_content_lines
end_flag_count += 1
else:
if start_flag_count == end_flag_count + 1:
mul_content_lines += ori_line
ori_file = '/tmp/test1.py'
fd, tmp_file_path = mkstemp()
with fdopen(fd, 'w') as new_file:
with open(ori_file) as old_file:
for line in old_file:
new_file.write(line)
line = line.strip()
if line and line.startswith('#') and line in content_dict:
new_file.write(content_dict[line])
copymode(ori_file, tmp_file_path)
remove(ori_file)
move(tmp_file_path, ori_file)
except Exception as e:
print str(e)
Run in Python 2.7, firstly, get the unique tags content and save it to dictionary with start flag by reading test2.py; then scan test1.py and add the content by tag after the start flag line.
I have a large data file in text format and I want to convert it to csv by specifying each column length.
number of columns = 5
column length
[4 2 5 1 1]
sample observations:
aasdfh9013512
ajshdj 2445df
Expected Output
aasd,fh,90135,1,2
ajsh,dj, 2445,d,f
GNU awk (gawk) supports this directly with FIELDWIDTHS, e.g.:
gawk '$1=$1' FIELDWIDTHS='4 2 5 1 1' OFS=, infile
Output:
aasd,fh,90135,1,2
ajsh,dj, 2445,d,f
I would use sed and catch the groups with the given length:
$ sed -r 's/^(.{4})(.{2})(.{5})(.{1})(.{1})$/\1,\2,\3,\4,\5/' file
aasd,fh,90135,1,2
ajsh,dj, 2445,d,f
Here's a solution that works with regular awk (does not require gawk).
awk -v OFS=',' '{print substr($0,1,4), substr($0,5,2), substr($0,7,5), substr($0,12,1), substr($0,13,1)}'
It uses awk's substr function to define each field's start position and length. OFS defines what the output field separator is (in this case, a comma).
(Side note: This only works if the source data does not have any commas. If the data has commas, then you have to escape them to be proper CSV, which is beyond the scope of this question.)
Demo:
echo 'aasdfh9013512
ajshdj 2445df' |
awk -v OFS=',' '{print substr($0,1,4), substr($0,5,2), substr($0,7,5), substr($0,12,1), substr($0,13,1)}'
Output:
aasd,fh,90135,1,2
ajsh,dj, 2445,d,f
Adding a Generic way of handling this(alternative to FIELDSWIDTH option) in awk(where we need not to harcode sub string positions, this will work as per position nuber provided by user wherever comma needs to be inserted) could be as follows, written and tested in GNU awk. To use this, we have to define values(like OP showed in samples), position numbers where we need to insert commas, awk variable name is colLength give position numbers with space between them.
awk -v colLengh="4 2 5 1 1" '
BEGIN{
num=split(colLengh,arr,OFS)
}
{
j=sum=0
while(++j<=num){
if(length($0)>sum){
sub("^.{"arr[j]+sum"}","&,")
}
sum+=arr[j]+1
}
}
1
' Input_file
Explanation: Simple explanation would be, creating awk variable named colLengh where we need to define position numbers wherever we need to insert commas. Then in BEGIN section creating array arr which has value of indexes where we need to insert commas in it.
In main program section first of all nullifying variables j and sum here. Then running a while loop from j=1 to till value of j becomes equal to num. In each run substituting from starting of current line(if length of current line is greater than sum else it doesn't make sense to perform substitution to I have put addiotnal check here) everything with everything + , as per need. Eg: sub function will become .{4} for first time loop runs then it becomes, .{7} because its 7th position we need to insert comma and so on. So sub will substitute those many characters from starting to till generated numbers with matched value + ,. At last in this program mentioning 1 will print edited/non-edited lines.
If any one is still looking for a solution, I have developed a small script in python. its easy to use provided you have python 3.5
https://github.com/just10minutes/FixedWidthToDelimited/blob/master/FixedWidthToDelimiter.py
"""
This script will convert Fixed width File into Delimiter File, tried on Python 3.5 only
Sample run: (Order of argument doesnt matter)
python ConvertFixedToDelimiter.py -i SrcFile.txt -o TrgFile.txt -c Config.txt -d "|"
Inputs are as follows
1. Input FIle - Mandatory(Argument -i) - File which has fixed Width data in it
2. Config File - Optional (Argument -c, if not provided will look for Config.txt file on same path, if not present script will not run)
Should have format as
FieldName,fieldLength
eg:
FirstName,10
SecondName,8
Address,30
etc:
3. Output File - Optional (Argument -o, if not provided will be used as InputFIleName plus Delimited.txt)
4. Delimiter - Optional (Argument -d, if not provided default value is "|" (pipe))
"""
from collections import OrderedDict
import argparse
from argparse import ArgumentParser
import os.path
import sys
def slices(s, args):
position = 0
for length in args:
length = int(length)
yield s[position:position + length]
position += length
def extant_file(x):
"""
'Type' for argparse - checks that file exists but does not open.
"""
if not os.path.exists(x):
# Argparse uses the ArgumentTypeError to give a rejection message like:
# error: argument input: x does not exist
raise argparse.ArgumentTypeError("{0} does not exist".format(x))
return x
parser = ArgumentParser(description="Please provide your Inputs as -i InputFile -o OutPutFile -c ConfigFile")
parser.add_argument("-i", dest="InputFile", required=True, help="Provide your Input file name here, if file is on different path than where this script resides then provide full path of the file", metavar="FILE", type=extant_file)
parser.add_argument("-o", dest="OutputFile", required=False, help="Provide your Output file name here, if file is on different path than where this script resides then provide full path of the file", metavar="FILE")
parser.add_argument("-c", dest="ConfigFile", required=False, help="Provide your Config file name here,File should have value as fieldName,fieldLength. if file is on different path than where this script resides then provide full path of the file", metavar="FILE",type=extant_file)
parser.add_argument("-d", dest="Delimiter", required=False, help="Provide the delimiter string you want",metavar="STRING", default="|")
args = parser.parse_args()
#Input file madatory
InputFile = args.InputFile
#Delimiter by default "|"
DELIMITER = args.Delimiter
#Output file checks
if args.OutputFile is None:
OutputFile = str(InputFile) + "Delimited.txt"
print ("Setting Ouput file as "+ OutputFile)
else:
OutputFile = args.OutputFile
#Config file check
if args.ConfigFile is None:
if not os.path.exists("Config.txt"):
print ("There is no Config File provided exiting the script")
sys.exit()
else:
ConfigFile = "Config.txt"
print ("Taking Config.txt file on this path as Default Config File")
else:
ConfigFile = args.ConfigFile
fieldNames = []
fieldLength = []
myvars = OrderedDict()
with open(ConfigFile) as myfile:
for line in myfile:
name, var = line.partition(",")[::2]
myvars[name.strip()] = int(var)
for key,value in myvars.items():
fieldNames.append(key)
fieldLength.append(value)
with open(OutputFile, 'w') as f1:
fieldNames = DELIMITER.join(map(str, fieldNames))
f1.write(fieldNames + "\n")
with open(InputFile, 'r') as f:
for line in f:
rec = (list(slices(line, fieldLength)))
myLine = DELIMITER.join(map(str, rec))
f1.write(myLine + "\n")
Portable awk
Generate an awk script with the appropriate substr commands
cat cols
4
2
5
1
1
<cols awk '{ print "substr($0,"p","$1")"; cs+=$1; p=cs+1 }' p=1
Output:
substr($0,1,4)
substr($0,5,2)
substr($0,7,5)
substr($0,12,1)
substr($0,13,1)
Combine lines and make it a valid awk-script:
<cols awk '{ print "substr($0,"p","$1")"; cs+=$1; p=cs+1 }' p=1 |
paste -sd, | sed 's/^/{ print /; s/$/ }/'
Output:
{ print substr($0,1,4),substr($0,5,2),substr($0,7,5),substr($0,12,1),substr($0,13,1) }
Redirect the above to a file, e.g. /tmp/t.awk and run it on the input-file:
<infile awk -f /tmp/t.awk
Output:
aasd fh 90135 1 2
ajsh dj 2445 d f
Or with comma as the output separator:
<infile awk -f /tmp/t.awk OFS=,
Output:
aasd,fh,90135,1,2
ajsh,dj, 2445,d,f
I have to replace a single nth character of each (alternate) row of a large file with correponding column of another file. For eg I am changing every 5th character.
file1:
>chr1:101842566-101842576
CCTCAACTCA
>chr1:101937281-101937291
GAATTGGATA
>chr1:101964276-101964286
AAAAAATAGG
>chr1:101972950-101972960
ggctctcatg
>chr1:101999969-101999979
CATCATGACG
file2:
G
A
T
A
C
output:
>chr1:101842566-101842576
CCTCGACTCA
>chr1:101937281-101937291
GAATAGGATA
>chr1:101964276-101964286
AAAATATAGG
>chr1:101972950-101972960
ggctAtcatg
>chr1:101999969-101999979
CATCCTGACG
The number of characters in each (alternate) row can be large. And number of rows are large too. How this can be done efficiently?
Here is one way with awk:
awk 'NR==FNR{a[NR]=$1;next}!/^>/{$1=substr($1,1,n-1) a[++i] substr($1,n+1)}1' n=5 f2 f1
Explanation:
We iterate over second file and store it in an array indexed at line number.
Once the second file is loaded in memory, we move to the second file.
We look for lines not starting with >.
When found we substitute the value from our array. We do this by using substr function.
The variable n defined allows you to modify the nth character
For the lines that donot have > we print them as is using 1 which is default for
printing.
This solution assumes the format of the file is as shown above. That is, the first file will always start with > followed by the line you want to make changes on. Substitution from the second file will be made in the order it is seen.
Demo:
Every 5th character:
$ awk 'NR==FNR{a[NR]=$1;next}!/^>/{$1=substr($1,1,n-1) a[++i] substr($1,n+1)}1' n=5 f2 f1
>chr1:101842566-101842576
CCTCGACTCA
>chr1:101937281-101937291
GAATAGGATA
>chr1:101964276-101964286
AAAATATAGG
>chr1:101972950-101972960
ggctAtcatg
>chr1:101999969-101999979
CATCCTGACG
Every 3rd character:
$ awk 'NR==FNR{a[NR]=$1;next}!/^>/{$1=substr($1,1,n-1) a[++i] substr($1,n+1)}1' n=3 f2 f1
>chr1:101842566-101842576
CCGCAACTCA
>chr1:101937281-101937291
GAATTGGATA
>chr1:101964276-101964286
AATAAATAGG
>chr1:101972950-101972960
ggAtctcatg
>chr1:101999969-101999979
CACCATGACG
This is how I would use perl. First read all of file2 into an array, and then iterate over that array reading two and two lines from file1, printing the first line unmodified and then change the 5th character on the second line:
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
#use Data::Printer;
# Read all of file2
my $lines;
open(FILE, $ARGV[1]);
{
local $/;
$lines = <FILE>;
}
close(FILE);
my #new_chars = split(/\n/, $lines);
# Read and process file1
open(FILE, $ARGV[0]);
foreach my $new_char (#new_chars) {
# >chr1:101842566-101842576
my $line = <FILE>;
print $line;
# CCTCAACTCA
$line = <FILE>;
$line =~ s/^(....)./$1$new_char/; # Replace 5th character
print $line;
}
close(FILE);
You could replace a column in the file inplace using mmap in Python:
#!/usr/bin/env python3
"""Replace inplace a column of a large file.
Usage:
$ ./replace-inplace file1 file2 5
"""
import sys
from mmap import ACCESS_WRITE, mmap
def main():
ncolumn = int(sys.argv[3]) - 1 # 1st column is 1
with open(sys.argv[1], 'r+b') as file1:
with mmap(file1.fileno(), 0, access=ACCESS_WRITE) as mm:
with open(sys.argv[2], 'rb') as file2:
while True:
mm.readline() # ignore every other line
pos = mm.tell() # remember current position
if not mm.readline(): # EOF
break
replacement = file2.readline().strip()[0]
mm[pos + ncolumn] = replacement # replace the column
main()
It assumes that you are replacing a byte with a byte i.e., no content is moved in the file.
This might work for you (GNU sed, paste and cat):
cat file1 | paste -d\\n\\t\\n - file2 - | sed -r 's/^(.)\t(.{4})./\2\1/' >file3
Embed the data from file2 into file1 and then re-arrange it.