Passing result of tr as second parameter in awk

Passing result of tr as second parameter in awk - linux

My command:
awk 'NR==FNR{a[$0]=1;next;} substr($0,50,6) in a' file1 file2
The problem is that file 2 contains \000 characters and awk consider it as binary file.
Replacing \000 with space character:
tr '\000' ' ' < file2 > file2_not_binary
solves binary file problem.
However my file2 is a 20GB file. And I don't want to do tr separately and save result as another file. I want to pass the result of tr to awk.
I have tried:
awk 'NR==FNR{a[$0]=1;next;} substr($0,50,6) in a' file1 < (tr '\000' ' ' < file2)
But the result is:
The system cannot find the file specified.
Another question is: can my memory or awk handle such a big file at once? I'm working on 12GB RAM PC.
EDIT
One of the answer works as I expected (credits to Ed Morton)
tr '\000' ' ' < file2 | awk 'NR==FNR{a[$0];next} substr($0,50,6) in a' file1 -
However it is like 2 time slower then doing the same in 2 steps - first removing \000 and save it and then using awk to search. How I can speed it up?
EDIT2
My bad. Ed Morton solution is actually a little bit faster then doing the same in two separately commands.
Two commands separately: 08:37:053
Two commands piped: 08:07:204

Since awk isn't storing your 2nd file in memory the size of that file is irrelevant except for speed of execution. Try this:
tr '\000' ' ' < file2 | awk 'NR==FNR{a[$0];next} substr($0,50,6) in a' file1 -

It should be:
awk ... <(tr -d '\0' < file2)
# -------^ no space!
Check the manual about Process Substitution.

You could replace it in awk using gsub(/\000/," "). Testing, let's make a test file:
$ awk 'BEGIN{print "a b\000c d"}' > foo
$ hexdump -C foo
00000000 61 20 62 00 63 20 64 0a |a b.c d.|
00000008
And then:
$ awk '{print; gsub(/\000/," "); print}' foo
a bc d
a b c d

Related

Copy first row to the last in file

The purpose here is to copy the first row in the file to the last
Here the input file
335418.75,2392631.25,36091,38466,1
335418.75,2392643.75,36092,38466,1
335418.75,2392656.25,36093,38466,1
335418.75,2392668.75,36094,38466,1
335418.75,2392681.25,36095,38466,1
335418.75,2392693.75,36096,38466,1
335418.75,2392706.25,36097,38466,1
335418.75,2392718.75,36098,38466,1
335418.75,2392731.25,36099,38466,1
Using the following code i got the output desired. Is there other easy option?
awk 'NR==1 {print}' FF1-1.csv > tmp1
cat FF1-1.csv tmp1
Output desired
335418.75,2392631.25,36091,38466,1
335418.75,2392643.75,36092,38466,1
335418.75,2392656.25,36093,38466,1
335418.75,2392668.75,36094,38466,1
335418.75,2392681.25,36095,38466,1
335418.75,2392693.75,36096,38466,1
335418.75,2392706.25,36097,38466,1
335418.75,2392718.75,36098,38466,1
335418.75,2392731.25,36099,38466,1
335418.75,2392631.25,36091,38466,1
Thanks in advance.

Save the line in a variable and print at end using the END block
$ seq 5 | awk 'NR==1{fl=$0} 1; END{print fl}'
1
2
3
4
5
1

headcan produce the same output as your awk, so you can cat that instead.
You can use process substitution to avoid the temporary file.
cat FF1-1.csv <(head -n 1 FF1-1.csv)
As mentionned by Sundeep if process substitution isn't available you can simply cat the file then head it sequentially to obtain the same result, putting both in a subshell if you need to redirect the output :
(cat FF1-1.csv; head -n1 FF1-1.csv) > dest
Another alternative would be to pipe the output of head to cat and refer to it with - which for cat represents standard input :
head -1 FF1-1.csv | cat FF1-1.csv -

When you want to overwrite the existing, normal solutions can fail: do not write to a file you are working with.
A solution for editing the file is:
printf "%s\n" 1y $ x w q | ed -s file > /dev/null
Explanation:
printf will help for entering all commands in new lines.
1y will put the first line in a buf.
$ moves to the last line.
x will paste the contents of the buf.
w will write the results.
q will quit the editor.
ed is the editor that performs all work.
-s is suppressing diagnostics.
file is your input file.
> /dev/null is suppressing output to your screen.

With GNU sed:
seq 1 5 | sed '1h;$G'
Output:
1
2
3
4
5
1
1h: In first row: copy current row (pattern space) to sed's hold space
$G: In last row ($): append content from hold space to pattern space
See: man sed

Following solution may also help on same:
Solution 1st: Simply using awk with using RS and FS here(without using variables):
awk -v RS="" -v FS="\n" '{print $0 ORS $1}' Input_file
Solution 2nd: Using cat and head:
cat Input_file && head -n1 Input_file

more than one source of data to process with bc

I have this file that contains number in a column:
[root#server1]# cat numbers.txt
30
25
15
I need to add them together so I do this:
[root#autonoc cattests]# cat numbers.txt | paste -s -d+ | bc
70
But after I add them together I need to divide them by 60,something like this:
[root#server1]# cat numbers.txt | paste -s -d+ | "new command" | bc
how can I do that?

Using awk:
$ awk '{s+=$1} END{print s/60}' numbers.txt
1.16667
How it works
s+=$1
The number on each lien of numbers.txt is added to variable s.
END{print s/60}
After we have finished reading the file, we print the value of s divided by 60.

bc -l <<< '('$(paste -s -d+ numbers.txt)')/60'

awk -v div=60 '{tot+=$0}END{print tot/div}' numbers.txt
Using the -v div=60 can be extended further to accept any user input, like
read -p "Enter the div value: " div
awk -v div="$div" ...
IHTH

You can use dc
dc -f numbers.txt -e '++3k60/[result = ]np'

Extracting whole line if a character is present at certain position

I am a java programmer and a newbie to shell scripting, I have a daunting task to parse multi gigabyte logs and look for lines where '1'(just 1 no qoutes) is present at 446th position of the line, I am able to verify that character 1 is present by running this cat *.log | cut -c 446-446 | sort | uniq -c but I am not able to extract the lines and print them in an output file.

awk '{if (substr($0,446,1) == "1") {print $0}}' file
is the basics.
You can use FILENAME in the print feature to add the filename to the output, so then you could do
awk '{if (substr($0,446,1) == "1") {print FILENAME ":" $0}}' file1 file2 ...
IHTH

Try adding grep to the pipe:
grep '^.\{445\}1.*$'

You can use an awk command for that:
awk 'substr($0, 446, 1) == "1"' file.log
substr function will get 1 character at position 446 and == "1" will ensure that character is 1.

Another in awk. To make a more sane example, we print lines where the third char is 3:
$ cat file
123 # this
456 # not this
$ awk -F '' '$3==3' file
123 # this
based on that example but untested:
$ awk -F '' '$446==1' file

Checking 2 files in Unix and finding the sum of a particular column in 3rd file through shell script

I have something I need help with, would appreciate your help
Let's take an example
I have file 1 with data
"eno", "ename", "salary"
"1","john","50000"
"2","steve","30000"
"3","aku","20000"
and I have file 2 with data
"eno", "ename", "incentives"
"1","john","2000"
"2","steve","5000"
"4","akshi","200"
And the expected output in 3 file I want is :
"eno", "ename", "t_salary"
"1","john","52000"
"2","steve","35000"
This is what is expected result
as I should be using eno and the ename as the primary key and output should be shown like this

if your files are sorted and first field is the key, you can join the files and work on the combined fields
that is,
$ join -t, file1 file2
"eno", "ename", "salary", "ename", "incentives"
"1","john","50000","john","2000"
"2","steve","30000","steve","5000"
and your awk can be
... | awk -F, -v OFS=, 'NR==1{print ...}
NR>1{gsub(/"/,"",$3);
gsub(/"/,"",$5);
print $1,$2,$3+$5}'
printing header and quoting the total field is left as an exercise.

$ cat tst.awk
BEGIN { FS="\"[[:space:]]*,[[:space:]]*\""; OFS="\",\"" }
{ key = $1 FS $2 }
NR==FNR { sal[key] = $NF; next }
key in sal { $3 = (FNR>1 ? $3+sal[key] : "t_salary") "\""; print }
$ awk -f tst.awk file1 file2
"eno","ename","t_salary"
"1","john","52000"
"2","steve","35000"
Get the book Effective Awk Programming, 4th Edition, by Arnold Robbins.

Abbreviating the input files to f1 & f2, and breaking out the swiss army knife utils, (plus a bashism):
head -n 1 f1 | sed 's/sal/t_&/' ; \
grep -h -f <(tail -qn +2 f1 f2 | tr ',' '\t' | sort -k1,2 | \
rev | uniq -d -f1 | rev | \
cut -f 2) \
f1 f2 | \
tr -s ',"' '\t' | datamash -s -g2,3 sum 4 | sed 's/[^\t]*/"&"/g;s/\t/,/g'
Output:
"eno", "ename", "t_salary"
"1","john","52000"
"2","steve","35000"
The main job is fairly simple:
grep searches for only those lines with duplicate (and therefore add-able) fields #1 & #2, and this is piped to...
datamash which does the adding.
The rest of the code is reformatting needed to please the various text utils which all seem to have ugly but minor format inconsistencies.
Those revs are only needed because uniq lacks most of sort's field functions.
The trs are because uniq also lacks a field separator switch, and datamash can't sum quoted numbers. The sed at the end is to undo all that tr-ing.

How to dump part of binary file

I have binary and want to extract part of it, starting from know byte string (i.e. FF D8 FF D0) and ending with known byte string (AF FF D9)
In the past I've used dd to cut part of binary file from beginning/ending but this command doesn't seem to support what I ask.
What tool on terminal can do this?

Locate the start/end position, then extract the range.
$ xxd -g0 input.bin | grep -im1 FFD8FFD0 | awk -F: '{print $1}'
0000cb0
$ ^FFD8FFD0^AFFFD9^
0009590
$ dd ibs=1 count=$((0x9590-0xcb0+1)) skip=$((0xcb0)) if=input.bin of=output.bin

In a single pipe:
xxd -c1 -p file |
awk -v b="ffd8ffd0" -v e="aaffd9" '
found == 1 {
print $0
str = str $0
if (str == e) {found = 0; exit}
if (length(str) == length(e)) str = substr(str, 3)}
found == 0 {
str = str $0
if (str == b) {found = 1; print str; str = ""}
if (length(str) == length(b)) str = substr(str, 3)}
END{ exit found }' |
xxd -r -p > new_file
test ${PIPESTATUS[1]} -eq 0 || rm new_file
The idea is to use awk between two xxd to select the part of the file that is needed. Once the 1st pattern is found, awk prints the bytes until the 2nd pattern is found and exit.
The case where the 1st pattern is found but the 2nd is not must be taken into account. It is done in the END part of the awk script, which return a non-zero exit status. This is catch by bash's ${PIPESTATUS[1]} where I decided to delete the new file.
Note that en empty file also mean that nothing has been found.

This should work with standard tools (xxd, tr, grep, awk, dd). This correctly handles the "pattern split across line" issue, also look for the pattern only aligned at byte offset (not nibble).
file=<yourfile>
outfile=<youroutputfile>
startpattern="ff d8 ff d0"
endpattern="af ff d9"
xxd -g0 -c1 -ps ${file} | tr '\n' ' ' > ${file}.hex
start=$((($(grep -bo "${startpattern}" ${file}.hex\
| head -1 | awk -F: '{print $1}')-1)/3))
len=$((($(grep -bo "${endpattern}" ${file}.hex\
| head -1 | awk -F: '{print $1}')-1)/3-${start}))
dd ibs=1 count=${len} skip=${start} if=${file} of=${outfile}
Note: The script above use a temporary file to prevent having the binary>hex conversion twice. A space/time trade-off is to pipe the result of xxd directly into the two grep. A one-liner is also possible, at the expense of clarity.
One could also use tee and named pipe to prevent having to store a temporary file and converting output twice, but I'm not sure it would be faster (xxd is fast) and is certainly more complex to write.

See this link for a way to do binary grep. Once you have the start and end offset, you should be able with dd to get what you need.

A variation on the awk solution that assumes that your binary file, once converted in hex with spaces, fits in memory:
xxd -c1 -p file |
tr "\n" " " |
sed -n -e 's/.*\(ff d8 ff d0.*aa ff d9\).*/\1/p' |
xxd -r -p > new_file

Another solution in sed, but using less memory:
xxd -c1 -p file |
sed -n -e '1{N;N;N}' -e '/ff\nd8\nff\nd0/{:begin;p;s/.*//;n;bbegin}' -e 'N;D' |
sed -n -e '1{N;N}' -e '/aa\nff\nd9/{p;Q1}' -e 'P;N;D' |
xxd -r -p > new_file
test ${PIPESTATUS[2]} -eq 1 || rm new_file
The 1st sed prints from ff d8 ff d0 till the end of file. Note that you need as much N in -e '1{N;N;N}' as there is bytes in your 1st pattern less one.
The 2nd sed prints from the beginning of the file to aa ff d9. Note again that you need as much N in -e '1{N;N}' as there is bytes in your 2nd pattern less one.
Again, a test is needed to check if the 2nd pattern is found, and delete the file if it is not.
Note that the Q command is a GNU extension to sed. If you do not have it, you need to trash the rest of the file once the pattern is found (in a loop like the 1st sed, but not printing the file), and check after hex to binary conversion that the new_file end with the wright pattern.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Passing result of tr as second parameter in awk - linux

Since awk isn't storing your 2nd file in memory the size of that file is irrelevant except for speed of execution. Try this: tr '\000' ' ' < file2 | awk 'NR==FNR{a[$0];next} substr($0,50,6) in a' file1 -

It should be: awk ... <(tr -d '\0' < file2) # -------^ no space! Check the manual about Process Substitution.

You could replace it in awk using gsub(/\000/," "). Testing, let's make a test file: $ awk 'BEGIN{print "a b\000c d"}' > foo $ hexdump -C foo 00000000 61 20 62 00 63 20 64 0a |a b.c d.| 00000008 And then: $ awk '{print; gsub(/\000/," "); print}' foo a bc d a b c d

Related

Copy first row to the last in file

more than one source of data to process with bc

Extracting whole line if a character is present at certain position

Checking 2 files in Unix and finding the sum of a particular column in 3rd file through shell script

How to dump part of binary file

Categories

Resources