Extracting whole line if a character is present at certain position

Extracting whole line if a character is present at certain position - linux

I am a java programmer and a newbie to shell scripting, I have a daunting task to parse multi gigabyte logs and look for lines where '1'(just 1 no qoutes) is present at 446th position of the line, I am able to verify that character 1 is present by running this cat *.log | cut -c 446-446 | sort | uniq -c but I am not able to extract the lines and print them in an output file.

awk '{if (substr($0,446,1) == "1") {print $0}}' file
is the basics.
You can use FILENAME in the print feature to add the filename to the output, so then you could do
awk '{if (substr($0,446,1) == "1") {print FILENAME ":" $0}}' file1 file2 ...
IHTH

Try adding grep to the pipe:
grep '^.\{445\}1.*$'

You can use an awk command for that:
awk 'substr($0, 446, 1) == "1"' file.log
substr function will get 1 character at position 446 and == "1" will ensure that character is 1.

Another in awk. To make a more sane example, we print lines where the third char is 3:
$ cat file
123 # this
456 # not this
$ awk -F '' '$3==3' file
123 # this
based on that example but untested:
$ awk -F '' '$446==1' file

Related

How to get 1st field of a file only when 2nd field matches a string?

How to get 1st field of a file only when 2nd field matches a given string?
#cat temp.txt
Ankit pass
amit pass
aman fail
abhay pass
asha fail
ashu fail
cat temp.txt | awk -F"\t" '$2 == "fail" { print $1 }'*
gives no output

Another syntax with awk:
awk '$2 ~ /^faild$/{print $1}' input_file
A deleted 'cat' command.
^ start string
$ end string
It's the best way to match patten.

Either:
Your fields are not tab-separated or
You have blanks at the end of the relevant lines or
You have DOS line-endings and so there are CRs at the end of every
line and so also at the end of every $2 in every line (see
Why does my tool output overwrite itself and how do I fix it?)
With GNU cat you can run cat -Tev temp.txt to see tabs (^I), CRs (^M) and line endings ($).

Your code seems to work fine when I remove the * at the end
cat temp.txt | awk -F"\t" '$2 == "fail" { print $1 }'
The other thing to check is if your file is using tab or spaces. My copy/paste of your data file copied spaces, so I needed this line:
cat temp.txt | awk '$2 == "fail" { print $1 }'
The other way of doing this is with grep:
cat temp.txt | grep fail$ | awk '{ print $1 }'

Extract substring of string if position is known

first, I need to extract the substring by a known position in the file.txt
file.txt in bash, but starting from the second line
>header
cgatgcgctctgtgcgtgcgtgcg
so let's assume I want position 10 from the second line, the output should be:
c
second, I want to include the surrounding ±5 characters, resulting in
gcgctctgtgc

{ read -r; read -r; echo "${REPLY:9:1}"; echo "${REPLY:4:11}"; } < file.txt
Output:
c
gcgctctgtgc
The ${parameter:offset:length} syntax for substrings is explained in https://www.gnu.org/software/bash/manual/bashref.html#Shell-Parameter-Expansion.
The read command is explained in https://www.gnu.org/software/bash/manual/bashref.html#index-read.
Input redirection: https://www.gnu.org/software/bash/manual/bashref.html#Redirections.

With awk:
To get the character at position 10, 1-indexed:
awk 'NR==2 {print substr($0, 10, 1)}'
NR==2 is checking if the record is second, if so the statements inside {} would be executed
substr($0, 10, 1) will extract 1 character starting from position 10 from field $0 (the whole record) i.e. only the 10-th character will be extracted. The format for substr() is substr(field, offset, length).
Similarly, to get ±5 characters around 10-th:
awk 'NR==2 {print substr($0, (10-5), 11)}'
(10-5) instead of 5 is just to give you the idea of the stuffs.
Example:
% cat file.txt
>header
cgatgcgctctgtgcgtgcgtgcg
% awk 'NR==2 {print substr($0, 10, 1)}' file.txt
c
% awk 'NR==2 {print substr($0, (10-5), 11)}' file.txt
gcgctctgtgc

use sed and cut:
sed -n '2p' file|cut -c 5-15
sed for access 2nd line and cut for print desired characters

Matching third field in a CSV with pattern file in GNU Linux (AWK/SED/GREP)

I need to print all the lines in a CSV file when 3rd field matches a pattern in a pattern file.
I have tried grep with no luck because it matches with any field not only the third.
grep -f FILE2 FILE1 > OUTPUT
FILE1
dasdas,0,00567,1,lkjiou,85249
sadsad,1,52874,0,lkjiou,00567
asdasd,0,85249,1,lkjiou,52874
dasdas,1,48555,0,gfdkjh,06793
sadsad,0,98745,1,gfdkjh,45346
asdasd,1,56321,0,gfdkjh,47832
FILE2
00567
98745
45486
54543
48349
96349
56485
19615
56496
39493
RIGHT OUTPUT
dasdas,0,00567,1,lkjiou,85249
sadsad,0,98745,1,gfdkjh,45346
WRONG OUTPUT
dasdas,0,00567,1,lkjiou,85249
sadsad,1,52874,0,lkjiou,00567 <---- I don't want this to appear
sadsad,0,98745,1,gfdkjh,45346
I have already searched everywhere and tried different formulas.
EDIT: thanks to Wintermute, I managed to write something like this:
csvquote file1.csv > file1.csv
awk -F '"' 'FNR == NR { patterns[$0] = 1; next } patterns[$6]' file2.csv file1.csv | csvquote -u > result.csv
Csvquote helps parsing CSV files with AWK.
Thank you very much everybody, great community!

With awk:
awk -F, 'FNR == NR { patterns[$0] = 1; next } patterns[$3]' file2 file1
This works as follows:
FNR == NR { # when processing the first file (the pattern file)
patterns[$0] = 1 # remember the patterns
next # and do nothing else
}
patterns[$3] # after that, select lines whose third field
# has been seen in the patterns.

Using grep and sed:
grep -f <( sed -e 's/^\|$/,/g' file2) file1
dasdas,0,00567,1,lkjiou,85249
sadsad,0,98745,1,gfdkjh,45346
Explanation:
We insert a coma at the beginning and at the end of file2, but without changing the file, then we just grep as you were already doing.

This can be a start
for i in $(cat FILE2);do cat FILE1| cut -d',' -f3|grep $i ;done

sed 's#.*#/^[^,]*,[^,]*,&,/!d#' File2 >/tmp/File2.sed && sed -f /tmp/File2.sed FILE1;rm /tmp/File2.sed
hard in a simple sed like awk can do but should work if awk is not available
same with egrep (usefull on huge file)
sed 's#.*#^[^,]*,[^,]*,&,#' File2 >/tmp/File2.egrep && egrep -f /tmp/File2.egrep FILE1;rm /tmp/File2.egrep

Linux awk with condition

I have a very large file (2.5M record) with 2 columns seperated by |.
I would like to filter all record that do not contain the value "-1" inside the second column and write it into a new file.
I tried to use:
grep -v "-1" norm_cats_21_07_assignments.psv > norm_cats_21_07_assignments.psv
but noo luck.

For quick and dirty solution, you can simply add | to your grep:
grep -v "|-1" input.psv > output.psv
This assumes that rows to be ignored look like
something|-1
Note that if you ever need to use grep -v "-1", you have to add -- after options, otherwise grep will treat -1 as an option, something like this:
grep -v -- "-1"

You could do this through awk,
awk -F"|" '$2~/^-1$/{next}1' file > newfile
Example:
$ cat r
foo|-1
foo|bar
$ awk -F"|" '$2~/^-1$/{next}1' r
foo|bar

You can have:
awk -F'|' '$2 != "-1"' file.psv > new_file.psv
Or
awk -F'|' '$2 !~ /-1/' file.psv > new_file.psv
!= matches the whole column while !~ needs only a part of it.
Edit: Just noticed that your input file and output file are the same. You can't do that as the output file which is the same file would get truncated even before awk starts reading it.
With awk after making the new filtered file (e.g. new_file.psv), you can save it back by using cat new_file.psv > file.psv or mv new_file.psv file.psv.
But somehow if you exactly have 2 columns separated with | and no spaces in between, and no quotes around, etc. You can just use inline editing with sed:
sed -i '/|-1/d' file.psv
Or perhaps something equivalent to awk -F'|' '$2 !~ /-1/':
sed -i '/|.*-1/d' file.psv

How to count number of tabs in each line using shell script?

I need to write a script to count the number of tabs in each line of a file and print the output to a text file (e.g., output.txt).
How do I do this?

awk '{print gsub(/\t/,"")}' inputfile > output.txt

If you treat \t as the field delimiter, there will be one fewer \t than fields on each line:
awk -F'\t' '{ print NF-1 }' input.txt > output.txt

sed 's/[^\t]//g' input.txt | awk '{ print length }' > output.txt
Based on this answer.

This will give the total number of tabs in file:
od -c infile | grep -o "\t" | wc -l > output.txt
This will give you number of tabs line by line:
awk '{print gsub(/\t/,"")}' infile > output.txt

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Extracting whole line if a character is present at certain position - linux

awk '{if (substr($0,446,1) == "1") {print $0}}' file is the basics. You can use FILENAME in the print feature to add the filename to the output, so then you could do awk '{if (substr($0,446,1) == "1") {print FILENAME ":" $0}}' file1 file2 ... IHTH

Try adding grep to the pipe: grep '^.\{445\}1.*$'

You can use an awk command for that: awk 'substr($0, 446, 1) == "1"' file.log substr function will get 1 character at position 446 and == "1" will ensure that character is 1.

Another in awk. To make a more sane example, we print lines where the third char is 3: $ cat file 123 # this 456 # not this $ awk -F '' '$3==3' file 123 # this based on that example but untested: $ awk -F '' '$446==1' file

Related

How to get 1st field of a file only when 2nd field matches a string?

Extract substring of string if position is known

Matching third field in a CSV with pattern file in GNU Linux (AWK/SED/GREP)

Linux awk with condition

How to count number of tabs in each line using shell script?

Categories

Resources