How to get 1st field of a file only when 2nd field matches a given string?
#cat temp.txt
Ankit pass
amit pass
aman fail
abhay pass
asha fail
ashu fail
cat temp.txt | awk -F"\t" '$2 == "fail" { print $1 }'*
gives no output
Another syntax with awk:
awk '$2 ~ /^faild$/{print $1}' input_file
A deleted 'cat' command.
^ start string
$ end string
It's the best way to match patten.
Either:
Your fields are not tab-separated or
You have blanks at the end of the relevant lines or
You have DOS line-endings and so there are CRs at the end of every
line and so also at the end of every $2 in every line (see
Why does my tool output overwrite itself and how do I fix it?)
With GNU cat you can run cat -Tev temp.txt to see tabs (^I), CRs (^M) and line endings ($).
Your code seems to work fine when I remove the * at the end
cat temp.txt | awk -F"\t" '$2 == "fail" { print $1 }'
The other thing to check is if your file is using tab or spaces. My copy/paste of your data file copied spaces, so I needed this line:
cat temp.txt | awk '$2 == "fail" { print $1 }'
The other way of doing this is with grep:
cat temp.txt | grep fail$ | awk '{ print $1 }'
first, I need to extract the substring by a known position in the file.txt
file.txt in bash, but starting from the second line
>header
cgatgcgctctgtgcgtgcgtgcg
so let's assume I want position 10 from the second line, the output should be:
c
second, I want to include the surrounding ±5 characters, resulting in
gcgctctgtgc
{ read -r; read -r; echo "${REPLY:9:1}"; echo "${REPLY:4:11}"; } < file.txt
Output:
c
gcgctctgtgc
The ${parameter:offset:length} syntax for substrings is explained in https://www.gnu.org/software/bash/manual/bashref.html#Shell-Parameter-Expansion.
The read command is explained in https://www.gnu.org/software/bash/manual/bashref.html#index-read.
Input redirection: https://www.gnu.org/software/bash/manual/bashref.html#Redirections.
With awk:
To get the character at position 10, 1-indexed:
awk 'NR==2 {print substr($0, 10, 1)}'
NR==2 is checking if the record is second, if so the statements inside {} would be executed
substr($0, 10, 1) will extract 1 character starting from position 10 from field $0 (the whole record) i.e. only the 10-th character will be extracted. The format for substr() is substr(field, offset, length).
Similarly, to get ±5 characters around 10-th:
awk 'NR==2 {print substr($0, (10-5), 11)}'
(10-5) instead of 5 is just to give you the idea of the stuffs.
Example:
% cat file.txt
>header
cgatgcgctctgtgcgtgcgtgcg
% awk 'NR==2 {print substr($0, 10, 1)}' file.txt
c
% awk 'NR==2 {print substr($0, (10-5), 11)}' file.txt
gcgctctgtgc
use sed and cut:
sed -n '2p' file|cut -c 5-15
sed for access 2nd line and cut for print desired characters
I need to print all the lines in a CSV file when 3rd field matches a pattern in a pattern file.
I have tried grep with no luck because it matches with any field not only the third.
grep -f FILE2 FILE1 > OUTPUT
FILE1
dasdas,0,00567,1,lkjiou,85249
sadsad,1,52874,0,lkjiou,00567
asdasd,0,85249,1,lkjiou,52874
dasdas,1,48555,0,gfdkjh,06793
sadsad,0,98745,1,gfdkjh,45346
asdasd,1,56321,0,gfdkjh,47832
FILE2
00567
98745
45486
54543
48349
96349
56485
19615
56496
39493
RIGHT OUTPUT
dasdas,0,00567,1,lkjiou,85249
sadsad,0,98745,1,gfdkjh,45346
WRONG OUTPUT
dasdas,0,00567,1,lkjiou,85249
sadsad,1,52874,0,lkjiou,00567 <---- I don't want this to appear
sadsad,0,98745,1,gfdkjh,45346
I have already searched everywhere and tried different formulas.
EDIT: thanks to Wintermute, I managed to write something like this:
csvquote file1.csv > file1.csv
awk -F '"' 'FNR == NR { patterns[$0] = 1; next } patterns[$6]' file2.csv file1.csv | csvquote -u > result.csv
Csvquote helps parsing CSV files with AWK.
Thank you very much everybody, great community!
With awk:
awk -F, 'FNR == NR { patterns[$0] = 1; next } patterns[$3]' file2 file1
This works as follows:
FNR == NR { # when processing the first file (the pattern file)
patterns[$0] = 1 # remember the patterns
next # and do nothing else
}
patterns[$3] # after that, select lines whose third field
# has been seen in the patterns.
Using grep and sed:
grep -f <( sed -e 's/^\|$/,/g' file2) file1
dasdas,0,00567,1,lkjiou,85249
sadsad,0,98745,1,gfdkjh,45346
Explanation:
We insert a coma at the beginning and at the end of file2, but without changing the file, then we just grep as you were already doing.
This can be a start
for i in $(cat FILE2);do cat FILE1| cut -d',' -f3|grep $i ;done
sed 's#.*#/^[^,]*,[^,]*,&,/!d#' File2 >/tmp/File2.sed && sed -f /tmp/File2.sed FILE1;rm /tmp/File2.sed
hard in a simple sed like awk can do but should work if awk is not available
same with egrep (usefull on huge file)
sed 's#.*#^[^,]*,[^,]*,&,#' File2 >/tmp/File2.egrep && egrep -f /tmp/File2.egrep FILE1;rm /tmp/File2.egrep
I have a very large file (2.5M record) with 2 columns seperated by |.
I would like to filter all record that do not contain the value "-1" inside the second column and write it into a new file.
I tried to use:
grep -v "-1" norm_cats_21_07_assignments.psv > norm_cats_21_07_assignments.psv
but noo luck.
For quick and dirty solution, you can simply add | to your grep:
grep -v "|-1" input.psv > output.psv
This assumes that rows to be ignored look like
something|-1
Note that if you ever need to use grep -v "-1", you have to add -- after options, otherwise grep will treat -1 as an option, something like this:
grep -v -- "-1"
You could do this through awk,
awk -F"|" '$2~/^-1$/{next}1' file > newfile
Example:
$ cat r
foo|-1
foo|bar
$ awk -F"|" '$2~/^-1$/{next}1' r
foo|bar
You can have:
awk -F'|' '$2 != "-1"' file.psv > new_file.psv
Or
awk -F'|' '$2 !~ /-1/' file.psv > new_file.psv
!= matches the whole column while !~ needs only a part of it.
Edit: Just noticed that your input file and output file are the same. You can't do that as the output file which is the same file would get truncated even before awk starts reading it.
With awk after making the new filtered file (e.g. new_file.psv), you can save it back by using cat new_file.psv > file.psv or mv new_file.psv file.psv.
But somehow if you exactly have 2 columns separated with | and no spaces in between, and no quotes around, etc. You can just use inline editing with sed:
sed -i '/|-1/d' file.psv
Or perhaps something equivalent to awk -F'|' '$2 !~ /-1/':
sed -i '/|.*-1/d' file.psv
I need to write a script to count the number of tabs in each line of a file and print the output to a text file (e.g., output.txt).
How do I do this?
awk '{print gsub(/\t/,"")}' inputfile > output.txt
If you treat \t as the field delimiter, there will be one fewer \t than fields on each line:
awk -F'\t' '{ print NF-1 }' input.txt > output.txt
sed 's/[^\t]//g' input.txt | awk '{ print length }' > output.txt
Based on this answer.
This will give the total number of tabs in file:
od -c infile | grep -o "\t" | wc -l > output.txt
This will give you number of tabs line by line:
awk '{print gsub(/\t/,"")}' infile > output.txt