Finding contents of one file in another file - linux

I'm using the following shell script to find the contents of one file into another:
#!/bin/ksh
file="/home/nimish/contents.txt"
while read -r line; do
grep $line /home/nimish/another_file.csv
done < "$file"
I'm executing the script, but it is not displaying the contents from the CSV file. My contents.txt file contains number such as "08915673" or "123223" which are present in the CSV file as well. Is there anything wrong with what I do?

grep itself is able to do so. Simply use the flag -f:
grep -f <patterns> <file>
<patterns> is a file containing one pattern in each line; and <file> is the file in which you want to search things.
Note that, to force grep to consider each line a pattern, even if the contents of each line look like a regular expression, you should use the flag -F, --fixed-strings.
grep -F -f <patterns> <file>
If your file is a CSV, as you said, you may do:
grep -f <(tr ',' '\n' < data.csv) <file>
As an example, consider the file "a.txt", with the following lines:
alpha
0891234
beta
Now, the file "b.txt", with the lines:
Alpha
0808080
0891234
bEtA
The output of the following command is:
grep -f "a.txt" "b.txt"
0891234
You don't need at all to for-loop here; grep itself offers this feature.
Now using your file names:
#!/bin/bash
patterns="/home/nimish/contents.txt"
search="/home/nimish/another_file.csv"
grep -f <(tr ',' '\n' < "${patterns}") "${search}"
You may change ',' to the separator you have in your file.

Another solution:
use awk and create your own hash(e.g. ahash), all controlled by yourself.
replace $0 to $i and you can match any fields you want.
awk -F"," '
{
if (nowfile==""){ nowfile = FILENAME; }
if(FILENAME == nowfile)
{
hash[$0]=$0;
}
else
{
if($0 ~ hash[$0])
{
print $0
}
}
} ' xx yy

I don't think you really need a script to perform what you're trying to do.
One command is enough. In my case, I needed an identification number in column 11 in a CSV file (with ";" as separator):
grep -f <(awk -F";" '{print $11}' FILE_TO_EXTRACT_PATTERNS_FROM.csv) TARGET_FILE.csv

Related

Replace pattern in one column bash

I have multiple *csv file that cat like:
#sample,time,N
SPH-01-HG00186-1_R1_001,8.33386,93
SPH-01-HG00266-1_R1_001,7.41229,93
SPH-01-HG00274-1_R1_001,7.63903,93
SPH-01-HG00276-1_R1_001,7.94798,93
SPH-01-HG00403-1_R1_001,7.99299,93
SPH-01-HG00404-1_R1_001,8.38001,93
And I try to wrangle cated csv file to:
#sample,time,N
HG00186,8.33386,93
HG00266,7.41229,93
HG00274,7.63903,93
HG00276,7.94798,93
HG00403,7.99299,93
HG00404,8.38001,93
I did:
for i in $(ls *csv); do line=$(cat ${i} | grep -v "#" | cut -d'-' -f3); sed 's/*${line}*/${line}/g'; done
Yet no result showed up... Any advice of doing so? Thanks.
With awk and the logic of splitting each line by , then split their first field by -:
awk -v FS=',' -v OFS=',' 'NR > 1 { split($1,w,"-"); $1 = w[3] } 1' file.csv
With sed and a robust regex that cannot possibly modify the other fields:
sed -E 's/^([^,-]*-){2}([^,-]*)[^,]*/\2/' file.csv
# or
sed -E 's/^(([^,-]*)-){3}[^,]*/\2/' file.csv
Use this Perl one-liner:
perl -i -pe 's{.*?-.*?-(.*?)-.*?,}{$1,}' *.csv
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-p : Loop over the input one line at a time, assigning it to $_ by default. Add print $_ after each loop iteration.
-i.bak : Edit input files in-place (overwrite the input file). Before overwriting, save a backup copy of the original file by appending to its name the extension .bak (you can omit .bak, to avoid creating any backup files).
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlre: Perl regular expressions (regexes)
perldoc perlre: Perl regular expressions (regexes): Quantifiers; Character Classes and other Special Escapes; Assertions; Capture groups
perldoc perlrequick: Perl regular expressions quick start
You can use
sed -E 's/^[^-]+-[0-9]+-([^-]+)[^,]+/\1/' file > newfile
Details:
-E - enabling the POSIX ERE regex flavor
^[^-]+-[0-9]+-([^-]+)[^,]+ - the regex pattern that searches for
^ - start of string
[^-]+ - one or more non-hyphen chars
- - a hyphen
[0-9]+ - one or more digits
- - a hyphen
([^-]+) - Group 1: one or more non-hyphens
[^,]+ - one or more non-comma chars
\1 - replace the match with Group 1 value.
See the online demo:
#!/bin/bash
s='SPH-01-HG00186-1_R1_001,8.33386,93
SPH-01-HG00266-1_R1_001,7.41229,93
SPH-01-HG00274-1_R1_001,7.63903,93
SPH-01-HG00276-1_R1_001,7.94798,93
SPH-01-HG00403-1_R1_001,7.99299,93
SPH-01-HG00404-1_R1_001,8.38001,93'
sed -E 's/^[^-]+-[0-9]+-([^-]+)[^,]+/\1/' <<< "$s"
Output:
HG00186,8.33386,93
HG00266,7.41229,93
HG00274,7.63903,93
HG00276,7.94798,93
HG00403,7.99299,93
HG00404,8.38001,93
You can mangle text using bash parameter expansion, without resorting to external tools like awk and sed:
IFS=","
while read -r -a line; do
x="${line[0]%-*}"
x="${x##*-}"
printf "%s,%s,%s\n" "$x" "${line[1]}" "${line[2]}"
done < input.txt
Or you could do it with simple awk, as others have done.
awk '{print $3,$5,$6}' FS='[-,]' OFS=, < input.txt
If you need to use cut AT ANY PRICE then I suggest following solution, let file.txt content be
#sample,time,N
SPH-01-HG00186-1_R1_001,8.33386,93
SPH-01-HG00266-1_R1_001,7.41229,93
SPH-01-HG00274-1_R1_001,7.63903,93
SPH-01-HG00276-1_R1_001,7.94798,93
SPH-01-HG00403-1_R1_001,7.99299,93
SPH-01-HG00404-1_R1_001,8.38001,93
then
head -1 file.txt && tail -6 file.txt | tr '-' ',' | cut --delimiter=',' --fields=3,5,6
gives output
#sample,time,N
HG00186,8.33386,93
HG00266,7.41229,93
HG00274,7.63903,93
HG00276,7.94798,93
HG00403,7.99299,93
HG00404,8.38001,93
Explanation: output 1st line as-is using head then ram 6 last lines into tr to replace - using , finally use cut with , delimiter and specify desired fields.
{m,n,g}awk NF++ FS='^[^-]+-[^-]+-|-[^,]+' OFS=
|
#sample,time,N
HG00186,8.33386,93
HG00266,7.41229,93
HG00274,7.63903,93
HG00276,7.94798,93
HG00403,7.99299,93
HG00404,8.38001,93

Bash: Move specific line of content in file to the top of the file

I'm making a script (in vim) that goes through a file and looks for a specific line. If the line matches, I want it to be moved to the top of the file.
The file looks something like this:
6596628c9cbab49b80d6a07d0304377768f5114e7f8b21edffa820aab1c508be, ./favicon.ico
150a04dd76f733e5ef05ece49de115d05f71efa8e73025e015dd4e0fb3217553, ./about.php
28acfc4b0d378c22a2c0e4913cae3d15aef9b21938de81be92f74aef85b0cc0e, ./info.php
67976bbbd1b62a00d454da3a9f95e72d97d0fd156d4c65a12707f2602cbea582, ./missing.php
6ed318718f4cc617c82121db7cde54188eac6f89c355f0bfe9d198218de7fffc, ./browse.php
abd277cc3453be980bb48cbffe9d1f7422ca1ef4bc0b7d035fda87cea4d55cbc, ./composer.phar
73ac79eccac12120dc601cd6cce1282a1d8a920d440d3d1141d257db1ed4b0f0, ./search.php
f412aabd74f4c99bd32c5e534132c565f52c2bd32fbf7f629eb5a4495ac46351, ./index.php
c2d49a4873088fbe635d8653494f7f1425b6ad9f55d63ee4de52170d8a8d01b8, ./content/style.css
18e7d61367d80bc125b309ac002bb3946c5e7ba419ef59537afc939eff799dfd, ./content/logo.png
d8da15f62d55641320f7e7c21d9be86db6d81f7667bbd35c738b4c917cad3ce9, ./robots.txt
How would I be able to move the content on line 8 (index.php) to the top so it looks like this:
f412aabd74f4c99bd32c5e534132c565f52c2bd32fbf7f629eb5a4495ac46351, ./index.php
6596628c9cbab49b80d6a07d0304377768f5114e7f8b21edffa820aab1c508be, ./favicon.ico
150a04dd76f733e5ef05ece49de115d05f71efa8e73025e015dd4e0fb3217553, ./about.php
28acfc4b0d378c22a2c0e4913cae3d15aef9b21938de81be92f74aef85b0cc0e, ./info.php
67976bbbd1b62a00d454da3a9f95e72d97d0fd156d4c65a12707f2602cbea582, ./missing.php
6ed318718f4cc617c82121db7cde54188eac6f89c355f0bfe9d198218de7fffc, ./browse.php
abd277cc3453be980bb48cbffe9d1f7422ca1ef4bc0b7d035fda87cea4d55cbc, ./composer.phar
73ac79eccac12120dc601cd6cce1282a1d8a920d440d3d1141d257db1ed4b0f0, ./search.php
c2d49a4873088fbe635d8653494f7f1425b6ad9f55d63ee4de52170d8a8d01b8, ./content/style.css
18e7d61367d80bc125b309ac002bb3946c5e7ba419ef59537afc939eff799dfd, ./content/logo.png
d8da15f62d55641320f7e7c21d9be86db6d81f7667bbd35c738b4c917cad3ce9, ./robots.txt
(The file contains about 9000 lines)
How can this be done most efficiently?
I think can use sed to do the operations needed:
This command will be used to get the content in the file at specific line
sed -n "$LINE_NUMBER"p "$FILENAME"
This command will be used to delete the content of the file at specified line
sed -i.bak -e "$LINE_NUMBER"'d' "$FILENAME"
And finally, this command will be used to append a string to the top of a file
sed -i -e "1i${LINE_CONTENT}" "$FILENAME"
Combine the 3 commands together, you can perform the operation mentioned in your question (test.log is the file that contain your sample input)
FILENAME="test.log"
LINE_NUMBER=8
LINE_CONTENT=$(sed -n "$LINE_NUMBER"p "$FILENAME")
sed -i.bak -e "$LINE_NUMBER"'d' "$FILENAME"
sed -i -e "1i${LINE_CONTENT}" "$FILENAME"
With ed, this will do in-place editing as well:
printf '8m0\nwq\n' | ed -s ip.txt -
Or, if you don't know line number:
printf '/index\.php/m0\nwq\n' | ed -s ip.txt -
Supposing your file name is test.txt, you can try this:
(Please backup your file first)
grep "./index.php" test.txt > out && grep -v "./index.php" test.txt >> out && mv out test.txt
To do this job reading the input file just once without reading the whole file into memory, and using any awk in any shell on every Unix box all you need is either of these:
awk 'f{print; next} NR==8{print $0 buf; f=1} {buf=buf ORS $0}' file
awk 'f{print; next} $2=="./index.php"{print $0 buf; f=1} {buf=buf ORS $0}' file
If you want to move the 8th line to the top:
$ awk 'f{print; next} NR==8{print $0 buf; f=1} {buf=buf ORS $0}' file
f412aabd74f4c99bd32c5e534132c565f52c2bd32fbf7f629eb5a4495ac46351, ./index.php
6596628c9cbab49b80d6a07d0304377768f5114e7f8b21edffa820aab1c508be, ./favicon.ico
150a04dd76f733e5ef05ece49de115d05f71efa8e73025e015dd4e0fb3217553, ./about.php
28acfc4b0d378c22a2c0e4913cae3d15aef9b21938de81be92f74aef85b0cc0e, ./info.php
67976bbbd1b62a00d454da3a9f95e72d97d0fd156d4c65a12707f2602cbea582, ./missing.php
6ed318718f4cc617c82121db7cde54188eac6f89c355f0bfe9d198218de7fffc, ./browse.php
abd277cc3453be980bb48cbffe9d1f7422ca1ef4bc0b7d035fda87cea4d55cbc, ./composer.phar
73ac79eccac12120dc601cd6cce1282a1d8a920d440d3d1141d257db1ed4b0f0, ./search.php
c2d49a4873088fbe635d8653494f7f1425b6ad9f55d63ee4de52170d8a8d01b8, ./content/style.css
18e7d61367d80bc125b309ac002bb3946c5e7ba419ef59537afc939eff799dfd, ./content/logo.png
d8da15f62d55641320f7e7c21d9be86db6d81f7667bbd35c738b4c917cad3ce9, ./robots.txt
If you want to move the line containing "index" to the top:
$ awk 'f{print; next} $2=="./index.php"{print $0 buf; f=1} {buf=buf ORS $0}' file
f412aabd74f4c99bd32c5e534132c565f52c2bd32fbf7f629eb5a4495ac46351, ./index.php
6596628c9cbab49b80d6a07d0304377768f5114e7f8b21edffa820aab1c508be, ./favicon.ico
150a04dd76f733e5ef05ece49de115d05f71efa8e73025e015dd4e0fb3217553, ./about.php
28acfc4b0d378c22a2c0e4913cae3d15aef9b21938de81be92f74aef85b0cc0e, ./info.php
67976bbbd1b62a00d454da3a9f95e72d97d0fd156d4c65a12707f2602cbea582, ./missing.php
6ed318718f4cc617c82121db7cde54188eac6f89c355f0bfe9d198218de7fffc, ./browse.php
abd277cc3453be980bb48cbffe9d1f7422ca1ef4bc0b7d035fda87cea4d55cbc, ./composer.phar
73ac79eccac12120dc601cd6cce1282a1d8a920d440d3d1141d257db1ed4b0f0, ./search.php
c2d49a4873088fbe635d8653494f7f1425b6ad9f55d63ee4de52170d8a8d01b8, ./content/style.css
18e7d61367d80bc125b309ac002bb3946c5e7ba419ef59537afc939eff799dfd, ./content/logo.png
d8da15f62d55641320f7e7c21d9be86db6d81f7667bbd35c738b4c917cad3ce9, ./robots.txt
To modify the original file if you have GNU awk:
awk -i inplace '...script...' file
or with any awk:
awk '...script...' file > tmp && mv tmp file
This might work for you (GNU sed):
sed -E '1,8{H;8!d;p;x;s/^(.)(.*)\1.*/\2/}' file
Create a range of lines in the hold space from the first to the line number to be inserted at the start of the file (in the example above line 8).
When the current line is the desired line, print it and then swap to the hold space, remove the newline introduced at the start of the range and the last line added (the desired line) and print the remainder too.
All other lines will be printed as normal.
Alternative:
sed -E '1{:a;N;8!ba;s/(.*)(\n)(.*)/\3\2\1/}' file
Yet another quirky solution:
sed '1,7{H;d};8{p;x;D}' file

Replace filename to a string of the first line in multiple files in bash

I have multiple fasta files, where the first line always contains a > with multiple words, for example:
File_1.fasta:
>KY620313.1 Hepatitis C virus isolate sP171215 polyprotein gene, complete cds
File_2.fasta:
>KY620314.1 Hepatitis C virus isolate sP131957 polyprotein gene, complete cds
File_3.fasta:
>KY620315.1 Hepatitis C virus isolate sP127952 polyprotein gene, complete cds
I would like to take the word starting with sP* from each file and rename each file to this string (for example: File_1.fasta to sP171215.fasta).
So far I have this:
$ for match in "$(grep -ro '>')";do
fname=$("echo $match|awk '{print $6}'")
echo mv "$match" "$fname"
done
But it doesn't work, I always get the error:
grep: warning: recursive search of stdin
I hope you can help me!
you can use something like this:
grep '>' *.fasta | while read -r line ; do
new_name="$(echo $line | cut -d' ' -f 6)"
old_name="$(echo $line | cut -d':' -f 1)"
mv $old_name "$new_name.fasta"
done
It searches for *.fasta files and handles every "hitted" line
it splits each result of grep by spaces and gets the 6th element as new name
it splits each result of grep by : and gets the first element as old name
it
moves/renames from old filename to new filename
There are several things going on with this code.
For a start, .. I actually don't get this particular error, and this might be due to different versions.
It might resolve to the fact that grep interprets '>' the same as > due to bash expansion being done badly. I would suggest maybe going for "\>".
Secondly:
fname=$("echo $match|awk '{print $6}'")
The quotes inside serve unintended purpose. Your code should like like this, if anything:
fname="$(echo $match|awk '{print $6}')"
Lastly, to properly retrieve your data, this should be your final code:
for match in "$(grep -Hr "\>")"; do
fname="$(echo "$match" | cut -d: -f1)"
new_fname="$(echo "$match" | grep -o "sP[^ ]*")".fasta
echo mv "$fname" "$new_fname"
done
Explanations:
grep -H -> you want your grep to explicitly use "Include Filename", just in case other shell environments decide to alias grep to grep -h (no filenames)
you don't want to be doing grep -o on your file search, as you want to have both the filename and the "new filename" in one data entry.
Although, i don't see why you would search for '>' and not directory for 'sP' as such:
for match in "$(grep -Hro "sP[0-9]*")"
This is not the exact same behaviour, and has different edge cases, but it just might work for you.
Quite straightforward in (g)awk :
create a file "script.awk":
FNR == 1 {
for (i=1; i<=NF; i++) {
if (index($i, "sP")==1) {
print "mv", FILENAME, $i ".fasta"
nextfile
}
}
}
use it :
awk -f script.awk *.fasta > cmmd.txt
check the content of the output.
mv File_1.fasta sP171215.fasta
mv File_2.fasta sP131957.fasta
if ok, launch rename with . cmmd.txt
For all fasta files in directory, search their first line for the first word starting with sP and rename them using that word as the basename.
Using a bash array:
for f in *.fasta; do
arr=( $(head -1 "$f") )
for word in "${arr[#]}"; do
[[ "$word" =~ ^sP* ]] && echo mv "$f" "${word}.fasta" && break
done
done
or using grep:
for f in *.fasta; do
word=$(head -1 "$f" | grep -o "\bsP\w*")
[ -z "$word" ] || echo mv "$f" "${word}.fasta"
done
Note: remove echo after you are ok with testing.

Write a file using AWK on linux

I have a file that has several lines of which one line is
-xxxxxxxx()xxxxxxxx
I want to add the contents of this line to a new file
I did this :
awk ' /^-/ {system("echo" $0 ">" "newline.txt")} '
but this does not work , it returns an error that says :
Unnexpected token '('
I believe this is due to the () present in the line. How to overcome this issue?
You need to add proper spaces!
With your erronous awk ' /^-/ {system("echo" $0 ">" "newline.txt")} ', the shell command is essentially echo-xxxxxxxx()xxxxxxxx>newline.txt, which surely doesn't work. You need to construct a proper shell command inside the awk string, and obey awks string concatenation rules, i.e. your intended script should look like this (which is still broken, because $0 is not properly quoted in the resulting shell command):
awk '/^-/ { system("echo " $0 " > newline.txt") }'
However, if you really just need to echo $0 into a file, you can simply do:
awk '/^-/ { print $0 > "newline.txt" }'
Or even more simply
awk '/^-/' > newline.txt
Which essentially applies the default operation to all records matching /^-/, whereby the default operation is to print, which is short for neatly printing the current record, i.e. this script simply filters out the desired records. The > newline.txt redirection outside awk simply puts it into a file.
You don't need the system, echo commands, simply:
awk '/^-/ {print $1}' file > newfile
This will capture lines starting with - and truncate the rest if there's a space.
awk '/^-/ {print $0}' file > newfile
Would capture the entire line including spaces.
You could use grep also:
grep -o '^-.*' file > newfile
Captures any lines starting with -
grep -o '^-.*().*' file > newfile
Would be more specific and capture lines starting with - also containing ()
First of all for simple extraction of patterns from file, you do not need to use awk it is an overkill, grep would be more than enough for the task:
INPUT:
$ more file
123
-xxxxxxxx()xxxxxxxx
abc
-xyxyxxux()xxuxxuxx
123
abc
123
command:
$ grep -oE '^-[^(]+\(\).*' file
-xxxxxxxx()xxxxxxxx
-xyxyxxux()xxuxxuxx
explanations:
Option: -oE to define the output as the pattern and not the whole line (can be removed)
Regex: ^-[^(]+\(\).* will select lines that starts with - and contains ()
You can redirect your output to a new_file by adding > new_file at the end of your command.

Iterative Bash Script Bug

Using a bash script, I'm trying to iterate through a text file that only has around 700 words, line-by-line, and run a case-insensitive grep search in the current directory using that word on particular files. To break it down, I'm trying to output the following to a file:
Append a newline to a file, then the searched word, then another newline
Append the results of the grep command using that search
Repeat steps 1 and 2 until all words in the list are exhausted
So for example, if I had this list.txt:
search1
search2
I'd want the results.txt to be:
search1:
grep result here
search2:
grep result here
I've found some answers throughout the stack exchanges on how to do this and have come up with the following implementation:
#!/usr/bin/bash
while IFS = read -r line;
do
"\n$line:\n" >> "results.txt";
grep -i "$line" *.in >> "results.txt";
done < "list.txt"
For some reason, however, this (and the numerous variants I've tried) isn't working. Seems trivial, but I'd it's been frustrating me beyond belief. Any help is appreciated.
Your script would work if you changed it to:
while IFS= read -r line; do
printf '\n%s:\n' "$line"
grep -i "$line" *.in
done < list.txt > results.txt
but it'd be extremely slow. See https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice for why you should think long and hard before writing a shell loop just to manipulate text. The standard UNIX tool for manipulating text is awk:
awk '
NR==FNR { words2matches[$0]; next }
{
for (word in words2matches) {
if ( index(tolower($0),tolower(word)) ) {
words2matches[word] = words2matches[word] $0 ORS
}
}
}
END {
for (word in words2matches) {
print word ":" ORS words2matches[word]
}
}
' list.txt *.in > results.txt
The above is untested of course since you didn't provide sample input/output we could test against.
Possible problems:
bash path - use /bin/bash path instead of /usr/bin/bash
blank spaces - remove ' ' after IFS
echo - use -e option for handling escape characters (here: '\n')
semicolons - not required at end of line
Try following script:
#!/bin/bash
while IFS= read -r line; do
echo -e "$line:\n" >> "results.txt"
grep -i "$line" *.in >> "results.txt"
done < "list.txt"
You do not even need to write a bash script for this purpose:
INPUT FILES:
$ more file?.in
::::::::::::::
file1.in
::::::::::::::
abc
search1
def
search3
::::::::::::::
file2.in
::::::::::::::
search2
search1
abc
def
::::::::::::::
file3.in
::::::::::::::
abc
search1
search2
def
search3
PATTERN FILE:
$ more patterns
search1
search2
search3
CMD:
$ grep -inf patterns file*.in | sort -t':' -k3 | awk -F':' 'BEGIN{OFS=FS}{if($3==buffer){print $1,$2}else{print $3; print $1,$2}buffer=$3}'
OUTPUT:
search1
file1.in:2
file2.in:2
file3.in:2
search2
file2.in:1
file3.in:3
search3
file1.in:4
file3.in:5
EXPLANATIONS:
grep -inf patterns file*.in will grep all the file*.in with all the patterns located in patterns file thanks to -f option, using -i forces insensitive case, -n will add the line numbers
sort -t':' -k3 you sort the output with the 3rd column to regroup patterns together
awk -F':' 'BEGIN{OFS=FS}{if($3==buffer){print $1,$2}else{print $3; print $1,$2}buffer=$3}' then awk will print the display that you want by using : as Field Separator and Output Field Separator, you use a buffer variable to save the pattern (3rd field) and you print the pattern whenever it changes ($3!=buffer)

Resources