linux copy files with first row containing genome to other directory - linux

I have many files under the directory. And I want to copy those ones with first line contains "genome" word to a new folder. How should I do that. I can match those line out but I do not know how to manipulate the file again.
I build a bash structure like
for i in *
do
if [ sed -ne 'genome' $i ]
then cp $i OTHERDIR
fi
done
But it seems the if statement can do very limited thing and can not have sed in it unlike other programming language.

Try this:
for i in *
do
head -1 $i| grep "genome" && cp $i OTHERDIR
done

You can use head -1 to peek at the first line:
for i in *
do
fline=`head -1 $i`
if [ "$fline" = genome ]
then cp $i OTHERDIR
fi
done

First of all, you seem to make no attempt to check only the first line. Your expression should use head to isolate the first line as mentioned by the other posters.
Anyway, you can accomplish this task with sed or grep inside the if statement. The pipe requires the use of output capture $()
With sed: You don't need the -e, it is the default.
$(head -1 $i | sed -n '/genome/=')
With grep: Probably the easier solution. sed is overpowered for this job.
$(head -1 $i | grep genome)
http://www.unix.com/unix-dummies-questions-answers/65705-how-grep-only-1st-line.html

Amazingly I tried the following scripts myself and it works. Haven't check answers posted yet.
for i in *
do
grep -l "genome" $i | xargs -I one cp one DESTDIR
done

Related

How to apply my sed command to some lines of all my files?

I've 95 files that looks like :
2019-10-29-18-00/dev/xx;512.00;0.4;/var/x/xx/xxx
2019-10-29-18-00/dev/xx;512.00;0.68;/xx
2019-10-29-18-00/dev/xx;512.00;1.84;/xx/xx/xx
2019-10-29-18-00/dev/xx;512.00;80.08;/opt/xx/x
2019-10-29-18-00/dev/xx;20480.00;83.44;/var/x/x
2019-10-29-18-00/dev/xx;3584.00;840.43;/var/xx/x
2019-10-30-00-00/dev/xx;2048.00;411.59;/
2019-10-30-00-00/dev/xx;7168.00;6168.09;/usr
2019-10-30-00-00/dev/xx;3072.00;1036.1;/var
2019-10-30-00-00/dev/xx;5120.00;348.72;/tmp
2019-10-30-00-00/dev/xx;20480.00;2033.19;/home
2019-10-30-12-00;/dev/xx;5120.00;348.72;/tmp
2019-10-30-12-00;/dev/hd1;20480.00;2037.62;/home
2019-10-30-12-00;/dev/xx;512.00;0.43;/xx
2019-10-30-12-00;/dev/xx;3584.00;794.39;/xx
2019-10-30-12-00;/dev/xx;512.00;0.4;/var/xx/xx/xx
2019-10-30-12-00;/dev/xx;512.00;0.68;/xx
2019-10-30-12-00;/dev/xx;512.00;1.84;/var/xx/xx
2019-10-30-12-00;/dev/xx;512.00;80.08;/opt/xx/x
2019-10-30-12-00;/dev/xx;20480.00;83.44;/var/xx/xx
2019-10-30-12-00;/dev/x;3584.00;840.43;/var/xx/xx
For some lines I've 2019-10-29-18-00/dev and for some other lines, I've 2019-10-30-12-00;/dev/
I want to add the ; before the /dev/ where it is missing, so for that I use this sed command :
sed 's/\/dev/\;\/dev/'
But How I can apply this command for each lines where the ; is missing ? I try this :
for i in $(cat /home/xxx/xxx/xxx/*.txt | grep -e "00/dev/")
do
sed 's/\/dev/\;\/dev/' $i > $i
done
But it doesn't work... Can you help me ?
Could you please try following with GNU awkif you are ok with it.
awk -i inplace '/00\/dev\//{gsub(/00\/dev\//,"/00;/dev/")} 1' *.txt
sed solution: Tested with GNU sed for few files and it worked fine.
sed -i.bak '/00\/dev/s/00\/dev/00\;\/dev/g' *.txt
This might work for you (GNU sed & parallel):
parallel -q sed -i 's#;*/dev#;/dev#' ::: *.txt
or if you prefer:
sed -i 's#;*/dev#;/dev#' *.txt
Ignore lines with ;/dev.
sed '/;\/dev/{p;d}; s^/dev^;/dev^'
The /;\/dev/ check if the line has ;/dev. If it has ;/dev do: p - print the current line and d - start from the beginning.
You can use any character with s command in sed. Also, there is no need in escaping \;, just ;.
How I can apply this command for each lines where the ; is missing ? I try this
Don't edit the same file redirecting to the same file $i > $i. Think about it. How can you re-write and read from the same file at the same time? You can't, the resulting file will be in most cases empty, as the > $i will "execute" first making the file empty, then sed $i will start running and it will read an empty file. Use a temporary file sed ... "$i" > temp.txt; mv temp.txt "$i" or use gnu extension -i sed option to edit in place.
What you want to do really is:
grep -l '00/dev/' /home/xxx/xxx/xxx/*.txt |
xargs -n1 sed -i '/;\/dev/{p;d}; s^/dev^;/dev^'
grep -l prints list of files that match the pattern, then xargs for each single one -n1 of the files executes sed which -i edits files in place.
grep for filtering can be eliminated in your case, we can accomplish the task with a single sed command:
for f in $(cat /home/xxx/xxx/xxx/*.txt)
do
[[ -f "$f" ]] && sed -Ei '/00\/dev/ s/([^;])(\/dev)/\1;\2/' "$f"
done
The easiest way would be to adjust your regex so that it's looking a bit wider than '/dev/', e.g.
sed -i -E 's|([0-9])/dev|\1;/dev|'
(note that I'm taking advantage of sed's flexible approach to delimiters on substitute. Also, -E changes the group syntax)
Alternatively, sed lets you filter which lines it handles:
sed -i '/[0-9]\/dev/ s/\/dev/;/dev/'
This uses the same substitution you already have but only applied on lines that match the filter regex

How to move files where the first line contains a string?

I am currently using the following command:
grep -l -Z -E '.*?FindMyRegex' /home/user/folder/*.csv | xargs -0 -I{} mv {} /home/destination/folder
This works fine. The problem is it uses grep on the entire file.
I would like to use the grep command on the FIRST line of the file only.
I have tried to use head -1 file | at the beginning, but it did not work.
A change I would add to your script is -
for file in *.csv; do
head -1 "$file" | grep -l -Z -E '.*?FindMyRegex' | xargs -0 -I{} mv {} /home/destination/folder;
done
you can maybe try sed '1q' file.csv | grep ... to search the regexp only in the first line.
You don't need grep or find, as long as your files don't have embedded newlines.
I don't know an easy way off the top of my head to get sed to delimit with nulls.
mv $( for f in /home/user/folder/*.csv;
do sed -ns '1 { /yourPattern/F; q; }' $f;
done ) /home/destination/folder/
EDIT
Rewrote with a loop. This will run a separate instance of sed to check each file, but at least it shouldn't read beyond the first line. It will fail syntactically if there are no hits.
You might need -E depending on your regex.
-n says don't print records from the files.
-s says treat each file as a distinct input - this is so the filenames aren't always the first one.
This does require GNU sed for the F.
gawk 'FNR==1{if($0~/PATTERN/)
printf "mv %s %s\n",FILENAME, "/target";nextfile}' /path/*.csv
First of all, in your regex: .*?FindMyRegex the .*? doesn't make any sense, they could be removed.
The above awk (gawk) one-liner will build up mv file target command lines for you. You can check them, if you are satisfied with them, pipe the output to |sh , the commands are gonna be executed.
replace PATTERN by your regex pattern, and /target by the real target dir.
The one-liner is assuming that the filenames don't contain special chars (space i.e.), if it is the case, add "s to the mv cmd.
using GNU awk to find the filenames, pipe the filenames into xargs
gawk -v pattern="myRegex" '
FNR == 1 {if ($0 ~ pattern) printf "%s\0", FILENAME; nextfile}
' *.csv | xargs -0 echo mv -t destination
If it looks OK, remove "echo"
Try this Shellcheck-clean Bash code:
#! /bin/bash
shopt -s nullglob # Globs that match nothing expand to nothing
shopt -s dotglob # Globs match files whose names start with '.'
dest=/home/destination/folder
for file in *.csv ; do
head -n 1 -- "$file" | grep -qE '.*?FindMyRegex' && mv -- "$file" "$dest"
done
shopt -s nullglob prevents an error if there are no .csv files in the directory.
shopt -s dotglob ensures that files whose name starts with '.' are handled.
The -- in the options for head and mv ensures that files whose names begin with - are handled correctly.
The quotes in "$file" and "$dest" ensure that names that contain whitespace (actually $IFS) characters (including newlines) or glob metacharacters are handled correctly.
Note that the .*? in the reqular expression is probably redundant, and may not do what you think it does (grep -E doesn't do non-greedy matching).

How to print the result of the first part of the pipe?

I have the following grep:
grep -Po '(?<=PROGRAM\()[^\)]+(?=\))' /home/programs/hello_word.sh
Wich displays the string between PROGRAM( and ):
RECTONTER
Then, I need to know if these string extracted is contained in a file, so:
grep -Po '(?<=PROGRAM\()[^\)]+(?=\))' /home/programs/hello_word.sh | xargs -I % grep -e % /home/leherad/pgm_currentdate
File content:
RECTONTER
CORASFE
RENTOASD
UBICARP
If its found, returns the line of /home/leherad/pgm_currentdate, but I want to print the line extracted in the first grep (RECTONTER). If not found, then wouldn't return nothing.
There is a simple way to do this, or I should not complicate and would be better build a script and save the first grep in a variable?
You can store it on a variable first:
read -r FIRST < <(exec grep -Po '(?<=PROGRAM\()[^\)]+(?=\))' /home/programs/hello_word.sh) && grep -e "$FIRST" /home/leherad/pgm_currentdate
Update 01
#!/bin/bash
shopt -s nullglob
for FILE in /home/programs/*; do
read -r FIRST < <(exec grep -Po '(?<=PROGRAM\()[^\)]+(?=\))' "$FILE") && grep -e "$FIRST" /home/leherad/pgm_currentdate && echo "$FIRST"
done
I think a straightforward way to solve this is to use a function.
Also, your grep pattern will match shell comments, which could cause unexpected behavior in your xargs command when there are more than one matches; you might want to take steps to only grab the first match. It's hard to say without actually seeing the input files, so I'm guessing this is either ok or comments are actually the expected place for your target pattern.
Anyway, here's my best guess at a function that would work for you.
get_program() {
local filename="$1"
local program="$( grep -m1 -Po '(?<=PROGRAM\()[^\)]+(?=\))' "$filename" )"
if grep -q -e "$program" /home/leherad/pgm_currentdate; then
echo $program
grep -e "$program" /home/leherad/pgm_currentdate
fi
}
get_program /home/programs/hello_word.sh

How do I insert a new line before concatenating?

I have about 80000 files which I am trying to concatenate. This one:
cat files_*.raw >> All
is extremely fast whereas the following:
for f in `ls files_*.raw`; do cat $f >> All; done;
is extremely slow. Because of this reason, I am trying to stick with the first option except that I need to be able to insert a new line after each file is concatenated to All. Is there any fast way of doing this?
What about
ls files_*.raw | xargs -L1 sed -e '$s/$/\n/' >>ALL
That will insert an extra newline at the end of each file as you concat them.
And a parallel version if you don't care about the order of concatenation:
find ./ -name "*.raw" -print | xargs -n1 -P4 sed -e '$s/$/\n/' >>All
The second command might be slow because you are opening the 'All' file for append 80000 times vs. 1 time in the first command. Try a simple variant of the second command:
for f in `ls files_*.raw`; do cat $f ; echo '' ; done >> All
I don't know why it would be slow, but I don't think you have much choice:
for f in `ls files_*.raw`; do cat $f >> All; echo '' >> All; done
Each time awk opens another file to process, the FRN equals 0, so:
awk '(0==FRN){print ""} {print}' files_*.raw >> All
Note, it's all done in one awk process. Performance should be close to the cat command from the question.

Linux using grep to print the file name and first n characters

How do I use grep to perform a search which, when a match is found, will print the file name as well as the first n characters in that file? Note that n is a parameter that can be specified and it is irrelevant whether the first n characters actually contains the matching string.
grep -l pattern *.txt |
while read line; do
echo -n "$line: ";
head -c $n "$line";
echo;
done
Change -c to -n if you want to see the first n lines instead of bytes.
You need to pipe the output of grep to sed to accomplish what you want. Here is an example:
grep mypattern *.txt | sed 's/^\([^:]*:.......\).*/\1/'
The number of dots is the number of characters you want to print. Many versions of sed often provide an option, like -r (GNU/Linux) and -E (FreeBSD), that allows you to use modern-style regular expressions. This makes it possible to specify numerically the number of characters you want to print.
N=7
grep mypattern *.txt /dev/null | sed -r "s/^([^:]*:.{$N}).*/\1/"
Note that this solution is a lot more efficient that others propsoed, which invoke multiple processes.
There are few tools that print 'n characters' rather than 'n lines'. Are you sure you really want characters and not lines? The whole thing can perhaps be best done in Perl. As specified (using grep), we can do:
pattern="$1"
shift
n="$2"
shift
grep -l "$pattern" "$#" |
while read file
do
echo "$file:" $(dd if="$file" count=${n}c)
done
The quotes around $file preserve multiple spaces in file names correctly. We can debate the command line usage, currently (assuming the command name is 'ngrep'):
ngrep pattern n [file ...]
I note that #litb used 'head -c $n'; that's neater than the dd command I used. There might be some systems without head (but they'd pretty archaic). I note that the POSIX version of head only supports -n and the number of lines; the -c option is probably a GNU extension.
Two thoughts here:
1) If efficiency was not a concern (like that would ever happen), you could check $status [csh] after running grep on each file. E.g.: (For N characters = 25.)
foreach FILE ( file1 file2 ... fileN )
grep targetToMatch ${FILE} > /dev/null
if ( $status == 0 ) then
echo -n "${FILE}: "
head -c25 ${FILE}
endif
end
2) GNU [FSF] head contains a --verbose [-v] switch. It also offers --null, to accomodate filenames with spaces. And there's '--', to handle filenames like "-c". So you could do:
grep --null -l targetToMatch -- file1 file2 ... fileN |
xargs --null head -v -c25 --

Resources