How to express "map a linux command to each line in a file"?

How to express "map a linux command to each line in a file"? - linux

Often I need to delete all the files not in a specific svn source tree. To get the list of all their file names, I use:
svn st | grep ^? | awk '{print $2}'
This command will give me a list of file names, one name per line. Then how can I express the idea of
for (each line in ${svn st | grep ^? | awk '{print $2}' )
rm -f line
?

Use xargs:
svn st | grep '^?' | awk '{print $2}' | xargs rm -rf
Note: In your command, you don't deal with files having whitespaces because of {print $2}. You also have to be careful with xargs as it splits its input on whitespaces. So it's safer to use the -0 option if you have whitespaces in your filenames.
This is an example of a more correct xargs usage:
find -type f| tr \\n \\0 | xargs -0 wc
Or, using find -print0 option:
find -type f -print0 | xargs -0 wc
xargs is one of the most powerful UNIX tools. I use it everyday.

Why are so many people invoking grep in the pipeline and either using bash's for loops or xargs on the back end when they have bloody awk right in the middle of things?
First let's get rid of the whimsical use of grep:
svn st | awk '/^?/ { print $2 }'
Since awk allows you to filter lines based on regular expressions, the use of grep is entirely unnecessary. The regular expressions of awk aren't that different from grep (depending on which implementation of awk and which of grep you're using) so why add a whole new process and a whole new bottleneck in the pipeline?
From there you already have two choices that would be shorter and more readable:
# option 1
svn st | awk '/^?/ { print $2 }' | xargs rm -f
# option 2
rm -f $(svn st | awk '/^?/ { print $2 }')
Now that second option will only work if your file list doesn't exceed the maximum command line size, so I recommend the xargs version.
Or, perhaps even better, use awk again.
svn st | awk '/^?/ { system("rm -f $2") }'
This will be the functional equivalent of what you did above with the for loop. It's grotesquely inefficient, seeing as it will execute rm once per file, but it's at least more readable than your for loop example. It can be improved still farther, however. I won't go into full details here, but will instead give you comments as clues as to what the final solution would look like.
svn st | awk 'BEGIN{ /*set up an array*/ }; /^?/ { /*add $2 to the array*/ }; END{ /*system("rm -f ...") safe chunks of the array*/}
OK, so that last one is a bit of a mouthful and too much to type off as a routine one-liner. Since you have to "often" do this, however, it shouldn't be too bad to put this into a script:
#!/usr/bin/env awk
BEGIN {
/* set up your accumulator array */
}
/^?/ {
/* add $2 to the array */
}
END {
/* invoke system("rm -f") on safe chunks of the accumulator array */
}
Now your command line will be svn st | myawkscript.
Now I'll warn you I'm not equipped to check all this (since I avoid SVN and CVS like I avoid MS-DOS -- and for much the same reason). You might have to monkey with the #! line in the script, for example, or with the regular expression you use to filter, but the general principle remains about the same. And me personally? I'd use svn st | awk '/^?/ { print $2 }' | xargs rm -f for something I do infrequently. I'd only do the full script for something I do several times a week or more.

There are a few ways to do this with various drawbacks associated... One of the simpler ones is to use a while read.. loop. Another way is to use a for loop and change IFS to \n, but that's a bit uglier without much of any benefit. You may see it done though, especially if they want word splitting on spaces.
git status | while read line; do
echo "I'm an albatross! $line"
done
or, using process substitution (which gets rid of a subshell which can add nasty complications)
while read line; do
echo "I'm an albatross! $line"
done <(git status)
There are caveats to this which may make things more complicated in specific situations. You should read help read, and Bash FAQ #1 contains lots of gory details.

Use Command Substitution using back-ticks: http://tldp.org/LDP/abs/html/commandsub.html
In your case:
rm -f `svn st | grep ^? | awk '{print $2'}`

xargs is your friend:
svn st | awk '/^?/{print $2}'|xargs rm -rf

Related

Move a file list based upon grep pattern in command line [duplicate]

I want to pass each output from a command as multiple argument to a second command, e.g.:
grep "pattern" input
returns:
file1
file2
file3
and I want to copy these outputs, e.g:
cp file1 file1.bac
cp file2 file2.bac
cp file3 file3.bac
How can I do that in one go? Something like:
grep "pattern" input | cp $1 $1.bac

You can use xargs:
grep 'pattern' input | xargs -I% cp "%" "%.bac"

You can use $() to interpolate the output of a command. So, you could use kill -9 $(grep -hP '^\d+$' $(ls -lad /dir/*/pid | grep -P '/dir/\d+/pid' | awk '{ print $9 }')) if you wanted to.

In addition to Chris Jester-Young good answer, I would say that xargs is also a good solution for these situations:
grep ... `ls -lad ... | awk '{ print $9 }'` | xargs kill -9
will make it. All together:
grep -hP '^\d+$' `ls -lad /dir/*/pid | grep -P '/dir/\d+/pid' | awk '{ print $9 }'` | xargs kill -9

For completeness, I'll also mention command substitution and explain why this is not recommended:
cp $(grep -l "pattern" input) directory/
(The backtick syntax cp `grep -l "pattern" input` directory/ is roughly equivalent, but it is obsolete and unwieldy; don't use that.)
This will fail if the output from grep produces a file name which contains whitespace or a shell metacharacter.
Of course, it's fine to use this if you know exactly which file names the grep can produce, and have verified that none of them are problematic. But for a production script, don't use this.
Anyway, for the OP's scenario, where you need to refer to each match individually and add an extension to it, the xargs or while read alternatives are superior anyway.
In the worst case (meaning problematic or unspecified file names), pass the matches to a subshell via xargs:
grep -l "pattern" input |
xargs -r sh -c 'for f; do cp "$f" "$f.bac"; done' _
... where obviously the script inside the for loop could be arbitrarily complex.
In the ideal case, the command you want to run is simple (or versatile) enough that you can simply pass it an arbitrarily long list of file names. For example, GNU cp has a -t option to facilitate this use of xargs (the -t option allows you to put the destination directory first on the command line, so you can put as many files as you like at the end of the command):
grep -l "pattern" input | xargs cp -t destdir
which will expand into
cp -t destdir file1 file2 file3 file4 ...
for as many matches as xargs can fit onto the command line of cp, repeated as many times as it takes to pass all the files to cp. (Unfortunately, this doesn't match the OP's scenario; if you need to rename every file while copying, you need to pass in just two arguments per cp invocation: the source file name and the destination file name to copy it to.)
So in other words, if you use the command substitution syntax and grep produces a really long list of matches, you risk bumping into ARG_MAX and "Argument list too long" errors; but xargs will specifically avoid this by instead copying only as many arguments as it can safely pass to cp at a time, and running cp multiple times if necessary instead.
The above will still work incorrectly if you have file names which contain newlines. Perhaps see also https://mywiki.wooledge.org/BashFAQ/020

#!/bin/bash
for f in files; do
if grep -q PATTERN "$f"; then
echo cp -v "$f" "${f}.bac"
fi
done
files can be *.txt or *.text which basically means files ending in *.txt or *text or replace with something that you want/need, of course replace PATTERN with yours. Remove echo if you're satisfied with the output. For a recursive solution take a look at the bash shell option globstar

Can find push the filenames of the found files into the pipe?

I would like to do a find in some dir, and do a awk on the files in this direcory, and then replace the original files by each result.
find dir | xargs cat | awk ... | mv ... > filename
So I need the filename (of each of the files found by find) in the last command. How can I do that?

I would use a loop, like:
for filename in `find . -name "*test_file*" -print0 | xargs -0`
do
# some processing, then
echo "what you like" > "$filename"
done
EDIT: as noted in the comments, the benefits of -print0 | xargs -0 are lost because of the for loop. And filenames containing a white space are still not handled correctly.
The following while loop would not handle unusual filenames neither (good to know it, though it was not in the question), but filenames with a standard white space at least, so it works better, indeed:
find . -name "*test*file*" -print > files_list
while IFS= read -r filename
do
# some process
echo "what you like" > "$filename"
done < files_list

You could do something like this (but I wouldn't recommend it at all).
find dir -print0 |
xargs -0 -n 2 awk -v OFS='\0' '<process the input and write to temporary file>
END {print "temporaryfile", FILENAME}' |
xargs -0 -n 2 mv
This passes the files to awk directly two at a time (which avoids the problem with your original where cat will get hundreds (perhaps more) files as arguments all at once and spit all their content at awk via standard input at once and thus lose their individual contents and filenames entirely).
It then has awk write the processed output to a temporary file and then outputs the temporary filename and the original filename where xargs picks them up (again two at a time) and runs mv on the pairs of temporary file/original file names.
As I said at the beginning however this is a terrible way to do this.
If you have a new enough version of GNU awk (version 4.1.0 or newer) then you could just use the -i (in-place) argument to awk and use (I believe):
find dir | xargs awk -i '......'
Without that I would use a while loop of the form in Bash FAQ 001 to read the find output line-by-line and operate on it in the loop.

How to grep within a grep

I have a bunch of massive text files, about 100MB each.
I want to grep to find entries that have 'INDIANA JONES' in it:
$ grep -ir 'INDIANA JONES' ./
Then, I would like to find the entries where there is the word PORTUGAL within 5,000 characters of the INDIANA JONES term. How would I do this?
# in pseudocode
grep -ir 'INDIANA JONES' ./ | grep 'PORTUGAL' within 5000 char

Use grep's -o flag to output the 5000 characters surround the match, then search those characters for the second string. For example:
grep -ioE ".{5000}INDIANA JONES.{5000}" file.txt | grep "PORTUGAL"
If you need the original match, add the -n flag to the second grep and pipe into:
cut -f1 -d: > line_numbers.txt
then you could use awk to print those lines:
awk 'FNR==NR { a[$0]; next } FNR in a' line_numbers.txt file.txt
To avoid the temporary file, this could be written like:
awk 'FNR==NR { a[$0]; next } FNR in a' <(grep -ioE ".{50000}INDIANA JONES.{50000}" file.txt | grep -n "PORTUGAL" | cut -f1 -d:) file.txt
For multiple files, use find and a bash loop:
for i in $(find . -type f); do
awk 'FNR==NR { a[$0]; next } FNR in a' <(grep -ioE ".{50000}INDIANA JONES.{50000}" "$i" | grep -n "PORTUGAL" | cut -f1 -d:) "$i"
done

One way to deal with this is with gawk. You could set the record separator to either INDIANA JONES or PORTUGAL and then perform a length check on the record (after stripping newlines, assuming newlines do not count towards the limit of 5000). You may have to resort to find to run this recursively within a directory
awk -v RS='INDIANA JONES|PORTUGAL' '{a = $0;
gsub("\n", "", a)};
((RT ~ /IND/ && prevRT ~/POR/) || (RT ~ /POR/ && prevRT ~/IND/)) && length(a) < 5000{found=1};
{prevRT=RT};
END{if (found) print FILENAME}' file.txt

Consider installing ack-grep.
sudo apt-get install ack-grep
ack-grep is a more powerful version of grep.
There's no trivial solution to your question (that i can think of) outside of a full batch script, but you can use the -A and -B flags on ack-grep to specify a number of trailing or leading lines to output, resp.
This may not be a number of chars, but is a step further in that direction.
While this may not be a solution, it might give you some idea as to how to do this. Lookup filters like ack, awk, sed, etc. and see if you can find one with a flag for this kind of behaviour.
The ack-grep manual:
http://manpages.ubuntu.com/manpages/hardy/man1/ack-grep.1p.html
EDIT:
I think the sad news is, what you might think you're looking for is something like:
grep "\(INDIANA JONES\).\{1,5000\}PORTUGAL" filename
The problem is, even on a small file, querying this is going to be impossible time-wise.
I got this one to work with a different number. it's a size problem.
For such a large set of files, you'll need to do this in more than one step.
A Solution:
The only solution I know of is the leading and trailing output from ack-grep.
Step 1: how long are your lines?
If you knew how many lines out you had to go
(and you could estimate/calculate this a few ways) then you'd be able to grep the output of the first grep. Depending on what's in your file, you should be able to get a decent upper bound as to how many lines is 5000 chars (if a line has 100 chars average, 50+ lines should cover you, but if it has 10 chars, you'll need 500+).
You've got to determine the maximum number of lines that could be 5000 chars. You could guess or pick a high range if you like, but that'll be up to you. It's your data.
With that, call: (if you needed 100 lines for 5000 chars)
ack-grep -ira "PORTUGAL" -A 100 -B 100 filename
and
ack-grep -ira "INDIANA JONES" -A 100 -B 100 filename
replace the 100s with what you need.
Step 2: parse the output
you'll need to take the matches that ack-grep returns and parse them, looking for any matches again, within these sub-ranges.
Look for INDIANA JONES in the first PORTUGAL ack-grep match output, and look for PORTUGAL in the second set of matches.
This should take a bit more work, likely involving a bash script (I might see if I can get one working this week), but it solves your massive-data problem, by breaking it down into more manageable chunks.

grep 'INDIANA JONES' . -iR -l | while read filename; do head -c 5000 "$filename" | grep -n PORTUGAL -H --label="$filename" ; done
This works as follows:
grep 'INDIANA JONES' . -iR -l. Search for all files in or below the current directory. Case insensitive (-i). And only print the names of the files that match (-l), don't print any content.
| while read filename; do ...|...|...; done for each line of input, store it in variable $filename and execute the pipeline.
Now, for each file that matched 'INDIANA JONES', we do
head -c 5000 "$filename" - extract the first 5000 characters
grep ... - search for PORTUGAL. Print the filename (-H), but where we tell us the 'filename' we want to use with --label="$filename". Print line numbers too, -n.

ksh storing result of a command to a variable

I want to store the result of a command to a variable in my shell script. I cant seem to get it to work. I want the most recently dated file in the directory.
PRODUCT= 'ls -t /some/dir/file* | head -1 | xargs -n1 basename'
it wont work

you have two options, either $ or backsticks`.
1) x=$(ls -t /some/dir/file* | head -1 | xargs -n1 basename)
or
2) x=`ls -t /some/dir/file* | head -1 | xargs -n1 basename`
echo $x
Edit: removing unnecessary bracket for (2).

The problem that you're having is that the command needs to be surrounded by back-ticks rather than single quotes. This is known as 'Command Substitution'.
Bash allows you to use $() for command substitution, but this is not available in all shells. I don't know if it's available in KSH; if it is, it's probably not available in all versions.
If the $() syntax is available in your version of ksh, you should definitely use it; it's easier to read (back ticks are too easy to confuse with single quotes); back-ticks are also hard to nest.
This only addresses one of the problems with your command, however: ls returns directories as well as files, so if the most recent thing modified in the specified directory is a sub-directory, that is what you will see.
If you only want to see files, I suggest using some version of the following (I'm using Bash, which supports default variables, you'll probably have to play around with the syntax of $1)
lastfile ()
{
find ${1:-.} -maxdepth 1 -type f -printf "%T+ %p\n" | sort -n | tail -1 | sed 's/[^[:space:]]\+ //'
}
This runs find on the directory, and only pulls files from that directory. It formats all of the files like this:
2012-08-29+16:21:40.0000000000 ./.sqlite_history
2013-01-14+08:52:14.0000000000 ./.davmail.properties
2012-04-04+16:16:40.0000000000 ./.DS_Store
2010-04-21+15:49:00.0000000000 ./.joe_state
2008-09-05+17:15:28.0000000000 ./.hplip.conf
2012-01-31+13:12:28.0000000000 ./.oneclick
sorts the list, takes the last line, and chops off everything before the first space.

You want $() (preferred) or backticks (``) (older style), rather than single quotes:
PRODUCT=$(ls -t /some/dir/file* | head -1 | xargs -n1 basename)
or
PRODUCT=`ls -t /some/dir/file* | head -1 | xargs -n1 basename`

You need both quotes to ensure you keep the name even if it contains spaces, and also in case you later want more than 1 file, and "$(..)" to run commands in background
I believe you also need the '-1' option to ls, otherwise you could have several names per lines (you only keep 1 line, but it could be several files)
PRODUCT="$(ls -1t /some/dir/file* | head -1 | xargs -n1 basename)"
Please do not put space around the "=" variable assignments (as I saw on other solutions here) , as it's not very compatible as well.

I would do something like:
Your version corrected:
PRODUCT=$(ls -t /some/dir/file* | head -1 | xargs -n1 basename)
Or simpler:
PRODUCT=$(cd /some/dir && ls -1t file* | head -1)
change to the directory
list one filename per line and sort by time/date
grab the first line

Linux using grep to print the file name and first n characters

How do I use grep to perform a search which, when a match is found, will print the file name as well as the first n characters in that file? Note that n is a parameter that can be specified and it is irrelevant whether the first n characters actually contains the matching string.

grep -l pattern *.txt |
while read line; do
echo -n "$line: ";
head -c $n "$line";
echo;
done
Change -c to -n if you want to see the first n lines instead of bytes.

You need to pipe the output of grep to sed to accomplish what you want. Here is an example:
grep mypattern *.txt | sed 's/^\([^:]*:.......\).*/\1/'
The number of dots is the number of characters you want to print. Many versions of sed often provide an option, like -r (GNU/Linux) and -E (FreeBSD), that allows you to use modern-style regular expressions. This makes it possible to specify numerically the number of characters you want to print.
N=7
grep mypattern *.txt /dev/null | sed -r "s/^([^:]*:.{$N}).*/\1/"
Note that this solution is a lot more efficient that others propsoed, which invoke multiple processes.

There are few tools that print 'n characters' rather than 'n lines'. Are you sure you really want characters and not lines? The whole thing can perhaps be best done in Perl. As specified (using grep), we can do:
pattern="$1"
shift
n="$2"
shift
grep -l "$pattern" "$#" |
while read file
do
echo "$file:" $(dd if="$file" count=${n}c)
done
The quotes around $file preserve multiple spaces in file names correctly. We can debate the command line usage, currently (assuming the command name is 'ngrep'):
ngrep pattern n [file ...]
I note that #litb used 'head -c $n'; that's neater than the dd command I used. There might be some systems without head (but they'd pretty archaic). I note that the POSIX version of head only supports -n and the number of lines; the -c option is probably a GNU extension.

Two thoughts here:
1) If efficiency was not a concern (like that would ever happen), you could check $status [csh] after running grep on each file. E.g.: (For N characters = 25.)
foreach FILE ( file1 file2 ... fileN )
grep targetToMatch ${FILE} > /dev/null
if ( $status == 0 ) then
echo -n "${FILE}: "
head -c25 ${FILE}
endif
end
2) GNU [FSF] head contains a --verbose [-v] switch. It also offers --null, to accomodate filenames with spaces. And there's '--', to handle filenames like "-c". So you could do:
grep --null -l targetToMatch -- file1 file2 ... fileN |
xargs --null head -v -c25 --

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string