Linux command to grab lines similar between files - linux

I have one file that has one word per line.
I have a second file that has many words per line.
I would like to go through each line in the first file, and all lines for which it is found in the second file, I would like to copy those lines from the second file into a new third file.
Is there a way to do this simply with Linux command?
Edit: Thanks for the input. But, I should specify better:
The first file is just a list of numbers (one number per line).
463463
43454
33634
The second file is very messy, and I am only looking for that number string to be in lines in any way (not necessary an individual word). So, for instance
ewjleji jejeti ciwlt 463463.52%
would return a hit. I think what was suggested to me does not work in this case (please forgive my having to edit for not being detailed enough)

If n is the number of lines in your first file and m is the number of lines in your second file, then you can solve this problem in O(nm) time in the following way:
cat firstfile | while read word; do
grep "$word" secondfile >>thirdfile
done
If you need to solve it more efficiently than that, I don't think there are any builtin utilties for that, however.
As for your edit, this method does work the way you describe.

Here is a short script that will do it. it will take 3 command line arguments 1- file with 1 word per line, 2- file with many lines you want to match for each word in file1 and 3- your output file:
#!/bin/bash
## test input and show usage on error
test -n "$1" && test -n "$2" && test -n "$3" || {
printf "Error: insufficient input, usage: %s file1 file2 file3\n" "${0//*\//}"
exit 1
}
while read line || test -n "$line" ; do
grep "$line" "$2" 1>>"$3" 2>/dev/null
done <"$1"
example:
$ cat words.txt
me
you
them
$ cat lines.txt
This line is for me
another line for me
maybe another for me
one for you
another for you
some for them
another for them
here is one that doesn't match any
$ bash ../lines.sh words.txt lines.txt outfile.txt
$ cat outfile.txt
This line is for me
another line for me
maybe another for me
some for them
one for you
another for you
some for them
another for them
(yes I know that me also matches some in the example file, but that's not really the point.

Related

Get last line from grep search on multiple files, and write them in a output file

I have multiple files located in multiple directories. From them I search a keyword 'ENERGY' by grep. In each file I get multiple match cases. I want to take the last line from each file and save the results in the output.txt file. I wrote the following code:
labl=SubDir
ENERGY=`grep 'ENERGY' MyDir*${labl}*/*.txt`
cat > output.txt << EOF
${ENERGY}
EOF
This code saves all match cases from each file. But as mentioned, I need the last match case from each file. For that I modified the grep command as:
ENERGY=`grep 'ENERGY' MyDir*${labl}*/*.txt|taile -l`
Unfortunately this doesn't do the job either. Instead, it saves all the match cases from the last file only.
How to solve it?
Please don't run multiple processes/pipes to achieve this.
gawk '/ENERGY/{last=$0} ENDFILE{if(last!="") print last; last=""}' MyDir*"$labl"*/*.txt
/ENERGY/{last=$0}: On lines which match the regex ENERGY, set variable last to the contents of the entire line $0
ENDFILE{...} Run this {action} at the end of every input file supplied by the glob.
if(last!="") print last: print last if it's not null
last="": reset this variable to null, avoiding duplication
MyDir*"${labl}"*/*.txt: Quoted variable in glob will match directory names that include spaces
Use a for loop:
for f in MyDir*"$lab1"*/*.txt; do
grep ENERGY "$f" | tail -1 >> output.txt
done
Yet one but probably not last possible approach is to use parallel like this. Probably you can achieve the same with xargs, but I personally prefer parallel as simpler and giving the possibility to scale your process.
ls -1 file* | parallel -j1 "grep ENERGY {} | tail -n 1" > output.txt

Edit text in a file with a script

I have a file called flw.py and would like to write a bash script that will replace some text in the file (take out the last two lines and add two new lines). I apologize if this seems like a stupid question. A thorough explanation would be appreciated since I am still learning to script. Thanks!
head -n -2 flw.py > tmp # (1)
echo "your first new line here..." >> tmp # (2)
echo "your second new line here...." >> tmp #
mv tmp flw.py # (3)
Explanation:
head normally prints out the first ten lines of a file. The -n argument can change the number of lines printed out. So if you wanted to print out the first 15 lines you would use head -n 15. If you give negative numbers to head it means the opposite: print out all lines but the last N lines. Which happens to be what you want: head -n -2
Then we redirect the output of our head command to a temporary file named tmp. > does the redirecting magic here. tmp now contains everything of flw.py but the last two lines.
Next we add the two new lines by using the echo command. We append the output of the echo "your first new line here..." to our tmp file. >> appends to an existing file, whereas > will overwrite an existing file.
We do the same thing for the second line we want to append.
Last, we move the tmp file to flw.py and the job is done.
You can use single sed command to get you expect result
sed -n 'N;$!P;$!D;a\line\n\line2' fly.py
Example:
cat fly.py
1
2
3
4
5
sed -n 'N;$!P;$!D;a\line\n\line2' fly.py
Output :
1
2
3
line1
line2
Note :
Using -i option to update your file

How can I remove the last character of a file in unix?

Say I have some arbitrary multi-line text file:
sometext
moretext
lastline
How can I remove only the last character (the e, not the newline or null) of the file without making the text file invalid?
A simpler approach (outputs to stdout, doesn't update the input file):
sed '$ s/.$//' somefile
$ is a Sed address that matches the last input line only, thus causing the following function call (s/.$//) to be executed on the last line only.
s/.$// replaces the last character on the (in this case last) line with an empty string; i.e., effectively removes the last char. (before the newline) on the line.
. matches any character on the line, and following it with $ anchors the match to the end of the line; note how the use of $ in this regular expression is conceptually related, but technically distinct from the previous use of $ as a Sed address.
Example with stdin input (assumes Bash, Ksh, or Zsh):
$ sed '$ s/.$//' <<< $'line one\nline two'
line one
line tw
To update the input file too (do not use if the input file is a symlink):
sed -i '$ s/.$//' somefile
Note:
On macOS, you'd have to use -i '' instead of just -i; for an overview of the pitfalls associated with -i, see the bottom half of this answer.
If you need to process very large input files and/or performance / disk usage are a concern and you're using GNU utilities (Linux), see ImHere's helpful answer.
truncate
truncate -s-1 file
Removes one (-1) character from the end of the same file. Exactly as a >> will append to the same file.
The problem with this approach is that it doesn't retain a trailing newline if it existed.
The solution is:
if [ -n "$(tail -c1 file)" ] # if the file has not a trailing new line.
then
truncate -s-1 file # remove one char as the question request.
else
truncate -s-2 file # remove the last two characters
echo "" >> file # add the trailing new line back
fi
This works because tail takes the last byte (not char).
It takes almost no time even with big files.
Why not sed
The problem with a sed solution like sed '$ s/.$//' file is that it reads the whole file first (taking a long time with large files), then you need a temporary file (of the same size as the original):
sed '$ s/.$//' file > tempfile
rm file; mv tempfile file
And then move the tempfile to replace the file.
Here's another using ex, which I find not as cryptic as the sed solution:
printf '%s\n' '$' 's/.$//' wq | ex somefile
The $ goes to the last line, the s deletes the last character, and wq is the well known (to vi users) write+quit.
After a whole bunch of playing around with different strategies (and avoiding sed -i or perl), the best way i found to do this was with:
sed '$! { P; D; }; s/.$//' somefile
If the goal is to remove the last character in the last line, this awk should do:
awk '{a[NR]=$0} END {for (i=1;i<NR;i++) print a[i];sub(/.$/,"",a[NR]);print a[NR]}' file
sometext
moretext
lastlin
It store all data into an array, then print it out and change last line.
Just a remark: sed will temporarily remove the file.
So if you are tailing the file, you'll get a "No such file or directory" warning until you reissue the tail command.
EDITED ANSWER
I created a script and put your text inside on my Desktop. this test file is saved as "old_file.txt"
sometext
moretext
lastline
Afterwards I wrote a small script to take the old file and eliminate the last character in the last line
#!/bin/bash
no_of_new_line_characters=`wc '/root/Desktop/old_file.txt'|cut -d ' ' -f2`
let "no_of_lines=no_of_new_line_characters+1"
sed -n 1,"$no_of_new_line_characters"p '/root/Desktop/old_file.txt' > '/root/Desktop/my_new_file'
sed -n "$no_of_lines","$no_of_lines"p '/root/Desktop/old_file.txt'|sed 's/.$//g' >> '/root/Desktop/my_new_file'
opening the new_file I created, showed the output as follows:
sometext
moretext
lastlin
I apologize for my previous answer (wasn't reading carefully)
sed 's/.$//' filename | tee newFilename
This should do your job.
A couple perl solutions, for comparison/reference:
(echo 1a; echo 2b) | perl -e '$_=join("",<>); s/.$//; print'
(echo 1a; echo 2b) | perl -e 'while(<>){ if(eof) {s/.$//}; print }'
I find the first read-whole-file-into-memory approach can be generally quite useful (less so for this particular problem). You can now do regex's which span multiple lines, for example to combine every 3 lines of a certain format into 1 summary line.
For this problem, truncate would be faster and the sed version is shorter to type. Note that truncate requires a file to operate on, not a stream. Normally I find sed to lack the power of perl and I much prefer the extended-regex / perl-regex syntax. But this problem has a nice sed solution.

Does shell confuse <space> with <new line> when reading text files?

i tried to run this script :
for line in $(cat song.txt)
do echo "$line" >> out.txt
done
running it on ubuntu 11.04
when "song.txt" contains :
I read the news today oh boy
About a lucky man who made the grade
after running the script the "out.txt" looks like that:
I
read
the
news
today
oh
boy
About
a
lucky
man
who
made
the
grade
is anyone can tell me what i am doing wrong here?
For per-line input you should use while read, for example:
cat song.txt | while read line
do
echo "$line" >> out.txt
done
Better (more efficient really) would be the following method:
while read line
do
echo "$line"
done < song.txt > out.txt
That's because of the for command that takes every word from the list it is given (in your case, the content of the song.txt file), whether the words are separated by spaces or newline characters.
What's misguiding you here, is that your for variable name is line. for only works with words. Reread your script changing line by word and it should make sense.
In the for-each-in loop the list that is specified is assumed to be white-space separated. A white-space includes a space, new-line, tab etc. In your case the list is the entire text of the file and hence, the loop runs for every word in the file.

Quick unix command to display specific lines in the middle of a file?

Trying to debug an issue with a server and my only log file is a 20GB log file (with no timestamps even! Why do people use System.out.println() as logging? In production?!)
Using grep, I've found an area of the file that I'd like to take a look at, line 347340107.
Other than doing something like
head -<$LINENUM + 10> filename | tail -20
... which would require head to read through the first 347 million lines of the log file, is there a quick and easy command that would dump lines 347340100 - 347340200 (for example) to the console?
update I totally forgot that grep can print the context around a match ... this works well. Thanks!
I found two other solutions if you know the line number but nothing else (no grep possible):
Assuming you need lines 20 to 40,
sed -n '20,40p;41q' file_name
or
awk 'FNR>=20 && FNR<=40' file_name
When using sed it is more efficient to quit processing after having printed the last line than continue processing until the end of the file. This is especially important in the case of large files and printing lines at the beginning. In order to do so, the sed command above introduces the instruction 41q in order to stop processing after line 41 because in the example we are interested in lines 20-40 only. You will need to change the 41 to whatever the last line you are interested in is, plus one.
# print line number 52
sed -n '52p' # method 1
sed '52!d' # method 2
sed '52q;d' # method 3, efficient on large files
method 3 efficient on large files
fastest way to display specific lines
with GNU-grep you could just say
grep --context=10 ...
No there isn't, files are not line-addressable.
There is no constant-time way to find the start of line n in a text file. You must stream through the file and count newlines.
Use the simplest/fastest tool you have to do the job. To me, using head makes much more sense than grep, since the latter is way more complicated. I'm not saying "grep is slow", it really isn't, but I would be surprised if it's faster than head for this case. That'd be a bug in head, basically.
What about:
tail -n +347340107 filename | head -n 100
I didn't test it, but I think that would work.
I prefer just going into less and
typing 50% to goto halfway the file,
43210G to go to line 43210
:43210 to do the same
and stuff like that.
Even better: hit v to start editing (in vim, of course!), at that location. Now, note that vim has the same key bindings!
You can use the ex command, a standard Unix editor (part of Vim now), e.g.
display a single line (e.g. 2nd one):
ex +2p -scq file.txt
corresponding sed syntax: sed -n '2p' file.txt
range of lines (e.g. 2-5 lines):
ex +2,5p -scq file.txt
sed syntax: sed -n '2,5p' file.txt
from the given line till the end (e.g. 5th to the end of the file):
ex +5,p -scq file.txt
sed syntax: sed -n '2,$p' file.txt
multiple line ranges (e.g. 2-4 and 6-8 lines):
ex +2,4p +6,8p -scq file.txt
sed syntax: sed -n '2,4p;6,8p' file.txt
Above commands can be tested with the following test file:
seq 1 20 > file.txt
Explanation:
+ or -c followed by the command - execute the (vi/vim) command after file has been read,
-s - silent mode, also uses current terminal as a default output,
q followed by -c is the command to quit editor (add ! to do force quit, e.g. -scq!).
I'd first split the file into few smaller ones like this
$ split --lines=50000 /path/to/large/file /path/to/output/file/prefix
and then grep on the resulting files.
If your line number is 100 to read
head -100 filename | tail -1
Get ack
Ubuntu/Debian install:
$ sudo apt-get install ack-grep
Then run:
$ ack --lines=$START-$END filename
Example:
$ ack --lines=10-20 filename
From $ man ack:
--lines=NUM
Only print line NUM of each file. Multiple lines can be given with multiple --lines options or as a comma separated list (--lines=3,5,7). --lines=4-7 also works.
The lines are always output in ascending order, no matter the order given on the command line.
sed will need to read the data too to count the lines.
The only way a shortcut would be possible would there to be context/order in the file to operate on. For example if there were log lines prepended with a fixed width time/date etc.
you could use the look unix utility to binary search through the files for particular dates/times
Use
x=`cat -n <file> | grep <match> | awk '{print $1}'`
Here you will get the line number where the match occurred.
Now you can use the following command to print 100 lines
awk -v var="$x" 'NR>=var && NR<=var+100{print}' <file>
or you can use "sed" as well
sed -n "${x},${x+100}p" <file>
With sed -e '1,N d; M q' you'll print lines N+1 through M. This is probably a bit better then grep -C as it doesn't try to match lines to a pattern.
Building on Sklivvz' answer, here's a nice function one can put in a .bash_aliases file. It is efficient on huge files when printing stuff from the front of the file.
function middle()
{
startidx=$1
len=$2
endidx=$(($startidx+$len))
filename=$3
awk "FNR>=${startidx} && FNR<=${endidx} { print NR\" \"\$0 }; FNR>${endidx} { print \"END HERE\"; exit }" $filename
}
To display a line from a <textfile> by its <line#>, just do this:
perl -wne 'print if $. == <line#>' <textfile>
If you want a more powerful way to show a range of lines with regular expressions -- I won't say why grep is a bad idea for doing this, it should be fairly obvious -- this simple expression will show you your range in a single pass which is what you want when dealing with ~20GB text files:
perl -wne 'print if m/<regex1>/ .. m/<regex2>/' <filename>
(tip: if your regex has / in it, use something like m!<regex>! instead)
This would print out <filename> starting with the line that matches <regex1> up until (and including) the line that matches <regex2>.
It doesn't take a wizard to see how a few tweaks can make it even more powerful.
Last thing: perl, since it is a mature language, has many hidden enhancements to favor speed and performance. With this in mind, it makes it the obvious choice for such an operation since it was originally developed for handling large log files, text, databases, etc.
print line 5
sed -n '5p' file.txt
sed '5q' file.txt
print everything else than line 5
`sed '5d' file.txt
and my creation using google
#!/bin/bash
#removeline.sh
#remove deleting it comes move line xD
usage() { # Function: Print a help message.
echo "Usage: $0 -l LINENUMBER -i INPUTFILE [ -o OUTPUTFILE ]"
echo "line is removed from INPUTFILE"
echo "line is appended to OUTPUTFILE"
}
exit_abnormal() { # Function: Exit with error.
usage
exit 1
}
while getopts l:i:o:b flag
do
case "${flag}" in
l) line=${OPTARG};;
i) input=${OPTARG};;
o) output=${OPTARG};;
esac
done
if [ -f tmp ]; then
echo "Temp file:tmp exist. delete it yourself :)"
exit
fi
if [ -f "$input" ]; then
re_isanum='^[0-9]+$'
if ! [[ $line =~ $re_isanum ]] ; then
echo "Error: LINENUMBER must be a positive, whole number."
exit 1
elif [ $line -eq "0" ]; then
echo "Error: LINENUMBER must be greater than zero."
exit_abnormal
fi
if [ ! -z $output ]; then
sed -n "${line}p" $input >> $output
fi
if [ ! -z $input ]; then
# remove this sed command and this comes move line to other file
sed "${line}d" $input > tmp && cp tmp $input
fi
fi
if [ -f tmp ]; then
rm tmp
fi
You could try this command:
egrep -n "*" <filename> | egrep "<line number>"
Easy with perl! If you want to get line 1, 3 and 5 from a file, say /etc/passwd:
perl -e 'while(<>){if(++$l~~[1,3,5]){print}}' < /etc/passwd
I am surprised only one other answer (by Ramana Reddy) suggested to add line numbers to the output. The following searches for the required line number and colours the output.
file=FILE
lineno=LINENO
wb="107"; bf="30;1"; rb="101"; yb="103"
cat -n ${file} | { GREP_COLORS="se=${wb};${bf}:cx=${wb};${bf}:ms=${rb};${bf}:sl=${yb};${bf}" grep --color -C 10 "^[[:space:]]\\+${lineno}[[:space:]]"; }

Resources