Faster way to merge multiple files

Faster way to merge multiple files - linux

I have multiple small files in Linux (about 70,000 files) and I want to add a word to the end of each line of the files and then merge them all into a single file.
I'm using this script:
for fn in *.sms.txt
do
sed 's/$/'$fn'/' $fn >> sms.txt
rm -f $fn
done
Is there a faster way to do this?

I tried with these files:
for ((i=1;i<70000;++i)); do printf -v fn 'file%.5d.sms.txt' $i; echo -e "HAHA\nLOL\nBye" > "$fn"; done
I tried your solution that took about 4 minutes (real) to process. The problem with your solution is that you're forking on sed 70000 times! And forking is rather slow.
#!/bin/bash
filename="sms.txt"
# Create file "$filename" or empty it if it already existed
> "$filename"
# Start editing with ed, the standard text editor
ed -s "$filename" < <(
# Go into insert mode:
echo i
# Loop through files
for fn in *.sms.txt; do
# Loop through lines of file "$fn"
while read l; do
# Insert line "$l" with "$fn" appended to
echo "$l$fn"
done < "$fn"
done
# Tell ed to quit insert mode (.), to save (w) and quit (q)
echo -e ".\nwq"
)
This solution took ca. 6 seconds.
Don't forget, ed is the standard text editor, and don't overlook it! If you enjoyed ed, you'll probably also enjoy ex!
Cheers!

Almost Same as gniourf_gniourf's solution, but without ed:
for i in *.sms.txt
do
while read line
do
echo $line $i
done < $i
done >sms.txt

What, no love for awk?
awk '{print $0" "FILENAME}' *.sms.txt >sms.txt
Using gawk, this took 1-2 seconds on gniourf_gniourf's sample on my machine (according to time).
mawk is about 0.2 seconds faster than gawk here.

This perl script adds the actual filename at the end of each line.
#!/usr/bin/perl
use strict;
while(<>){
chomp;
print $_, $ARGV, "\n";
}
Call it like this:
scriptname *.sms.txt > sms.txt
Since there is only one process and no regular expression processing involved it should be quite fast.

Related

Remove a specific line from a file WITHOUT using sed or awk

I need to remove a specific line number from a file using a bash script.
I get the line number from the grep command with the -n option.
I cannot use sed for a variety of reasons, least of which is that it is not installed on all the systems this script needs to run on and installing it is not an option.
awk is out of the question because in testing, on different machines with different UNIX/Linux OS's (RHEL, SunOS, Solaris, Ubuntu, etc.), it gives (sometimes wildly) different results on each. So, no awk.
The file in question is just a flat text file, with one record per line, so nothing fancy needs to be done, except for remove the line by number.
If at all possible, I need to avoid doing something like extracting the contents of the file, not including the line I want gone, and then overwriting the original file.

Since you have grep, the obvious thing to do is:
$ grep -v "line to remove" file.txt > /tmp/tmp
$ mv /tmp/tmp file.txt
$
But it sounds like you don't want to use any temporary files - I assume the input file is large and this is an embedded system where memory and storage are in short supply. I think you ideally need a solution that edits the file in place. I think this might be possible with dd but haven't figured it out yet :(
Update - I figured out how to edit the file in place with dd. Also grep, head and cut are needed. If these are not available then they can probably be worked around for the most part:
#!/bin/bash
# get the line number to remove
rline=$(grep -n "$1" "$2" | head -n1 | cut -d: -f1)
# number of bytes before the line to be removed
hbytes=$(head -n$((rline-1)) "$2" | wc -c)
# number of bytes to remove
rbytes=$(grep "$1" "$2" | wc -c)
# original file size
fsize=$(cat "$2" | wc -c)
# dd will start reading the file after the line to be removed
ddskip=$((hbytes + rbytes))
# dd will start writing at the beginning of the line to be removed
ddseek=$hbytes
# dd will move this many bytes
ddcount=$((fsize - hbytes - rbytes))
# the expected new file size
newsize=$((fsize - rbytes))
# move the bytes with dd. strace confirms the file is edited in place
dd bs=1 if="$2" skip=$ddskip seek=$ddseek conv=notrunc count=$ddcount of="$2"
# truncate the remainder bytes of the end of the file
dd bs=1 if="$2" skip=$newsize seek=$newsize count=0 of="$2"
Run it thusly:
$ cat > file.txt
line 1
line two
line 3
$ ./grepremove "tw" file.txt
7+0 records in
7+0 records out
0+0 records in
0+0 records out
$ cat file.txt
line 1
line 3
$
Suffice to say that dd is a very dangerous tool. You can easily unintentionally overwrite files or entire disks. Be very careful!

Try ed. The here-document-based example below deletes line 2 from test.txt
ed -s test.txt <<!
2d
w
!

You can do it without grep using posix shell builtins which should be on any *nix.
while read LINE || [ "$LINE" ];do
case "$LINE" in
*thing_you_are_grepping_for*)continue;;
*)echo "$LINE";;
esac
done <infile >outfile

If n is the line you want to omit:
{
head -n $(( n-1 )) file
tail +$(( n+1 )) file
} > newfile

Given dd is deemed too dangerous for this in-place line removal, we need some other method where we have fairly fine-grained control over the file system calls. My initial urge is to write something in c, but while possible, I think that is a bit of overkill. Instead it is worth looking to common scripting (not shell-scripting) languages, as these typically have fairly low-level file APIs which map to the file syscalls in a fairly straightforward manner. I'm guessing this can be done using python, perl, Tcl or one of many other scripting language that might be available. I'm most familiar with Tcl, so here we go:
#!/bin/sh
# \
exec tclsh "$0" "$#"
package require Tclx
set removeline [lindex $argv 0]
set filename [lindex $argv 1]
set infile [open $filename RDONLY]
for {set lineNumber 1} {$lineNumber < $removeline} {incr lineNumber} {
if {[eof $infile]} {
close $infile
puts "EOF at line $lineNumber"
exit
}
gets $infile line
}
set bytecount [tell $infile]
gets $infile rmline
set outfile [open $filename RDWR]
seek $outfile $bytecount start
while {[gets $infile line] >= 0} {
puts $outfile $line
}
ftruncate -fileid $outfile [tell $outfile]
close $infile
close $outfile
Note on my particular box I have Tcl 8.4, so I had to load the Tclx package in order to use the ftruncate command. In Tcl 8.5, there is chan truncate which could be used instead.
You can pass the line number you want to remove and the filename to this script.
In short, the script does this:
open the file for reading
read the first n-1 lines
get the offset of the start of the next line (line n)
read line n
open the file with a new FD for writing
move the file location of the write FD to the offset of the start of line n
continue reading the remaining lines from the read FD and write them to the write FD until the whole read FD is read
truncate the write FD
The file is edited exactly in place. No temporary files are used.
I'm pretty sure this can be re-written in python or perl or ... if necessary.
Update
Ok, so in-place line removal can be done in almost-pure bash, using similar techniques to the Tcl script above. But the big caveat is that you need to have truncate command available. I do have it on my Ubuntu 12.04 VM, but not on my older Redhat-based box. Here is the script:
#!/bin/bash
n=$1
filename=$2
exec 3<> $filename
exec 4<> $filename
linecount=1
bytecount=0
while IFS="" read -r line <&3 ; do
if [[ $linecount == $n ]]; then
echo "omitting line $linecount: $line"
else
echo "$line" >&4
((bytecount += ${#line} + 1))
fi
((linecount++))
done
exec 3>&-
exec 4>&-
truncate -s $bytecount $filename
#### or if you can tolerate dd, just to do the truncate:
# dd of="$filename" bs=1 seek=$bytecount count=0
#### or if you have python
# python -c "open(\"$filename\", \"ab\").truncate($bytecount)"
I would love to hear of a more generic (bash-only?) way to do the partial truncate at the end and complete this answer. Of course the truncate can be done with dd as well, but I think that was already ruled out for my earlier answer.
And for the record this site lists how to do an in-place file truncation in many different languages - in case any of these could be used in your environment.

If you can indicate under which circumstances on which platform(s) the most obvious Awk script is failing for you, perhaps we can devise a workaround.
awk "NR!=$N" infile >outfile
If course, obtaining $N with grep just to feed it to Awk is pretty bass-ackwards. This will delete the line containing the first occurrence of foo:
awk '/foo/ { if (!p++) next } 1' infile >outfile

Based on Digital Trauma's answere, I found an improvement that just needs grep and echo, but no tempfile:
echo $(grep -v PATTERN file.txt) > file.txt
Depending on the kind of lines your file contains and whether your pattern requires a more complex syntax or not, you can embrace the grep command with double quotes:
echo "$(grep -v PATTERN file.txt)" > file.txt
(useful when deleting from your crontab)

How to check if sed has changed a file

I am trying to find a clever way to figure out if the file passed to sed has been altered successfully or not.
Basically, I want to know if the file has been changed or not without having to look at the file modification date.
The reason why I need this is because I need to do some extra stuff if sed has successfully replaced a pattern.
I currently have:
grep -q $pattern $filename
if [ $? -eq 0 ]
then
sed -i s:$pattern:$new_pattern: $filename
# DO SOME OTHER STUFF HERE
else
# DO SOME OTHER STUFF HERE
fi
The above code is a bit expensive and I would love to be able to use some hacks here.

A bit late to the party but for the benefit of others, I found the 'w' flag to be exactly what I was looking for.
sed -i "s/$pattern/$new_pattern/w changelog.txt" "$filename"
if [ -s changelog.txt ]; then
# CHANGES MADE, DO SOME STUFF HERE
else
# NO CHANGES MADE, DO SOME OTHER STUFF HERE
fi
changelog.txt will contain each change (ie the changed text) on it's own line. If there were no changes, changelog.txt will be zero bytes.
A really helpful sed resource (and where I found this info) is http://www.grymoire.com/Unix/Sed.html.

I believe you may find these GNU sed extensions useful
t label
If a s/// has done a successful substitution since the last input line
was read and since the last t or T command, then branch to label; if
label is omitted, branch to end of script.
and
q [exit-code]
Immediately quit the sed script without processing any more input, except
that if auto-print is not disabled the current pattern space will be printed.
The exit code argument is a GNU extension.
It seems like exactly what are you looking for.

This might work for you (GNU sed):
sed -i.bak '/'"$old_pattern"'/{s//'"$new_pattern"'/;h};${x;/./{x;q1};x}' file || echo changed
Explanation:
/'"$old_pattern"'/{s//'"$new_pattern"'/;h} if the pattern space (PS) contains the old pattern, replace it by the new pattern and copy the PS to the hold space (HS).
${x;/./{x;q1};x} on encountering the last line, swap to the HS and test it for the presence of any string. If a string is found in the HS (i.e. a substitution has taken place) swap back to the original PS and exit using the exit code of 1, otherwise swap back to the original PS and exit with the exit code of 0 (the default).

You can diff the original file with the sed output to see if it changed:
sed -i.bak s:$pattern:$new_pattern: "$filename"
if ! diff "$filename" "$filename.bak" &> /dev/null; then
echo "changed"
else
echo "not changed"
fi
rm "$filename.bak"

You could use awk instead:
awk '$0 ~ p { gsub(p, r); t=1} 1 END{ exit (!t) }' p="$pattern" r="$repl"
I'm ignoring the -i feature: you can use the shell do do redirections as necessary.
Sigh. Many comments below asking for basic tutorial on the shell. You can use the above command as follows:
if awk '$0 ~ p { gsub(p, r); t=1} 1 END{ exit (!t) }' \
p="$pattern" r="$repl" "$filename" > "${filename}.new"; then
cat "${filename}.new" > "${filename}"
# DO SOME OTHER STUFF HERE
else
# DO SOME OTHER STUFF HERE
fi
It is not clear to me if "DO SOME OTHER STUFF HERE" is the same in each case. Any similar code in the two blocks should be refactored accordingly.

In macos I just do it as follows:
changes=""
changes+=$(sed -i '' "s/$to_replace/$replacement/g w /dev/stdout" "$f")
if [ "$changes" != "" ]; then
echo "CHANGED!"
fi
I checked, and this is faster than md5, cksum and sha comparisons

I know it is a old question and using awk instead of sed is perhaps the best idea, but if one wants to stick with sed, an idea is to use the -w flag. The file argument to the w flag only contains the lines with a match. So, we only need to check that it is not empty.

perl -sple '$replaced++ if s/$from/$to/g;
END{if($replaced != 0){ print "[Info]: $replaced replacement done in $ARGV(from/to)($from/$to)"}
else {print "[Warning]: 0 replacement done in $ARGV(from/to)($from/$to)"}}' -- -from="FROM_STRING" -to="$DESIRED_STRING" </file/name>
Example:
The command will produce the following output, stating the number of changes made/file.
perl -sple '$replaced++ if s/$from/$to/g;
END{if($replaced != 0){ print "[Info]: $replaced replacement done in $ARGV(from/to)($from/$to)"}
else {print "[Warning]: 0 replacement done in $ARGV(from/to)($from/$to)"}}' -- -from="timeout" -to="TIMEOUT" *
[Info]: 5 replacement done in main.yml(from/to)(timeout/TIMEOUT)
[Info]: 1 replacement done in task/main.yml(from/to)(timeout/TIMEOUT)
[Info]: 4 replacement done in defaults/main.yml(from/to)(timeout/TIMEOUT)
[Warning]: 0 replacement done in vars/main.yml(from/to)(timeout/TIMEOUT)
Note: I have removed -i from the above command , so it will not update the files for the people who are just trying out the command. If you want to enable in-place replacements in the file add -i after perl in above command.

check if sed has changed MANY files
recursive replace of all files in one directory
produce a list of all modified files
workaround with two stages: match + replace
g='hello.*world'
s='s/hello.*world/bye world/g;'
d='./' # directory of input files
o='modified-files.txt'
grep -r -l -Z -E "$g" "$d" | tee "$o" | xargs -0 sed -i "$s"
the file paths in $o are zero-delimited

$ echo hi > abc.txt
$ sed "s/hi/bye/g; t; q1;" -i abc.txt && (echo "Changed") || (echo "Failed")
Changed
$ sed "s/hi/bye/g; t; q1;" -i abc.txt && (echo "Changed") || (echo "Failed")
Failed
https://askubuntu.com/questions/1036912/how-do-i-get-the-exit-status-when-using-the-sed-command/1036918#1036918

Don't use sed to tell if it has changed a file; instead, use grep to tell if it is going to change a file, then use sed to actually change the file. Notice the single line of sed usage at the very end of the Bash function below:
# Usage: `gs_replace_str "regex_search_pattern" "replacement_string" "file_path"`
gs_replace_str() {
REGEX_SEARCH="$1"
REPLACEMENT_STR="$2"
FILENAME="$3"
num_lines_matched=$(grep -c -E "$REGEX_SEARCH" "$FILENAME")
# Count number of matches, NOT lines (`grep -c` counts lines),
# in case there are multiple matches per line; see:
# https://superuser.com/questions/339522/counting-total-number-of-matches-with-grep-instead-of-just-how-many-lines-match/339523#339523
num_matches=$(grep -o -E "$REGEX_SEARCH" "$FILENAME" | wc -l)
# If num_matches > 0
if [ "$num_matches" -gt 0 ]; then
echo -e "\n${num_matches} matches found on ${num_lines_matched} lines in file"\
"\"${FILENAME}\":"
# Now show these exact matches with their corresponding line 'n'umbers in the file
grep -n --color=always -E "$REGEX_SEARCH" "$FILENAME"
# Now actually DO the string replacing on the files 'i'n place using the `sed`
# 's'tream 'ed'itor!
sed -i "s|${REGEX_SEARCH}|${REPLACEMENT_STR}|g" "$FILENAME"
fi
}
Place that in your ~/.bashrc file, for instance. Close and reopen your terminal and then use it.
Usage:
gs_replace_str "regex_search_pattern" "replacement_string" "file_path"
Example: replace do with bo so that "doing" becomes "boing" (I know, we should be fixing spelling errors not creating them :) ):
$ gs_replace_str "do" "bo" test_folder/test2.txt
9 matches found on 6 lines in file "test_folder/test2.txt":
1:hey how are you doing today
2:hey how are you doing today
3:hey how are you doing today
4:hey how are you doing today hey how are you doing today hey how are you doing today hey how are you doing today
5:hey how are you doing today
6:hey how are you doing today?
$SHLVL:3
Screenshot of the output:
References:
https://superuser.com/questions/339522/counting-total-number-of-matches-with-grep-instead-of-just-how-many-lines-match/339523#339523
https://unix.stackexchange.com/questions/112023/how-can-i-replace-a-string-in-a-files/580328#580328

How to change a word in a file with linux shell script

I have a text file which have lots of lines
I have a line in it which is:
MyCar on
how can I turn my car off?

You could use sed:
sed -i 's/MyCar on/MyCar off/' path/to/file

sed 's/MyCar on/MyCar off/' >filename
more on sed

You can do this with shell only. This example uses an unnecessary case statement for this particular example, but I included it to show how you could incorporate multiple replacements. Although the code is larger than a sed 1-liner it is typically much faster since it uses only shell builtins (as much as 20x for small files).
REPLACEOLD="old"
WITHNEW="new"
FILE="tmpfile"
OUTPUT=""
while read LINE || [ "$LINE" ]; do
case "$LINE" in
*${REPLACEOLD}*)OUTPUT="${OUTPUT}${LINE//$REPLACEOLD/$WITHNEW}
";;
*)OUTPUT="${OUTPUT}${LINE}
";;
esac
done < "${FILE}"
printf "${OUTPUT}" > "${FILE}"
for the simple case one could omit the case statement:
while read LINE || [ "$LINE" ]; do
OUTPUT="${OUTPUT}${LINE//$REPLACEOLD/$WITHNEW}
"; done < "${FILE}"
printf "${OUTPUT}" > "${FILE}"
Note: the ...|| [ "$LINE" ]... bit is to prevent losing the last line of a file that doesn't end in a new line
(now you know at least one reasone why your text editor keeps adding those)

try this command when inside the file
:%s/old test/new text/g

Using sed with variables;
host=$(hostname)
se1=$(cat /opt/splunkforwarder/etc/system/local/server.conf | grep serverName)
sed -i "s/${se1}/serverName = ${host}/g" /opt/splunkforwarder/etc/system/local/server.conf`

Looping through the content of a file in Bash

How do I iterate through each line of a text file with Bash?
With this script:
echo "Start!"
for p in (peptides.txt)
do
echo "${p}"
done
I get this output on the screen:
Start!
./runPep.sh: line 3: syntax error near unexpected token `('
./runPep.sh: line 3: `for p in (peptides.txt)'
(Later I want to do something more complicated with $p than just output to the screen.)
The environment variable SHELL is (from env):
SHELL=/bin/bash
/bin/bash --version output:
GNU bash, version 3.1.17(1)-release (x86_64-suse-linux-gnu)
Copyright (C) 2005 Free Software Foundation, Inc.
cat /proc/version output:
Linux version 2.6.18.2-34-default (geeko#buildhost) (gcc version 4.1.2 20061115 (prerelease) (SUSE Linux)) #1 SMP Mon Nov 27 11:46:27 UTC 2006
The file peptides.txt contains:
RKEKNVQ
IPKKLLQK
QYFHQLEKMNVK
IPKKLLQK
GDLSTALEVAIDCYEK
QYFHQLEKMNVKIPENIYR
RKEKNVQ
VLAKHGKLQDAIN
ILGFMK
LEDVALQILL

One way to do it is:
while read p; do
echo "$p"
done <peptides.txt
As pointed out in the comments, this has the side effects of trimming leading whitespace, interpreting backslash sequences, and skipping the last line if it's missing a terminating linefeed. If these are concerns, you can do:
while IFS="" read -r p || [ -n "$p" ]
do
printf '%s\n' "$p"
done < peptides.txt
Exceptionally, if the loop body may read from standard input, you can open the file using a different file descriptor:
while read -u 10 p; do
...
done 10<peptides.txt
Here, 10 is just an arbitrary number (different from 0, 1, 2).

cat peptides.txt | while read line
do
# do something with $line here
done
and the one-liner variant:
cat peptides.txt | while read line; do something_with_$line_here; done
These options will skip the last line of the file if there is no trailing line feed.
You can avoid this by the following:
cat peptides.txt | while read line || [[ -n $line ]];
do
# do something with $line here
done

Option 1a: While loop: Single line at a time: Input redirection
#!/bin/bash
filename='peptides.txt'
echo Start
while read p; do
echo "$p"
done < "$filename"
Option 1b: While loop: Single line at a time:
Open the file, read from a file descriptor (in this case file descriptor #4).
#!/bin/bash
filename='peptides.txt'
exec 4<"$filename"
echo Start
while read -u4 p ; do
echo "$p"
done

This is no better than other answers, but is one more way to get the job done in a file without spaces (see comments). I find that I often need one-liners to dig through lists in text files without the extra step of using separate script files.
for word in $(cat peptides.txt); do echo $word; done
This format allows me to put it all in one command-line. Change the "echo $word" portion to whatever you want and you can issue multiple commands separated by semicolons. The following example uses the file's contents as arguments into two other scripts you may have written.
for word in $(cat peptides.txt); do cmd_a.sh $word; cmd_b.py $word; done
Or if you intend to use this like a stream editor (learn sed) you can dump the output to another file as follows.
for word in $(cat peptides.txt); do cmd_a.sh $word; cmd_b.py $word; done > outfile.txt
I've used these as written above because I have used text files where I've created them with one word per line. (See comments) If you have spaces that you don't want splitting your words/lines, it gets a little uglier, but the same command still works as follows:
OLDIFS=$IFS; IFS=$'\n'; for line in $(cat peptides.txt); do cmd_a.sh $line; cmd_b.py $line; done > outfile.txt; IFS=$OLDIFS
This just tells the shell to split on newlines only, not spaces, then returns the environment back to what it was previously. At this point, you may want to consider putting it all into a shell script rather than squeezing it all into a single line, though.
Best of luck!

A few more things not covered by other answers:
Reading from a delimited file
# ':' is the delimiter here, and there are three fields on each line in the file
# IFS set below is restricted to the context of `read`, it doesn't affect any other code
while IFS=: read -r field1 field2 field3; do
# process the fields
# if the line has less than three fields, the missing fields will be set to an empty string
# if the line has more than three fields, `field3` will get all the values, including the third field plus the delimiter(s)
done < input.txt
Reading from the output of another command, using process substitution
while read -r line; do
# process the line
done < <(command ...)
This approach is better than command ... | while read -r line; do ... because the while loop here runs in the current shell rather than a subshell as in the case of the latter. See the related post A variable modified inside a while loop is not remembered.
Reading from a null delimited input, for example find ... -print0
while read -r -d '' line; do
# logic
# use a second 'read ... <<< "$line"' if we need to tokenize the line
done < <(find /path/to/dir -print0)
Related read: BashFAQ/020 - How can I find and safely handle file names containing newlines, spaces or both?
Reading from more than one file at a time
while read -u 3 -r line1 && read -u 4 -r line2; do
# process the lines
# note that the loop will end when we reach EOF on either of the files, because of the `&&`
done 3< input1.txt 4< input2.txt
Based on #chepner's answer here:
-u is a bash extension. For POSIX compatibility, each call would look something like read -r X <&3.
Reading a whole file into an array (Bash versions earlier to 4)
while read -r line; do
my_array+=("$line")
done < my_file
If the file ends with an incomplete line (newline missing at the end), then:
while read -r line || [[ $line ]]; do
my_array+=("$line")
done < my_file
Reading a whole file into an array (Bash versions 4x and later)
readarray -t my_array < my_file
or
mapfile -t my_array < my_file
And then
for line in "${my_array[#]}"; do
# process the lines
done
More about the shell builtins read and readarray commands - GNU
More about IFS - Wikipedia
BashFAQ/001 - How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?
Related posts:
Creating an array from a text file in Bash
What is the difference between thee approaches to reading a file that has just one line?
Bash while read loop extremely slow compared to cat, why?

Use a while loop, like this:
while IFS= read -r line; do
echo "$line"
done <file
Notes:
If you don't set the IFS properly, you will lose indentation.
You should almost always use the -r option with read.
Don't read lines with for

If you don't want your read to be broken by newline character, use -
#!/bin/bash
while IFS='' read -r line || [[ -n "$line" ]]; do
echo "$line"
done < "$1"
Then run the script with file name as parameter.

Suppose you have this file:
$ cat /tmp/test.txt
Line 1
Line 2 has leading space
Line 3 followed by blank line
Line 5 (follows a blank line) and has trailing space
Line 6 has no ending CR
There are four elements that will alter the meaning of the file output read by many Bash solutions:
The blank line 4;
Leading or trailing spaces on two lines;
Maintaining the meaning of individual lines (i.e., each line is a record);
The line 6 not terminated with a CR.
If you want the text file line by line including blank lines and terminating lines without CR, you must use a while loop and you must have an alternate test for the final line.
Here are the methods that may change the file (in comparison to what cat returns):
1) Lose the last line and leading and trailing spaces:
$ while read -r p; do printf "%s\n" "'$p'"; done </tmp/test.txt
'Line 1'
'Line 2 has leading space'
'Line 3 followed by blank line'
''
'Line 5 (follows a blank line) and has trailing space'
(If you do while IFS= read -r p; do printf "%s\n" "'$p'"; done </tmp/test.txt instead, you preserve the leading and trailing spaces but still lose the last line if it is not terminated with CR)
2) Using process substitution with cat will reads the entire file in one gulp and loses the meaning of individual lines:
$ for p in "$(cat /tmp/test.txt)"; do printf "%s\n" "'$p'"; done
'Line 1
Line 2 has leading space
Line 3 followed by blank line
Line 5 (follows a blank line) and has trailing space
Line 6 has no ending CR'
(If you remove the " from $(cat /tmp/test.txt) you read the file word by word rather than one gulp. Also probably not what is intended...)
The most robust and simplest way to read a file line-by-line and preserve all spacing is:
$ while IFS= read -r line || [[ -n $line ]]; do printf "'%s'\n" "$line"; done </tmp/test.txt
'Line 1'
' Line 2 has leading space'
'Line 3 followed by blank line'
''
'Line 5 (follows a blank line) and has trailing space '
'Line 6 has no ending CR'
If you want to strip leading and trading spaces, remove the IFS= part:
$ while read -r line || [[ -n $line ]]; do printf "'%s'\n" "$line"; done </tmp/test.txt
'Line 1'
'Line 2 has leading space'
'Line 3 followed by blank line'
''
'Line 5 (follows a blank line) and has trailing space'
'Line 6 has no ending CR'
(A text file without a terminating \n, while fairly common, is considered broken under POSIX. If you can count on the trailing \n you do not need || [[ -n $line ]] in the while loop.)
More at the BASH FAQ

I like to use xargs instead of while. xargs is powerful and command line friendly
cat peptides.txt | xargs -I % sh -c "echo %"
With xargs, you can also add verbosity with -t and validation with -p

This might be the simplest answer and maybe it don't work in all cases, but it is working great for me:
while read line;do echo "$line";done<peptides.txt
if you need to enclose in parenthesis for spaces:
while read line;do echo \"$line\";done<peptides.txt
Ahhh this is pretty much the same as the answer that got upvoted most, but its all on one line.

#!/bin/bash
#
# Change the file name from "test" to desired input file
# (The comments in bash are prefixed with #'s)
for x in $(cat test.txt)
do
echo $x
done

Here is my real life example how to loop lines of another program output, check for substrings, drop double quotes from variable, use that variable outside of the loop. I guess quite many is asking these questions sooner or later.
##Parse FPS from first video stream, drop quotes from fps variable
## streams.stream.0.codec_type="video"
## streams.stream.0.r_frame_rate="24000/1001"
## streams.stream.0.avg_frame_rate="24000/1001"
FPS=unknown
while read -r line; do
if [[ $FPS == "unknown" ]] && [[ $line == *".codec_type=\"video\""* ]]; then
echo ParseFPS $line
FPS=parse
fi
if [[ $FPS == "parse" ]] && [[ $line == *".r_frame_rate="* ]]; then
echo ParseFPS $line
FPS=${line##*=}
FPS="${FPS%\"}"
FPS="${FPS#\"}"
fi
done <<< "$(ffprobe -v quiet -print_format flat -show_format -show_streams -i "$input")"
if [ "$FPS" == "unknown" ] || [ "$FPS" == "parse" ]; then
echo ParseFPS Unknown frame rate
fi
echo Found $FPS
Declare variable outside of the loop, set value and use it outside of loop requires done <<< "$(...)" syntax. Application need to be run within a context of current console. Quotes around the command keeps newlines of output stream.
Loop match for substrings then reads name=value pair, splits right-side part of last = character, drops first quote, drops last quote, we have a clean value to be used elsewhere.

This is coming rather very late, but with the thought that it may help someone, i am adding the answer. Also this may not be the best way. head command can be used with -n argument to read n lines from start of file and likewise tail command can be used to read from bottom. Now, to fetch nth line from file, we head n lines, pipe the data to tail only 1 line from the piped data.
TOTAL_LINES=`wc -l $USER_FILE | cut -d " " -f1 `
echo $TOTAL_LINES # To validate total lines in the file
for (( i=1 ; i <= $TOTAL_LINES; i++ ))
do
LINE=`head -n$i $USER_FILE | tail -n1`
echo $LINE
done

#Peter: This could work out for you-
echo "Start!";for p in $(cat ./pep); do
echo $p
done
This would return the output-
Start!
RKEKNVQ
IPKKLLQK
QYFHQLEKMNVK
IPKKLLQK
GDLSTALEVAIDCYEK
QYFHQLEKMNVKIPENIYR
RKEKNVQ
VLAKHGKLQDAIN
ILGFMK
LEDVALQILL

Another way to go about using xargs
<file_name | xargs -I {} echo {}
echo can be replaced with other commands or piped further.

for p in `cat peptides.txt`
do
echo "${p}"
done

Quick unix command to display specific lines in the middle of a file?

Trying to debug an issue with a server and my only log file is a 20GB log file (with no timestamps even! Why do people use System.out.println() as logging? In production?!)
Using grep, I've found an area of the file that I'd like to take a look at, line 347340107.
Other than doing something like
head -<$LINENUM + 10> filename | tail -20
... which would require head to read through the first 347 million lines of the log file, is there a quick and easy command that would dump lines 347340100 - 347340200 (for example) to the console?
update I totally forgot that grep can print the context around a match ... this works well. Thanks!

I found two other solutions if you know the line number but nothing else (no grep possible):
Assuming you need lines 20 to 40,
sed -n '20,40p;41q' file_name
or
awk 'FNR>=20 && FNR<=40' file_name
When using sed it is more efficient to quit processing after having printed the last line than continue processing until the end of the file. This is especially important in the case of large files and printing lines at the beginning. In order to do so, the sed command above introduces the instruction 41q in order to stop processing after line 41 because in the example we are interested in lines 20-40 only. You will need to change the 41 to whatever the last line you are interested in is, plus one.

# print line number 52
sed -n '52p' # method 1
sed '52!d' # method 2
sed '52q;d' # method 3, efficient on large files
method 3 efficient on large files
fastest way to display specific lines

with GNU-grep you could just say
grep --context=10 ...

No there isn't, files are not line-addressable.
There is no constant-time way to find the start of line n in a text file. You must stream through the file and count newlines.
Use the simplest/fastest tool you have to do the job. To me, using head makes much more sense than grep, since the latter is way more complicated. I'm not saying "grep is slow", it really isn't, but I would be surprised if it's faster than head for this case. That'd be a bug in head, basically.

What about:
tail -n +347340107 filename | head -n 100
I didn't test it, but I think that would work.

I prefer just going into less and
typing 50% to goto halfway the file,
43210G to go to line 43210
:43210 to do the same
and stuff like that.
Even better: hit v to start editing (in vim, of course!), at that location. Now, note that vim has the same key bindings!

You can use the ex command, a standard Unix editor (part of Vim now), e.g.
display a single line (e.g. 2nd one):
ex +2p -scq file.txt
corresponding sed syntax: sed -n '2p' file.txt
range of lines (e.g. 2-5 lines):
ex +2,5p -scq file.txt
sed syntax: sed -n '2,5p' file.txt
from the given line till the end (e.g. 5th to the end of the file):
ex +5,p -scq file.txt
sed syntax: sed -n '2,$p' file.txt
multiple line ranges (e.g. 2-4 and 6-8 lines):
ex +2,4p +6,8p -scq file.txt
sed syntax: sed -n '2,4p;6,8p' file.txt
Above commands can be tested with the following test file:
seq 1 20 > file.txt
Explanation:
+ or -c followed by the command - execute the (vi/vim) command after file has been read,
-s - silent mode, also uses current terminal as a default output,
q followed by -c is the command to quit editor (add ! to do force quit, e.g. -scq!).

I'd first split the file into few smaller ones like this
$ split --lines=50000 /path/to/large/file /path/to/output/file/prefix
and then grep on the resulting files.

If your line number is 100 to read
head -100 filename | tail -1

Get ack
Ubuntu/Debian install:
$ sudo apt-get install ack-grep
Then run:
$ ack --lines=$START-$END filename
Example:
$ ack --lines=10-20 filename
From $ man ack:
--lines=NUM
Only print line NUM of each file. Multiple lines can be given with multiple --lines options or as a comma separated list (--lines=3,5,7). --lines=4-7 also works.
The lines are always output in ascending order, no matter the order given on the command line.

sed will need to read the data too to count the lines.
The only way a shortcut would be possible would there to be context/order in the file to operate on. For example if there were log lines prepended with a fixed width time/date etc.
you could use the look unix utility to binary search through the files for particular dates/times

Use
x=`cat -n <file> | grep <match> | awk '{print $1}'`
Here you will get the line number where the match occurred.
Now you can use the following command to print 100 lines
awk -v var="$x" 'NR>=var && NR<=var+100{print}' <file>
or you can use "sed" as well
sed -n "${x},${x+100}p" <file>

With sed -e '1,N d; M q' you'll print lines N+1 through M. This is probably a bit better then grep -C as it doesn't try to match lines to a pattern.

Building on Sklivvz' answer, here's a nice function one can put in a .bash_aliases file. It is efficient on huge files when printing stuff from the front of the file.
function middle()
{
startidx=$1
len=$2
endidx=$(($startidx+$len))
filename=$3
awk "FNR>=${startidx} && FNR<=${endidx} { print NR\" \"\$0 }; FNR>${endidx} { print \"END HERE\"; exit }" $filename
}

To display a line from a <textfile> by its <line#>, just do this:
perl -wne 'print if $. == <line#>' <textfile>
If you want a more powerful way to show a range of lines with regular expressions -- I won't say why grep is a bad idea for doing this, it should be fairly obvious -- this simple expression will show you your range in a single pass which is what you want when dealing with ~20GB text files:
perl -wne 'print if m/<regex1>/ .. m/<regex2>/' <filename>
(tip: if your regex has / in it, use something like m!<regex>! instead)
This would print out <filename> starting with the line that matches <regex1> up until (and including) the line that matches <regex2>.
It doesn't take a wizard to see how a few tweaks can make it even more powerful.
Last thing: perl, since it is a mature language, has many hidden enhancements to favor speed and performance. With this in mind, it makes it the obvious choice for such an operation since it was originally developed for handling large log files, text, databases, etc.

print line 5
sed -n '5p' file.txt
sed '5q' file.txt
print everything else than line 5
`sed '5d' file.txt
and my creation using google
#!/bin/bash
#removeline.sh
#remove deleting it comes move line xD
usage() { # Function: Print a help message.
echo "Usage: $0 -l LINENUMBER -i INPUTFILE [ -o OUTPUTFILE ]"
echo "line is removed from INPUTFILE"
echo "line is appended to OUTPUTFILE"
}
exit_abnormal() { # Function: Exit with error.
usage
exit 1
}
while getopts l:i:o:b flag
do
case "${flag}" in
l) line=${OPTARG};;
i) input=${OPTARG};;
o) output=${OPTARG};;
esac
done
if [ -f tmp ]; then
echo "Temp file:tmp exist. delete it yourself :)"
exit
fi
if [ -f "$input" ]; then
re_isanum='^[0-9]+$'
if ! [[ $line =~ $re_isanum ]] ; then
echo "Error: LINENUMBER must be a positive, whole number."
exit 1
elif [ $line -eq "0" ]; then
echo "Error: LINENUMBER must be greater than zero."
exit_abnormal
fi
if [ ! -z $output ]; then
sed -n "${line}p" $input >> $output
fi
if [ ! -z $input ]; then
# remove this sed command and this comes move line to other file
sed "${line}d" $input > tmp && cp tmp $input
fi
fi
if [ -f tmp ]; then
rm tmp
fi

You could try this command:
egrep -n "*" <filename> | egrep "<line number>"

Easy with perl! If you want to get line 1, 3 and 5 from a file, say /etc/passwd:
perl -e 'while(<>){if(++$l~~[1,3,5]){print}}' < /etc/passwd

I am surprised only one other answer (by Ramana Reddy) suggested to add line numbers to the output. The following searches for the required line number and colours the output.
file=FILE
lineno=LINENO
wb="107"; bf="30;1"; rb="101"; yb="103"
cat -n ${file} | { GREP_COLORS="se=${wb};${bf}:cx=${wb};${bf}:ms=${rb};${bf}:sl=${yb};${bf}" grep --color -C 10 "^[[:space:]]\\+${lineno}[[:space:]]"; }

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Faster way to merge multiple files - linux

I have multiple small files in Linux (about 70,000 files) and I want to add a word to the end of each line of the files and then merge them all into a single file. I'm using this script: for fn in *.sms.txt do sed 's/$/'$fn'/' $fn >> sms.txt rm -f $fn done Is there a faster way to do this?

Almost Same as gniourf_gniourf's solution, but without ed: for i in *.sms.txt do while read line do echo $line $i done < $i done >sms.txt

What, no love for awk? awk '{print $0" "FILENAME}' *.sms.txt >sms.txt Using gawk, this took 1-2 seconds on gniourf_gniourf's sample on my machine (according to time). mawk is about 0.2 seconds faster than gawk here.

This perl script adds the actual filename at the end of each line. #!/usr/bin/perl use strict; while(<>){ chomp; print $_, $ARGV, "\n"; } Call it like this: scriptname *.sms.txt > sms.txt Since there is only one process and no regular expression processing involved it should be quite fast.

Related

Remove a specific line from a file WITHOUT using sed or awk

How to check if sed has changed a file

How to change a word in a file with linux shell script

Looping through the content of a file in Bash

Quick unix command to display specific lines in the middle of a file?

Categories

Resources