How to remove certain lines of a large file (>5G) using linux commands - linux

I have files which are very large (> 5G), and I want to remove some lines by the line numbers without moving (copy and paste) files.
I know this command works for a small size file. (my sed command do not recognize -i option)
sed "${line}d" file.txt > file.tmp && mv file.tmp file.txt
This command takes relatively long time because of the size. I just need to remove the first line and the last line, but also want to know how to remove line number n, for example.

Because of the way files are stored on standard filesystems (NTFS, EXTFS, ...), you cannot remove parts of a file in-place.
The only thing you can do in-place is
append to the end of a file (append mode)
modify data in a file (read-write mode)
Other operations must use a temporary file, or temporary memory to read the file fully and write it back modified.
EDIT: you can also "shrink" a file as read here using a C program (Linux or Windows would work) so that means that you could remove the last line (but still not the first line or any line in between)

you can use the ed command which is quite similar to sed
ed -s file.text
you can use the d command, $d will delete the last line while 1d will delete the first one, and wq will write and exit.
The following command will do everything (delete first and last line, write, and exit)
echo -e '1d\n$d\nwq' | ed -s test.txt
using sed you can use the same commands sed '1d;$d' test.txt

If you are using a recent Linux, you can remove chunks of the file in any position: https://lwn.net/Articles/415889/
There's a command to remove any part of the file: fallocate
See: https://manpages.ubuntu.com/manpages/xenial/man1/fallocate.1.html
For example: fallocate -p -o 10G -l 1G qqq

Related

How to remove 1st empty column from file using bash commands?

I have output file that can be read by a visualizing tool, the problem is that at the beginning of each line there is one space. Because of this the visualizing tool isn't able to recognize the scripts and, hence crashing. When I manually remove the first column of spaces, it works fine. Is there a way to remove the 1st empty column of spaces using bash command
What I want is to remove the excess column of empty space like shown in this second image using a bash command
At present I use Vim editor to remove the 1st column manually. But I would like to do it using a bash command so that I can automate the process. My file is not just full of columns, it has some independent data line
Using cut or sed would be two simple solutions.
cut
cut -c2- file > file.cut && mv file.cut file
cut cannot modify a file, therefore you need to redirect its output to a different file and then overwrite the old file with the new file.
sed
sed -i 's/^.//' file
-i modifies the file in-place.
I would use
sed -ie 's/^ //' file
to just remove spaces (in case a line does not contain it)

Improve performance of Bash loop that removes windows line endings

Editor's note: This question was always about loop performance, but the original title led some answerers - and voters - to believe it was about how to remove Windows line endings.
The below bash loop below just remove the windows line endings and converts them to unix and appears to be running, but it is slow. The input files are small (4 files ranging from 167 bytes - 1 kb), and are all the same structure (list of names) and the only thing that varies is the length (ie. some files are 10 names others are 50). Is it supposed to take over 15 minutes to complete this task using a xeon processor? Thank you :)
for f in /home/cmccabe/Desktop/files/*.txt ; do
bname=`basename $f`
pref=${bname%%.txt}
sed 's/\r//' $f - $f > /home/cmccabe/Desktop/files/${pref}_unix.txt
done
Input .txt files
AP3B1
BRCA2
BRIP1
CBL
CTC1
EDIT
This is not a duplicate as I was more asking for why my bash loop that uses sed to remove windows line endings was running so slow. I did not mean to imply how to remove them, was asking for ideas that might speed up the loop and I got many. Thank you :). I hope this helps.
Use the utilities dos2unix and unix2dos to convert between unix and windows style line endings.
Your 'sed' command looks wrong. I believe the trailing $f - $f should simply be $f. Running your script as written hangs for a very long time on my system, but making this change causes it to complete almost instantly.
Of course, the best answer is to use dos2unix, which was designed to handle this exact thing:
cd /home/cmccabe/Desktop/files
for f in *.txt ; do
pref=$(basename -s '.txt' "$f")
dos2unix -q -n "$f" "${pref}_unix.txt"
done
This always works for me:
perl -pe 's/\r\n/\n/' inputfile.txt > outputfile.txt
you can use dos2unix as stated before or use this small sed:
sed 's/\r//' file
The key to performance in Bash is to avoid loops in general, and in particular those that call one or more external utilities in each iteration.
Here is a solution that uses a single GNU awk command:
awk -v RS='\r\n' '
BEGINFILE { outFile=gensub("\\.txt$", "_unix&", 1, FILENAME) }
{ print > outFile }
' /home/cmccabe/Desktop/files/*.txt
-v RS='\r\n' sets CRLF as the input record separator, and by virtue of leaving ORS, the output record separator at its default, \n, simply printing each input line will terminate it with \n.
the BEGINFILE block is executed every time processing of a new input file starts; in it, gensub() is used to insert _unix before the .txt suffix of the input file at hand to form the output filename.
{print > outFile} simply prints the \n-terminated lines to the output file at hand.
Note that use of a multi-char. RS value, the BEGINFILE block, and the gensub() function are GNU extensions to the POSIX standard.
Switching from the OP's sed solution to a GNU awk-based one was necessary in order to provide a single-command solution that is both simpler and faster.
Alternatively, here's a solution that relies on dos2unix for conversion of Window line-endings (for instance, you can install dos2unix with sudo apt-get install dos2unix on Debian-based systems); except for requiring dos2unix, it should work on most platforms (no GNU utilities required):
It uses a loop only to construct the array of filename arguments to pass to dos2unix - this should be fast, given that no call to basename is involved; Bash-native parameter expansion is used instead.
then uses a single invocation of dos2unix to process all files.
# cd to the target folder, so that the operations below do not need to handle
# path components.
cd '/home/cmccabe/Desktop/files'
# Collect all *.txt filenames in an array.
inFiles=( *.txt )
# Derive output filenames from it, using Bash parameter expansion:
# '%.txt' matches '.txt' at the end of each array element, and replaces it
# with '_unix.txt', effectively inserting '_unix' before the suffix.
outFiles=( "${inFiles[#]/%.txt/_unix.txt}" )
# Create an interleaved array of *input-output filename pairs* to be passed
# to dos2unix later.
# To inspect the resulting array, run `printf '%s\n' "${fileArgs[#]}"`
# You'll see pairs like these:
# file1.txt
# file1_unix.txt
# ...
fileArgs=(); i=0
for inFile in "${inFiles[#]}"; do
fileArgs+=( "$inFile" "${outFiles[i++]}" )
done
# Now, use a *single* invocation of dos2unix, passing all input-output
# filename pairs at once.
dos2unix -q -n "${fileArgs[#]}"

How can I remove the last character of a file in unix?

Say I have some arbitrary multi-line text file:
sometext
moretext
lastline
How can I remove only the last character (the e, not the newline or null) of the file without making the text file invalid?
A simpler approach (outputs to stdout, doesn't update the input file):
sed '$ s/.$//' somefile
$ is a Sed address that matches the last input line only, thus causing the following function call (s/.$//) to be executed on the last line only.
s/.$// replaces the last character on the (in this case last) line with an empty string; i.e., effectively removes the last char. (before the newline) on the line.
. matches any character on the line, and following it with $ anchors the match to the end of the line; note how the use of $ in this regular expression is conceptually related, but technically distinct from the previous use of $ as a Sed address.
Example with stdin input (assumes Bash, Ksh, or Zsh):
$ sed '$ s/.$//' <<< $'line one\nline two'
line one
line tw
To update the input file too (do not use if the input file is a symlink):
sed -i '$ s/.$//' somefile
Note:
On macOS, you'd have to use -i '' instead of just -i; for an overview of the pitfalls associated with -i, see the bottom half of this answer.
If you need to process very large input files and/or performance / disk usage are a concern and you're using GNU utilities (Linux), see ImHere's helpful answer.
truncate
truncate -s-1 file
Removes one (-1) character from the end of the same file. Exactly as a >> will append to the same file.
The problem with this approach is that it doesn't retain a trailing newline if it existed.
The solution is:
if [ -n "$(tail -c1 file)" ] # if the file has not a trailing new line.
then
truncate -s-1 file # remove one char as the question request.
else
truncate -s-2 file # remove the last two characters
echo "" >> file # add the trailing new line back
fi
This works because tail takes the last byte (not char).
It takes almost no time even with big files.
Why not sed
The problem with a sed solution like sed '$ s/.$//' file is that it reads the whole file first (taking a long time with large files), then you need a temporary file (of the same size as the original):
sed '$ s/.$//' file > tempfile
rm file; mv tempfile file
And then move the tempfile to replace the file.
Here's another using ex, which I find not as cryptic as the sed solution:
printf '%s\n' '$' 's/.$//' wq | ex somefile
The $ goes to the last line, the s deletes the last character, and wq is the well known (to vi users) write+quit.
After a whole bunch of playing around with different strategies (and avoiding sed -i or perl), the best way i found to do this was with:
sed '$! { P; D; }; s/.$//' somefile
If the goal is to remove the last character in the last line, this awk should do:
awk '{a[NR]=$0} END {for (i=1;i<NR;i++) print a[i];sub(/.$/,"",a[NR]);print a[NR]}' file
sometext
moretext
lastlin
It store all data into an array, then print it out and change last line.
Just a remark: sed will temporarily remove the file.
So if you are tailing the file, you'll get a "No such file or directory" warning until you reissue the tail command.
EDITED ANSWER
I created a script and put your text inside on my Desktop. this test file is saved as "old_file.txt"
sometext
moretext
lastline
Afterwards I wrote a small script to take the old file and eliminate the last character in the last line
#!/bin/bash
no_of_new_line_characters=`wc '/root/Desktop/old_file.txt'|cut -d ' ' -f2`
let "no_of_lines=no_of_new_line_characters+1"
sed -n 1,"$no_of_new_line_characters"p '/root/Desktop/old_file.txt' > '/root/Desktop/my_new_file'
sed -n "$no_of_lines","$no_of_lines"p '/root/Desktop/old_file.txt'|sed 's/.$//g' >> '/root/Desktop/my_new_file'
opening the new_file I created, showed the output as follows:
sometext
moretext
lastlin
I apologize for my previous answer (wasn't reading carefully)
sed 's/.$//' filename | tee newFilename
This should do your job.
A couple perl solutions, for comparison/reference:
(echo 1a; echo 2b) | perl -e '$_=join("",<>); s/.$//; print'
(echo 1a; echo 2b) | perl -e 'while(<>){ if(eof) {s/.$//}; print }'
I find the first read-whole-file-into-memory approach can be generally quite useful (less so for this particular problem). You can now do regex's which span multiple lines, for example to combine every 3 lines of a certain format into 1 summary line.
For this problem, truncate would be faster and the sed version is shorter to type. Note that truncate requires a file to operate on, not a stream. Normally I find sed to lack the power of perl and I much prefer the extended-regex / perl-regex syntax. But this problem has a nice sed solution.

using Linux shell script to edit and rename a file

I am trying to execute a command using Vi or ex to edit a file by deleting the first five lines, replace x with y, remove extra spaces at the end of each line but retain the carraige returns, and remove the last eight lines of the file, then rename the file into a shell script and run the new script from the current script.
This will be something that is scheduled in cron. I have been looking for a simple way to do it using the command line or a Vim script or something.
Any ideas? The format of the input file does not change, just the amount of lines, so I can't specify the line numbers for the last eight lines.
You actually have about half a dozen questions here. Here's an answer for the first five which are probably the ones you'll have the most difficulty solving:
sed -e ':label' -n -e '1d' -e 's/x/y/g' -e 's/[ \t]*$//g' -e '1,9!{P;N;D};N;b label' file.txt > script.sh
Vi is an interactive editor. You probably don't want to use it for something that'll be run by cron. Also, I agree with the comments saying this is probably a bad idea. Be that as it may:
printf 'one\ntwo\nthree\nfour\nfive\necho x \n1\n2\n3\n4\n5\n6\n7\n8\n' \
| sed '1,5d;s/ *$//;s/x/y/' \
| tail -r | sed 1,8d | tail -r \
| sh
Our first sed script does most of the work. We reverse the lines with tail -r, then delete the first 8 lines, then reverse again. That trims off the last 8 lines.
Note that on Linux systems (or any with GNU coreutils), you may also have a tac command which reverse lines, but tail -r is more portable.
Also, the final | sh simply runs the output. If you REALLY want to save this as a script, you can do that by redirecting the output to a file ... but I'll leave at least that to your imagination. Can't do all your scripting for you, can we?! :-)
To edit a file by a script, you could use ed (even if it hard to learn or remember).
You could also use some scripting language (Python, Perl, AWK, Ruby) to achieve your goal.

Replace whole line containing a string using Sed

I have a text file which has a particular line something like
sometext sometext sometext TEXT_TO_BE_REPLACED sometext sometext sometext
I need to replace the whole line above with
This line is removed by the admin.
The search keyword is TEXT_TO_BE_REPLACED
I need to write a shell script for this. How can I achieve this using sed?
You can use the change command to replace the entire line, and the -i flag to make the changes in-place. For example, using GNU sed:
sed -i '/TEXT_TO_BE_REPLACED/c\This line is removed by the admin.' /tmp/foo
You need to use wildcards (.*) before and after to replace the whole line:
sed 's/.*TEXT_TO_BE_REPLACED.*/This line is removed by the admin./'
The Answer above:
sed -i '/TEXT_TO_BE_REPLACED/c\This line is removed by the admin.' /tmp/foo
Works fine if the replacement string/line is not a variable.
The issue is that on Redhat 5 the \ after the c escapes the $. A double \\ did not work either (at least on Redhat 5).
Through hit and trial, I discovered that the \ after the c is redundant if your replacement string/line is only a single line. So I did not use \ after the c, used a variable as a single replacement line and it was joy.
The code would look something like:
sed -i "/TEXT_TO_BE_REPLACED/c $REPLACEMENT_TEXT_STRING" /tmp/foo
Note the use of double quotes instead of single quotes.
The accepted answer did not work for me for several reasons:
my version of sed does not like -i with a zero length extension
the syntax of the c\ command is weird and I couldn't get it to work
I didn't realize some of my issues are coming from unescaped slashes
So here is the solution I came up with which I think should work for most cases:
function escape_slashes {
sed 's/\//\\\//g'
}
function change_line {
local OLD_LINE_PATTERN=$1; shift
local NEW_LINE=$1; shift
local FILE=$1
local NEW=$(echo "${NEW_LINE}" | escape_slashes)
# FIX: No space after the option i.
sed -i.bak '/'"${OLD_LINE_PATTERN}"'/s/.*/'"${NEW}"'/' "${FILE}"
mv "${FILE}.bak" /tmp/
}
So the sample usage to fix the problem posed:
change_line "TEXT_TO_BE_REPLACED" "This line is removed by the admin." yourFile
All of the answers provided so far assume that you know something about the text to be replaced which makes sense, since that's what the OP asked. I'm providing an answer that assumes you know nothing about the text to be replaced and that there may be a separate line in the file with the same or similar content that you do not want to be replaced. Furthermore, I'm assuming you know the line number of the line to be replaced.
The following examples demonstrate the removing or changing of text by specific line numbers:
# replace line 17 with some replacement text and make changes in file (-i switch)
# the "-i" switch indicates that we want to change the file. Leave it out if you'd
# just like to see the potential changes output to the terminal window.
# "17s" indicates that we're searching line 17
# ".*" indicates that we want to change the text of the entire line
# "REPLACEMENT-TEXT" is the new text to put on that line
# "PATH-TO-FILE" tells us what file to operate on
sed -i '17s/.*/REPLACEMENT-TEXT/' PATH-TO-FILE
# replace specific text on line 3
sed -i '3s/TEXT-TO-REPLACE/REPLACEMENT-TEXT/'
for manipulation of config files
i came up with this solution inspired by skensell answer
configLine [searchPattern] [replaceLine] [filePath]
it will:
create the file if not exists
replace the whole line (all lines) where searchPattern matched
add replaceLine on the end of the file if pattern was not found
Function:
function configLine {
local OLD_LINE_PATTERN=$1; shift
local NEW_LINE=$1; shift
local FILE=$1
local NEW=$(echo "${NEW_LINE}" | sed 's/\//\\\//g')
touch "${FILE}"
sed -i '/'"${OLD_LINE_PATTERN}"'/{s/.*/'"${NEW}"'/;h};${x;/./{x;q100};x}' "${FILE}"
if [[ $? -ne 100 ]] && [[ ${NEW_LINE} != '' ]]
then
echo "${NEW_LINE}" >> "${FILE}"
fi
}
the crazy exit status magic comes from https://stackoverflow.com/a/12145797/1262663
In my makefile I use this:
#sed -i '/.*Revision:.*/c\'"`svn info -R main.cpp | awk '/^Rev/'`"'' README.md
PS: DO NOT forget that the -i changes actually the text in the file... so if the pattern you defined as "Revision" will change, you will also change the pattern to replace.
Example output:
Abc-Project written by John Doe
Revision: 1190
So if you set the pattern "Revision: 1190" it's obviously not the same as you defined them as "Revision:" only...
bash-4.1$ new_db_host="DB_HOSTNAME=good replaced with 122.334.567.90"
bash-4.1$
bash-4.1$ sed -i "/DB_HOST/c $new_db_host" test4sed
vim test4sed
'
'
'
DB_HOSTNAME=good replaced with 122.334.567.90
'
it works fine
To do this without relying on any GNUisms such as -i without a parameter or c without a linebreak:
sed '/TEXT_TO_BE_REPLACED/c\
This line is removed by the admin.
' infile > tmpfile && mv tmpfile infile
In this (POSIX compliant) form of the command
c\
text
text can consist of one or multiple lines, and linebreaks that should become part of the replacement have to be escaped:
c\
line1\
line2
s/x/y/
where s/x/y/ is a new sed command after the pattern space has been replaced by the two lines
line1
line2
cat find_replace | while read pattern replacement ; do
sed -i "/${pattern}/c ${replacement}" file
done
find_replace file contains 2 columns, c1 with pattern to match, c2 with replacement, the sed loop replaces each line conatining one of the pattern of variable 1
To replace whole line containing a specified string with the content of that line
Text file:
Row: 0 last_time_contacted=0, display_name=Mozart, _id=100, phonebook_bucket_alt=2
Row: 1 last_time_contacted=0, display_name=Bach, _id=101, phonebook_bucket_alt=2
Single string:
$ sed 's/.* display_name=\([[:alpha:]]\+\).*/\1/'
output:
100
101
Multiple strings delimited by white-space:
$ sed 's/.* display_name=\([[:alpha:]]\+\).* _id=\([[:digit:]]\+\).*/\1 \2/'
output:
Mozart 100
Bach 101
Adjust regex to meet your needs
[:alpha] and [:digit:]
are Character Classes and Bracket Expressions
This worked for me:
sed -i <extension> 's/.*<Line to be replaced>.*/<New line to be added>/'
An example is:
sed -i .bak -e '7s/.*version.*/ version = "4.33.0"/'
-i: The extension for the backup file after the replacement. In this case, it is .bak.
-e: The sed script. In this case, it is '7s/.*version.*/ version = "4.33.0"/'. If you want to use a sed file use the -f flag
s: The line number in the file to be replaced. In this case, it is 7s which means line 7.
Note:
If you want to do a recursive find and replace with sed then you can grep to the beginning of the command:
grep -rl --exclude-dir=<directory-to-exclude> --include=\*<Files to include> "<Line to be replaced>" ./ | sed -i <extension> 's/.*<Line to be replaced>.*/<New line to be added>/'
The question asks for solutions using sed, but if that's not a hard requirement then there is another option which might be a wiser choice.
The accepted answer suggests sed -i and describes it as replacing the file in-place, but -i doesn't really do that and instead does the equivalent of sed pattern file > tmp; mv tmp file, preserving ownership and modes. This is not ideal in many circumstances. In general I do not recommend running sed -i non-interactively as part of an automatic process--it's like setting a bomb with a fuse of an unknown length. Sooner or later it will blow up on someone.
To actually edit a file "in place" and replace a line matching a pattern with some other content you would be well served to use an actual text editor. This is how it's done with ed, the standard text editor.
printf '%s\n' '/TEXT_TO_BE_REPLACED/' d i 'This line is removed by the admin' . w q | \
ed -s /tmp/foo > /dev/null
Note that this only replaces the first matching line, which is what the question implied was wanted. This is a material difference from most of the other answers.
That disadvantage aside, there are some advantages to using ed over sed:
You can replace the match with one or multiple lines without any extra effort.
The replacement text can be arbitrarily complex without needing any escaping to protect it.
Most importantly, the original file is opened, modified, and saved. A copy is not made.
How it works
How it works:
printf will use its first argument as a format string and print each of its other arguments using that format, effectively meaning that each argument to printf becomes a line of output, which is all sent to ed on stdin.
The first line is a regex pattern match which causes ed to move its notion of "the current line" forward to the first line that matches (if there is no match the current line is set to the last line of the file).
The next is the d command which instructs ed to delete the entire current line.
After that is the i command which puts ed into insert mode;
after that all subsequent lines entered are written to the current line (or additional lines if there are any embedded newlines). This means you can expand a variable (e.g. "$foo") containing multiple lines here and it will insert all of them.
Insert mode ends when ed sees a line consisting of .
The w command writes the content of the file to disk, and
the q command quits.
The ed command is given the -s switch, putting it into silent mode so it doesn't echo any information as it runs,
the file to be edited is given as an argument to ed,
and, finally, stdout is thrown away to prevent the line matching the regex from being printed.
Some Unix-like systems may (inappropriately) ship without an ed installed, but may still ship with an ex; if so you can simply use it instead. If have vim but no ex or ed you can use vim -e instead. If you have only standard vi but no ex or ed, complain to your sysadmin.
It is as similar to above one..
sed 's/[A-Za-z0-9]*TEXT_TO_BE_REPLACED.[A-Za-z0-9]*/This line is removed by the admin./'
Below command is working for me. Which is working with variables
sed -i "/\<$E\>/c $D" "$B"
I very often use regex to extract data from files I just used that to replace the literal quote \" with // nothing :-)
cat file.csv | egrep '^\"([0-9]{1,3}\.[0-9]{1,3}\.)' | sed s/\"//g | cut -d, -f1 > list.txt

Resources