Splitting a file and its lines under Linux/bash - linux

I have a rather large file (150 million lines of 10 chars). I need to split it in 150 files of 2 million lines, with each output line being alternatively the first 5 characters or the last 5 characters of the source line.
I could do this in Perl rather quickly, but I was wondering if there was an easy solution using bash.
Any ideas?

Homework? :-)
I would think that a simple pipe with sed (to split each line into two) and split (to split things up into multiple files) would be enough.
The man command is your friend.
Added after confirmation that it is not homework:
How about
sed 's/\(.....\)\(.....\)/\1\n\2/' input_file | split -l 2000000 - out-prefix-
?

I think that something like this could work:
out_file=1
out_pairs=0
cat $in_file | while read line; do
if [ $out_pairs -gt 1000000 ]; then
out_file=$(($out_file + 1))
out_pairs=0
fi
echo "${line%?????}" >> out${out_file}
echo "${line#?????}" >> out${out_file}
out_pairs=$(($out_pairs + 1))
done
Not sure if it's simpler or more efficient than using Perl, though.

First five chars of each line variant, assuming that the large file called x.txt, and assuming it's OK to create files in the current directory with names x.txt.* :
split -l 2000000 x.txt x.txt.out && (for splitfile in x.txt.out*; do outfile="${splitfile}.firstfive"; echo "$splitfile -> $outfile"; cut -c 1-5 "$splitfile" > "$outfile"; done)

Why not just use native linux split function?
split -d -l 999999 input_filename
this will output new split files with file names like x00 x01 x02...
for more info see the manual
man split

Related

Problem with splitting files based on numeric suffix

I have a file called files.txt and I need to split it based on lines. The command is as follows -
split -l 1 files.txt file --numeric-suffixes=1 --suffix-length=4
The numeric suffixes here start from file0001 to file9000. But I want it to be from 1 to 9000.
I can't seem to change it when --suffix-length=1, as split exhausted output filenames. Any suggestions using the same split command?
I don't think split will do what you want it to do, though I'm on macOS, so the *nix I'm using is Darwin not Linux; however, a simple shell script would do the trick:
#!/bin/bash
N=1
cat $1 | while read line
do
echo "$line" > file$N
N=`expr $N + 1`
done
Assuming you save it as mysplit (don't forget chmod -x mysplit), then you run it:
./mysplit files.txt

Can you use 'less' or 'more' to output one page worth of text?

So in Linux, less is used to read files page by page for better readability. I was wondering if you can do something like less file.txt > output.txt to get one page worth of file.txt and output/write it to `output.txt.
Apparently, this does not work, output.txt is exactly the same as the original file, I'm wondering why this is the case, and if there are other work-arounds. Thank you!
You can use the split command.
split -l 100 -d -a 3 input output
This will split the input file every 100 lines (-l 100), will use numbers as suffixes (-d) and will use 3 numbers as suffix (-a 3) in the output file. Something like this output000, output001, output002
You can use head to get a specific number of lines, and tput lines to see how many lines there are on your current terminal.
Here's a script that fetches a pageful, or the standard 25 lines if no terminal is available:
#!/bin/bash
lines=$(tput lines) || lines=25
head -n "$lines" file.txt > output.txt
we use head and tail to get n lines of top or bottom part of a file
cat /var/log/messages|tail -n20
head -n10 src/main.h

Splitting bulk text file every n line

I have a folder that contains multiple text files. I'm trying to split all text files at 10000 line per file while keeping the base file name i.e. if filename1.txt contains 20000 lines the output will be filename1-1.txt (10000 lines) and filename1-2.txt (10000 lines).
I tried to use split -10000 filename1.txt but this is not keeping the base filename and i have to repeat the command for each text file in the folder. I also tried doing for f in *.txt; do split -10000 $f.txt; done. This didn't work too.
Any idea how can i do this? Thanks.
for f in filename*.txt; do split -d -a1 -l10000 --additional-suffix=.txt "$f" "${f%.txt}-"; done
Or, written over multiple lines:
for f in filename*.txt
do
split -d -a1 -l10000 --additional-suffix=.txt "$f" "${f%.txt}-"
done
How it works:
-d tells split to use numeric suffixes
-a1 tells split to start with only single digits for the suffix.
-l10000 tells split to split every 10,000 lines.
--additional-suffix=.txt tells split to add .txt to the end of the names of the new files.
"$f" tells split the name of the file to split.
"${f%.txt}-" tells split the prefix name to use for the split files.
Example
Suppose that we start with these files:
$ ls
filename1.txt filename2.txt
Then we run our command:
$ for f in filename*.txt; do split -d -a1 -l10000 --additional-suffix=.txt "$f" "${f%.txt}-"; done
When this is done, we now have the original files and the new split files:
$ ls
filename1-0.txt filename1-1.txt filename1.txt filename2-0.txt filename2-1.txt filename2.txt
Using older, less featureful forms of split
If your split does not offer --additional-suffix, then consider:
for f in filename*.txt
do
split -d -a1 -l10000 "$f" "${f%.txt}-"
for g in "${f%.txt}-"*
do
mv "$g" "$g.txt"
done
done
No need for shell loops, just one simple awk command does it for all files:
awk 'FNR%1000==1{if(FNR==1)c=0; close(out); out=FILENAME; sub(/.txt/,"-"++c".txt)} {print > out}' *
--suffix-length=3
If it is going to make more than 9 files you might need to add something like that.

linux cron truncate large file

I have a large csv file that I need to reduce to the last 1000 lines through a cron job every day.
Can anyone suggest how to accomplish this?
What I have so far are two commands but do not know how to combine them
For deleting lines from the beggining of the file the command is
ed -s file.csv <<< $'1,123d\nwq'
where 123 is the number of lines needed to delete from the beginning of the file
For reading the number of lines in the file the command is
wc -l file.csv
I would need to subtract 1000 from this and pass the result to the first command
Is there any way to combine the result of wc command in the ed command?
Thank you in advance
Assuming bash is the shell, 'file' is the file (and it exists) :
sed -i "1,$(( $(wc -l < file) - 1000 ))d" file
Edit: the brief version above will not work cleanly for files with 1000 or fewer lines. A more robust script, handling all .csv files in a particular directory:
#!/bin/env bash
DIR=/path/to/csv/files
N=1000
for csv in $DIR/*.csv; do
L=$(wc -l < $csv)
[ $L -le $N ] && continue
sed -i "1,$(($L - $N))d" $csv
done
Next edit: handle a directory with no .csv files?

Quick unix command to display specific lines in the middle of a file?

Trying to debug an issue with a server and my only log file is a 20GB log file (with no timestamps even! Why do people use System.out.println() as logging? In production?!)
Using grep, I've found an area of the file that I'd like to take a look at, line 347340107.
Other than doing something like
head -<$LINENUM + 10> filename | tail -20
... which would require head to read through the first 347 million lines of the log file, is there a quick and easy command that would dump lines 347340100 - 347340200 (for example) to the console?
update I totally forgot that grep can print the context around a match ... this works well. Thanks!
I found two other solutions if you know the line number but nothing else (no grep possible):
Assuming you need lines 20 to 40,
sed -n '20,40p;41q' file_name
or
awk 'FNR>=20 && FNR<=40' file_name
When using sed it is more efficient to quit processing after having printed the last line than continue processing until the end of the file. This is especially important in the case of large files and printing lines at the beginning. In order to do so, the sed command above introduces the instruction 41q in order to stop processing after line 41 because in the example we are interested in lines 20-40 only. You will need to change the 41 to whatever the last line you are interested in is, plus one.
# print line number 52
sed -n '52p' # method 1
sed '52!d' # method 2
sed '52q;d' # method 3, efficient on large files
method 3 efficient on large files
fastest way to display specific lines
with GNU-grep you could just say
grep --context=10 ...
No there isn't, files are not line-addressable.
There is no constant-time way to find the start of line n in a text file. You must stream through the file and count newlines.
Use the simplest/fastest tool you have to do the job. To me, using head makes much more sense than grep, since the latter is way more complicated. I'm not saying "grep is slow", it really isn't, but I would be surprised if it's faster than head for this case. That'd be a bug in head, basically.
What about:
tail -n +347340107 filename | head -n 100
I didn't test it, but I think that would work.
I prefer just going into less and
typing 50% to goto halfway the file,
43210G to go to line 43210
:43210 to do the same
and stuff like that.
Even better: hit v to start editing (in vim, of course!), at that location. Now, note that vim has the same key bindings!
You can use the ex command, a standard Unix editor (part of Vim now), e.g.
display a single line (e.g. 2nd one):
ex +2p -scq file.txt
corresponding sed syntax: sed -n '2p' file.txt
range of lines (e.g. 2-5 lines):
ex +2,5p -scq file.txt
sed syntax: sed -n '2,5p' file.txt
from the given line till the end (e.g. 5th to the end of the file):
ex +5,p -scq file.txt
sed syntax: sed -n '2,$p' file.txt
multiple line ranges (e.g. 2-4 and 6-8 lines):
ex +2,4p +6,8p -scq file.txt
sed syntax: sed -n '2,4p;6,8p' file.txt
Above commands can be tested with the following test file:
seq 1 20 > file.txt
Explanation:
+ or -c followed by the command - execute the (vi/vim) command after file has been read,
-s - silent mode, also uses current terminal as a default output,
q followed by -c is the command to quit editor (add ! to do force quit, e.g. -scq!).
I'd first split the file into few smaller ones like this
$ split --lines=50000 /path/to/large/file /path/to/output/file/prefix
and then grep on the resulting files.
If your line number is 100 to read
head -100 filename | tail -1
Get ack
Ubuntu/Debian install:
$ sudo apt-get install ack-grep
Then run:
$ ack --lines=$START-$END filename
Example:
$ ack --lines=10-20 filename
From $ man ack:
--lines=NUM
Only print line NUM of each file. Multiple lines can be given with multiple --lines options or as a comma separated list (--lines=3,5,7). --lines=4-7 also works.
The lines are always output in ascending order, no matter the order given on the command line.
sed will need to read the data too to count the lines.
The only way a shortcut would be possible would there to be context/order in the file to operate on. For example if there were log lines prepended with a fixed width time/date etc.
you could use the look unix utility to binary search through the files for particular dates/times
Use
x=`cat -n <file> | grep <match> | awk '{print $1}'`
Here you will get the line number where the match occurred.
Now you can use the following command to print 100 lines
awk -v var="$x" 'NR>=var && NR<=var+100{print}' <file>
or you can use "sed" as well
sed -n "${x},${x+100}p" <file>
With sed -e '1,N d; M q' you'll print lines N+1 through M. This is probably a bit better then grep -C as it doesn't try to match lines to a pattern.
Building on Sklivvz' answer, here's a nice function one can put in a .bash_aliases file. It is efficient on huge files when printing stuff from the front of the file.
function middle()
{
startidx=$1
len=$2
endidx=$(($startidx+$len))
filename=$3
awk "FNR>=${startidx} && FNR<=${endidx} { print NR\" \"\$0 }; FNR>${endidx} { print \"END HERE\"; exit }" $filename
}
To display a line from a <textfile> by its <line#>, just do this:
perl -wne 'print if $. == <line#>' <textfile>
If you want a more powerful way to show a range of lines with regular expressions -- I won't say why grep is a bad idea for doing this, it should be fairly obvious -- this simple expression will show you your range in a single pass which is what you want when dealing with ~20GB text files:
perl -wne 'print if m/<regex1>/ .. m/<regex2>/' <filename>
(tip: if your regex has / in it, use something like m!<regex>! instead)
This would print out <filename> starting with the line that matches <regex1> up until (and including) the line that matches <regex2>.
It doesn't take a wizard to see how a few tweaks can make it even more powerful.
Last thing: perl, since it is a mature language, has many hidden enhancements to favor speed and performance. With this in mind, it makes it the obvious choice for such an operation since it was originally developed for handling large log files, text, databases, etc.
print line 5
sed -n '5p' file.txt
sed '5q' file.txt
print everything else than line 5
`sed '5d' file.txt
and my creation using google
#!/bin/bash
#removeline.sh
#remove deleting it comes move line xD
usage() { # Function: Print a help message.
echo "Usage: $0 -l LINENUMBER -i INPUTFILE [ -o OUTPUTFILE ]"
echo "line is removed from INPUTFILE"
echo "line is appended to OUTPUTFILE"
}
exit_abnormal() { # Function: Exit with error.
usage
exit 1
}
while getopts l:i:o:b flag
do
case "${flag}" in
l) line=${OPTARG};;
i) input=${OPTARG};;
o) output=${OPTARG};;
esac
done
if [ -f tmp ]; then
echo "Temp file:tmp exist. delete it yourself :)"
exit
fi
if [ -f "$input" ]; then
re_isanum='^[0-9]+$'
if ! [[ $line =~ $re_isanum ]] ; then
echo "Error: LINENUMBER must be a positive, whole number."
exit 1
elif [ $line -eq "0" ]; then
echo "Error: LINENUMBER must be greater than zero."
exit_abnormal
fi
if [ ! -z $output ]; then
sed -n "${line}p" $input >> $output
fi
if [ ! -z $input ]; then
# remove this sed command and this comes move line to other file
sed "${line}d" $input > tmp && cp tmp $input
fi
fi
if [ -f tmp ]; then
rm tmp
fi
You could try this command:
egrep -n "*" <filename> | egrep "<line number>"
Easy with perl! If you want to get line 1, 3 and 5 from a file, say /etc/passwd:
perl -e 'while(<>){if(++$l~~[1,3,5]){print}}' < /etc/passwd
I am surprised only one other answer (by Ramana Reddy) suggested to add line numbers to the output. The following searches for the required line number and colours the output.
file=FILE
lineno=LINENO
wb="107"; bf="30;1"; rb="101"; yb="103"
cat -n ${file} | { GREP_COLORS="se=${wb};${bf}:cx=${wb};${bf}:ms=${rb};${bf}:sl=${yb};${bf}" grep --color -C 10 "^[[:space:]]\\+${lineno}[[:space:]]"; }

Resources