How can I randomize the lines in a file using standard tools on Red Hat Linux? - linux

How can I randomize the lines in a file using standard tools on Red Hat Linux?
I don't have the shuf command, so I am looking for something like a perl or awk one-liner that accomplishes the same task.

Um, lets not forget
sort --random-sort

shuf is the best way.
sort -R is painfully slow. I just tried to sort 5GB file. I gave up after 2.5 hours. Then shuf sorted it in a minute.

And a Perl one-liner you get!
perl -MList::Util -e 'print List::Util::shuffle <>'
It uses a module, but the module is part of the Perl code distribution. If that's not good enough, you may consider rolling your own.
I tried using this with the -i flag ("edit-in-place") to have it edit the file. The documentation suggests it should work, but it doesn't. It still displays the shuffled file to stdout, but this time it deletes the original. I suggest you don't use it.
Consider a shell script:
#!/bin/sh
if [[ $# -eq 0 ]]
then
echo "Usage: $0 [file ...]"
exit 1
fi
for i in "$#"
do
perl -MList::Util -e 'print List::Util::shuffle <>' $i > $i.new
if [[ `wc -c $i` -eq `wc -c $i.new` ]]
then
mv $i.new $i
else
echo "Error for file $i!"
fi
done
Untested, but hopefully works.

cat yourfile.txt | while IFS= read -r f; do printf "%05d %s\n" "$RANDOM" "$f"; done | sort -n | cut -c7-
Read the file, prepend every line with a random number, sort the file on those random prefixes, cut the prefixes afterwards. One-liner which should work in any semi-modern shell.
EDIT: incorporated Richard Hansen's remarks.

A one-liner for python:
python -c "import random, sys; lines = open(sys.argv[1]).readlines(); random.shuffle(lines); print ''.join(lines)," myFile
And for printing just a single random line:
python -c "import random, sys; print random.choice(open(sys.argv[1]).readlines())," myFile
But see this post for the drawbacks of python's random.shuffle(). It won't work well with many (more than 2080) elements.

Related to Jim's answer:
My ~/.bashrc contains the following:
unsort ()
{
LC_ALL=C sort -R "$#"
}
With GNU coreutils's sort, -R = --random-sort, which generates a random hash of each line and sorts by it. The randomized hash wouldn't actually be used in some locales in some older (buggy) versions, causing it to return normal sorted output, which is why I set LC_ALL=C.
Related to Chris's answer:
perl -MList::Util=shuffle -e'print shuffle<>'
is a slightly shorter one-liner. (-Mmodule=a,b,c is shorthand for -e 'use module qw(a b c);'.)
The reason giving it a simple -i doesn't work for shuffling in-place is because Perl expects that the print happens in the same loop the file is being read, and print shuffle <> doesn't output until after all input files have been read and closed.
As a shorter workaround,
perl -MList::Util=shuffle -i -ne'BEGIN{undef$/}print shuffle split/^/m'
will shuffle files in-place. (-n means "wrap the code in a while (<>) {...} loop; BEGIN{undef$/} makes Perl operate on files-at-a-time instead of lines-at-a-time, and split/^/m is needed because $_=<> has been implicitly done with an entire file instead of lines.)

When I install coreutils with homebrew
brew install coreutils
shuf becomes available as n.

Mac OS X with DarwinPorts:
sudo port install unsort
cat $file | unsort | ...

FreeBSD has its own random utility:
cat $file | random | ...
It's in /usr/games/random, so if you have not installed games, you are out of luck.
You could consider installing ports like textproc/rand or textproc/msort. These might well be available on Linux and/or Mac OS X, if portability is a concern.

On OSX, grabbing latest from http://ftp.gnu.org/gnu/coreutils/ and something like
./configure
make
sudo make install
...should give you /usr/local/bin/sort --random-sort
without messing up /usr/bin/sort

Or get it from MacPorts:
$ sudo port install coreutils
and/or
$ /opt/local//libexec/gnubin/sort --random-sort

Related

how to convert filename.bz2.gz file to filename.gz

I have a bunch of files with filename.bz2.gz which I want to convert to filename.gz.
any help ?
thanks
Having your filename *.bz2.gz I assume the file had been created using the following order of compressions:
echo test | bzip2 | gzip -f > file.bz2.gz
Meaning it is a gzipped bzip2 file (for whatever reason). If my assumption is correct you can change it's compression to gzip-only, using the following commands:
gunzip < file.bz2.gz | bunzip2 | gzip > file.gz
If you just want to rename then do this.
for i in `ls|awk -F. '{print $1}'`
do
mv "$i".bz2.gz "$i".gz
done
I would refine Ajit's solution in this way:
for i in *.bz2.gz; do
i=${i%.bz2.gz}
mv "$i.bz2.gz" "$i.gz"
done
Using a glob rather than command subsitution avoids problems with word-splitting for filenames with whitespace. It also avoids the extra ls process, which is marginally more efficient, particularly on platforms like Cygwin with slow process forking. For the same reason, the awk command can be replaced with the ${parameter%[word]} parameter expansion syntax. (Quoting style of "$i".gz vs "$i.gz" makes no difference and is just personal preference.)

Calculate Levenshtein Distances between many consecutive strings

I've got a text file with str1 str2 str3... and I want to output another text file with LD(str1,str2) LD(str2,str3) LD(str3,str4) and so on. How to do this? Any language will do.
#ASSUMING YOUR RUNNIG SOME KIND OF UNIX
#install a perl module that computes it:
sudo cpan String::Approx
# (Note: there is also Text::Levenshtein module)
# if you need to, change your shell to:
bash
# so you can use command substitution:
perl -M'String::Approx(adist)' -ane 'print adist(#F)' <(paste <(ghead -n -1 in.txt ) <(gtail -n +2 in.txt ))
# note: I have gnu core utils installed with 'g' prefix. You might just use 'head' and 'tail' above.

Linux: Move 1 million files into prefix-based created Folders

I have a directory called "images" filled with about one million images. Yep.
I want to write a shell command to rename all of those images into the following format:
original: filename.jpg
new: /f/i/l/filename.jpg
Any suggestions?
Thanks,
Dan
for i in *.*; do mkdir -p ${i:0:1}/${i:1:1}/${i:2:1}/; mv $i ${i:0:1}/${i:1:1}/${i:2:1}/; done;
The ${i:0:1}/${i:1:1}/${i:2:1} part could probably be a variable, or shorter or different, but the command above gets the job done. You'll probably face performance issues but if you really want to use it, narrow the *.* to fewer options (a*.*, b*.* or what fits you)
edit: added a $ before i for mv, as noted by Dan
You can generate the new file name using, e.g., sed:
$ echo "test.jpg" | sed -e 's/^\(\(.\)\(.\)\(.\).*\)$/\2\/\3\/\4\/\1/'
t/e/s/test.jpg
So, you can do something like this (assuming all the directories are already created):
for f in *; do
mv -i "$f" "$(echo "$f" | sed -e 's/^\(\(.\)\(.\)\(.\).*\)$/\2\/\3\/\4\/\1/')"
done
or, if you can't use the bash $( syntax:
for f in *; do
mv -i "$f" "`echo "$f" | sed -e 's/^\(\(.\)\(.\)\(.\).*\)$/\2\/\3\/\4\/\1/'`"
done
However, considering the number of files, you may just want to use perl as that's a lot of sed and mv processes to spawn:
#!/usr/bin/perl -w
use strict;
# warning: untested
opendir DIR, "." or die "opendir: $!";
my #files = readdir(DIR); # can't change dir while reading: read in advance
closedir DIR;
foreach my $f (#files) {
(my $new_name = $f) =~ s!^((.)(.)(.).*)$!$2/$3/$4/$1/;
-e $new_name and die "$new_name already exists";
rename($f, $new_name);
}
That perl is surely limited to same-filesystem only, though you can use File::Copy::move to get around that.
You can do it as a bash script:
#!/bin/bash
base=base
mkdir -p $base/shorts
for n in *
do
if [ ${#n} -lt 3 ]
then
mv $n $base/shorts
else
dir=$base/${n:0:1}/${n:1:1}/${n:2:1}
mkdir -p $dir
mv $n $dir
fi
done
Needless to say, you might need to worry about spaces and the files with short names.
I suggest a short python script. Most shell tools will balk at that much input (though xargs may do the trick). Will update with example in a sec.
#!/usr/bin/python
import os, shutil
src_dir = '/src/dir'
dest_dir = '/dest/dir'
for fn in os.listdir(src_dir):
os.makedirs(dest_dir+'/'+fn[0]+'/'+fn[1]+'/'+fn[2]+'/')
shutil.copyfile(src_dir+'/'+fn, dest_dir+'/'+fn[0]+'/'+fn[1]+'/'+fn[2]+'/'+fn)
Any of the proposed solutions which use a wildcard syntax in the shell will likely fail due to the sheer number of files you have. Of the current proposed solutions, the perl one is probably the best.
However, you can easily adapt any of the shell script methods to deal with any number of files thus:
ls -1 | \
while read filename
do
# insert the loop body of your preference here, operating on "filename"
done
I would still use perl, but if you're limited to only having simple unix tools around, then combining one of the above shell solutions with a loop like I've shown should get you there. It'll be slow, though.

Tail multiple files in CentOS

I want to tail multiple files (and follow them) in CentOS, I've tried this:
tail -f file1 file2 file3
but the output is very unfriendly
I've also had a look at multitail but can't find a CentOS version.
What other choices do I have?
Multitail is available for CentOS in rpmforge repos. To add rpmforge repository check the documentation on 3rd Party Repositories.
I found the solution described here work well on centos:
The link is http://www.thegeekstuff.com/2009/09/multitail-to-view-tail-f-output-of-multiple-log-files-in-one-terminal/
Thanks to Ramesh Natarajan
$ vi multi-tail.sh
#!/bin/sh
# When this exits, exit all back ground process also.
trap 'kill $(jobs -p)' EXIT
# iterate through the each given file names,
for file in "$#"
do
# show tails of each in background.
tail -f $file &
done
# wait .. until CTRL+C
wait
You could simulate multitail by opening multiple instances of tail -f in Emacs subwindows.
I usually just open another xterm and run a separate 'tail -f' there.
Otherwise if I'm using the 'screen' tool, I'll set up separate 'tail -f' commands there. I don't like that as much because it takes a few keystrokes to enable scrolling in screen before using the Page Up and Page Down keys. I prefer to just use xterm's scroll bar.
You can use the watch command, i use it to tail two files at the same time:
watch -n0 tail -n30 file1 file2
A better answer to an old question...
I create a shell function in my .bashrc (obviously assumes you're using bash as your shell) and use tmux. You can probably complicate this a whole lot and do it without the tempfile, but the quoting is just ugly if you're trying to ensure that files with spaces or other weird characters in the name still work.
multitail ()
{
cmdfile=`mktemp`
echo "new-session -d \"tail -f '$1'\"" >$cmdfile
shift
for file in "$#"
do
echo "split-window -d \"tail -f '$file'\"" >>$cmdfile
done
echo "select-layout even-vertical" >>$cmdfile
tmux source-file $cmdfile \; attach && rm -f $cmdfile
}

Quick unix command to display specific lines in the middle of a file?

Trying to debug an issue with a server and my only log file is a 20GB log file (with no timestamps even! Why do people use System.out.println() as logging? In production?!)
Using grep, I've found an area of the file that I'd like to take a look at, line 347340107.
Other than doing something like
head -<$LINENUM + 10> filename | tail -20
... which would require head to read through the first 347 million lines of the log file, is there a quick and easy command that would dump lines 347340100 - 347340200 (for example) to the console?
update I totally forgot that grep can print the context around a match ... this works well. Thanks!
I found two other solutions if you know the line number but nothing else (no grep possible):
Assuming you need lines 20 to 40,
sed -n '20,40p;41q' file_name
or
awk 'FNR>=20 && FNR<=40' file_name
When using sed it is more efficient to quit processing after having printed the last line than continue processing until the end of the file. This is especially important in the case of large files and printing lines at the beginning. In order to do so, the sed command above introduces the instruction 41q in order to stop processing after line 41 because in the example we are interested in lines 20-40 only. You will need to change the 41 to whatever the last line you are interested in is, plus one.
# print line number 52
sed -n '52p' # method 1
sed '52!d' # method 2
sed '52q;d' # method 3, efficient on large files
method 3 efficient on large files
fastest way to display specific lines
with GNU-grep you could just say
grep --context=10 ...
No there isn't, files are not line-addressable.
There is no constant-time way to find the start of line n in a text file. You must stream through the file and count newlines.
Use the simplest/fastest tool you have to do the job. To me, using head makes much more sense than grep, since the latter is way more complicated. I'm not saying "grep is slow", it really isn't, but I would be surprised if it's faster than head for this case. That'd be a bug in head, basically.
What about:
tail -n +347340107 filename | head -n 100
I didn't test it, but I think that would work.
I prefer just going into less and
typing 50% to goto halfway the file,
43210G to go to line 43210
:43210 to do the same
and stuff like that.
Even better: hit v to start editing (in vim, of course!), at that location. Now, note that vim has the same key bindings!
You can use the ex command, a standard Unix editor (part of Vim now), e.g.
display a single line (e.g. 2nd one):
ex +2p -scq file.txt
corresponding sed syntax: sed -n '2p' file.txt
range of lines (e.g. 2-5 lines):
ex +2,5p -scq file.txt
sed syntax: sed -n '2,5p' file.txt
from the given line till the end (e.g. 5th to the end of the file):
ex +5,p -scq file.txt
sed syntax: sed -n '2,$p' file.txt
multiple line ranges (e.g. 2-4 and 6-8 lines):
ex +2,4p +6,8p -scq file.txt
sed syntax: sed -n '2,4p;6,8p' file.txt
Above commands can be tested with the following test file:
seq 1 20 > file.txt
Explanation:
+ or -c followed by the command - execute the (vi/vim) command after file has been read,
-s - silent mode, also uses current terminal as a default output,
q followed by -c is the command to quit editor (add ! to do force quit, e.g. -scq!).
I'd first split the file into few smaller ones like this
$ split --lines=50000 /path/to/large/file /path/to/output/file/prefix
and then grep on the resulting files.
If your line number is 100 to read
head -100 filename | tail -1
Get ack
Ubuntu/Debian install:
$ sudo apt-get install ack-grep
Then run:
$ ack --lines=$START-$END filename
Example:
$ ack --lines=10-20 filename
From $ man ack:
--lines=NUM
Only print line NUM of each file. Multiple lines can be given with multiple --lines options or as a comma separated list (--lines=3,5,7). --lines=4-7 also works.
The lines are always output in ascending order, no matter the order given on the command line.
sed will need to read the data too to count the lines.
The only way a shortcut would be possible would there to be context/order in the file to operate on. For example if there were log lines prepended with a fixed width time/date etc.
you could use the look unix utility to binary search through the files for particular dates/times
Use
x=`cat -n <file> | grep <match> | awk '{print $1}'`
Here you will get the line number where the match occurred.
Now you can use the following command to print 100 lines
awk -v var="$x" 'NR>=var && NR<=var+100{print}' <file>
or you can use "sed" as well
sed -n "${x},${x+100}p" <file>
With sed -e '1,N d; M q' you'll print lines N+1 through M. This is probably a bit better then grep -C as it doesn't try to match lines to a pattern.
Building on Sklivvz' answer, here's a nice function one can put in a .bash_aliases file. It is efficient on huge files when printing stuff from the front of the file.
function middle()
{
startidx=$1
len=$2
endidx=$(($startidx+$len))
filename=$3
awk "FNR>=${startidx} && FNR<=${endidx} { print NR\" \"\$0 }; FNR>${endidx} { print \"END HERE\"; exit }" $filename
}
To display a line from a <textfile> by its <line#>, just do this:
perl -wne 'print if $. == <line#>' <textfile>
If you want a more powerful way to show a range of lines with regular expressions -- I won't say why grep is a bad idea for doing this, it should be fairly obvious -- this simple expression will show you your range in a single pass which is what you want when dealing with ~20GB text files:
perl -wne 'print if m/<regex1>/ .. m/<regex2>/' <filename>
(tip: if your regex has / in it, use something like m!<regex>! instead)
This would print out <filename> starting with the line that matches <regex1> up until (and including) the line that matches <regex2>.
It doesn't take a wizard to see how a few tweaks can make it even more powerful.
Last thing: perl, since it is a mature language, has many hidden enhancements to favor speed and performance. With this in mind, it makes it the obvious choice for such an operation since it was originally developed for handling large log files, text, databases, etc.
print line 5
sed -n '5p' file.txt
sed '5q' file.txt
print everything else than line 5
`sed '5d' file.txt
and my creation using google
#!/bin/bash
#removeline.sh
#remove deleting it comes move line xD
usage() { # Function: Print a help message.
echo "Usage: $0 -l LINENUMBER -i INPUTFILE [ -o OUTPUTFILE ]"
echo "line is removed from INPUTFILE"
echo "line is appended to OUTPUTFILE"
}
exit_abnormal() { # Function: Exit with error.
usage
exit 1
}
while getopts l:i:o:b flag
do
case "${flag}" in
l) line=${OPTARG};;
i) input=${OPTARG};;
o) output=${OPTARG};;
esac
done
if [ -f tmp ]; then
echo "Temp file:tmp exist. delete it yourself :)"
exit
fi
if [ -f "$input" ]; then
re_isanum='^[0-9]+$'
if ! [[ $line =~ $re_isanum ]] ; then
echo "Error: LINENUMBER must be a positive, whole number."
exit 1
elif [ $line -eq "0" ]; then
echo "Error: LINENUMBER must be greater than zero."
exit_abnormal
fi
if [ ! -z $output ]; then
sed -n "${line}p" $input >> $output
fi
if [ ! -z $input ]; then
# remove this sed command and this comes move line to other file
sed "${line}d" $input > tmp && cp tmp $input
fi
fi
if [ -f tmp ]; then
rm tmp
fi
You could try this command:
egrep -n "*" <filename> | egrep "<line number>"
Easy with perl! If you want to get line 1, 3 and 5 from a file, say /etc/passwd:
perl -e 'while(<>){if(++$l~~[1,3,5]){print}}' < /etc/passwd
I am surprised only one other answer (by Ramana Reddy) suggested to add line numbers to the output. The following searches for the required line number and colours the output.
file=FILE
lineno=LINENO
wb="107"; bf="30;1"; rb="101"; yb="103"
cat -n ${file} | { GREP_COLORS="se=${wb};${bf}:cx=${wb};${bf}:ms=${rb};${bf}:sl=${yb};${bf}" grep --color -C 10 "^[[:space:]]\\+${lineno}[[:space:]]"; }

Resources