I need to make a big test file for a sorting algorithm. For that I need to generate 10 million random strings. How do I do that? I tried using cat on /dev/urandom but it keeps going for minutes and when I look in the file, there are only around 8 pages of strings. How do I generate 10 million strings in bash? Strings should be 10 characters long.
Using openssl:
#!/bin/bash
openssl rand -hex $(( 100000000 * 4 )) | \
while IFS= read -rn8 -d '' r; do
echo "$r"
done
This will not guarantee uniquness, but gives you 10 million random lines in a file. Not too fast, but ran in under 30 sec on my machine:
cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w 10 | head -n 10000000 > file
Update, if you have shuf from GNU coreutils you can use:
shuf -i 1-10000000 > file
Takes 2 sec on my computer. (Thanks rici!)
You can use awk to generate sequential numbers and shuffle them with shuf:
awk 'BEGIN{for(i=1;i<10000001;i++){print i}}' | shuf > big-file.txt
This takes ~ 5 sec on my computer
If they don't need to be uniq, you can do:
$ awk -v n=10000000 'BEGIN{for (i=1; i<=n; i++) printf "%010d\n", int(rand()*n)}' >big_file
That runs in about 3 seconds on my iMac.
Don't generate it, download it. For example Nic funet fi has file 100Mrnd (size 104857600) in its /dev (just funet below). 10M rows, 10 bytes on each row is 100M but using xxd to convert from bin to hex (\x12 -> 12) we'll only need 50M bytes, so:
$ wget -S -O - ftp://funet/100Mrnd | head -c 50000000 | xxd -p | fold -w 10 > /dev/null
$ head -5 file
f961b3ef0e
dc0b5e3b80
513e7c37e1
36d2e4c7b0
0514e626e5
(replace funet with the domain name and path given and /dev/null with your desired filename.)
Related
I am using the following grep script to output all the unmatched patterns:
grep -oFf patterns.txt large_strings.txt | grep -vFf - patterns.txt > unmatched_patterns.txt
patterns file contains the following 12-characters long substrings (some instances are shown below):
6b6c665d4f44
8b715a5d5f5f
26364d605243
717c8a919aa2
large_strings file contains extremely long strings of around 20-100 million characters longs (a small piece of the string is shown below):
121b1f212222212123242223252b36434f5655545351504f4e4e5056616d777d80817d7c7b7a7a7b7c7d7f8997a0a2a2a3a5a5a6a6a6a6a6a7a7babbbcbebebdbcbcbdbdbdbdbcbcbcbcc2c2c2c2c2c2c2c2c4c4c4c3c3c3c2c2c3c3c3c3c3c3c3c3c2c2c1c0bfbfbebdbebebebfbfc0c0c0bfbfbfbebebdbdbdbcbbbbbababbbbbcbdbdbdbebebfbfbfbebdbcbbbbbbbbbcbcbcbcbcbcbcbcbcb8b8b8b7b7b6b6b6b8b8b9babbbbbcbcbbbabab9b9bababbbcbcbcbbbbbababab9b8b7b6b6b6b6b7b7b7b7b7b7b7b7b7b7b6b6b5b5b6b6b7b7b7b7b8b8b9b9b9b9b9b8b7b7b6b5b5b5b5b5b4b4b3b3b3b6b5b4b4b5b7b8babdbebfc1c1c0bfbec1c2c2c2c2c1c0bfbfbebebebebfc0c1c0c0c0bfbfbebebebebebebebebebebebebebdbcbbbbbab9babbbbbcbcbdbdbdbcbcbbbbbbbbbbbabab9b7b6b5b4b4b4b4b3b1aeaca9a7a6a9a9a9aaabacaeafafafafafafafafafb1b2b2b2b2b1b0afacaaa8a7a5a19d9995939191929292919292939291908f8e8e8d8c8b8a8a8a8a878787868482807f7d7c7975716d6b6967676665646261615f5f5e5d5b5a595957575554525
How can we speed up the above script (gnu parallel, xargs, fgrep, etc.)? I tried using --pipepart and --block but it doesn't allow you to pipe two grep commands.
Btw these are all hexadecimal strings and patterns.
The working code below is a little faster than the traditional grep:
rg -oFf patterns.txt large_strings.txt | rg -vFf - patterns.txt > unmatched_patterns.txt
grep took an hour to finish the process of pattern matching while it took ripgrep around 45 mins.
If you do not need to use grep try:
build_k_mers() {
k="$1"
slot="$2"
perl -ne 'for $n (0..(length $_)-'"$k"') {
$prefix = substr($_,$n,2);
$fh{$prefix} or open $fh{$prefix}, ">>", "tmp/kmer.$prefix.'"$slot"'";
$fh = $fh{$prefix};
print $fh substr($_,$n,'"$k"'),"\n"
}'
}
export -f build_k_mers
rm -rf tmp
mkdir tmp
export LC_ALL=C
# search strings must be sorted for comm
parsort patterns.txt | awk '{print >>"tmp/patterns."substr($1,1,2)}' &
# make shorter lines: Insert \n(last 12 char before \n) for every 32k
# This makes it easier for --pipepart to find a newline
# It will not change the kmers generated
perl -pe 's/(.{32000})(.{12})/$1$2\n$2/g' large_strings.txt > large_lines.txt
# Build 12-mers
parallel --pipepart --block -1 -a large_lines.txt 'build_k_mers 12 {%}'
# -j10 and 20s may be adjusted depending on hardware
parallel -j10 --delay 20s 'parsort -u tmp/kmer.{}.* > tmp/kmer.{}; rm tmp/kmer.{}.*' ::: `perl -e 'map { printf "%02x ",$_ } 0..255'`
wait
parallel comm -23 {} {=s/patterns./kmer./=} ::: tmp/patterns.??
I have tested this on patterns.txt: 9GBytes/725937231 lines, large_strings.txt: 19GBytes/184 lines and on my 64-core machine it completes in 3 hours.
I have a file and I am processing it line by line and producing another file with the result. I want to monitor the percentage of completion. In my case, it is just the number of lines in the new file divide by the number of lines from the input file. A simple example would be:
$ cat infile
unix
is
awesome
$ cat infile | process.sh >> outfile &
Now, if I run my command, I should get 0.33 if process.sh completed the first line.
Any suggestions?
You can use pv for progress (in debian/ubuntu inside package pv):
pv -l -s `wc -l file.txt` file.txt | process.sh
This will use number of lines for progress.
Or you can use just the number of bytes:
pv file.txt | process.sh
The above commands will show you the percentage of completion and ETA.
You can use bc:
echo "scale=2; $(cat outfile | wc -l) / $(cat infile | wc -l) * 100" | bc
In addition, combine this with watch for updated progress:
watch -d "echo \"scale=2; \$(cat outfile | wc -l) / \$(cat infile | wc -l) * 100\" | bc"
TOTAL_LINES=`wc -l infile`
LINES=`wc -l outfile`
PERCENT=`echo "scale=2;${LINES}/${TOTAL_LINES}" | bc | sed -e 's_^\.__'`
echo "${PERCENT} % Complete"
scale=2 means you get two significant digits.
I am trying to use the Bash variable $RANDOM to create a random string that consists of 8 characters from a variable that contains integer and alphanumeric digits, e.g., var="abcd1234ABCD".
How can I do that?
Use parameter expansion. ${#chars} is the number of possible characters, % is the modulo operator. ${chars:offset:length} selects the character(s) at position offset, i.e. 0 - length($chars) in our case.
chars=abcd1234ABCD
for i in {1..8} ; do
echo -n "${chars:RANDOM%${#chars}:1}"
done
echo
For those looking for a random alpha-numeric string in bash:
LC_ALL=C tr -dc A-Za-z0-9 </dev/urandom | head -c 64
The same as a well-documented function:
function rand-str {
# Return random alpha-numeric string of given LENGTH
#
# Usage: VALUE=$(rand-str $LENGTH)
# or: VALUE=$(rand-str)
local DEFAULT_LENGTH=64
local LENGTH=${1:-$DEFAULT_LENGTH}
LC_ALL=C tr -dc A-Za-z0-9 </dev/urandom | head -c $LENGTH
# LC_ALL=C: required for Mac OS X - https://unix.stackexchange.com/a/363194/403075
# -dc: delete complementary set == delete all except given set
}
Another way to generate a 32 bytes (for example) hexadecimal string:
xxd -l 32 -c 32 -p < /dev/random
add -u if you want uppercase characters instead.
OPTION 1 - No specific length, no openssl needed, only letters and numbers, slower than option 2
sed "s/[^a-zA-Z0-9]//g" <<< $(cat /dev/urandom | tr -dc 'a-zA-Z0-9!##$%*()-+' | fold -w 32 | head -n 1)
DEMO: x=100; while [ $x -gt 0 ]; do sed "s/[^a-zA-Z0-9]//g" <<< $(cat /dev/urandom | tr -dc 'a-zA-Z0-9!##$%*()-+' | fold -w 32 | head -n 1) <<< $(openssl rand -base64 17); x=$(($x-1)); done
Examples:
j0PYAlRI1r8zIoOSyBhh9MTtrhcI6d
nrCaiO35BWWQvHE66PjMLGVJPkZ6GBK
0WUHqiXgxLq0V0mBw2d7uafhZt2s
c1KyNeznHltcRrudYpLtDZIc1
edIUBRfttFHVM6Ru7h73StzDnG
OPTION 2 - No specific length, openssl needed, only letters and numbers, faster than option 1
openssl rand -base64 12 # only returns
rand=$(openssl rand -base64 12) # only saves to var
sed "s/[^a-zA-Z0-9]//g" <<< $(openssl rand -base64 17) # leave only letters and numbers
# The last command can go to a var too.
DEMO: x=100; while [ $x -gt 0 ]; do sed "s/[^a-zA-Z0-9]//g" <<< $(openssl rand -base64 17); x=$(($x-1)); done
Examples:
9FbVwZZRQeZSARCH
9f8869EVaUS2jA7Y
V5TJ541atfSQQwNI
V7tgXaVzmBhciXxS
Others options not necessarily related:
uuidgen or cat /proc/sys/kernel/random/uuid
After generating 1 billion UUIDs every second for the next 100 years,
the probability of creating just one duplicate would be about 50%. The
probability of one duplicate would be about 50% if every person on
earth owns 600 million UUIDs 😇 source
Not using $RANDOM, but worth mentioning.
Using shuf as source of entropy (a.k.a randomness) (which, in turn, may use /dev/random as source of entropy. As in `shuf -i1-10 --random-source=/dev/urandom) seems like a solution that use less resources:
$ shuf -er -n8 {A..Z} {a..z} {0..9} | paste -sd ""
tf8ZDZ4U
head -1 <(fold -w 20 <(tr -dc 'a-zA-Z0-9' < /dev/urandom))
This is safe to use in bash script if you have safety options turned on:
set -eou pipefail
This is a workaround of bash exit status 141 when you use pipes
tr -dc 'a-zA-Z0-9' < /dev/urandom | fold -w 20 | head -1
Little bit obscure but short to write solution is
RANDSTR=$(mktemp XXXXX) && rm "$RANDSTR"
expecting you have write access to current directory ;-)
mktemp is part of coreutils
UPDATE:
As Bazi pointed out in the comment, mktemp can be used without creating the file ;-) so the command can be even shorter.
RANDSTR=$(mktemp --dry-run XXXXX)
Using sparse array to shuffle characters.
#!/bin/bash
array=()
for i in {a..z} {A..Z} {0..9}; do
array[$RANDOM]=$i
done
printf %s ${array[#]::8} $'\n'
(Or alot of random strings)
#!/bin/bash
b=()
while ((${#b[#]} <= 32768)); do
a=(); for i in {a..z} {A..Z} {0..9}; do a[$RANDOM]=$i; done; b+=(${a[#]})
done
tr -d ' ' <<< ${b[#]} | fold -w 8 | head -n 4096
An abbreviated safe pipe workaround based on Radu Gabriel's answer and tested with GNU bash version 4.4.20 and set -euxo pipefail:
head -c 20 <(tr -dc [:alnum:] < /dev/urandom)
I want to convert binary data to hexadecimal, just that, no fancy formatting and all. hexdump seems too clever, and it "overformats" for me. I want to take x bytes from the /dev/random and pass them on as hexadecimal.
Preferably I'd like to use only standard Linux tools, so that I don't need to install it on every machine (there are many).
Perhaps use xxd:
% xxd -l 16 -p /dev/random
193f6c54814f0576bc27d51ab39081dc
Watch out!
hexdump and xxd give the results in a different endianness!
$ echo -n $'\x12\x34' | xxd -p
1234
$ echo -n $'\x12\x34' | hexdump -e '"%x"'
3412
Simply explained. Big-endian vs. little-endian :D
With od (GNU systems):
$ echo abc | od -A n -v -t x1 | tr -d ' \n'
6162630a
With hexdump (BSD systems):
$ echo abc | hexdump -ve '/1 "%02x"'
6162630a
From Hex dump, od and hexdump:
"Depending on your system type, either or both of these two utilities will be available--BSD systems deprecate od for hexdump, GNU systems the reverse."
Perhaps you could write your own small tool in C, and compile it on-the-fly:
int main (void) {
unsigned char data[1024];
size_t numread, i;
while ((numread = read(0, data, 1024)) > 0) {
for (i = 0; i < numread; i++) {
printf("%02x ", data[i]);
}
}
return 0;
}
And then feed it from the standard input:
cat /bin/ls | ./a.out
You can even embed this small C program in a shell script using the heredoc syntax.
All the solutions seem to be hard to remember or too complex. I find using printf the shortest one:
$ printf '%x\n' 256
100
But as noted in comments, this is not what author wants, so to be fair, below is the full answer.
... to use above to output actual binary data stream:
printf '%x\n' $(cat /dev/urandom | head -c 5 | od -An -vtu1)
What it does:
printf '%x\n' .... - prints a sequence of integers , i.e. printf '%x,' 1 2 3, will print 1,2,3,
$(...) - this is a way to get output of some shell command and process it
cat /dev/urandom - it outputs random binary data
head -c 5 - limits binary data to 5 bytes
od -An -vtu1 - octal dump command, converts binary to decimal
As a testcase ('a' is 61 hex, 'p' is 70 hex, ...):
$ printf '%x\n' $(echo "apple" | head -c 5 | od -An -vtu1)
61
70
70
6c
65
Or to test individual binary bytes, on input let’s give 61 decimal ('=' char) to produce binary data ('\\x%x' format does it). The above command will correctly output 3d (decimal 61):
$printf '%x\n' $(echo -ne "$(printf '\\x%x' 61)" | head -c 5 | od -An -vtu1)
3d
If you need a large stream (no newlines) you can use tr and xxd (part of Vim) for byte-by-byte conversion.
head -c1024 /dev/urandom | xxd -p | tr -d $'\n'
Or you can use hexdump (POSIX) for word-by-word conversion.
head -c1024 /dev/urandom | hexdump '-e"%x"'
Note that the difference is endianness.
dd + hexdump will also work:
dd bs=1 count=1 if=/dev/urandom 2>/dev/null | hexdump -e '"%x"'
Sometimes perl5 works better for portability if you target more than one platform. It comes with every Linux distribution and Unix OS. You can often find it in container images where other tools like xxd or hexdump are not available. Here's how to do the same thing in Perl:
$ head -c8 /dev/urandom | perl -0777 -ne 'print unpack "H*"'
5c9ed169dabf33ab
$ echo -n $'\x01\x23\xff' | perl -0777 -ne 'print unpack "H*"'
0123ff
$ echo abc | perl -0777 -ne 'print unpack "H*"'
6162630a
Note that this uses slurp more, which causes Perl to read the entire input into memory, which may be suboptimal when the input is large.
These three commands will print the same (0102030405060708090a0b0c):
n=12
echo "$a" | xxd -l "$n" -p
echo "$a" | od -N "$n" -An -tx1 | tr -d " \n" ; echo
echo "$a" | hexdump -n "$n" -e '/1 "%02x"'; echo
Given that n=12 and $a is the byte values from 1 to 26:
a="$(printf '%b' "$(printf '\\0%o' {1..26})")"
That could be used to get $n random byte values in each program:
xxd -l "$n" -p /dev/urandom
od -vN "$n" -An -tx1 /dev/urandom | tr -d " \n" ; echo
hexdump -vn "$n" -e '/1 "%02x"' /dev/urandom ; echo
Given: One big text-data file (e.g. CSV format) with a 'special' first line (e.g., field names).
Wanted: An equivalent of the coreutils split -l command, but with the additional requirement that the header line from the original file appear at the beginning of each of the resulting pieces.
I am guessing some concoction of split and head will do the trick?
This is robhruska's script cleaned up a bit:
tail -n +2 file.txt | split -l 4 - split_
for file in split_*
do
head -n 1 file.txt > tmp_file
cat "$file" >> tmp_file
mv -f tmp_file "$file"
done
I removed wc, cut, ls and echo in the places where they're unnecessary. I changed some of the filenames to make them a little more meaningful. I broke it out onto multiple lines only to make it easier to read.
If you want to get fancy, you could use mktemp or tempfile to create a temporary filename instead of using a hard coded one.
Edit
Using GNU split it's possible to do this:
split_filter () { { head -n 1 file.txt; cat; } > "$FILE"; }; export -f split_filter; tail -n +2 file.txt | split --lines=4 --filter=split_filter - split_
Broken out for readability:
split_filter () { { head -n 1 file.txt; cat; } > "$FILE"; }
export -f split_filter
tail -n +2 file.txt | split --lines=4 --filter=split_filter - split_
When --filter is specified, split runs the command (a function in this case, which must be exported) for each output file and sets the variable FILE, in the command's environment, to the filename.
A filter script or function could do any manipulation it wanted to the output contents or even the filename. An example of the latter might be to output to a fixed filename in a variable directory: > "$FILE/data.dat" for example.
This one-liner will split the big csv into pieces of 999 records, preserving the header row at the top of each one (so 999 records + 1 header = 1000 rows)
cat bigFile.csv | parallel --header : --pipe -N999 'cat >file_{#}.csv'
Based on Ole Tange's answer.
See comments for some tips on installing parallel
You could use the new --filter functionality in GNU coreutils split >= 8.13 (2011):
tail -n +2 FILE.in | split -l 50 - --filter='sh -c "{ head -n1 FILE.in; cat; } > $FILE"'
You can use [mg]awk:
awk 'NR==1{
header=$0;
count=1;
print header > "x_" count;
next
}
!( (NR-1) % 100){
count++;
print header > "x_" count;
}
{
print $0 > "x_" count
}' file
100 is the number of lines of each slice.
It doesn't require temp files and can be put on a single line.
I'm a novice when it comes to Bash-fu, but I was able to concoct this two-command monstrosity. I'm sure there are more elegant solutions.
$> tail -n +2 file.txt | split -l 4
$> for file in `ls xa*`; do echo "`head -1 file.txt`" > tmp; cat $file >> tmp; mv -f tmp $file; done
This is assuming your input file is file.txt, you're not using the prefix argument to split, and you're working in a directory that doesn't have any other files that start with split's default xa* output format. Also, replace the '4' with your desired split line size.
Use GNU Parallel:
parallel -a bigfile.csv --header : --pipepart 'cat > {#}'
If you need to run a command on each of the parts, then GNU Parallel can help do that, too:
parallel -a bigfile.csv --header : --pipepart my_program_reading_from_stdin
parallel -a bigfile.csv --header : --pipepart --fifo my_program_reading_from_fifo {}
parallel -a bigfile.csv --header : --pipepart --cat my_program_reading_from_a_file {}
If you want to split into 2 parts per CPU core (e.g. 24 cores = 48 equal sized parts):
parallel --block -2 -a bigfile.csv --header : --pipepart my_program_reading_from_stdin
If you want to split into 10 MB blocks:
parallel --block 10M -a bigfile.csv --header : --pipepart my_program_reading_from_stdin
Below is a 4 liner that can be used to split a bigfile.csv into multiple smaller files, and preserve the csv header. Uses only built-in Bash commands (head, split, find, grep, xargs, and sed) which should work on most *nix systems. Should also work on Windows if you install mingw-64 / git-bash.
csvheader=`head -1 bigfile.csv`
split -d -l10000 bigfile.csv smallfile_
find .|grep smallfile_ | xargs sed -i "1s/^/$csvheader\n/"
sed -i '1d' smallfile_00
Line by line explanation:
Capture the header to a variable named csvheader
Split the bigfile.csv into a number of smaller files with prefix smallfile_
Find all smallfiles and insert the csvheader into the FIRST line using xargs and sed -i. Note that you need to use sed within "double quotes" in order to use variables.
The first file named smallfile_00 will now have redundant headers on lines 1 and 2 (from the original data as well as from the sed header insert in step 3). We can remove the redundant header with sed -i '1d' command.
This is a more robust version of Denis Williamson's script. The script creates a lot of temporary files, and it would be a shame if they were left lying around if the run was incomplete. So, let's add signal trapping (see http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_02.html and then http://tldp.org/LDP/abs/html/debugging.html) and remove our temporary files; this is a best practice anyways.
trap 'rm split_* tmp_file ; exit 13' SIGINT SIGTERM SIGQUIT
tail -n +2 file.txt | split -l 4 - split_
for file in split_*
do
head -n 1 file.txt > tmp_file
cat $file >> tmp_file
mv -f tmp_file $file
done
Replace '13' with whatever return code you want. Oh, and you should probably be using mktemp anyways (as some have already suggested), so go ahead and remove 'tmp_file" from the rm in the trap line. See the signal man page for more signals to catch.
I liked the awk version of marco, adopted from this a simplified one-liner where you can easily specify the split fraction as granular as you want:
awk 'NR==1{print $0 > FILENAME ".split1"; print $0 > FILENAME ".split2";} NR>1{if (NR % 10 > 5) print $0 >> FILENAME ".split1"; else print $0 >> FILENAME ".split2"}' file
I really liked Rob and Dennis' versions, so much so that I wanted to improve them.
Here's my version:
in_file=$1
awk '{if (NR!=1) {print}}' $in_file | split -d -a 5 -l 100000 - $in_file"_" # Get all lines except the first, split into 100,000 line chunks
for file in $in_file"_"*
do
tmp_file=$(mktemp $in_file.XXXXXX) # Create a safer temp file
head -n 1 $in_file | cat - $file > $tmp_file # Get header from main file, cat that header with split file contents to temp file
mv -f $tmp_file $file # Overwrite non-header containing file with header-containing file
done
Differences:
in_file is the file argument you want to split maintaining headers
Use awk instead of tail due to awk having better performance
split into 100,000 line files instead of 4
Split file name will be input file name appended with an underscore and numbers (up to 99999 - from the "-d -a 5" split argument)
Use mktemp to safely handle temporary files
Use single head | cat line instead of two lines
Inspired by #Arkady's comment on a one-liner.
MYFILE variable simply to reduce boilerplate
split doesn't show file name, but the --additional-suffix option allows us to easily control what to expect
removal of intermediate files via rm $part (assumes no files with same suffix)
MYFILE=mycsv.csv && for part in $(split -n4 --additional-suffix=foo $MYFILE; ls *foo); do cat <(head -n1 $MYFILE) $part > $MYFILE.$part; rm $part; done
Evidence:
-rw-rw-r-- 1 ec2-user ec2-user 32040108 Jun 1 23:18 mycsv.csv.xaafoo
-rw-rw-r-- 1 ec2-user ec2-user 32040108 Jun 1 23:18 mycsv.csv.xabfoo
-rw-rw-r-- 1 ec2-user ec2-user 32040108 Jun 1 23:18 mycsv.csv.xacfoo
-rw-rw-r-- 1 ec2-user ec2-user 32040110 Jun 1 23:18 mycsv.csv.xadfoo
and of course head -2 *foo to see the header is added.
A simple but maybe not as elegant way: Cut off the header beforehand, split the file, and then rejoin the header on each file with cat, or with whatever file is reading it in.
So something like:
head -n1 file.txt > header.txt
split -l file.txt
cat header.txt f1.txt
I had a better result using the following code, every split file will have a header and the generated files will have a normalized name.
export F=input.csv && LINES=3 &&\
export PF="${F%.*}_" &&\
split -l $LINES "${F}" "${PF}" &&\
for fn in $PF*
do
mv "${fn}" "${fn}.csv"
done &&\
export FILES=($PF*) && for file in "${FILES[#]:1}"
do
head -n 1 "${F}" > tmp_file
cat "$file" >> tmp_file
mv -f tmp_file "${file}"
done
output
$ wc -l input*
22 input.csv
3 input_aa.csv
4 input_ab.csv
4 input_ac.csv
4 input_ad.csv
4 input_ae.csv
4 input_af.csv
4 input_ag.csv
2 input_ah.csv
51 total