How to split a large variable? - linux

I'm working with large variables and it can be very slow "looping" through them with while read line, I found out that the smaller the variable the faster it works.
How can I split large variable into smaller variables and then read them one by one?
for example,
What I would like to achieve:
bigVar=$(echo "$bigVar" | split_var)
for var in "${bigVar[#]}"; do
while read line; do
...
done <<< "${var}"
done
or may be split to bigVar1, bigVar2, bigVar3 etc.. and than read them one by one.

Instead of doing
bigVar=$(someCommand)
while read line
do
...
done <<< "$bigVar"
Use
while read line
do
...
done < <(someCommand)
This way, you avoid the problem with big variables entirely, and someCommand can output gigabyte after gigabyte with no problem.
If the reason you put it in a variable was to do work in multiple steps on it, rewrite it as a pipeline.

If BigVar is made of words, you could use xargs to split it in lines no longer than the maximum length of a command line, usually 32kb or 64kb :
someCommand|xargs|while read line
do
...
done
In this case xargs uses its default command, which is echo.
I'm curious about what you want to do in the while loop, as it may be optimized with a pipeline.

Related

Is there some way in Linux to create a "virtual file" that is the concatenation of two files?

I have two data sets that I want to run a program on. I want to compare the results to running the analysis on each set individually, and on the combined data. Since they're large files, I'd prefer not to create a whole new file that's the two data sets concatenated, doubling the disk space required. Is there some way in Linux I can create something like a symbolic link, but that points to two files, so that when it's read it will read two other files in sequence, as if they're concatenated into one file? I'm guessing probably not, but I thought I'd check.
Can your program read from standard input?
cat file-1 file-2 | yourprogram
If your program can only read from a file that is named on the command line, then this might work:
yourprogram <(cat file-1 file-2)
I think you need to be running the /bin/bash shell for the second example to work. The shell replaces <(foobar) with the name of a named pipe that your program can open and read like a file. The shell runs the foobar command in another process, and sends its output in to the other end of the pipe.

Read lines from file and write directly to FIFO, without buffering in RAM

I have a file with very, very long lines. A single line will not fit in memory. I need to process each line separately, so I'd like to write each line to a FIFO node, sending EOF between lines. This model works well for what I'm doing, and would be ideal.
The lines cannot be stored in RAM in their entirety, so the read built-in is not an option. Whatever command I use needs to write directly--unbuffered--to the FIFO while reading from the multi-gigabyte source file.
How can I achieve this, preferably with pure bash and basic Linux utilities, and no special programs that I have to install? I've tried things like sed -u 'p', but unfortunately, I can't seem to get any common programs to send EOF between lines.
bash version:
GNU bash, version 4.2.45(1)-release (x86_64-pc-linux-gnu)
Update
I'd like to avoid utilities/tricks that read "line number X" from the file. This file has thousands of very long lines; repeatedly evaluating the same line breaks would take far too long. For this to complete in a reasonable timeframe, whatever program is reading lines will need to read each line in sequence, saving its previous position, much like read + pipe.
Let's think about the question "hashing lines that don't fit in memory", because that's a straightforward problem that you can solve with only a few lines of Python.
import sys
import hashlib
def hash_lines(hashname, file):
hash = hashlib.new(hashname)
while True:
data = file.read(1024 * 1024)
if not data:
break
while True:
i = data.find('\n')
if i >= 0:
hash.update(data[:i]) # See note
yield hash.hexdigest()
data = data[i+1:]
hash = hashlib.new(hashname)
else:
hash.update(data)
break
yield hash.hexdigest()
for h in hash_lines('md5', sys.stdin):
print h
It's a bit wacky because most languages do not have good abstractions for objects that do not fit in memory (an exception is Haskell; this would probably be four lines of Haskell).
Note: Use i+1 if you want to include the line break in the hash.
Haskell version
The Haskell version is like a pipe (read right to left). Haskell supports lazy IO, which means that it only reads the input as necessary, so the entire line doesn't need to be in memory at once.
A more modern version would use conduits instead of lazy IO.
module Main (main) where
import Crypto.Hash.MD5
import Data.ByteString.Lazy.Char8
import Data.ByteString.Base16
import Prelude hiding (interact, lines, unlines)
main = interact $ unlines . map (fromStrict . encode . hashlazy) . lines
The problem was that I should've been using a normal file, not a FIFO. I was looking at this wrong. read works the same way as head: it doesn't remember where it is in the file. The stream remembers for it. I don't know what I was thinking. head -n 1 will read one line from a stream, starting at whatever position the stream is already at.
#!/bin/bash
# So we don't leave a giant temporary file hanging around:
tmp=
trap '[ -n "$tmp" ] && rm -f "$tmp" &> /dev/null' EXIT
tmp="$(mktemp)" || exit $?
while head -n 1 > "$tmp" && [ -s "$tmp" ]; do
# Run $tmp file through several programs, like md5sum
done < "$1"
This works quite effectively. head -n 1 reads each line in sequence from a basic file stream. I don't have to worry about background tasks or blocking when reading from empty FIFOs, either.
Hashing long lines should be perfectly feasible, but probably not so easily in bash.
If I may suggest Python, it could be done like this:
# open FIFO somehow. If it is stdin (as a pipe), fine. Of not, simply open it.
# I suggest f is the opened FIFO.
def read_blockwise(f, blocksize):
while True:
data = f.read(blocksize)
if not data: break
yield data
hasher = some_object_which_does_the_hashing()
for block in read_blockwise(f, 65536):
spl = block.split('\n', 1)
hasher.update(spl[0])
if len(spl) > 1:
hasher.wrap_over() # or whatever you need to do if a new line comes along
new_hasher.update(spl[1])
This is merely Python-orientated pseudo-code which shows the direction how to do what you seem to want to do.
Note that it will not be feasible without memory, but I don't think that it matters. The chunks are very small (and could be made even smaller) and are processed as they come along.
Encountering a line break leads to terminate processing of the old line and starting to process a new one.

Is there an efficient way to read line input in bash?

I want to split large, compressed CSV files into multiple smaller gzip file, split on line boundary.
I'm trying to pipe gunzip to a bash script with a while read LINE. That script writes to a named pipe where a background gzip process is recompressing it. Every X characters read I close the FD and restart a new gzip process for the next split.
But in this scenario the script, with while read LINE, is consuming 90% of the cpu because read is so inefficient here (I understand that it makes a system call to read 1 char at a time).
Any thoughts on doing this efficiently? I would expect gzip to consume the majority cpu.
Use split with the -l option to specify how many lines you want. Use --filter option $FILE is the name split would have used for output to file (and has to be quoted with single quotes to prevent expanding by the shell too early:
zcat doc.gz | split -l 1000 --filter='gzip > $FILE.gz'
If you need any additional processing, just pen a script, that will accept the filename as argument and process standard input accordingly, and use that instead of plain gzip.
How about using split command with -l option?
gzcat large.csv.gz | split -l 1000 - xxx
gzip xxx*

Write a certain hex pattern to a file from bash

I am trying to do some memory tests and am trying to write a certain hex pattern to a regular file from bash. How would I go about doing this without using the xxd or hexdump tool/command?
Thanks,
Neco
The simplest thing is probably:
printf '\xde\xad\xbe\xef' > file
but it is often more convenient to do
perl -e 'print pack "H*", "deadbeef"' > file
If I get your question correctly printf should do:
>printf %X 256
100
Can you use od -x instead? That's pretty universaly available, od has been around since the dawn of time[1]
[1] Not really the dawn of time.
There are multiple ways to do this in bash one of the way is
\x31
'\' is used to skip next charcter from bash decoding
'x' show its a hex number
echo -en \\x31\\x32\\x33 > test
-e to avoid trailing line (else 0x0A will apend at end)
-n to interprete backslash escapes
Memory testing is much more complex subject than just writing / reading patterns in memory. Memory testing puts pretty hard limits on what a testing program can do and what state the whole system is in. Technically, it's impossible to test 100% of memory when you're running regular OS at all.
On the other hand, you can run some real test program from a shell, or schedule test execution on next boot with some clever hacking around. You might want to take a look at how it's done in Inquisitor, i.e. running memtester for in-OS testing and scheduling memtest86* run on next boot.
If you absolutely must remain in your current booted OS, then probably memtester would be your tool of choice - although note that it's not very precise memory test.
There are a lot of suggestions of using printf and echo, but there's one tiny difference. Printf is not capable of producing zeros (binary zeros), while echo does the job properly. Consider these examples:
printf "\x31\x32\x00\x33\x00\x00\x00\x34">printf.txt
echo -en "\x31\x32\x00\x33\x00\x00\x00\x34">echo.txt
As a result, printf.txt has a size of 3 bytes (yep, it writes the first zero and stops). And the echo.txt is 8 bytes long and contains actual data.

Read data from pipe and write to standard out with a delay in between. Must handle binary files too

I have been trying for about an hour now to find an elegant solution to this problem. My goal is basically to write a bandwidth control pipe command which I could re-use in various situations (not just for network transfers, I know about scp -l 1234). What I would like to do is:
Delay for X seconds.
Read Y amount (or less than Y if there isn't enough) data from pipe.
Write the read data to standard output.
Where:
X could be 1..n.
Y could be 1 Byte up to some high value.
My problem is:
It must support binary data which Bash can't handle well.
Roads I've taken or at least thought of:
Using a while read data construct, it filters all white characters in the encoding your using.
Using dd bs=1 count=1 and looping. dd doesn't seem to have different exit codes for when there were something in if and not. Which makes it harder to know when to stop looping. This method should work if I redirect standard error to a temporary file, read it to check if something was transfered (as it's in the statistics printed on stderr) and repeat. But I suspect that it's extremely slow if used on large amounts of data and if it's possible I'd like to skip creating any temporary files.
Any ideas or suggestions on how to solve this as cleanly as possible using Bash?
may be pv -qL RATE ?
-L RATE, --rate-limit RATE
Limit the transfer to a maximum of RATE bytes per second. A
suffix of "k", "m", "g", or "t" can be added to denote kilobytes
(*1024), megabytes, and so on.
It's not much elegant but you can use some redirection trick to catch the number of bytes copied by dd and then use it as the exit condition for a while loop:
while [ -z "$byte_copied" ] || [ "$byte_copied" -ne 0 ]; do
sleep $X;
byte_copied=$(dd bs=$Y count=1 2>&1 >&4 | awk '$2 == "byte"{print $1}');
done 4>&1
However, if your intent is to limit the transfer throughput, I suggest you to use pv.
Do you have to do it in bash? Can you just use an existing program such as cstream?
cstream meets your goal of a bandwidth controlled pipe command, but doesn't necessarily meet your other criteria with regard to your specific algorithm or implementation language.
What about using head -c ?
cat /dev/zero | head -c 10 > test.out
Gives you a nice 10 bytes file.

Resources