Linux: Using split on limited space

Linux: Using split on limited space - linux

I have a huge file on a linux machine. The file is ~20GB and the space on my box is ~25GB. I want to split the file into ~100mb parts. I know theres a 'split' command but that keeps the original file. I don't have enough space to keep the original. Any ideas on how this can be acomplished? I'll even work with any node modules if they make the task easier than bash.

My attempt:
#! /bin/bash
if [ $# -gt 2 -o $# -lt 1 -o ! -f "$1" ]; then
echo "Usage: ${0##*/} <filename> [<split size in M>]" >&2
exit 1
fi
bsize=${2:-100}
bucket=$( echo $bsize '* 1024 * 1024' | bc )
size=$( stat -c '%s' "$1" )
chunks=$( echo $size / $bucket | bc )
rest=$( echo $size % $bucket | bc )
[ $rest -ne 0 ] && let chunks++
while [ $chunks -gt 0 ]; do
let chunks--
fn=$( printf '%s_%03d.%s' "${1%.*}" $chunks "${1##*.}" )
skip=$(( bsize * chunks ))
dd if="$1" of="$fn" bs=1M skip=${skip} || exit 1
truncate -c -s ${skip}M "$1" || exit 1
done
The above assumes bash(1), and Linux implementations of stat(1), dd(1), and truncate(1). It should be pretty much as fast as it gets, since it uses dd(1) to copy chunks of the initial file. It also uses bc(1) to make sure arithmetic operations in the 20GB range don't overflow anything. However, the script was only tested on smaller files, so double check it before running it against your data.

You can use tail and truncate in a shell script to split a file in place, while destroying the original file. We are splitting the file in place backwards so that we can use the truncate. Here is a sample Bash script:
#!/bin/bash
if [ -z "$2" ]; then
echo "Usage: insplit.sh <splitsize> <filename>"
exit 1
fi
FILE="$2"
SPLITSIZE="$1"
FILESIZE=`stat -c '%s' $FILE`
BLOCKCOUNT=$(( (FILESIZE+SPLITSIZE-1)/SPLITSIZE ))
echo "Split count: $BLOCKCOUNT"
BLOCKCOUNT=$(($BLOCKCOUNT-1))
while [ $BLOCKCOUNT -ge 0 ]; do
FNAME="$FILE.$BLOCKCOUNT"
echo "writing $FNAME"
OFFSET=$((BLOCKCOUNT * SPLITSIZE))
BLOCKSIZE=$(( $FILESIZE - $OFFSET))
tail -c "$BLOCKSIZE" $FILE > $FNAME
truncate -s $OFFSET $FILE
FILESIZE=$((FILESIZE-BLOCKSIZE))
BLOCKCOUNT=$(( $BLOCKCOUNT-1 ))
done
I confirmed the results with a random file:
$ dd if=/dev/urandom of=largefile bs=512 count=1000
$ md5sum largefile
7ff913b62ef572265661a85f06417746 largefile
$ ./insplit.sh 200000 largefile
Split count: 3
writing largefile.2
writing largefile.1
writing largefile.0
$ cat largefile.0 largefile.1 largefile.2 | md5sum
7ff913b62ef572265661a85f06417746 -

Related

Why does my nohup bash script reading in file always stop outputting count around 6k before end of file?

I use nohup to run a bash script to read in each line of a file (and extract info I need). I've used it on multiple files with different line sizes, mostly between 50k and 100k. But no matter how many lines my file is, nohup always stops outputting info around 6k before the last line.
my script called: fetchStuff.sh
#!/bin/bash
urlFile=$1
myHost='http://example.com'
useragent='me'
count=0
total_lines=$(wc -l < $urlFile)
while read url; do
if [[ "$url" == *html ]]; then continue; fi
reqURL=${myHost}${url}
stuffInfo=$(curl -s -XGET -A "$useragent" "$reqURL" | jq -r '.stuff')
[ "$stuffInfo" != "null" ] && echo ${stuffInfo/unwanted_garbage/} >> newversion-${urlFile}
((count++))
if [ $(( $count%20 )) -eq 0 ]
then
sleep 1
fi
if [ $(( $count%100 )) -eq 0 ]; then echo "$urlFile read ${count} of $total_lines"; fi
done < $urlFile
I call it like so: nohup ./fetchStuff.sh file1.txt &
I get count info in nohup.out, e.g. "file1 read 100 of 60000", "file1 read 200 of 60000", etc.
But it always stops around 6k before end of file.
When I do tail nohup.out each time after running the script on the file, I get these as the last line in nohup.out:
file1.txt read 90000 of 96317
file2.txt read 68000 of 73376
file3.txt read 85000 of 91722
file4.txt read 93000 of 99757
I can't figure out why it always stops around 6k before end of file. (I put the sleep timer in to avoid flooding the api w/a lot of requests).

The loops skips lines that end with html, and they're not counted in $count. So I'll bet there are 6317 lines in file1.txt that end with html, 5376 in file2.txt, and so on.
If you want $count to include them, put ((count++)) before the if statement that checks the suffix.
while read url; do
((count++))
if [[ "$url" == *html ]]; then continue; fi
reqURL=${myHost}${url}
stuffInfo=$(curl -s -XGET -A "$useragent" "$reqURL" | jq -r '.stuff')
[ "$stuffInfo" != "null" ] && echo ${stuffInfo/unwanted_garbage/} >> newversion-${urlFile}
if [ $(( $count%20 )) -eq 0 ]
then
sleep 1
fi
if [ $(( $count%100 )) -eq 0 ]; then echo "$urlFile read ${count} of $total_lines"; fi
done < $urlFile
Alternatively you could leave them out of total_lines with:
total_lines=$(grep -c -v 'html$' "$urlFile")
And you could do away with the if statement by using
grep -v 'html$' "$urlFile" | while read url; do
...
done

Can't parse a string with brace expansion operations into a command

have some problem with shell script.
In our office we set up only few commands, that available for devs when they are trying ssh to server. It is configured with help of .ssh/authorized_keys file and available command for user there is bash script:
#!/bin/sh
if [[ $1 == "--help" ]]; then
cat <<"EOF"
This script has the purpose to let people remote execute certain commands without logging into the system.
For this they NEED to have a homedir on this system and uploaded their RSA public key to .ssh/authorized_keys (via ssh-copy-id)
Then you can alter that file and add some commands in front of their key eg :
command="/usr/bin/dev.sh",no-port-forwarding,no-X11-forwarding,no-agent-forwarding,no-pty
The user will do the following : ssh testuser#server tail testserver.example.com/2017/01/01/user.log
EOF
exit 0;
fi
# set global variable
set $SSH_ORIGINAL_COMMAND
# set the syslog path where the files can be found
PATH="/opt/syslog/logs"
# strip ; or any other unwanted signs out of the command, this prevents them from breaking out of the setup command
if [[ $1 != "" ]]; then
COMMAND=$1
COMMAND=${COMMAND//[;\`]/}
fi
if [[ $2 != "" ]]; then
ARGU1=$2
ARGU1=${ARGU1//[;\`]/}
fi
if [[ $3 != "" ]]; then
ARGU2=$3
ARGU2=${ARGU2//[;\`]/}
fi
if [[ $4 != "" ]]; then
ARGU3=$4
ARGU3=${ARGU3//[;\`]/}
fi
# checking for the commands
case "$COMMAND" in
less)
ARGU2=${ARGU1//\.\./}
FILE=$PATH/$ARGU1
if [ ! -f $FILE ]; then
echo "File doesn't exist"
exit 1;
fi
#echo " --------------------------------- LESS $FILE"
/usr/bin/less $FILE
;;
grep)
if [[ $ARGU2 == "" ]]; then
echo "Pls give a filename"
exit 1
fi
if [[ $ARGU1 == "" ]]; then
echo "Pls give a string to search for"
exit 1
fi
ARGU2=${ARGU2//\.\./}
FILE=$PATH/$ARGU2
/usr/bin/logger -t restricted-command -- "------- $USER Executing grep $ARGU1 \"$ARGU2\" $FILE"
if [ ! -f $FILE ]; then
echo "File doesn't exist"
/usr/bin/logger -t restricted-command -- "$USER Executing $#"
exit 1;
fi
/bin/grep $ARGU1 $FILE
;;
tail)
if [[ $ARGU1 == "" ]]; then
echo "Pls give a filename"
exit 1
fi
ARGU1=${ARGU1//\.\./}
FILE=$PATH/$ARGU1
if [ ! -f $FILE ]; then
echo "File doesn't exist"
/usr/bin/logger -t restricted-command -- "$USER Executing $# ($FILE)"
exit 1;
fi
/usr/bin/tail -f $FILE
;;
cat)
ARGU2=${ARGU1//\.\./}
FILE=$PATH/$ARGU1
if [ ! -f $FILE ]; then
echo "File doesn't exist"
exit 1;
fi
/bin/cat $FILE
;;
help)
/bin/cat <<"EOF"
# less LOGNAME (eg less testserver.example.com/YYYY/MM/DD/logfile.log)
# grep [ARGUMENT] LOGNAME
# tail LOGNAME (eg tail testserver.example.com/YYYY/MM/DD/logfile.log)
# cat LOGNAME (eg cat testserver.example.com/YYYY/MM/DD/logfile.log)
In total the command looks like this : ssh user#testserver.example.com COMMAND [ARGUMENT] LOGFILE
EOF
/usr/bin/logger -t restricted-command -- "$USER HELP requested $#"
exit 1
;;
*)
/usr/bin/logger -s -t restricted-command -- "$USER Invalid command $#"
exit 1
;;
esac
/usr/bin/logger -t restricted-command -- "$USER Executing $#"
The problem is next:
when i try to exec some command, it takes only first argument, if i do recursion in files by using {n,n1,n2} - it doesn't work:
[testuser#local ~]$ ssh testuser#syslog.server less srv1838.example.com/2017/02/10/local1.log |grep 'srv2010' | wc -l
0
[testuser#local ~]$ ssh testuser#syslog.server less srv2010.example.com/2017/02/10/local1.log |grep 'srv2010' | wc -l
11591
[testuser#local ~]$ ssh testuser#syslog.server less srv{1838,2010}.example.com/2017/02/10/local1.log |grep 'srv2010' | wc -l
0
[testuser#local ~]$ ssh testuser#syslog.server less srv{2010,1838}.example.com/2017/02/21/local1.log |grep 'srv2010' | wc -l
11591
Could someone help me, how can i parse\count command arguments to make it work?
Thank you and have a nice day!

The number of arguments for a bash script would be $#. As a quick example:
#!/bin/bash
narg=$#
typeset -i i
i=1
while [ $i -le $narg ] ; do
echo " $# $i: $1"
shift
i=$i+1
done
gives, for bash tst.sh a b {c,d}
4 1: a
3 2: b
2 3: c
1 4: d
In your script, the command to execute (cat, less, ...) gets explicitly only the second argument to the script. If you want to read all arguments, you should do something like this (note: only a hint, removed all sorts of checks etc..)
command="$1"
shift
case $command in
(grep) pattern="$1"
shift
while [ $# -gt 0 ] ; do
grep "$pattern" "$1"
shift
done
;;
esac
note: added some quotes as comment suggested, but, being only a hint, you should carefully look at quoting and your checks in your own script.

Less command working now:
case "$COMMAND" in
less)
if [[ $ARGU1 == "" ]]; then
echo "Pls give a filename"
exit 1
fi
FILES_LIST=${#:2}
FILE=(${FILES_LIST//\.\./})
for v in "${FILE[#]}";do
v=${v//[;\']/}
if [ ! -f $v ]; then
echo "File doesn't exist"
fi
/usr/bin/less $PATH/$v
done;;
tail command works too with 2 and more files, but i can't execute tail -f command on two files unfortunately.

Script to calculate odd file size

I need to write a script which will calculate a total size of files which size is odd number; could you help me please?
#!/bin/bash
echo "Directory <$1> contains the following filenames of odd size:"
ls -l $1 |
while read file_parm
do
size=`echo $file_parm | cut -f 5 -d " "`
name=`echo $file_parm | cut -f 9 -d " "`
let "div=size%2"
if [ ! -d $name ]
then
if [ $div -ne 0 ]
then
# this is listing odd numbers from this
# directory; I just need to add them together
# and print result
echo "[$name : $size]"
fi
fi
done

I virtually copied the code from my comment and ran it, and it worked -- I just had to ensure I had $1 set to somewhere sane, rather than empty.
$ set -- "."; totsize=0; for file in "$1"/*; do if [ -f "$file" ]; then size=$(stat -c '%s' "$file"); if ((size % 2 == 1)); then echo "[$file : $size]"; ((totsize += $size)); fi; fi; done; echo "Total size of odd-sized files = $totsize"
[./bash-assoc-arrays.sh : 417]
[./makefile : 1125]
[./xx.pl : 117]
Total size of odd-sized files = 1659
$
Or, formatted for readability:
set -- "."
totsize=0
for file in "$1"/*
do
if [ -f "$file" ]
then
size=$(stat -c '%s' "$file")
if ((size % 2 == 1))
then
echo "[$file : $size]"
((totsize += $size))
fi
fi
done
echo "Total size of odd-sized files = $totsize"
The repeated invocation of stat is a bit expensive. If you don't have files with newlines in their names (most people don't), you can speed it up with a single invocation of stat and some care:
stat -c '%s %F %n' "$1"/* |
{
totsize=0
while read size type name
do
if [ "X$type" = "X-" ] && ((size % 2 == 1))
then
((totsize+=$size))
echo "[$name : $size]"
fi
done
echo "Total size of odd-sized files = $totsize"
}
You could use (...) in place of {...} at a marginal (unmeasurable) cost in efficiency.
Answers to other questions explain the [ "X$type" = "X-" ] notation.

Why do this sample script, keep outputting error near token?

enter image description hereI was trying to see how a shell scripts work and how to run them, so I toke some sample code from a book I picked up from the library called "Wicked Cool Shell Scripts"
I re wrote the code verbatim, but I'm getting an error from Linux, which I compiled the code on saying:
'd.sh: line 3: syntax error near unexpected token `{
'd.sh: line 3:`gmk() {
Before this I had the curly bracket on the newline but I was still getting :
'd.sh: line 3: syntax error near unexpected token
'd.sh: line 3:`gmk()
#!/bin/sh
#format directory- outputs a formatted directory listing
gmk()
{
#Give input in Kb, output converted to Kb, Mb, or Gb for best output format
if [$1 -ge 1000000]; then
echo "$(scriptbc -p 2 $1/1000000)Gb"
elif [$1 - ge 1000]; then
echo "$$(scriptbc -p 2 $1/1000)Mb"
else
echo "${1}Kb"
fi
}
if [$# -gt 1] ; then
echo "Usage: $0 [dirname]" >&2; exit 1
elif [$# -eq 1] ; then
cd "$#"
fi
for file in *
do
if [-d "$file"] ; then
size = $(ls "$file"|wc -l|sed 's/[^[:digit:]]//g')
elif [$size -eq 1] ; then
echo "$file ($size entry)|"
else
echo "$file ($size entries)|"
fi
else
size ="$(ls -sk "$file" | awk '{print $1}')"
echo "$file ($(gmk $size))|"
fi
done | \
sed 's/ /^^^/g' |\
xargs -n 2 |\
sed 's/\^\^\^/ /g' | \
awk -F\| '{ printf "%39s %-39s\n", $1, $2}'
exit 0
if [$#-gt 1]; then
echo "Usage :$0 [dirname]" >&2; exit 1
elif [$# -eq 1]; then
cd "$#"
fi
for file in *
do
if [ -d "$file" ] ; then
size =$(ls "$file" | wc -l | sed 's/[^[:digit:]]//g')
if [ $size -eq 1 ] ; then
echo "$file ($size entry)|"
else
echo "$file ($size entries)|"
fi
else
size ="$(ls -sk "$file" | awk '{print $1}')"
echo "$file ($(convert $size))|"
fi
done | \
sed 's/ /^^^/g' | \
xargs -n 2 | \
sed 's/\^\^\^/ /g' | \
awk -F\| '{ printf "%-39s %-39s\n", $1, $2 }'
exit 0

sh is very sensitive to spaces. In particular assignment (no spaces around =) and testing (must have spaces inside the [ ]).
This version runs, although fails on my machine due to the lack of scriptbc.
You put an elsif in a spot where it was supposed to be if.
Be careful of column alignment between starts and ends. If you mismatch them it will easily lead you astray in thinking about how this works.
Also, adding a set -x near the top of a script is a very good way of debugging what it is doing - it will cause the interpreter to output each line it is about to run before it does.
#!/bin/sh
#format directory- outputs a formatted directory listing
gmk()
{
#Give input in Kb, output converted to Kb, Mb, or Gb for best output format
if [ $1 -ge 1000000 ]; then
echo "$(scriptbc -p 2 $1/1000000)Gb"
elif [ $1 -ge 1000 ]; then
echo "$(scriptbc -p 2 $1/1000)Mb"
else
echo "${1}Kb"
fi
}
if [ $# -gt 1 ] ; then
echo "Usage: $0 [dirname]" >&2; exit 1
elif [ $# -eq 1 ] ; then
cd "$#"
fi
for file in *
do
if [ -d "$file" ] ; then
size=$(ls "$file"|wc -l|sed 's/[^[:digit:]]//g')
if [ $size -eq 1 ] ; then
echo "$file ($size entry)|"
else
echo "$file ($size entries)|"
fi
else
size="$(ls -sk "$file" | awk '{print $1}')"
echo "$file ($(gmk $size))|"
fi
done | \
sed 's/ /^^^/g' |\
xargs -n 2 |\
sed 's/\^\^\^/ /g' | \
awk -F\| '{ printf "%39s %-39s\n", $1, $2}'
exit 0

By the way, with respect to the book telling you to modify your PATH variable, that's really a bad idea, depending on what exactly it advised you to do. Just to be clear, never add your current directory to the PATH variable unless you intend on making that directory a permanent location for all of your scripts etc. If you are making this a permanent location for your scripts, make sure you add the location to the END of your PATH variable, not the beginning, otherwise you are creating a major security problem.
Linux and Unix do not add your current location, commonly called your PWD, or present working directory, to the path because someone could create a script called 'ls', for example, which could run something malicious instead of the actual 'ls' command. The proper way to execute something in your PWD, is to prepend it with './' (e.g. ./my_new_script.sh). This basically indicates that you really do want to run something from your PWD. Think of it as telling the shell "right here". The '.' actually represents your current directory, in other words "here".

Existing file not found

There is a bash script that I use to extract the artwork from MP3 files before converting them.
#!/bin/bash
MUSIC_FILE=$1
IMAGE_FILE=""
TIMESTAMP=$2
if [ -z "$MUSIC_FILE" ] ; then
exit 2;
fi
if [ -z $TIMESTAMP ] ; then
TIMESTAMP=$(date +%s)
fi
IMAGE_FILE=`/usr/bin/eyeD3 --write-images=. "$MUSIC_FILE" 2>&1 | grep Writing | sed -e 's/Writing //g' -e 's/\.\.\.//g' | tr ' ' '_'`
if [ -z $IMAGE_FILE ] ; then
exit 3
fi
if [ -e $IMAGE_FILE ] ; then
/usr/bin/convert $IMAGE_FILE $TIMESTAMP.png
exit 0
else
exit 4
fi
The artwork file is well extracted, I can see it by a ls output, and the variable used to get the file name is correct (no heading/trailing spaces, etc), but within the script nor convert nor any additional ls finds it (no such file or directory)...
It really drives me nut...
Additional information : when I launch the script with the -x flag, every representation of my file name is yellow-colorized, can't figure out why...
Thanks for your help !
Jérémie

Instead of trying to obtain filename by filtering command's output, why did you not:
#!/bin/bash
MUSIC_FILE="$1"
[ -f "$MUSIC_FILE" ] || exit 2
TIMESTAMP=${2:-$(date +%s)}
mydir=$(mktemp -d)
/usr/bin/eyeD3 --write-images=$mydir "$MUSIC_FILE" >/dev/null 2>&1
filename=($(/bin/ls -1 $mydir))
[ -f "$mydir/$filename" ] &&
convert $mydir/$filename" $TIMESTAMP.png
rm -fR $mydir
Not tested, but could work... ( approx ;-)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Linux: Using split on limited space - linux

Related

Why does my nohup bash script reading in file always stop outputting count around 6k before end of file?

Can't parse a string with brace expansion operations into a command

Script to calculate odd file size

Why do this sample script, keep outputting error near token?

Existing file not found

Categories

Resources