Gunzip from string - linux

I have one problem. I'd like to decompress string directly from a file. I have one script in bash that create another script.
#!/bin/bash
echo -n '#!/bin/bash
' > test.sh #generate header for interpreter
echo -n "echo '" >> test.sh #print echo to file
echo -n "My name is Daniel" | gzip -f >> test.sh #print encoded by gzip string into a file
echo -n "' | gunzip;" >> test.sh #print reverse commands for decode into a file
chmod a+x test.sh #make file executable
I want to generate script test.sh that will the shortest script. I'm trying to compress string "My name is Daniel" and write it directly into file test.sh
But if I run test.sh i got gzip: stdin has flags 0x81 -- not supported
Do you know why have I got this problem?

gzip output is binary so it can contain any character, as script is generated with bash it contains characters which are encoded (echo $LANG).
characters which cause problem between single quotes are NUL 0x0, ' 0x27 and non ascii characters 128-256 0x80-0xff.
a solution can be to use ANSI C quotes $'..' and to escape NUL and non ascii characters.
EDIT bash string can't contain nul character :
gzip -c <<<"My name is Daniel" | od -c -tx1
trying to create ansi string
echo -n $'\x1f\x8b\x08\x00\xf7i\xe2Y\x00\x03\xf3\xadT\xc8K\xccMU\xc8,VpI\xcc\xcbL\xcd\^C1\x00\xa5u\x87\xad\x11\x00\x00\x00' | od -c -tx1
shows that string is truncated after nul character.
The best compromise may be to use base64 encoding:
gzip <<<"My name is Daniel"| base64
base64 --decode <<__END__ | gzip -cd
H4sIAPts4lkAA/OtVMhLzE1VyCxWcEnMy0zN4QIAgdbGlBIAAAA=
__END__
or
base64 --decode <<<H4sIAPts4lkAA/OtVMhLzE1VyCxWcEnMy0zN4QIAgdbGlBIAAAA=|gzip -cd

The problem was with storing null character (\0) in bash script.
Null character can not be stored in echo and variable string. It can be stored in files and pipes.
I want to avoid use base64, but I fixed it with
printf "...%b....%b" "\0" "\0"
I edited script with bless hex editor. It's working for me :)

Related

How to add UTF-16 characters at the beginning of an existing file using sed?

I have a large script that generates many files and part of it doesn't work due to BOM missing. I have to work with the file named pagecounts-${_date} which is ultimately created like this:
cat $TMPDIR/*.filtered > $TMPDIR/pagecounts-${_date}
Then, I use sort and try to work with it in another script, but I get the BOM error. My guestion is, can I add BOM for utf-16 at the beginning of an already existing file? If yes, how can I achieve that?
I was thinking of using a temporary file like this:
cat $TMPDIR/*.filtered > $TMPDIR/tmp_pagecounts-${_date}
echo '\ufeff' > $TMPDIR/pagecounts-${_date}
cat $TMPDIR/tmp_pagecounts-${_date} | sort >> $TMPDIR/pagecounts-${_date}
But this way seems to chop off some of the UTF-16 characters.
You could use echo -e for printing the Unicode utf-16 character sequence as is
sed "1s/^/$(echo -ne '\ufeff')/" "$TMPDIR"/pagecounts-${_date}
or use printf too
sed "1s/^/$(printf '\ufeff')/" "$TMPDIR"/pagecounts-${_date}
Confirm the same sequence to be accurate after doing a hexdump -c or hexdump -C on the same file
echo -ne '\ufeff' | hexdump -c
0000000 355 237 277 355 273 277
0000006
You can confirm these bytes to be consistent on applying to the file also.
The above sed commands just print the file contents to stdout, to modify the file in-place use the -i flag (-i '' is required for macOS's sed)
sed -i '' ...

Bash add line numbers to a file and save the output to the input file itself [duplicate]

Basically I want to take as input text from a file, remove a line from that file, and send the output back to the same file. Something along these lines if that makes it any clearer.
grep -v 'seg[0-9]\{1,\}\.[0-9]\{1\}' file_name > file_name
however, when I do this I end up with a blank file.
Any thoughts?
Use sponge for this kind of tasks. Its part of moreutils.
Try this command:
grep -v 'seg[0-9]\{1,\}\.[0-9]\{1\}' file_name | sponge file_name
You cannot do that because bash processes the redirections first, then executes the command. So by the time grep looks at file_name, it is already empty. You can use a temporary file though.
#!/bin/sh
tmpfile=$(mktemp)
grep -v 'seg[0-9]\{1,\}\.[0-9]\{1\}' file_name > ${tmpfile}
cat ${tmpfile} > file_name
rm -f ${tmpfile}
like that, consider using mktemp to create the tmpfile but note that it's not POSIX.
Use sed instead:
sed -i '/seg[0-9]\{1,\}\.[0-9]\{1\}/d' file_name
try this simple one
grep -v 'seg[0-9]\{1,\}\.[0-9]\{1\}' file_name | tee file_name
Your file will not be blank this time :) and your output is also printed to your terminal.
You can't use redirection operator (> or >>) to the same file, because it has a higher precedence and it will create/truncate the file before the command is even invoked. To avoid that, you should use appropriate tools such as tee, sponge, sed -i or any other tool which can write results to the file (e.g. sort file -o file).
Basically redirecting input to the same original file doesn't make sense and you should use appropriate in-place editors for that, for example Ex editor (part of Vim):
ex '+g/seg[0-9]\{1,\}\.[0-9]\{1\}/d' -scwq file_name
where:
'+cmd'/-c - run any Ex/Vim command
g/pattern/d - remove lines matching a pattern using global (help :g)
-s - silent mode (man ex)
-c wq - execute :write and :quit commands
You may use sed to achieve the same (as already shown in other answers), however in-place (-i) is non-standard FreeBSD extension (may work differently between Unix/Linux) and basically it's a stream editor, not a file editor. See: Does Ex mode have any practical use?
One liner alternative - set the content of the file as variable:
VAR=`cat file_name`; echo "$VAR"|grep -v 'seg[0-9]\{1,\}\.[0-9]\{1\}' > file_name
Since this question is the top result in search engines, here's a one-liner based on https://serverfault.com/a/547331 that uses a subshell instead of sponge (which often isn't part of a vanilla install like OS X):
echo "$(grep -v 'seg[0-9]\{1,\}\.[0-9]\{1\}' file_name)" > file_name
The general case is:
echo "$(cat file_name)" > file_name
Edit, the above solution has some caveats:
printf '%s' <string> should be used instead of echo <string> so that files containing -n don't cause undesired behavior.
Command substitution strips trailing newlines (this is a bug/feature of shells like bash) so we should append a postfix character like x to the output and remove it on the outside via parameter expansion of a temporary variable like ${v%x}.
Using a temporary variable $v stomps the value of any existing variable $v in the current shell environment, so we should nest the entire expression in parentheses to preserve the previous value.
Another bug/feature of shells like bash is that command substitution strips unprintable characters like null from the output. I verified this by calling dd if=/dev/zero bs=1 count=1 >> file_name and viewing it in hex with cat file_name | xxd -p. But echo $(cat file_name) | xxd -p is stripped. So this answer should not be used on binary files or anything using unprintable characters, as Lynch pointed out.
The general solution (albiet slightly slower, more memory intensive and still stripping unprintable characters) is:
(v=$(cat file_name; printf x); printf '%s' ${v%x} > file_name)
Test from https://askubuntu.com/a/752451:
printf "hello\nworld\n" > file_uniquely_named.txt && for ((i=0; i<1000; i++)); do (v=$(cat file_uniquely_named.txt; printf x); printf '%s' ${v%x} > file_uniquely_named.txt); done; cat file_uniquely_named.txt; rm file_uniquely_named.txt
Should print:
hello
world
Whereas calling cat file_uniquely_named.txt > file_uniquely_named.txt in the current shell:
printf "hello\nworld\n" > file_uniquely_named.txt && for ((i=0; i<1000; i++)); do cat file_uniquely_named.txt > file_uniquely_named.txt; done; cat file_uniquely_named.txt; rm file_uniquely_named.txt
Prints an empty string.
I haven't tested this on large files (probably over 2 or 4 GB).
I have borrowed this answer from Hart Simha and kos.
This is very much possible, you just have to make sure that by the time you write the output, you're writing it to a different file. This can be done by removing the file after opening a file descriptor to it, but before writing to it:
exec 3<file ; rm file; COMMAND <&3 >file ; exec 3>&-
Or line by line, to understand it better :
exec 3<file # open a file descriptor reading 'file'
rm file # remove file (but fd3 will still point to the removed file)
COMMAND <&3 >file # run command, with the removed file as input
exec 3>&- # close the file descriptor
It's still a risky thing to do, because if COMMAND fails to run properly, you'll lose the file contents. That can be mitigated by restoring the file if COMMAND returns a non-zero exit code :
exec 3<file ; rm file; COMMAND <&3 >file || cat <&3 >file ; exec 3>&-
We can also define a shell function to make it easier to use :
# Usage: replace FILE COMMAND
replace() { exec 3<$1 ; rm $1; ${#:2} <&3 >$1 || cat <&3 >$1 ; exec 3>&- }
Example :
$ echo aaa > test
$ replace test tr a b
$ cat test
bbb
Also, note that this will keep a full copy of the original file (until the third file descriptor is closed). If you're using Linux, and the file you're processing on is too big to fit twice on the disk, you can check out this script that will pipe the file to the specified command block-by-block while unallocating the already processed blocks. As always, read the warnings in the usage page.
The following will accomplish the same thing that sponge does, without requiring moreutils:
shuf --output=file --random-source=/dev/zero
The --random-source=/dev/zero part tricks shuf into doing its thing without doing any shuffling at all, so it will buffer your input without altering it.
However, it is true that using a temporary file is best, for performance reasons. So, here is a function that I have written that will do that for you in a generalized way:
# Pipes a file into a command, and pipes the output of that command
# back into the same file, ensuring that the file is not truncated.
# Parameters:
# $1: the file.
# $2: the command. (With $3... being its arguments.)
# See https://stackoverflow.com/a/55655338/773113
siphon()
{
local tmp file rc=0
[ "$#" -ge 2 ] || { echo "Usage: siphon filename [command...]" >&2; return 1; }
file="$1"; shift
tmp=$(mktemp -- "$file.XXXXXX") || return
"$#" <"$file" >"$tmp" || rc=$?
mv -- "$tmp" "$file" || rc=$(( rc | $? ))
return "$rc"
}
There's also ed (as an alternative to sed -i):
# cf. http://wiki.bash-hackers.org/howto/edit-ed
printf '%s\n' H 'g/seg[0-9]\{1,\}\.[0-9]\{1\}/d' wq | ed -s file_name
You can use slurp with POSIX Awk:
!/seg[0-9]\{1,\}\.[0-9]\{1\}/ {
q = q ? q RS $0 : $0
}
END {
print q > ARGV[1]
}
Example
This does the trick pretty nicely in most of the cases I faced:
cat <<< "$(do_stuff_with f)" > f
Note that while $(…) strips trailing newlines, <<< ensures a final newline, so generally the result is magically satisfying.
(Look for “Here Strings” in man bash if you want to learn more.)
Full example:
#! /usr/bin/env bash
get_new_content() {
sed 's/Initial/Final/g' "${1:?}"
}
echo 'Initial content.' > f
cat f
cat <<< "$(get_new_content f)" > f
cat f
This does not truncate the file and yields:
Initial content.
Final content.
Note that I used a function here for the sake of clarity and extensibility, but that’s not a requirement.
A common usecase is JSON edition:
echo '{ "a": 12 }' > f
cat f
cat <<< "$(jq '.a = 24' f)" > f
cat f
This yields:
{ "a": 12 }
{
"a": 24
}
Try this
echo -e "AAA\nBBB\nCCC" > testfile
cat testfile
AAA
BBB
CCC
echo "$(grep -v 'AAA' testfile)" > testfile
cat testfile
BBB
CCC
I usually use the tee program to do this:
grep -v 'seg[0-9]\{1,\}\.[0-9]\{1\}' file_name | tee file_name
It creates and removes a tempfile by itself.

bash replace \ with / in a tetxfile

I have a file with some file’s path. Like this:
C:\Users\peter\workspace\etwas.txt
I tried read with this routine
while read LINE
do
web=$LINE
echo $web
done <etwas.txt
The result:
C:Userspeterworkspaceetwas.txt
I will read in this form
C:/Users/peter/workspace/etwas.txt
How you can read it?
Try doing this using only bash builtins:
while read -r LINE; do
web="${LINE//\\//}"
echo "$web"
done < etwas.txt
output
$ cat etwas.txt
C:\Users\peter\workspace\etwas.txt
$ while read -r LINE; do
> web="${LINE//\\//}"
> echo "$web"
> done < etwas.txt
C:/Users/peter/workspace/etwas.txt
I use bash parameter expansion to substitute \ with /
The most portable way to do this (i.e. not relying on bash extensions) is with the tr command.
tr \\\\ / < etwas.txt | while read LINE
do
web=$LINE
echo $web
done
(Quadruple backslash?! Yeah, because both the shell and tr itself treat backslash as an escape character, so you need to write four of them to get a literal backslash in this context. '\\' would also work.)
WARNING: piping input to while may, but then again may not, cause the contents of the loop to be executed in a subshell.

How to find a windows end of line (EOL) character

I have several hundred GB of data that I need to paste together using the unix paste utility in Cygwin, but it won't work properly if there are windows EOL characters in the files. The data may or may not have windows EOL characters, and I don't want to spend the time running dos2unix if I don't have to.
So my question is, in Cygwin, how can I figure out whether these files have windows EOL CRLF characters?
I've tried creating some test data and running
sed -r 's/\r\n//' testdata.txt
But that appears to match regardless of whether dos2unix has been run or not.
Thanks.
The file(1) utility knows the difference:
$ file * | grep ASCII
2: ASCII text
3: ASCII English text
a: ASCII C program text
blah: ASCII Java program text
foo.js: ASCII C++ program text
openssh_5.5p1-4ubuntu5.dsc: ASCII text, with very long lines
windows: ASCII text, with CRLF line terminators
file(1) has been optimized to try to read as little of a file as possible, so you may be lucky and drastically reduce the amount of disk IO you need to perform when finding and fixing the CRLF terminators.
Note that some cases of CRLF should stay in place: captures of SMTP will use CRLF. But that's up to you. :)
#!/bin/bash
for i in $(find . -type f); do
if file $i | grep CRLF ; then
echo $i
file $i
#dos2unix "$i"
fi
done
Uncomment "#dos2unix "$i"" when you are ready to convert them.
You can find out using file:
file /mnt/c/BOOT.INI
/mnt/c/BOOT.INI: ASCII text, with CRLF line terminators
CRLF is the significant value here.
If you expect the exit code to be different from sed, it won't be. It will perform a substitution or not depending on the match. The exit code will be true unless there's an error.
You can get a usable exit code from grep, however.
#!/bin/bash
for f in *
do
if head -n 10 "$f" | grep -qs $'\r'
then
dos2unix "$f"
fi
done
grep recursive, with file pattern filter
grep -Pnr --include=*file.sh '\r$' .
output file name, line number and line itself
./test/file.sh:2:here is windows line break
You can use dos2unix's -i option to get information about DOS Unix Mac line breaks (in that order), BOMs, and text/binary without converting the file.
$ dos2unix -i *.txt
6 0 0 no_bom text dos.txt
0 6 0 no_bom text unix.txt
0 0 6 no_bom text mac.txt
6 6 6 no_bom text mixed.txt
50 0 0 UTF-16LE text utf16le.txt
0 50 0 no_bom text utf8unix.txt
50 0 0 UTF-8 text utf8dos.txt
With the "c" flag dos2unix will report files that would be converted, iow files have have DOS line breaks. To report all txt files with DOS line breaks you could do this:
$ dos2unix -ic *.txt
dos.txt
mixed.txt
utf16le.txt
utf8dos.txt
To convert only these files you simply do:
dos2unix -ic *.txt | xargs dos2unix
If you need to go recursive over directories you do:
find -name '*.txt' | xargs dos2unix -ic | xargs dos2unix
See also the man page of dos2unix.
As stated above the 'file' solution works. Maybe the following code snippet may help.
#!/bin/ksh
EOL_UNKNOWN="Unknown" # Unknown EOL
EOL_MAC="Mac" # File EOL Classic Apple Mac (CR)
EOL_UNIX="Unix" # File EOL UNIX (LF)
EOL_WINDOWS="Windows" # File EOL Windows (CRLF)
SVN_PROPFILE="name-of-file" # Filename to check.
...
# Finds the EOL used in the requested File
# $1 Name of the file (requested filename)
# $r EOL_FILE set to enumerated EOL-values.
getEolFile() {
EOL_FILE=$EOL_UNKNOWN
# Check for EOL-windows
EOL_CHECK=`file $1 | grep "ASCII text, with CRLF line terminators"`
if [[ -n $EOL_CHECK ]] ; then
EOL_FILE=$EOL_WINDOWS
return
fi
# Check for Classic Mac EOL
EOL_CHECK=`file $1 | grep "ASCII text, with CR line terminators"`
if [[ -n $EOL_CHECK ]] ; then
EOL_FILE=$EOL_MAC
return
fi
# Check for Classic Mac EOL
EOL_CHECK=`file $1 | grep "ASCII text"`
if [[ -n $EOL_CHECK ]] ; then
EOL_FILE=$EOL_UNIX
return
fi
return
} # getFileEOL
...
# Using this snippet
getEolFile $SVN_PROPFILE
echo "Found EOL: $EOL_FILE"
exit -1
Thanks for the tip to use file(1) command, however it does need a bit more refinement. I had the situation where not only plain text files but also some ".sh" scripts had the wrong eol. And "file" reports them as follows regardless of eol:
xxx/y/z.sh: application/x-shellscript
So the "file -e soft" option was needed (at least for Linux):
bash$ find xxx -exec file -e soft {} \; | grep CRLF
This finds all the files with DOS eol in directory xxx and subdirs.

Counting number of characters in a file through shell script

I want to check the no of characters in a file from starting to EOF character. Can anyone tell me how to do this through shell script
This will do it for counting bytes in file:
wc -c filename
If you want only the count without the filename being repeated in the output:
wc -c < filename
This will count characters in multibyte files (Unicode etc.):
wc -m filename
(as shown in Sébastien's answer).
#!/bin/sh
wc -m $1 | awk '{print $1}'
wc -m counts the number of characters; the awk command prints the number of characters only, omitting the filename.
wc -c would give you the number of bytes (which can be different to the number of characters, as depending on the encoding you may have a character encoded on several bytes).
To get exact character count of string, use printf, as opposed to echo, cat, or running wc -c directly on a file, because using echo, cat, etc will count a newline character, which will give you the amount of characters including the newline character. So a file with the text 'hello' will print 6 if you use echo etc, but if you use printf it will return the exact 5, because theres no newline element to count.
How to use printf for counting characters within strings:
$printf '6chars' | wc -m
6
To turn this into a script you can run on a text file to count characters, save the following in a file called print-character-amount.sh:
#!/bin/bash
characters=$(cat "$1")
printf "$characters" | wc -m
chmod +x on file print-character-amount.sh containing above text, place the file in your PATH (i.e. /usr/bin/ or any directory exported as PATH in your .bashrc file) then to run script on text file type:
print-character-amount.sh file-to-count-characters-of.txt
awk '{t+=length($0)}END{print t}' file3
awk only
awk 'BEGIN{FS=""}{for(i=1;i<=NF;i++)c++}END{print "total chars:"c}' file
shell only
var=$(<file)
echo ${#var}
Ruby(1.9+)
ruby -0777 -ne 'print $_.size' file
The following script is tested and gives exactly the results, that are expected
\#!/bin/bash
echo "Enter the file name"
read file
echo "enter the word to be found"
read word
count=0
for i in \`cat $file`
do
if [ $i == $word ]
then
count=\`expr $count + 1`
fi
done
echo "The number of words are $count"
I would have thought that it would be better to use stat to find the size of a file, since the filesystem knows it already, rather than causing the whole file to have to be read with awk or wc - especially if it is a multi-GB file or one that may be non-resident in the file-system on an HSM.
stat -c%s file
Yes, I concede it doesn't account for multi-byte characters, but would add that the OP has never clarified whether that is/was an issue.
Credits to user.py et al.
echo "ää" > /tmp/your_file.txt
cat /tmp/your_file.txt | wc -m
results in 3.
In my example the result is expected to be 2 (twice the letter ä). However, echo (or vi) adds a line break \n to the end of the output (or file). So two ä and one Linux line break \n are counted. That's three together.
Working with pipes | is not the shortest variant, but so I have to know less wc parameters by heart. In addition, cat is bullet-proof in my experience.
Tested on Ubuntu 18.04.1 LTS (Bionic Beaver).

Resources