How do I implement "file -s <file>" on Linux in pure Go? - linux

Intent:
Does Go have the functionality (package or otherwise) to perform a special file stat on Linux akin to the command file -s <path>
Example:
[root#localhost ~]# file /proc/uptime
/proc/uptime: empty
[root#localhost ~]# file -s /proc/uptime
/proc/uptime: ASCII text
Use Case:
I have a fileglob of files in /proc/* that I need to very quickly detect if they are truly empty instead of appearing to be empty.
Using The os Package:
Code:
result,_ := os.Stat("/proc/uptime")
fmt.Println("Name:",result.Name()," Size:",result.Size()," Mode:",int(result.Mode()))
fmt.Printf("%q",result)
Result:
Name: uptime Size: 0 Mode: 292
&{"uptime" '\x00' 'Ĥ' {%!q(int64=63606896088) %!q(int32=413685520) %!q(*time.Location=&{ [] [] 0 0 <nil>})} {'\x03' %!q(uint64=4026532071) '\x01' '脤' '\x00' '\x00' '\x00' '\x00' '\x00' 'Ѐ' '\x00' {%!q(int64=1471299288) %!q(int64=413685520)} {%!q(int64=1471299288) %!q(int64=413685520)} {%!q(int64=1471299288) %!q(int64=413685520)} ['\x00' '\x00' '\x00']}}
Obvious Workaround:
There is the obvious workaround of the following. But it's a little over the top to need to call in a bash shell in order to get file stats.
output,_ := exec.Command("bash","-c","file -s","/proc/uptime").Output()
//parse output etc...
EDIT/MY PRACTICAL USE CASE:
Quickly determining which files are zero size without needing to read each one of them first.
file -s /cgroup/memory/lsf/<cluster>/*/tasks | <clean up commands> | uniq -c
6 /cgroup/memory/lsf/<cluster>/<jobid>/tasks: ASCII text
805 /cgroup/memory/lsf/<cluster>/<jobid>/tasks: empty
So in this case, I know that only those 6 jobs are running and the rest (805) have terminated. Reading the file works like this:
# cat /cgroup/memory/lsf/<cluster>/<jobid>/tasks
#
or
# cat /cgroup/memory/lsf/<cluster>/<jobid>/tasks
12352
53455
...

I'm afraid you might be confusing matters here: file is special in precisely a way it "knows" a set of heuristics to carry out its tasks.
To my knowledge, Go does not have anything like this in its standard library, and I've not came across a 3rd-party package implementing a file-like functionality (though I invite you to search by relevant keywords on http://godoc.org)
On the other hand, Go provides full access to the syscall interface of the underlying OS so when it comes to querying the OS in a way file does it, there's nothing you could not do in plain Go.
So I suggest you to just fetch the source code of file, learn what it does in its mode turned on by the "-s" command-line option and implement that in your Go code.
We'll try to have you with specific problems doing that — should you have any.
Update
Looks like I've managed to grasp the OP is struggling with: a simple check:
$ stat -c %s /proc/$$/status && wc -c < $_
0
849
That is, the stat call on a file under /proc shows it has no contents but actually reading from that file returns that contents.
OK, so the solution is simple: instead of doing a call to os.Stat() while traversing the subtree of the filesystem one should instead merely attempt to read a single byte from the file, like in:
var buf [1]byte
f, err := os.Open(fname)
if err != nil {
// do something, or maybe ignore.
// A not existing file is OK to ignore
// (the POSIX error code will be ENOENT)
// because after the `path/filepath.Walk()` fetched an entry for
// this file from its directory, the file might well have gone.
}
_, err = f.Read(buf[:])
if err != nil {
if err == io.EOF {
// OK, we failed to read 1 byte, so the file is empty.
}
// Otherwise, deal with the error
}
f.Close()
You might try to be more clever and first obtain the stat information
(using a call to os.Stat()) to see if the file is a regular file—to
not attempt reading from sockets etc.

I have a fileglob of files in /proc/* that I need to very quickly
detect if they are truly empty instead of appearing to be empty.
They are truly empty in some sense (eg. they occupy no space on file system). If you want to check whether any data can be read from them, try reading from them - that's what file -s does:
-s, --special-files
Normally, file only attempts to read and
determine the type of argument files which stat(2) reports are
ordinary files. This prevents problems, because reading special files
may have peculiar consequences. Specifying the -s option causes file
to also read argument files which are block or character special
files. This is useful for determining the filesystem types of the
data in raw disk partitions, which are block special files. This
option also causes file to disregard the file size as reported by
stat(2) since on some systems it reports a zero size for raw disk
partitions.

Related

Is it possible to display a file's contents and delete that file in the same command?

I'm trying to display the output of an AWS lambda that is being captured in a temporary text file, and I want to remove that file as I display its contents. Right now I'm doing:
... && cat output.json && rm output.json
Is there a clever way to combine those last two commands into one command? My goal is to make the full combined command string as short as possible.
For cases where
it is possible to control the name of the temporary text file.
If file is not used by other code
Possible to pass "/dev/stdout" as the.name of the output
Regarding portability: see stack exchange how portable ... /dev/stdout
POSIX 7 says they are extensions.
Base Definitions,
Section 2.1.1 Requirements:
The system may provide non-standard extensions. These are features not required by POSIX.1-2008 and may include, but are not limited to:
[...]
• Additional character special files with special properties (for example,  /dev/stdin, /dev/stdout,  and  /dev/stderr)
Using the mandatory supported /dev/tty will force output into “current” terminal, making it impossible to pipe the output of the whole command into different program (or log file), or to use the program when there is no connected terminals (cron job, or other automation tools)
No, you cannot easily remove the lines of a file while displaying them. It would be highly inefficient as it would require removing characters from the beginning of a file each time you read a line. Current filesystems are pretty good at truncating lines at the end of a file, but not at the beginning.
A simple but extremely slow method would look like this:
while [ -s output.json ]
do
head -1 output.json
sed -i 1d output.json
done
While this algorithm is plain and simple, you should know that each time you remove the first line with sed -i 1d it will copy the whole content of the file but the first line into a temporary file, resulting in approximately 0.5*n² lines written in total (where n is the number of lines in your file).
In theory you could avoid this by do something like that:
while [ -s output.json ]
do
line=$(head -1 output.json)
printf -- '%s\n' "$line"
fallocate -c -o 0 -l $((${#len}+1)) output.json
done
But this does not account for variable newline characters (namely DOS-formatted newlines) and fallocate does not always work on xfs, among other issues.
Since you are trying to consume a file alongside its creation without leaving a trace of its existence on disk, you are essentially asking for a pipe functionality. In my opinion you should look into how your output.json file is produced and hopefully you can pipe it to a script of your own.

Is it possible to partially unzip a .vcf file?

I have a ~300 GB zipped vcf file (.vcf.gz) which contains the genomes of about 700 dogs. I am only interested in a few of these dogs and I do not have enough space to unzip the whole file at this time, although I am in the process of getting a computer to do this. Is it possible to unzip only parts of the file to begin testing my scripts?
I am trying to a specific SNP at a position on a subset of the samples. I have tried using bcftools to no avail: (If anyone can identify what went wrong with that I would also really appreciate it. I created an empty file for the output (722g.990.SNP.INDEL.chrAll.vcf.bgz) but it returns the following error)
bcftools view -f PASS --threads 8 -r chr9:55252802-55252810 -o 722g.990.SNP.INDEL.chrAll.vcf.gz -O z 722g.990.SNP.INDEL.chrAll.vcf.bgz
The output type "722g.990.SNP.INDEL.chrAll.vcf.bgz" not recognised
I am planning on trying awk, but need to unzip the file first. Is it possible to partially unzip it so I can try this?
Double check your command line for bcftools view.
The error message 'The output type "something" is not recognized' is printed by bcftools when you specify an invalid value for the -O (upper-case O) command line option like this -O something. Based on the error message you are getting it seems that you might have put the file name there.
Check that you don't have your input and output file names the wrong way around in your command. Note that the -o (lower-case o) command line option specifies the output file name, and the file name at the end of the command line is the input file name.
Also, you write that you created an empty file for the output. You don't need to do that, bcftools will create the output file.
I don't have that much experience with bcftools but generically If you want to to use awk to manipulate a gzipped file you can pipe to it so as to only unzip the file as needed, you can also pipe the result directly through gzip so it too is compressed e.g.
gzip -cd largeFile.vcf.gz | awk '{ <some awk> }' | gzip -c > newfile.txt.gz
Also zcat is an alias for gzip -cd, -c is input/output to standard out, -d is decompress.
As a side note if you are trying to perform operations on just a part of a large file you may also find the excellent tool less useful it can be used to view your large file loading only the needed parts, the -S option is particularly useful for wide formats with many columns as it stops line wrapping, as is -N for showing line numbers.
less -S largefile.vcf.gz
quit the view with q and g takes you to the top of the file.

Why there are written randomly null characters in some of my output files?

I have some scripts in my RedHat server which contains Microfocus COBOL programs which generates a huge file of aprox 3GB in a sort of time of 3 hours on average. The programs write their output files directly in the directory /my_test/files/.
The problem is that sometimes (randomly) some files generated contains null character sections in the middle of the file. And when I check them up, if I reexecute the script again (with the same input parameters), the output file is perfectly generated (it doesn't contain any nullchars). I've checked it a lot of times and I'm pretty sure is not the fault of the COBOL programs (they use quite simple operations). The space in use of that folder is 40%.
Some programs updates the database, and if they finish with return code 0, then the changes are commited, and I don't have any backup, so this is the point of what I'm doing.
This is an example of a file declaration of one of the problematic COBOL programs:
FILE-CONTROL.
SELECT MYFILE
ASSIGN TO MYFILE
ORGANIZATION IS SEQUENTIAL
ACCESS MODE IS SEQUENTIAL
FILE STATUS IS FILE-STATUS.
DATA DIVISION.
FILE SECTION.
FD MYFILE
LABEL RECORD STANDARD
RECORDING MODE F.
01 REG-OUTPUT PIC X(400).
I've also checked for the nulls in the COBOL programs before the NULL files, but unfortunately there are no nulls spotted.
Then I thought about creating a crontab which executes the following script each 5 seconds:
if [[ -f /tmp/sorry_im_working ]]; then
exit
fi
trap 'rm -rf /tmp/sorry_im_working' EXIT
touch /tmp/sorry_im_working
lsof | awk 'BEGIN{
sfiles="";
} {
if($1=="PROGRAM" && $9~/my_test\/files/){
sfiles=sfiles" "$9
}
}END{
comm="find "sfiles" -newermt \x27-2 seconds\x27 -exec env LC_ALL=C bash -c \x27grep -Pq \x22\x5Cx00{200}\x22 <(tail -c 1000 {}) && echo {}\x27 \x5C\x3B";
while(comm | getline sout){
print sout;
};
close(comm);
}' >> /home/ouhma/nullfiles.txt
Therefore, I would like to ask you the following questions:
Any idea of what's going on here?
Do you have any other way to trigger the lastest modified files?
What other information of interest could I add to my log?
If you construct a file d with only \x00 :
hexdump -C d
00000000 5c 78 30 30 0a |\x00.|
00000005
and you :
grep -Faq '\x00' d;echo $?
0
But they're no null caracter inside d.
Maybe, is better to use grep -Paq '\x00'
Depending on the configuration and record structure that is used for the file MF will pad different characters with hex null.
Please copy the 'ASSIGN' clause and the 'FD' clause of the COBOL program.
BTW: if your COBOL programs run three ours to do some calculations and write three GB of data back you should investigate the storage and / or get a COBOL programmer to check the programs, sounds much to slow.
I suspect you are have non-printable characters in your file, the null inserts can be controlled, take a look # INSERTNULL file configuration.

How to use sed command to delete lines without backup file?

I have large file with size of 130GB.
# ls -lrth
-rw-------. 1 root root 129G Apr 20 04:25 syslog.log
So I need to reduce file size by deleting line which starts with "Nov 2" , So I have given the following command,
sed -i '/Nov 2/d' syslog.log
So I can't edit file using VIM editor also.
When I trigger SED command , its creating backup file also. But I don't have much space in root. Please try to give alternate solution to delete particular line from this file without increasing space in server.
It does not create a real backup file. sed is a stream editor. When applied to a file with option -i it will stream that file through the sed process, write the output to a new file (a temporary one), when everything is done, it will rename the new file to the original name.
(There are options to create backup files also, but you didn't give them, so I won't mention that further.)
In your case you have a very large file and don't want to create any copy, however temporary. For this you need to open the file for reading and writing at the same time, then your sed process can overwrite the original. After this, you will have to truncate the file at the end of the writing.
To demonstrate how this can be done, we first perform a test case.
Create a test file, containing lots of lines:
seq 0 999999 > x
Now, lets say we want to remove all lines containing the digit 4:
grep -v 4 1<>x <x
This will open the file for reading and writing as STDOUT (1), and for reading as STDIN. The grep command will read all lines and will output only the lines not containing a 4 (option -v).
This will effectively overwrite the beginning of the original file.
You will not know how long the output is, so after the output the original contents of the file will appear:
…
999991
999992
999993
999995
999996
999997
999998
999999
537824
537825
537826
537827
537828
537829
…
You can use the Unix tool truncate to shorten your file manually afterwards. In a real scenario you will have trouble finding the right spot for this, so it makes sense to count the number of bytes written (using wc):
(Don't forget to recreate the original x for this test.)
(grep -v 4 <x | tee /dev/stderr 1<>x) |& wc -c
This will preform the step above and additionally print out the number of bytes written to the terminal, in this example case the output will be 3653658. Now use truncate:
truncate -s 3653658 x
Now you have the result you want.
If you want to do this in a script, i. e. without interaction, you can use this:
length=$((grep -v 4 <x | tee /dev/stderr 1<>x) |& wc -c)
truncate -s "$length" x
I cannot guarantee that this will work for files >2GB or >4GB on your machine; depending on your operating system (32bit?) and the versions of the installed tools you might run into largefile issues. I'd perform tests with large files first (>4GB as this is typically a limit for many things) and then cross your fingers and give it a try :)
Some caveats you have to keep in mind:
Of course, nobody is supposed to append log entries to that log file while the procedure is running.
Also, any abort during the running of the process (power failure, signal caught, etc.) will leave the file in an undefined state. But re-running the command again after such a mishap will in most cases produce the correct output; some lines might be doubled, but not more than a single line should be corrupted then.
The output must be smaller than the input, of course, otherwise the writing will overtake the reading, corrupting the whole result so that lines which should be there will be missing (or truncated at the start).

How to tell binary from text files in linux

The linux file command does a very good job in recognising file types and gives very fine-grained results. The diff tool is able to tell binary files from text files, producing a different output.
Is there a way to tell binary files form text files? All I want is a yes/no answer whether a given file is binary. Because it's difficult to define binary, let's say I want to know if diff will attempt a text-based comparison.
To clarify the question: I do not care if it's ASCII text or XML as long as it's text. Also, I do not want to differentiate between MP3 and JPEG files, as they're all binary.
file is still the command you want. Any file that is text (according to its heuristics) will include the word "text" in the output of file; anything that is binary will not include the word "text".
If you don't agree with the heuristics that file uses to determine text vs. not-text, then the question needs to be better specified, since text vs. non-text is an inherently vague question. For example, file does not identify a PGP public key block in ASCII as "text", but you might (since it is composed only of printable characters, even though it is not human-readable).
The diff manual specifies that
diff determines whether a file is text
or binary by checking the first few
bytes in the file; the exact number of
bytes is system dependent, but it is
typically several thousand. If every
byte in that part of the file is
non-null, diff considers the file to
be text; otherwise it considers the
file to be binary.
A quick-and-dirty way is to look for a NUL character (a zero byte) in the first K or two of the file. As long as you're not worried about UTF-16 or UTF-32, no text file should ever contain a NUL.
Update: According to the diff manual, this is exactly what diff does.
This approach defers to the grep command in determining whether a file is binary or text:
is_text_file() { grep -qIF '' "$1"; }
grep options used:
-q Quiet; Exit immediately with zero status if any match is found
-I Process a binary file as if it did not contain matching data
-F Interpret PATTERNS as fixed strings, not regular expressions.
grep pattern used:
'' Empty string. All files (except an empty file)
will match this pattern.
Notes
An empty file is not considered a text file according to this test. (The GNU file command agrees with this assessment.)
A file with one printable character, say a, is considered a text file according to this test. (Makes sense to me.) (The file command disagrees with this assessment. (Tested with GNU file))
This approach requires only one child process to test whether a file is text or binary.
Test
# cd into a temp directory
cd "$(mktemp -d)"
# Create 3 corner-case test files
touch empty_file # An empty file
echo -n a >one_byte_a # A file containing just `a`
echo a >one_line_a # A file containing just `a` and a newline
# Another test case: a 96KiB text file that ends with a NUL
head -c 98303 /usr/share/dict/words > file_with_a_null_96KiB
dd if=/dev/zero bs=1 count=1 >> file_with_a_null_96KiB
# Last test case: a 96KiB text file plus a NUL added at the end
head -c 98304 /usr/share/dict/words > file_with_a_null_96KiB_plus1
dd if=/dev/zero bs=1 count=1 >> file_with_a_null_96KiB_plus1
# Defer to grep to determine if a file is a text file
is_text_file() { grep -qI '^' "$1"; }
# Test harness
do_test() {
printf '%22s ... ' "$1"
if is_text_file "$1"; then
echo "is a text file"
else
echo "is a binary file"
fi
}
# Test each of our test cases
do_test empty_file
do_test one_byte_a
do_test one_line_a
do_test file_with_a_null_96KiB
do_test file_with_a_null_96KiB_plus1
Output
empty_file ... is a binary file
one_byte_a ... is a text file
one_line_a ... is a text file
file_with_a_null_96KiB ... is a binary file
file_with_a_null_96KiB_plus1 ... is a text file
On my machine, it seems grep checks the first 96 KiB of a file for a NUL. (Tested with GNU grep). The exact crossover point depends on your machine's page size.
Relevant source code: https://git.savannah.gnu.org/cgit/grep.git/tree/src/grep.c?h=v3.6#n1550
You could try to give a
strings yourfile
command and compare the size of the results with the file size ... i'm not totally sure, but if they are the same the file is really a text file.
These days the term "text file" is ambiguous, because a text file can be encoded in ASCII, ISO-8859-*, UTF-8, UTF-16, UTF-32 and so on.
See here for how Subversion does it.
A fast way to do this in ubuntu is use nautilus in the "list" view. The type column will show you if its text or binary
Commands like less, grep detect it quite easily(and fast). You can have a look at their source.

Resources