PHP Convert a big binary file to a readable file - file-get-contents

I am having problems with this simple script...
<?php
$file = "C:\\Users\\Alejandro\\Desktop\\integers\\integers";
$file = file_get_contents($file, NULL, NULL, 0, 34359738352);
echo $file;
?>
It gives me always this error:
file_get_contents(): length must be greater than or equal to zero
I am really tired, can someone help me?
EDIT:
After dividing the file in 32 parts...
The error is
PHP Fatal error: Allowed memory size of 1585446912 bytes exhausted (tried to al
locate 2147483648 bytes)
Fatal error: Allowed memory size of 1585446912 bytes exhausted (tried to allocat
e 2147483648 bytes)
EDIT (2):
Now the deal is to convert this binary file into a list of numbers from 0 to (2^31)-1 That is why I need to read the file, so I can convert the binary digits to decimal numbers.

Looks like your file path is missing a backslash, which could be causing file_load_contents() to fail loading the file.
Try this instead:
$file = "C:\\\\Users\\Alejandro\\Desktop\\integers\\integers";
EDIT: Now that you can read your file, we're finding that because your file is so large, it's exceeding your memory limit and causing your script to crash.
Try buffering one piece at a time instead, so as not to exceed memory limits. Here's a script that will read binary data, one 4-byte number at a time, which you can then process to suit your needs.
By buffering only one piece at a time, you allow PHP to use only 4 bytes at a time to store the file's contents (while it processes each piece), instead of storing the entire contents in a huge buffer and causing an overflow error.
Here you go:
$file = "C:\\\\Users\\Alejandro\\Desktop\\integers\\integers";
// Open the file to read binary data
$handle = #fopen($file, "rb");
if ($handle) {
while (!feof($handle)) {
// Read a 4-byte number
$bin = fgets($handle, 4);
// Process it
processNumber($bin);
}
fclose($handle);
}
function processNumber($bin) {
// Print the decimal equivalent of the number
print(bindec($bin));
}

The last argument is an int, and apparently in your implementation of PHP int only has 31 bits for magnitude. The number you provide is larger than that, so this call would never work on that system.
There are also other peculiarities:
why do you pass NULL for the second argument instead of false?
you are saying that you want to read up to more than 2GB from that file -- is the system and PHP configuration up to doing that?

Related

Extending ext4 File system's filename size limit to 1012 characters

I am pulling data from a server and one of the folder name is longer than 256 bytes, so my CentOS is throwing an error that the name is too long. I have searched all over the internet but couldn't find any way to create a folder/file name with size of over 256 bytes under ext2/ext3/ext4 file systems.
However, One solution had suggested to create reiserfs file system alongside ext4 to handle files\folder with longer names. This solution might work, but I just read in one of the book that it is possible to extend the limit of filename size from 255 characters to 1012 characters if needed.
The maximal file name size is 255 characters. This limit could be extended to `1012` if needed.
Source: Google
But I couldn't find any website that explains how the filesystem could be modified to extend the size to 1012?
Can someone please help me with that?
Don't know where 1012 was found (mentioned in http://e2fsprogs.sourceforge.net/ext2intro.html - Design and Implementation of the Second Extended Filesystem, ISBN 90-367-0385-9. 1995), but in modern Linux kernel file name is fixed in struct ext2_dir_entry_2 with maximum of 255 chars (bytes):
https://elixir.bootlin.com/linux/v4.10/source/fs/ext2/ext2.h#L600
/*
* The new version of the directory entry. Since EXT2 structures are
* stored in intel byte order, and the name_len field could never be
* bigger than 255 chars, it's safe to reclaim the extra byte for the
* file_type field.
*/
struct ext2_dir_entry_2 {
__le32 inode; /* Inode number */
__le16 rec_len; /* Directory entry length */
__u8 name_len; /* Name length */
__u8 file_type;
char name[]; /* File name, up to EXT2_NAME_LEN */
};
There was struct ext2_dir_entry with longer file name length, but extra byte of name_len was redefined as file_type.
__le16 name_len; /* Name length */
So, current maximum file name length for ext2 is 255
https://elixir.bootlin.com/linux/v4.10/source/include/linux/ext2_fs.h#L22
#define EXT2_NAME_LEN 255
https://elixir.bootlin.com/linux/v4.10/source/fs/ext2/namei.c#L62
if (dentry->d_name.len > EXT2_NAME_LEN)
return ERR_PTR(-ENAMETOOLONG);
Same for ext3/ext4:
https://elixir.bootlin.com/linux/v4.10/source/fs/ext4/ext4.h#L1892
/*
* Structure of a directory entry
*/
#define EXT4_NAME_LEN 255
https://elixir.bootlin.com/linux/v4.10/source/fs/ext4/namei.c
* `len <= EXT4_NAME_LEN' is guaranteed by caller.
if (namelen > EXT4_NAME_LEN)
return NULL;
if (dentry->d_name.len > EXT4_NAME_LEN)
return ERR_PTR(-ENAMETOOLONG);
Ondisk format is described with 8 bit file_name too (file_type uses only 3 bits in older docs - EXT2_FT_MAX, but modern driver will not handle 255+ file names. ext4 has extra FT of 0xde):
http://www.nongnu.org/ext2-doc/ext2.html#IFDIR-NAME-LEN "4.1.3. name_len - 8bit unsigned value indicating how many bytes of character data are contained in the name."
http://cs.smith.edu/~nhowe/262/oldlabs/ext2.html#direntry "The file_type field indicates what kind of file the entry is referring to... The maximum length of a file name is EXT2_NAME_LEN, which is usually 255."
https://oss.oracle.com/projects/ocfs2/dist/documentation/disklayout.pdf#page=16 "__u8 name_len"
See https://en.wikipedia.org/wiki/Comparison_of_file_systems#Limits, not a lot of filesystems handle filenames of more than 255 bytes.
While it may not be a direct reply to your question, I do not think you should try to go this route (changing the maximum length).
From what server do you retrieve your files? Why not changing their name when doing the retrieval?
Are you sure that the whole path name is relevant, if you extract the first or last 250 characters isn't it enough to reference each file without ambiguity?
You have many options, depending on your constraints:
either generate "random" names (or sequential ones), and just store in a text file the mapping between old true name and the new fake name
or split the initial name in 250 characters path elements (or something like that) and create intermediate directories with them
You can find similar discussion at https://serverfault.com/questions/264339/how-to-increase-filename-size-limit-on-ext3-on-ubuntu

How to check the available free space in Haxe?

I don't find a way to check the free space available in a device using Haxe, Openfl, Lime or another library.
I would like to avoid download data that will exceed the size recommended for an app in each device.
What do you do to check that?
Try creating a file of that size! Then either delete it or reopen and write (not append) over its contents.
I don't know whether all platforms Haxe supports will work fine with this trick, but this algorithm is reported to work in many places and languages (I personally tested it in Ruby and saw the same suggestion for C++/.NET). To check whether X bytes of disk space are available:
open a new file for writing
seek X-1 bytes from the beginning
write a byte of data (whatever you want, 0, 42...)
close the file (probably unrelated to the task at hand, but don't forget to do that anyway)
If there's insufficient disk space, you'll likely get an exception at some point in this algorithm. You'll have to find out what errors to expect and process them properly.
Using ihx I've found this is working and requires nothing but Haxe Standard Library:
haxe interactive shell v0.3.4
type "help" for help
>> import sys.io.*;
>> var f = File.write('loca', true)
sys.io.FileOutput : { __f => #abstract }
>> f.seek(39999, FileSeek.SeekBegin)
Void : null
>> f.writeByte(0)
Void : null
>> f.close()
Void : null
After these manipulations, I had a file named loca of exactly 40000 bytes in my working directory.
By the way, be careful when doing things like these in ihx since it re-runs the entire session with the last entered line appended each time.
Ongoing experimentation:
However, when there's insufficient disk space, it may not fail with errors. In this case you'll have to check the real size with sys.FileSystem.stat(path).size. And don't forget to delete the file if there's not enough space.

J2ME - PNG created in PhotoShop not displaying in emulator

My midlet is showing some images fine, but not others.
They are all 8-bit PNGs, but the ones that aren't displaying are the ones I have created myself in PhotoShop.
So I am thinking maybe my PhotoShop (CS6) settings are wrong...
PNG-8, Selective, Diffusion, Colors: 256, Dither: 100%, Matte: None, Web
Snap: 0%, Convert to sRGB: ticked, Width: 48, Height: 48, Percent: 100%,
Quality: Bicubic.
I've experimented with a few of these settings, but to no avail.
Any ideas?
There is a similar problem here but this is opposite to mine in that PhotoShop mends things in that case, rather than breaks things...
My code is...
image = Image.createImage("/img/loading1.png");
...and here is my stack trace:
java.io.EOFException
at javax.imageio.stream.ImageInputStreamImpl.readFully(
ImageInputStreamImpl.java:353)
at java.io.DataInputStream.readUTF(DataInputStream.java:609)
at javax.imageio.stream.ImageInputStreamImpl.readUTF(ImageInputStreamImpl.java:332)
at com.sun.kvem.png.PNGImageReader.parse_iTXt_chunk(PNGImageReader.java:447)
at com.sun.kvem.png.PNGImageReader.readMetadata(PNGImageReader.java:650)
at com.sun.kvem.png.PNGImageReader.readImage(PNGImageReader.java:1312)
at com.sun.kvem.png.PNGImageReader.read(PNGImageReader.java:1582)
at com.sun.kvem.midp.GraphicsBridge.loadImage(GraphicsBridge.java:2602)
at com.sun.kvem.midp.GraphicsBridge.createImageFromData(GraphicsBridge.java:2511)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at com.sun.kvem.sublime.MethodExecution.process(MethodExecution.java:42)
at com.sun.kvem.sublime.SublimeExecutor.processRequest(SublimeExecutor.java:63)
at javax.microedition.lcdui.Image.createImage(Image.java:315)
The image in question does exist - both in the project and in the jar that is built.
Here is the image in question:
According to the crash log, the PNG decoder in J2ME fails inside the non-critical chunk iTXt:1
> com.sun.kvem.png.PNGImageReader.readMetadata
> com.sun.kvem.png.PNGImageReader.parse_iTXt_chunk
> javax.imageio.stream.ImageInputStreamImpl.readUTF
> java.io.DataInputStream.readUTF
According to libpng documentation, the text part of an iTXt chunk must be valid UTF8:
... The remaining chunk data is the main UTF-8 text, either zlib-compressed or not, according to the compression flag. Since its length can be determined from the chunk length, it is not null-terminated. As with the other two text chunks, newlines should be represented by single line-feed characters (decimal 10), and all other control characters (1-9, 11-31, and 127-159) are discouraged.
and so normally this would indicate that the stream read is not valid UTF8 text - it contains 'raw' bytes higher than the plain ASCII range 0..127 that do not conform to UTF8 rules.
I found that not to be the case in the sample image. There is only one set of consecutive bytes that form a UTF8 code sequence, and it is a valid one:
<?xpacket begin="EFBBBF" id=" ..
(the bolded section represents 3 data bytes in hexadecimal notation). I first suspected this was the error:
If the BOM character appears in the middle of a data stream, Unicode says it should be interpreted as a "zero-width non-breaking space" (inhibits line-breaking between word-glyphs). In Unicode 3.2, this usage is deprecated in favour of the "Word Joiner" character, U+2060.[1] This allows U+FEFF to be only used as a BOM.
(http://en.wikipedia.org/wiki/Byte_order_mark)
.. and so a fully conforming UTF8 reader should inspect its bytes and throw an UTFDataFormatException when it encounters a BOM anywhere else than as the very first value. Surprisingly, this does not seem to be the problem! First of all, there is no indication any of the readUTF sources do anything else than only verify if the UTF8 code is valid on its own, irrespective of its value. There are lots of 'invalid' Unicode code points (values that do not represent a valid Unicode character or instruction), but it appears to me they are all silently ignored. But I noticed the common readUTF functions only implement a small subset of UTF8/Unicode (see, e.g., Modified UTF-8 in Oracle's documentation).
So the problem lies elsewhere. Another clue to this is that the error thrown is not UTFDataFormatException but rather EOFException, indicating the read buffer ran out of the number of bytes it was promised to contain.
(warning: pure conjecture follows)
Looking at a source of DataInputStream, I find this snippet of code:
588 public final static String readUTF(DataInput in) throws IOException {
589 int utflen = in.readUnsignedShort();
followed by a loop to read utflen bytes (not "Unicode characters"). This is wrong for an iTXt chunk, as it does not have a 'first word' to indicate its length. The number of bytes in the plain text can be derived from the chunk length (which is, per PNG convention, the total data length excluding the length long word, the iTXt signature itself, and the final CRC32 code) minus the length of the zero-terminated keyword name, language, and "translated keyword" strings, and the two bytes which indicate compression of the full plain text.
As a work-around, remove the iTXt chunks from your PNG images. The data itself -- XMP Metadata -- is most likely not interesting at all for your purposes (but feel free to read what benefits Adobe thinks it has). And if your workflow does not use it, it's just a useless hunk of uncompressed text, taking up 814 bytes of the total of 981 bytes in your sample image -- a whopping 83%!
You can use an external utility to remove extraneous data chunks; the command line for the popular pngcrush, for example, is
pngcrush -rem alla -rem text InputFile.png OutputFile.png
(from en.wikipedia.org/wiki/Pngcrush).
Or directly from Photoshop: if you save a PNG 'the usual way' with the "Save As" menu option, the metadata goes in and there is no checkbox to get rid of it. If you use "Save for Web & Devices" instead, you get a large dialog with a lot of handy options, such as a drop down list labelled "Metadata".
Choosing "All" I got an even larger file; my version of Photoshop creates a massive 3K chunk of XMP Metadata, including a 2K totally empty 'filler' block...
Selecting "Copyright" or "None" finally got rid of all the crud (presumably because I did not fill in any copyright information), and then you get a nice 169 bytes long PNG, in which the only metadata is that software used is called "Adobe ImageReady".
1 Which is kind of ironic. Per PNG specifications,
.. A decoder encountering an unknown chunk in which the ancillary bit is 1 can safely ignore the chunk and proceed to display the image.
(source)
This "ancillary bit" is the 5th bit of the first byte of the chunk ID: 0 (uppercase) = critical, 1 (lowercase) = ancillary, i.e., if the first character of the chunk ID is a capital, a PNG reader must read and interpret its data correctly, and if it's not, it can be skipped silently.
So technically, the writers of J2ME could safely have ignored this entire chunk. But they messed it up, attempt to read it, and now the code crashes on all programs that merely try to read the image data in PNGs which happen to contain iTXt chunks.

file command generating 'invalid argument'

I have a perl script that traverses a set of directories and when it hits one of them it blows up with an Invalid Argument and I want to be able to programmatically skip it. I thought I could start by finding out the file type with the file command but it too blows up like this:
$ file /sys/devices/virtual/net/br-ex/speed
/sys/devices/virtual/net/br-ex/speed: ERROR: cannot read `/sys/devices/virtual/net/br-ex/speed' (Invalid argument)
If I print out the mode of the file with the perl or python stat function it tells me 33060 but I'm not sure what all the bits mean and I'm hoping a particular one would tell me not to try to look inside. Any suggestions?
To understand the stats number you got, you need to convert the number to octal (in python oct(...)).
Then you'll see that 33060 interprets to 100444. You're interested only in the last three digits (444). The first digit is file owner permissions, the second is group and the third is everyone else.
You can look at each of the numbers (in your case all are 4) as 3 binary bits in this order:
read-write-execute.
Since in your case owner, group & other has 4, it is translated (for all of them) to 100 (in binary) which means that only the read bit is on for all three - meaning that all three can only read the file.
As far as file permissions go, you should have been successful reading /sys/devices/virtual/net/br-ex/speed.
There are two reasons for the read to fail:
- Either speed is a directory, (directories require execute permissions to read inside).
- Or it's a special file - which can be tested using the -f flag in perl or bash, or using os.path.isfile(...) in python.
Anyhow, you can use the following links to filter files & directories according to their permissions in the 3 languages you mentioned:
ways to test permissions in perl.
ways to test permissions in python.
ways to test permissions in bash.
Not related to this particular case, but I hit the same error when I ran it on a malicious ELF (Linux executable) file. In that case it was because the program headers of the ELF was intentionally corrupted. Looking at the source code for file command, this is clear as it checks the ELF headers and bails out with the same error in case the headers are corrupted:
/*
* Loop through all the program headers.
*/
for ( ; num; num--) {
if (pread(fd, xph_addr, xph_sizeof, off) <
CAST(ssize_t, xph_sizeof)) {
file_badread(ms);
return -1;
}
TLDR; The file command checks not only the magic bytes, but it also performs other checks to validate a file type.

Resolving Out of Memory error when executing Perl script

I'm attempting to build a n-gram language model based on the top 100K words found in the english language wikipedia dump. I've already extracted out the plain text with a modified XML parser written in Java, but need to convert it to a vocab file.
In order to do this, I found a perl script that is said to do the job, but lacks instructions on how to execute. Needless to say, I'm a complete newbie to Perl and this is the first time I've encountered a need for its usage.
When I run this script, I'm getting an Out of Memory Error when using this on a 7.2GB text file on two separate dual core machines with 4GB RAM and runnung Ubuntu 10.04 and 10.10.
When I contacted the author, he said this script ran fine on a MacBook Pro with 4GB RAM, and the total in-memory usage was about 78 MB when executed on a 6.6GB text file with perl 5.12. The author also said that the script reads the input file line by line and creates a hashmap in memory.
The script is:
#! /usr/bin/perl
use FindBin;
use lib "$FindBin::Bin";
use strict;
require 'english-utils.pl';
## Create a list of words and their frequencies from an input corpus document
## (format: plain text, words separated by spaces, no sentence separators)
## TODO should words with hyphens be expanded? (e.g. three-dimensional)
my %dict;
my $min_len = 3;
my $min_freq = 1;
while (<>) {
chomp($_);
my #words = split(" ", $_);
foreach my $word (#words) {
# Check validity against regexp and acceptable use of apostrophe
if ((length($word) >= $min_len) && ($word =~ /^[A-Z][A-Z\'-]+$/)
&& (index($word,"'") < 0 || allow_apostrophe($word))) {
$dict{$word}++;
}
}
}
# Output words which occur with the $min_freq or more often
foreach my $dictword (keys %dict) {
if ( $dict{$dictword} >= $min_freq ) {
print $dictword . "\t" . $dict{$dictword} . "\n";
}
}
I'm executing this script from the command line via mkvocab.pl corpus.txt
The included extra script is simply a regex script to test the placement of apostrophe's and whether they match English grammar rules.
I thought the memory leak was due to the different versions, as 5.10 was installed on my machine. So I upgraded to 5.14, but the error still persists. According to free -m, I have approximately 1.5GB free memory on my system.
As I am completely unfamiliar with the syntax and structure of language, can you point out the problem areas along with why the issue exists and how to fix it.
Loading a 7,2Gb file into a hash could be possible if there is some repetition in the words, e.g. the occurs 17,000 times, etc. It seems to be rather a lot, though.
Your script assumes that the lines in the file are appropriately long. If your file does not contain line breaks, you will load the whole file into memory in $_, then double that memory load with split, and then add quite a whole lot more into your hash. Which would strain any system.
One idea may be to use space " " as your input record separator. It will do approximately what you are already doing with split, except that it will leave other whitespace characters alone, and will not trim excess whitespace as prettily. For example:
$/ = " ";
while (<>) {
for my $word ( split ) { # avoid e.g. "foo\nbar" being considered one word
if (
(length($word) >= $min_len) &&
($word =~ /^[A-Z][A-Z\'-]+$/) &&
(index($word,"'") < 0 || allow_apostrophe($word))
) {
$dict{$word}++;
}
}
}
This will allow even very long lines to be read in bite size chunks, assuming you do have spaces between the words (and not tabs or newlines).
Try running
dos2unix corpus.txt
It is possible that you are reading the entire file as one line...

Resources