SystemML comes packaged with a range of scripts that generate random input data files for use by the various algorithms. Each script accepts an option 'format' which determines whether the data files should be written in CSV or binary format.
I've taken a look at the binary files but they're not in any format I recognize. There doesn't appear to be documentation anywhere online. What is the binary format? What fields are in the header? For dense matrices, are the data contiguously packed at the end of the file (IEEE-754 32-bit float), or are there metadata fields spaced throughout the file?
Essentially, our binary format for matrices and frames are hadoop sequence files (single file or directory of part files) of type <MatrixIndexes,MatrixBlock> (with MatrixIndexes being a long-long pair for row/column block indexes) and <LongWritable,FrameBlock>, respectively. So anybody with hadoop io libraries and SystemML in the classpath can consume these files.
In detail, this binary blocked format is our internal tiled matrix representation (with default blocksize of 1K x 1K entries, and hence fixed logical but potentially variable physical size). Any external format provided to SystemML, such as csv or matrix market, is automatically converted into binary block format and all operations work over these binary intermediates. Depending on the backend, there are different representations, though:
For singlenode, in-memory operations and storage, the entire matrix is represented as a single block in deserialized form (where we use linearized double arrays for dense and MCSR, CSR, or COO for sparse).
For spark operations and storage, a matrix is represented as JavaPairRDD<MatrixIndexes, MatrixBlock> and we use MEMORY_AND_DISK (deserialized) as default storage level in aggregated memory.
For mapreduce operations and storage, matrices are actually persisted to sequence files (similar to inputs/outputs).
Furthermore, in serialized form (as written to sequence files or during shuffle), matrix blocks are encoded in one of the following: (1) empty (header: int rows, int cols, byte type), (2) dense (header plus serialized double values), (3) sparse (header plus for each row: nnz per row, followed by column index, value pairs), (4) ultra-sparse (header plus triples of row/column indexes and values, or pairs of row indexes and values for vectors). Note that we also redirect java serialization via writeExternal(ObjectOutput os) and readExternal(ObjectInput is) to the same serialization code path.
There are more details, especially with regard to the recently added compressed matrix blocks and frame blocks - so please ask if you're interested in anything specific here.
Related
I have a standard word2vec output which is a .txt file formatted as follows:
[number of words] [dimension (300)]
word1 [300 float numbers separated by spaces]
word2 ...
Now I want to read at most M word representations out of this file. A simple way is to loop the first M+1 lines in the file, and store the M vectors into a numpy array. But this is super slow, is there a faster way?
What do you mean, "is super slow"? Compared to what?
Because it's a given text format, there's no way around reading the file line-by-line, parsing the floats, and assigning them into a usable structure. But you might be doing things very inefficiently – without seeing your code, it's hard to tell.
The gensim library in Python includes classes for working with word-vectors in this format. And, its routines include an optional limit argument for reading just a certain number of vectors from the front of a file. For example, this will read the 1st 1000 from a file named vectors.txt:
word_vecs = KeyedVectors.load_word2vec_format('word-vectors.txt',
binary=False,
limit=1000)
I've never noticed it as being a particularly slow operation, even when loading something like the 3GB+ set of word-vectors Google released. (If it does seem super-slow, it could be you have insufficient RAM, and the attempted load is relying on virtual memory paging – which you never want to happe with a random-access data structure like this.)
If you then save the vectors in gensim's native format, via .save(), and if the constituent numpy arrays are large enough to be saved as separate files, then you'd have the option of using gensim's native .load() with the optional mmap='r' argument. This would entirely skip any parsing of the raw on-disk numpy arrays, just memory-mapping them into addressable space – making .load() complete very quickly. Then, as ranges of the array are accessed, they'd be paged into RAM. You'd still be paying the cost of reading-from-disk all the data – but incrementally, as needed, rather than in a big batch up front.
For example...
word_vecs.save('word-vectors.gensim')
...then later...
word_vecs2 = KeyedVectors.load('word_vectors.gensim', mmap='r')
(There's no 'limit' option for the native .load().)
For a project I am decoding wav files and am using the values in the data channel. I am using the node package "node-wav". From what I understand the values should be in the thousands, but I am seeing values that are scaled between -1 and 1. If I want the actual values do I need to multiply the scaled value by some number?
Part of the reason I am asking is that I still do not fully understand how WAV files store the necessary data.
I don't exactly know how node.js is but usually audio data is stored in float values so it makes sense to see it scaled between -1 and 1.
What I pulled from the website:
Data format
Data is always returned as Float32Arrays. While reading and writing 64-bit float WAV files is supported, data is truncated to 32-bit floats.
And endianness if you need it for some reason:
Endianness
This module assumes a little endian CPU, which is true for pretty much every processor these days (in particular Intel and ARM).
If you needed it to scale from float to fixed point integer, you'd multiply the value by the number of bits. For example, if you're trying to convert to 16 bit integers; y = (2^15 - 1) * x, where x is the data value, y is the scaled value.
I have a 3D triangulated surface. Nodes and Conn variables store the coordinates and connectivity of the triangles. At each vertex, a scalar quantity, S, and a vector with three components, V, are stored. These data are time-dependent. Also, my geometry does not change over time and I have one surface for all the timesteps.
How should I approach for writing a VTK file that has the transient data over this surface? In other words, I want to write the value of S and V at different timestep on this 3D surface in a single VTK file. I ultimately want to import this VTK file into Paraview for visualization. vtkTemporalDataSet seems to be the solution for me but I could not find an example on how to write an ASCII or binary file for this VTK class. Could vtkPolyData somehow be used to define time so that Paraview knows the transient nature of my dataset? I would appreciate any help or comment.
The VTK file format does not support transient data. However, you can write a series of files that ParaView will interpret as a time sequence. This will work fine with poly data in the VTK file. The file series is defined as files of the same name with a number identifier in them. For example, if you have a series of files named:
MyFile_000.vtk
MyFile_001.vtk
MyFile_002.vtk
ParaView will group these files together in its file browser and when you read them together, it will treat them as a file sequence with 3 time steps.
The bad part of this representation is that you will have to replicate the Nodes and Conn in each file. If that is a problem, you will have to use a different file format that supports multiple time steps using the same connection information (such as the Exodus II file format).
Please explain, how does the technology -scans file in libjpeg
In progressive JPEG encoding there is a practically infinite number of possibilities on how the image can be encoded. The amount of complexity is so great that it does not lend itself to parameter passing or command line arguments. LibJpeg allows you to specify a file to indicate how this is done.
In sequential JPEG, each component is encoded in a single scan. A scan can contain multiple components, in which case it is "interleaved".
In progressive JPEG, each component is encoded in 2 or more scans. As in sequential JPEG, a scan may or may not be interleaved.
The DCT produces 64 coefficients. The first is referred to as the "DC" coefficient. The others are the "AC" coefficients.
A progressive scan can divide the DCT data up in two wages.
1. By coefficient range (aka spectral selection). This can be either the DC coefficient or a range of contiguous AC coefficients. (You must send some DC data before sending any AC).
2. Sending the bits of the coefficients in different scans (calls successive approximation)
Your choices in a scan are then:
1. Which components
2. Spectral selection (0 or a range within 1 .. 63)
3. Successive approximation (a range within 0 .. 13)
There are semantic rules as well. You must have a DC scan for each component before an AC scan. You cannot send any data twice.
If you have a grayscale image (one component), you could send the image in as many as 64*14 =896 separate scans or as few as two.
There are so many choices that Libjpeg uses a file to specify them.
Any help please
i want to provide a simple framework for identifying and cleaning duplicates data in the context big data . This pretreatment must be performed in real time (streaming).
we reperesent our data base by a file.csv , this file contains patient (medical) records without duplication .
we want to clusterig the file.csv into 4 clusters by using a incremental parallel k mean clustering for mixed categorical and numeric value, each cluster contain similars records.
every time that (data stream) a structured data comes (record), we must compare it with representatives of clusters (M1, M2, M3, M4).............
If the data does not represent a duplicate data , we save it in file.csv , if it represents a duplicate data it is not saved in file.csv.
1)so what's the effiscient tool in my case hadoop or spark !
2) how can i impliment clustering for mixed categorical and numeric value with Mlib(spark) or mahout (hadoop).
3) what does it mean incremental clustering , is that the same of streaming clustering!
As already noted a dozen of times here on SO/CV:
k-means computes means
unless you can define a least-squares mean for categorical data (that is still useful in practise) using k-means on such data doesn't work.
Sure, you can do one-hot encoding amd similar hacks, but they make the results next to meaningless. "Least-squares" is not a meaningful objective on binary input data.
KMeans dealing with categorical variable
Why am I not getting points around clusers in this kmeans implementation?
https://stats.stackexchange.com/questions/58910/kmeans-whether-to-standardise-can-you-use-categorical-variables-is-cluster-3