I am saving blob in cassandra by converting a text into a compressed blob using erlang function 'term_to_binary'.
https://www.erlang.org/doc/man/erlang.html#term_to_binary-2
Is there a way to decode the above compressed blob to original text using a cqlsh query?
You can convert the blob back to an Erlang term with binary_to_term, but it cannot be done in cqlsh, it has to be done in Erlang/Elixir on the application level.
If you want to read from the database using a language which doesn't have binary_to_term, then you might be interested in BERT which is almost 100% compatible with Erlang Term Format and has libraries for Javascript and Ruby.
Alternatively, any other serialisation format like Piqi, JSON, XML etc would work.
Related
I have explored bit on cassandra stress tool using yaml file and it is working fine. I just wanted to know is there anyway by which we can specify the location of any external csv file in yaml profile to insert data into Cassandra table using cassandra stress?
So instead of random data i wanted to see the cassandra stres test result on specific dataload on this data model?
Standard cassandra-stress doesn't have such functionality, but you can use the NoSQLBench tool that was recently open sourced by DataStax. It also uses YAML to describe workloads, but it's much more flexible, and has a number of functions for sampling data from CSV files.
P.S. there is also a separate Slack workspace for this project (to get invite, fill this form)
In Apache NiFi I can have an input with compressed data that's unpacked using the UnpackContent processor and then connect the output to further record processing or otherwise.
Is it possible to operate directly on the compressed input? In a normal programming environment, one might easily wrap the record processor in a container that more or less transparently unpacks the data in a stream-processing fashion.
If this is not supported out of the box, would it be reasonable to implement a processor that extends for example ConvertRecord to accept a compressed input?
The motivation for this is to work efficiently with large CSV data files, converting it into a binary record format without having to spill the uncompressed CSV data to disk.
Compressed input for record processing is not supported currently, but is a great idea for improvement.
Instead of implementing it at a particular processor (e.g. ConvertRecord), I'd suggest following two approaches:
Create CompressedRecordReaderFactory implementing RecordReaderFactory
Like Java compressed stream such as GZIPInputStreawm, CompressedRecordReaderFactory will wrap another RecordReaderFactory, user specifies compression type (or the reader factory may be able to implement auto-detect capability by looking at FlowFile attributes ... etc)
Benefit of this approach is once we add this, we can support reading compressed input stream at any existing RecordReader and Processors using Record api, not only CSV, but also XML, JSON ... etc
Wrap InputStream at each RecordReaderFactory (e.g. CSVReader)
We could implement the same thing at each RecordReaderFactory and supporting compressed input gradually
This may provide a better UX because no additional ControllerService has to be configured
How do you think? For further discussion, I suggest creating a NiFi JIRA ticket. If you're willing to contribute, that would be even better.
I need to optimize disk usage and amount of data transferred during replication with my CouchDB instance. Does storing numerical data as int/floats instead of as string make a difference to file storage and or during http requests? I've read that JSON treats everything as strings, but newer JSON specs make use of different datatypes (float/int/boolean). What about for PouchDB?
CouchDB stores JSON data in native JSON types, so ints and floats are actual number types when serialised to disk. But I doubt you save much disk space over when that wouldn’t be the case. The replication protocol uses JSON and the internal encoding has no effect on this.
PouchDB in WebSQL and Sqlite store your document as string (I don't know what IndexedDb).
So to optimize disk usage, just keep less data. :)
Using the .NET SDK, I'm doing some log file parsing with Azure HDInsight. Seemingly simple things like changing the output file format from "part-xxxxx" to something related to the input file name seems to be quite complicated, and documentation is scant.
Based on what I've seen about output file formats in Hadoop in general, it looks like this isn't a setting I can change based on a template (which could then be fed in with HadoopJobConfiguration.AdditionalGenericArguments in the .NET SDK), but some actual Java code, which seems to suggest that the only way to get this done is to recode my solution as an actual Java class.
Suggestions?
This is a fundamental Hadoop thing.
Hadoop jobs will always output files in part-nnnnn format, the only bit you can specify is the baseOutputDirectory path they will go in, so you could certainly use the directory to relate the output to the input.
The reason for this is that each reducer has to have its own output file.
If you're doing any further processing on the output in Hadoop, with Hive for example, then this shouldn't be too much of a hardship, since the InputFormats used will pick up all the part-nnnnn files for you.
That said, you could provide a subclass of the MultipleOutputFormat class to control the pattern of the filenames, but that will need to be in Java, since you can't write OutputFormats with the streaming API.
Another option might be to use the Azure Storage client to merge, and rename the output files once the
I'm a total newb when it comes to MongoDB, but I do have previous experience with nosql stores like Hbase and Accumulo. When I used these other nosql platforms, I ended up writing my own data ingest frameworks (typically in java) do perform ETL like functions, plus inline enrichment.
I haven't found a tool that has similar functionality for Mongo, but maybe I'm missing it.
To date I have a Logstash instance and collects logs from multiple sources and saves them to disk as JSON. I know there is a mongodb output plugin for Logstash, but it doesn't have any options for configuring how the records should be indexed (i.e. aggregate documents, etc).
For my needs, I would like to create multiple aggregated documents for each event that arrives via Logstash -- which requires some preprocessing and specific inserts into Mongo.
Bottom line -- before I go build ingest tooling (probably in python, or node) -- is there something that exists already?
Try node-datapumps, an etl tool for nodejs. Just fill the input buffer from JSON objects, enrich data in .process() and use a mongo mixin to write to mongodb.
Pentaho ETL have good support of Mongodb functionnality.
You can have a look at http://community.pentaho.com/projects/data-integration/
http://wiki.pentaho.com/display/EAI/MongoDB+Output
I just found one ETL tool Talend Open Studio, it has support for many file formats . I just uploaded multiple xml files on MongoDB using Talend. It also is backed by a Talend forum where many Q & A can be found.