MongoDB Ingest ETL Design Options - node.js

I'm a total newb when it comes to MongoDB, but I do have previous experience with nosql stores like Hbase and Accumulo. When I used these other nosql platforms, I ended up writing my own data ingest frameworks (typically in java) do perform ETL like functions, plus inline enrichment.
I haven't found a tool that has similar functionality for Mongo, but maybe I'm missing it.
To date I have a Logstash instance and collects logs from multiple sources and saves them to disk as JSON. I know there is a mongodb output plugin for Logstash, but it doesn't have any options for configuring how the records should be indexed (i.e. aggregate documents, etc).
For my needs, I would like to create multiple aggregated documents for each event that arrives via Logstash -- which requires some preprocessing and specific inserts into Mongo.
Bottom line -- before I go build ingest tooling (probably in python, or node) -- is there something that exists already?

Try node-datapumps, an etl tool for nodejs. Just fill the input buffer from JSON objects, enrich data in .process() and use a mongo mixin to write to mongodb.

Pentaho ETL have good support of Mongodb functionnality.
You can have a look at http://community.pentaho.com/projects/data-integration/
http://wiki.pentaho.com/display/EAI/MongoDB+Output

I just found one ETL tool Talend Open Studio, it has support for many file formats . I just uploaded multiple xml files on MongoDB using Talend. It also is backed by a Talend forum where many Q & A can be found.

Related

Is there any way to insert csv file data using cassandra stress?

I have explored bit on cassandra stress tool using yaml file and it is working fine. I just wanted to know is there anyway by which we can specify the location of any external csv file in yaml profile to insert data into Cassandra table using cassandra stress?
So instead of random data i wanted to see the cassandra stres test result on specific dataload on this data model?
Standard cassandra-stress doesn't have such functionality, but you can use the NoSQLBench tool that was recently open sourced by DataStax. It also uses YAML to describe workloads, but it's much more flexible, and has a number of functions for sampling data from CSV files.
P.S. there is also a separate Slack workspace for this project (to get invite, fill this form)

How to get/export all of the data from a Cassandra DB

In a project that I am working on, we are planning migrate from Cassandra DB to other technology,
The problem is how to get all of the data out of cassandra? (We are talking about 4M-8M records)
I should try exporting data to CSV file and then import it to another db.
To export data to CSV you can start with cpoy to command ..if that does not work then a simple Java program described in this can help you for bigger set of data.
But more importantly you should understand the data model of other technology before importing data into it..you may need to change your data model..
You can also look at other tools like https://github.com/brianmhess/cassandra-loader. I have imported/exported data in terms of hundreds of million using this application.

How to create an hbase sequencefile key in Spark for load to Bigtable?

I want to be able to easily create test data files that I can save and re-load into a dev Bigtable instance at will, and pass to other members of my team so they can do the same. The suggested way of using Dataflow to Bigtable seems ridiculously heavy-weight (anyone loading a new type of data--not for production purposes, even just playing around with Bigtable for the first time--needs to know Apache Beam, Dataflow, Java, and Maven??--that's potentially going to limit Bigtable adoption for my team) and my data isn't already in HBase so I can't just export a sequencefile.
However, per this document, it seems like the sequencefile key for HBase should be constructible in regular Java/Scala/Python code:
The HBase Key consists of: the row key, column family, column qualifier, timestamp and a type.
It just doesn't go into enough detail for me to actually do it. What delimiters exist between the different parts of the key? (This is my main question).
From there, Spark at least has a method to write a sequencefile so I think I should be able to create the files I want as long as I can construct the keys.
I'm aware that there's an alternative (described in this answer, whose example link is broken) that would involve writing a script to spin up a Dataproc cluster, push a TSV file there, and use HBase ImportTsv to push the data to Bigtable. This also seems overly heavy-weight to me but maybe I'm just not used to the cloud world yet.
The sequence file solution is meant for situations where large sets of data need to be imported and/or exported from Cloud Bigtable. If your file is small enough, then create a script that creates a table, reads from a file, and uses a BufferedMutator (or bath writes in your favorite language) to write to Cloud Bigtable.

How to migrate data between two tables in Cassandra properly

I have to change the schema of one of my tables in Cassandra. It's cannot be done by simply using ALTER TABLE command, because there are some changes in primary key.
So the question is: How to do such a migration in the best way?
Using COPY command in cql is not an option in here because dump file can be really huge.
Can I solve this problem by not creating some custom application?
Like Guillaume has suggested in the comment - you can't do this directly in cassandra. Schema altering operations are very limited here. You have to perform such migration manually using one of suggested there tools OR if you have very large tables you can leverage Spark.
Spark can efficiently read data from your nodes, transform them locally and save them back to db. Remember that such migration requires reading whole db content, so might take a while. It might be the most performant solution, however needs some bigger preparation - Spark cluster setup.

What is a good Bulk data loading tool for Cassandra

I'm looking for a tool to load CSV into Cassandra. I was hoping to use RazorSQL for this but I've been told that it will be several months out.
What is a good tool?
Thanks
1) If you have all the data to be loaded in place you can try sstableloader(only for cassandra 0.8.x onwards) utility to bulk load the data.For more details see:cassandra bulk loader
2) Cassandra has introduced BulkOutputFormat bulk loading data into cassandra with hadoop job in latest version that is cassandra-1.1.x onwards.
For more details see:Bulkloading to Cassandra with Hadoop
I'm dubious that tool support would help a great deal with this, since a Cassandra schema needs to reflect the queries that you want to run, rather than just being a generic model of your domain.
The built-in bulk loading mechanism for cassandra is via BinaryMemtables: http://wiki.apache.org/cassandra/BinaryMemtable
However, whether you use this or the more usual Thrift interface, you still probably need to manually design a mapping from your CSV into Cassandra ColumnFamilies, taking into account the queries you need to run. A generic mapping from CSV-> Cassandra may not be appropriate since secondary indexes and denormalisation are commonly needed.
For Cassandra 1.1.3 and higher, there is the CQL COPY command that is available for importing (or exporting) data to (or from) a table. According to the documentation, if you are importing less than 2 million rows, roughly, then this is a good option. Is is much easier to use than the sstableloader and less error prone. The sstableloader requires you to create strictly formatted .db files whereas the CQL COPY command accepts a delimited text file. Documenation here:
http://www.datastax.com/docs/1.1/references/cql/COPY
For larger data sets, you should use the sstableloader.http://www.datastax.com/docs/1.1/references/bulkloader. A working example is described here http://www.datastax.com/dev/blog/bulk-loading.

Resources