I am looking for a complete example for bulk loading data into janusgraph using tinkerpop3's hadoop-gremlin package. I read the documentation and didn't find any complete example, only fragments of code and suggestions.
Related
I need to create an index and load the data into Elasticsearch. How can i do it by using python? I need to do bulk data operations. I wanted to do it without CURL.
I was using 8+ version of Elasticsearch and 3+ version of python. Need to do bulk data insertion into index without CURL.
Have a look at the Python Elasticsearch documentation here
Otherwise you can also find help in this Stackoverflow discussion
I have explored bit on cassandra stress tool using yaml file and it is working fine. I just wanted to know is there anyway by which we can specify the location of any external csv file in yaml profile to insert data into Cassandra table using cassandra stress?
So instead of random data i wanted to see the cassandra stres test result on specific dataload on this data model?
Standard cassandra-stress doesn't have such functionality, but you can use the NoSQLBench tool that was recently open sourced by DataStax. It also uses YAML to describe workloads, but it's much more flexible, and has a number of functions for sampling data from CSV files.
P.S. there is also a separate Slack workspace for this project (to get invite, fill this form)
I tried to look for any existing intake components such as Driver, Plugin that can support GCP BigQuery. Given that if it cannot support, please advise on how to implement subclassing of intake.source.base.DataSource
Pandas can read from BigQuery with the function read_gbq. If you are only interested in reading whole results in a single shot, then this is all you need. You would need to do something like the sql source, which calls pandas to load the data in _get_schema method.
There is currently no GBQ reader for dask, so you cannot load out-of-core or in parralel, but see the discussion in this thread.
I'd like to build this workflow:
preprocess some data with Spark, ending with a data frame
write such dataframe to Neo4j as a set of nodes
My idea is really basic: write each row in the df as a node, where each column value represents the value of the node's attribute
I have seen many articles, including neo4j-spark-connector and Introducing the Neo4j 3.0 Apache Spark Connector but they all focus on importing into Spark data from a Neo4j db... so far, I wasn't able to find a clear example of writing a Spark data frame to a Neo4j database.
Any pointer to documentation or very basic examples are much appreciated.
Read this issue to answer my question.
Long story short, neo4j-spark-connector can write Spark data to Neo4j db, and yes, there is a lack in the documentation of the new release.
you can write some routine and use an opensource neo4j java driver
https://github.com/neo4j/neo4j-java-driver
for example.
Simple serialise the result of an RDD (using rdd.toJson) and then use the above driver to create your neo4j nodes and push into your neo4j instance.
I know the question is pretty old but I don't think the neo4j-spark-connector can solve your issue. The full story, sample code and the details are available here but to cut the long story short if you look carefully at the Neo4jDataFrame.mergeEdgeList example (which has been suggested), you'll noticed that what it does is to instantiate a driver for each row in the dataframe. That will work in a unit test with 10 rows but you can't expect it to work in a real case scenario with millions or billions of rows. Besides there are other defects explained in the link above where you can find a csv based solution. Hope it helps.
I'm a total newb when it comes to MongoDB, but I do have previous experience with nosql stores like Hbase and Accumulo. When I used these other nosql platforms, I ended up writing my own data ingest frameworks (typically in java) do perform ETL like functions, plus inline enrichment.
I haven't found a tool that has similar functionality for Mongo, but maybe I'm missing it.
To date I have a Logstash instance and collects logs from multiple sources and saves them to disk as JSON. I know there is a mongodb output plugin for Logstash, but it doesn't have any options for configuring how the records should be indexed (i.e. aggregate documents, etc).
For my needs, I would like to create multiple aggregated documents for each event that arrives via Logstash -- which requires some preprocessing and specific inserts into Mongo.
Bottom line -- before I go build ingest tooling (probably in python, or node) -- is there something that exists already?
Try node-datapumps, an etl tool for nodejs. Just fill the input buffer from JSON objects, enrich data in .process() and use a mongo mixin to write to mongodb.
Pentaho ETL have good support of Mongodb functionnality.
You can have a look at http://community.pentaho.com/projects/data-integration/
http://wiki.pentaho.com/display/EAI/MongoDB+Output
I just found one ETL tool Talend Open Studio, it has support for many file formats . I just uploaded multiple xml files on MongoDB using Talend. It also is backed by a Talend forum where many Q & A can be found.