How Spark can speed up bulk loading to JanusGraph? - apache-spark

I need to load lots of vertices and edges to JanusGraph with Cassandra backend from other storage. I've read about bulk loading and Spark configuring (https://docs.janusgraph.org/advanced-topics/bulk-loading/ and https://docs.janusgraph.org/advanced-topics/hadoop/) .
It's clear how to configure JanusGraph for Spark usage but I'm still not sure how to use Spark then and if Spark can help to speed up inserting into graph.
Please give some use cases and code example of using Hadoop MapReduce or Spark to speed up bulk loading data to Janusgraph (Java or Python are preferred). Any help welcome!

I worked on POC project recently to Bulk Load data into JanusGraph using Apache Spark. We were getting pretty good performance loading data into using Spark. Setup and sample code is provided in the article below.
https://medium.com/#nitinpoddar/bulk-loading-data-into-janusgraph-ace7d146af05
https://medium.com/#nitinpoddar/bulk-loading-data-into-janusgraph-part-2-ca946db26582

Alternatively, you can write a Kafka consumer application to load data from your Kafka to JanusGraph. The amount of parallelism will be restricted to the number of partitions of the source/input topic from which your application is reading data. The application will be single-threaded but you can scale your application to the number of input topics. Each instance of your application can open up a connection and write to JanusGraph using a transaction. You can batch transactions with some batch size to spread the load.

Related

Process real time data using kafka

I have a requirement to implement the solution for below usecase.
Currently Applications are storing data into Postgres database but Postgres database is facing storage issue. So the plan is to move the data from postgres to Hadoop with near realtime data in hadoop. So we thought of below solution .
Write Kafka producer application to listen to postgres tables and capture changing data and write to Kafka Topic .
Write a Kafka sink application to read from kafka topic and write to hive tables(parquet -- external tables -- partitioned and non partitioned) . So for non partitioned tables if we want to apply updates/deletes then we need to touch the whole table in spark code right? which will lead to performance degrade for every record getting from kafka topic . We have already developed sqoop incremental job which runs for every 5 minutes to do the same. But client needs real time data in hadoop so kafka+spark processing came into discussion .
Could you provide pro's and con's for step2 comparing to sqoop incremental.
please share code snippets/links if any which helps my thought process.
Getting data into Kafka is easy - use Debezium.
For getting it out...
I wouldn't use Hive at all for this. Real time data (depending on on the volume of the data, obviously) results in tiny files in HDFS. Subsequently, Hive queries become slower and slower over time.
Hive is not a replacement for Postgres. In fact, the Hive metastore requires a relational database still, such as Postgres.
I also wouldn't use Spark. You have to write code when ingesting Kafka topics into queryable formats is already a solved problem with other tools.
Popular options include Apache Pinot, Druid, or Apache Iceberg storage with Presto (some of which may overlap with HDFS storage, but will be much, much faster than Hive to query). Only the third option requires writing Kafka consumer code; the other two have native Kafka ingestion.
And even still, if you're stuck with HDFS, Kafka Connect framework comes with Kafka. There's an HDFS Sink plugin, written by Confluent, which supports Hive integration.

Kafka to Spark batch processing

I'm looking for an optimal data architecture.
I'm dealing with TS data that is flushed from a Redis database to OpenTSDB database each week.
OpenTSDB stores its data on HBase which is launched on a Hadoop cluster.
Then, the time series data available on OpenTSDB has to be batch processed (at 1-6 months interval).
Knowing that OpenTSDB data is stored in Binary large object format on HBase, I can't currently tackle HBase HTTP API.
Since Spark cannot directly access OpenTSDB API (While Kafka seems to be Okay with HTTP Api)... I'm facing architecture issues which can be expressed as follows, would it be more convenient to:
Use Apache Kafka to extract batch data (TByte) and use it as a pipeline to ingest and analyze data into Spark Dataframes ?
Flush redis data directly in HBase and hence use Spark directly from it ?
That said, I want to be sure than Spark can handle terabytes batch analytics and kafka can handle that amount before loading it as Spark RDD.
Any suggestion on help will be welcome. Thanks

Redshift with Spark Streaming

I have a Kafka - Spark Streaming application to ingest and process 60K events per min. I need a database to store my transformed dataframes to be accessed by visualization layer. Can Redshift be used for this with Spark Streaming or should Cassandra be used? I will be processing and storing the dataframes in every spark window of 30 seconds. Also I need to read from the datastore in every window. I guess Redhsift is primarily a data warehousing database not for OLTP sort of the processing.. any ideas?
You should check out SnappyData. SnappyData deeply integrates an in-memory database with Spark that allows hybrid OLTP/OLAP applications. You can write Spark Streaming applications on top of Snappy that can update/delete data from the database. Further, because it does not go over a connector, it performs better than the myriad datastores that have Spark connectors and even the native Spark cache. There may be other datastores that offer hybrid OLTP/OLAP applications on Spark in the aforementioned link.
Disclaimer: I am a SnappyData employee.

Spark as Data Ingestion/Onboarding to HDFS

While exploring various tools like [Nifi, Gobblin etc.], I have observed that Databricks is now promoting for using Spark for data ingestion/on-boarding.
We have a spark[scala] based application running on YARN. So far we are working on a hadoop and spark cluster where we manually place required data files in HDFS first and then run our spark jobs later.
Now when we are planning to make our application available for the client we are expecting any type and number of files [mainly csv, jason, xml etc.] from any data source [ftp, sftp, any relational and nosql database] of huge size [ranging from GB to PB].
Keeping this in mind we are looking for options which could be used for data on-boarding and data sanity before pushing data into HDFS.
Options which we are looking for based on priority:
1) Spark for data ingestion and sanity: As our application is written and is running on spark cluster, we are planning to use the same for data ingestion and sanity task as well.
We are bit worried about Spark's support for many datasources/file types/etc. Also, we are not sure if we try to copy data from let's say any FTP/SFTP then will all workers will write data on HDFS in parallel? Is there any limitation while using it? Is there any Audit trail maintained by Spark while this data copy?
2) Nifi in clustered mode: How good Nifi would be for this purpose? Can it be used for any datasource and for any size of file? Will be maintain the Audit trail? Would Nifi we able to handle such large files? How large cluster would be required in case we try to copy GB - PB of data and perform certain sanity on top of that data before pushing it to HDFS?
3) Gobblin in clustered mode: Would like to hear similar answers as that for Nifi?
4) If at all there is any other good option available for this purpose with lesser infra/cost involved and better performance?
Any guidance/pointers/comparisions for above mentioned tools and technologies would be appreciated.
Best Regards,
Bhupesh
After doing certain R&D and considering the fact that using NIFI or goblin will demand for more infrastructure cost. I have started testing Spark for data on-boarding.
SO far I have tried using Spark job for importing data [present at a remote staging area/node] into my HDFS and I am able to do that by mounting that remote location with all my spark cluster worker nodes. Doing this made that location local to those workers, hence spark job ran properly and data is on-boarded to my HDFS.
Since my whole project is going to be on Spark, hence keeping data on-boarding part on spark would not cost anything extra to me. So far I am going good. Hence I would suggest to others as well, if you already have spark cluster and hadoop cluster up and running then instead of adding extra cost [where cost could be a major constraint] go for spark job for data on-boarding.

Spark goodness with Cassandra?

I've been reading about Apache Cassandra lately to learn how it works and how to use it for IoT projects, especially in the need of time series based database..
However, I started to notice that Apache Spark is often mentioned when people talk about Cassandra too.
The question is, as long as I can use Cassandra cluster of nodes to serve my app, to store and read data, why would I need Apache Spark? any useful use-cases are appreciated!
The answer is broad but summarizing ... Cassandra is highly scalable and there are lot of scenarios where it fits but CQL sintax has some limitations if you don't have your schema ready for some queries.
If you want to make use of your data without restrictions and doing analytical workloads with your cassandra data or join with other tables Spark is the most appropriate complement. Spark has a tight integration with Cassandra.
I recommend you to check this slides: http://www.slideshare.net/patrickmcfadin/apache-cassandra-and-spark-you-got-the-the-lighter-lets-start-the-fire?qid=48e2528c-a03c-49b4-879e-45599b2aff34&v=&b=&from_search=5
Cassandra is for storing data where as Spark is for performing some computation on top of it. Analogy with Hadoop: Cassandra is like HDFS where as Spark is like Map Reduce.
Especially with computations, when using DataStax Cassandra connector, data locality can be exploited. If you need to do some computation which modifies a row (but doesn't really depend on anything else), then that operation is optimized to run locally on each machine in cluster without any data movement in network.
Same goes with a lot of other Spark workload, the actions(some function which modifies the data) are done locally and only result is sent to client. As far as I know, when you want to do analytics on top of data stored in Cassandra, Spark is well supported and popular choice. If you don't need to do any operations on the data, still you can use Spark for other purposes like I mentioned below.
Spark streaming can be used to ingest or export data from Cassandra ( I used it a lot personally). The same data import/export can be achieved with small hand-written JDBC agents but Spark streaming code I wrote for ingesting 10GB data from Cassandra contains less than 20 lines of code with multi machine-multi threading built-in and an admin UI where I can see the job progress.
With Spark+Zeppelin, we can visualize Cassandra data using Spark, we can build beautiful UIs with little Spark code where users can even enter input and see the result as graph/table etc.
Note: Actually, visualization can be better with Kibana/ElasticSearch or Solr/Banana when used with Cassandra but they are very hard to setup and indexing has it's own issues to deal with.
There are a lot of other use cases, but personally I used Spark as a Swiss army knife for multiple tasks.
Apache cassandra is have feature like fast read and write so you can use it with the apache spark streaming to write your data directly into cassandra without legacy.
For use case you can consider any video application to upload video with the help of streaming and directly store it into cassandra blob.

Resources