Spark java with Google Store - apache-spark

Can some please share any examples on spark java read and write files from Google Store.
Thanks You in advance.

Good Store is should be backed by Google cloud there have a lot of Apache Spark and Hadoop APIs for Java examples it maybe useful for you
https://cloud.google.com/java/samples
https://cloud.google.com/hadoop/
Also here have a lot of Java for Spark jobs read/write files from Google Store
https://github.com/GoogleCloudPlatform/spark-examples/tree/master/gaming/java/src/main/java/com/google/cloud/sparkdemo

Related

What is exactly the need of spark when using talend?

I am new to both spark and talend.
But I read everywhere that both of these are ETL tools. I read another stackoverflow answer here. From the other answer what I understood is talend do use spark for large data processing. But can talend do all the ETL work efficiently that spark is doing without using spark under the hood? Or is it essentially a wrapper over spark where all the data is send to talend is actually put inside the spark inside talend for processing?
I am quite confused with this. Can someone clarify this?
Unlike Informatica BDM which has its own Blaze framework for processing on Hadoop (native), Talend relies on other frameworks such as Map Reduce (Hadoop using possibly tez underneath) or Spark engine. So you could avoid Spark, but there is less point in doing so. The key point is that we could expect I think some productivity using Talend as it is graphical based, which is handy when there are many fields and you do not need possibly the most skilled staff.
For NOSQL, like HBase, they provide specific connectors or could use the Phoenix route. Talend also has connectors for KAFKA.
Spark is just one of the frameworks supported by Talend. When you create a new job, you can pick Spark from the dropdown list. You can get more details in the docs.

Securing Spark Job on Cluster

I was dealing with one the spark requirement here where Client (like Banking Client where security is major concern) needs all spark processing should happen securely.
For example all communication happening between spark client and server ( driver & executor communication) should be on secure channel. Even when spark spills on disk based on storage level (Mem+Disk), it should not be written in un-encrypted format on local disk or there should be some workaround to prevent spill.
I did some research but could not get any concrete solution.Let me know if someone has done this.
Any guidance would be a great help. Thanks in advance.
Sounds like the right job to implement Apache Commons Crypto
Instead of preventing the spill, that usually happens during the shuffle phase, you can implement the Crypto library to encrypt the output that is spilled.
Here are a few recommended reads:
Securing Apache Spark Shuffle using Apache Commons Crypto
Spark Encryption
JAVA based examples
CipherByteBuffer
Stream Example
These examples are in Java I don't think you should have any problem in implementing them with Spark/Scala as well. I haven't implemented them myself so I am not sure of any underlying issues etc.

How to set Cassandra as my Distributed Storage(File System) for my Spark Cluster

I am new to big data and Spark(pyspark).
Recently I just setup a spark cluster and wanted to use Cassandra File System (CFS) on my spark cluster to help upload files.
Can any one tell me how to set it up and briefly introduce how to use CFS system? (like how to upload files / from where)
BTW I don't even know how to use HDFS(I downloaded pre-built spark-bin-hadoop but I can't find hadoop in my system tho.)
Thanks in advance!
CFS only exists in DataStax Enterprise and isn't appropriate for most Distributed File applications. It's primary focused as a substitute for HDFS for map/reduce jobs and small temporary but distributed files.
To use it you just use the CFS:// uri and make sure you are using dse spark-submit from your application.

Bluemix Spark Service

Firstly, I need to admit that I am new to Bluemix and Spark. I just want to try out my hands with Bluemix Spark service.
I want to perform a batch operation over, say, a billion records in a text file, then I want to process these records with my own set of Java APIs.
This is where I want to use the Spark service to enable faster processing of the dataset.
Here are my questions:
Can I call Java code from Python? As I understand it, presently only Python boilerplate is supported? There are few a pieces of JNI as well beneath my Java API.
Can I perform the batch operation with the Bluemix Spark service or it is just for interactive purposes?
Can I create something like a pipeline (output of one stage goes to another) with Bluemix, do I need to code for it ?
I will appreciate any and all help coming my way with respect to above queries.
Look forward to some expert advice here.
Thanks.
The IBM Analytics for Apache Spark sevices is now available and it allow you to submit a java code/batch program with spark-submit along with notebook interface for both python/scala.
Earlier, the beta code was limited to notebook interactive interface.
Regards
Anup

Data types supported in Storm and Spark

I am new to Storm and Spark. I just would like to ask how these two frameworks store files? Can they use HDFS? Also, can they support xml format?
Thanks,
I am not sure about spark but storm can write onto HDFS
https://github.com/apache/storm/tree/master/external/storm-hdfs
apart from just HDFS it allows you to write onto hbase, redis and gives you a jdbc connector module as well. you can see the list here -
https://github.com/apache/storm/tree/master/external
As far as Spark is concerned, it can read and write on HDFS and it supports xml format just as Hadoop does.

Resources