Spark NiFi site to site connection - apache-spark

I am new with NiFi, I am trying to send data from NiFi to Spark or to establish a stream from NiFi output port to Spark according to this tutorial.
Nifi is running on Kubernetes and I am using Spark operator on the same cluster to submit my applications.
It seems like Spark is able to access the web NiFi and it starts a streaming receiver. However, data is not coming to the Spark app through output and I have empty rdds. I have not seen any warnings or errors in Spark logs
Any Idea or information which could help me to solve this issue is appreciated.
My code:
val conf = new SiteToSiteClient.Builder()
.keystoreFilename("..")
.keystorePass("...")
.keystoreType(...)
.truststoreFilename("..")
.truststorePass("..")
.truststoreType(...)
.url("https://...../nifi")
.portName("spark")
.buildConfig()
val lines = ssc.receiverStream(new NiFiReceiver(conf, StorageLevel.MEMORY_ONLY))

Related

how can spark read / write from azurite

I am trying to read (and eventually write) from azurite (version 3.18.0) using spark (3.1.1)
i can't understand what spark configurations and file uri i need to set to make this work properly
for example these are the containers and files i have inside azurite
/devstoreaccount1/container1/file1.avro
/devstoreaccount1/container2/file2.avro
This is the code that im running - the uri val is one of the values below
val uri = ...
val spark = SparkSession.builder()
.appName(appName)
.master("local")
.config("spark.driver.host", "127.0.0.1").getOrCreate()
spark.conf.set("spark.hadoop.fs.wasbs.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark.conf.set(s"spark.hadoop.fs.azure.account.auth.type.devstoreaccount1.blob.core.windows.net", "SharedKey")
spark.conf.set(s"spark.hadoop.fs.azure.account.key.devstoreaccount1.blob.core.windows.net", <azurite account key>)
spark.read.format("avro").load(uri)
uri value - what is the correct one?
http://127.0.0.1:10000/container1/file1.avro
I get UnsupportedOperationException when i perform the spark.read.format("avro").load(uri) because spark will use the HttpFileSystem implementation and it doesn't support listStatus
wasb://container1#devstoreaccount1.blob.core.windows.net/file1.avro
Spark will try to authenticate against azure servers (and will fail for obvious reasons)
I have tried to follow this stackoverflow post without success.
I have also tried to remove the blob.core.windows.net configuration postfix but then i don't how to give spark the endpoint for the azurite container?
So my question is what are the correct configurations to give spark so it will be able to read from azurite, and what are the correct file path formats to pass as the URI?

What does "avoid multiple Kudu clients per cluster" mean?

I am looking at kudu's documentation.
Below is a partial description of kudu-spark.
https://kudu.apache.org/docs/developing.html#_avoid_multiple_kudu_clients_per_cluster
Avoid multiple Kudu clients per cluster.
One common Kudu-Spark coding error is instantiating extra KuduClient objects. In kudu-spark, a KuduClient is owned by the KuduContext. Spark application code should not create another KuduClient connecting to the same cluster. Instead, application code should use the KuduContext to access a KuduClient using KuduContext#syncClient.
To diagnose multiple KuduClient instances in a Spark job, look for signs in the logs of the master being overloaded by many GetTableLocations or GetTabletLocations requests coming from different clients, usually around the same time. This symptom is especially likely in Spark Streaming code, where creating a KuduClient per task will result in periodic waves of master requests from new clients.
Does this mean that I can only run one kudu-spark task at a time?
If I have a spark-streaming program that is always writing data to the kudu,
How can I connect to kudu with other spark programs?
In a non-Spark program you use a KUDU Client for accessing KUDU. With a Spark App you use a KUDU Context that has such a Client already, for that KUDU cluster.
Simple JAVA program requires a KUDU Client using JAVA API and maven
approach.
KuduClient kuduClient = new KuduClientBuilder("kudu-master-hostname").build();
See http://harshj.com/writing-a-simple-kudu-java-api-program/
Spark / Scala program of which many can be running at the same time
against the same Cluster using Spark KUDU Integration. Snippet
borrowed from official guide as quite some time ago I looked at this.
import org.apache.kudu.client._
import collection.JavaConverters._
// Read a table from Kudu
val df = spark.read
.options(Map("kudu.master" -> "kudu.master:7051", "kudu.table" -> "kudu_table"))
.format("kudu").load
// Query using the Spark API...
df.select("id").filter("id >= 5").show()
// ...or register a temporary table and use SQL
df.registerTempTable("kudu_table")
val filteredDF = spark.sql("select id from kudu_table where id >= 5").show()
// Use KuduContext to create, delete, or write to Kudu tables
val kuduContext = new KuduContext("kudu.master:7051", spark.sparkContext)
// Create a new Kudu table from a dataframe schema
// NB: No rows from the dataframe are inserted into the table
kuduContext.createTable("test_table", df.schema, Seq("key"),
new CreateTableOptions()
.setNumReplicas(1)
.addHashPartitions(List("key").asJava, 3))
// Insert data
kuduContext.insertRows(df, "test_table")
See https://kudu.apache.org/docs/developing.html
The more clear statement of "avoid multiple Kudu clients per cluster" is "avoid multiple Kudu clients per spark application".
Instead, application code should use the KuduContext to access a KuduClient using KuduContext#syncClient.

Apache Kafka + Spark Integration (REST API is needed?)

Got some fundamental problems, hope someone can clear them up.
So I want to use Apache Kafka and Apache spark for my application. I have gone through numerous tutorials and got the basic idea of what it is and how it will work.
Use case :
Data will be generated from a mobile device(multiple devices, lets say 1000) at an interval of 40 sec and I need to process that data and add values to the database which in turn will be reflected back in a dashboard.
What I wanted to do is to use Apache Streams and make a post request from android itself and then those data will be processed by the spark application and that's it.
Issues:
Apache Spark
I am following this tutorial to get it up and running.( Am using JAVA, not scala)
Link : https://www.santoshsrinivas.com/installing-apache-spark-on-ubuntu-16-04/
After everything is done, I execute spark-shell and it start. I have also installed zookeeper and kafka on my server and I have started the Kafka in the background, so that's not an issue.
When I run http://161.xxx.xxx.xxx:4040/jobs/ I get this page
In all the tutorial which I have gone through, there is a page like this : https://i.stack.imgur.com/gF1fN.png but I don't get this. Is it that spark is not properly installed?
Now when I want to deploy a standalone jar to spark, (Using this link : http://data-scientist-in-training.blogspot.in/2015/03/apache-spark-cluster-deployment-part-1.html ) am able to run it.
i.e with the command : spark-submit --class SimpleApp.SimpleApp --master spark://http://161.xxx.xxx.xxx:7077 --name "try" /opt/spark/bin/try-0.0.1-SNAPSHOT.jar , I get the output.
Do I need to submit the application everytime if I want to use it?
This is my Program :
package SimpleApp;
/* SimpleApp.java */
import org.apache.spark.api.java.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;
public class SimpleApp {
public static void main(String[] args) {
String logFile = "/opt/spark/README.md"; // Should be some file on your system
SparkConf conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(conf);
//System.setProperty("hadoop.home.dir", "C:/winutil");
sc.setLogLevel("ERROR"); // Don't want the INFO stuff
JavaRDD<String> logData = sc.textFile(logFile).cache();
long numAs = logData.filter(new Function<String, Boolean>() {
public Boolean call(String s) { return s.contains("a"); }
}).count();
long numBs = logData.filter(new Function<String, Boolean>() {
public Boolean call(String s) { return s.contains("b"); }
}).count();
System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);
System.out.println("word count : "+logData.first());
sc.stop();
}
}
Now how do I integrate Kafka into it?
How to configure the app in such a way that it get executed everytime
kafka receives a message?
Moreover, do I need to make a REST API through which I need to send
the data to kafka i.e the REST api will be used as producer?
Something like spark Java framework? http://sparkjava.com/
If yes, again the bottleneck will happen at REST api level i.e how
many request it can handle or not because everywhere I read that
Kafka has a very high throughput.
Is the final structure going to be like SPARK JAVA -> KAFKA -> APACHE
SPARK ?
Lastly how to do I set up the development structure on my local
device? I have kafka/apache spark installed. And am using Eclipse.
Thanks
Well,
You are facing some problems to understand how Spark works with Kafka.
First let's understand somethings:
Kafka is a Stream process platform for low latency and high throughput. This will allow you to store and read lot's of data really fast.
Spark has two types of processing, Spark Batch and Spark Streaming. What you are studying is batch, for your problem I suggest you to see apache streaming.
What is Streaming?
Streaming is a way to transport and transform your data in real time or near real time. It will not be necessary to create a process that you need to call every 10 minutes or every 10 seconds. You will start the job and it will consume the source and will post in the sink.
Kafka is a passive platform, so Kafka can be a source or a sink of a stream process.
In your case, what I suggest is:
Create a streaming producer for your Kafka, you will read the log of your mobile application in your web server. So, you need to plug something at your web server to start the consumption of the data. What I suggest you is FluentdIs a really strong application for streaming, this is in Ruby but is really easy to use. If you want something more robust and more focused in BigData I suggest Apache Nifi This is hard to work, that is not easy but you can create pipelines of data flow to transfer your information to your cluster. And something REALLY SIMPLE and that will solve your problem is Apache Flume.
Start your Kafka, you can use Docker to use it. This will hold your data for a period, and will allow you to take your data when you need really fast and with a lot of information. Please read the docs to understand how it works.
Spark Streaming - That will not make sense to use a Kafka if you don't have a stream process, your solution of Rest to produce the data at Kafka is slow and if is batch doesn't make sense. So if you are writing as streaming, you should analyse as streaming too. I suggest you to read about Spark Streaming here. And how integrate the Spark with Kafka here.
So, as you asked:
Do I need a REST API?
The answer is No.
The architecture will be like this:
Web Server -> Fluentd -> Apache Kafka -> Spark Streaming -> Output
I hope that will help

Accessing Spark RDDs from a web browser via thrift server - java

We have processed our data using Spark 1.2.1 with Java and stored in Hive tables. We want to access this data as RDDs from an web browser.
I read documentation and I understood the steps to do the task.
I am unable to find the way to interact with Spark SQL RDDs via thrift server. Examples I found have belw line in the code and I am not find the class for this in Spark 1.2.1 java API docs.
HiveThriftServer2.startWithContext
In github i saw scala examples using
import org.apache.spark.sql.hive.thriftserver , but I dont see this in Java API docs. Not sure if I am missing something.
Did anybody had luck with accessing Spark SQL RDDs from a browser via thrift? Can you post the code snippet. We are using Java.
I've got most of this working. Lets dissect each part of it: (References at bottom of post)
HiveThriftServer2.startWithContext is defined in Scala. I was never able to access it from Java or from Python using Py4j, and am no JVM expert, but I ended up switching to Scala. This may have something to do with the annotation #DeveloperApi . This is how I imported it Scala in Spark 1.6.1:
import org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
For anyone reading this and not using Hive, a Spark SQL context won't do, and you need a hive context. However, the HiveContext constructor requires a Java spark context, not a scala one.
import org.apache.spark.api.java.JavaSparkContext
import org.apache.spark.sql.hive.HiveContext
var hiveContext = new HiveContext(JavaSparkContext.toSparkContext(sc))
Now start the thrift server
HiveThriftServer2.startWithContext(hiveContext)
// Yay
Next, we need to make our RDDs available as SQL tables. First, we have to convert them into Spark SQL DataFrames:
val someDF = hiveContext.createDataFrame(someRDD)
Then, we need to turn them into Spark SQL tables. You do this by persisting them to Hive, or making the RDD available as a temporary table.
Persist to Hive:
// Deprecated since Spark 1.4, to be removed in Spark 2.0:
someDF.saveAsTable("someTable")
// Up-to-date at time of writing
someDF.write().saveAsTable("someTable")
Or, use a temporary table:
// Use the Data Frame as a Temporary Table
// Introduced in Spark 1.3.0
someDF.registerTempTable("someTable")
Note - temporary tables are isolated to an SQL session.
Spark's hive thrift server is multi-session by default
in version 1.6 (one session per connection). Therefore,
for clients to access temporary tables you've registered,
you'll need to set the option spark.sql.hive.thriftServer.singleSession to true
You can test this by querying the tables in beeline, a command line utility for interacting with the hive thrift server. It ships with Spark.
Finally, you need a way of accessing the hive thrift server from the browser. Thanks to its awesome developers, it has an HTTP mode, so if you want to build a web app, you can use the thrift protocol over AJAX requests from the browser. A simpler strategy might be to create an IPython notebook, and use pyhive to connect to the thrift server.
Data Frame Reference:
https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/sql/DataFrame.html
singleSession option pull request:
https://mail-archives.apache.org/mod_mbox/spark-commits/201511.mbox/%3Cc2bd1313f7ca4e618ec89badbd8f9f31#git.apache.org%3E
HTTP mode and beeline howto:
https://spark.apache.org/docs/latest/sql-programming-guide.html#distributed-sql-engine
Pyhive:
https://github.com/dropbox/PyHive
HiveThriftServer2 startWithContext definition:
https://github.com/apache/spark/blob/6b1a6180e7bd45b0a0ec47de9f7c7956543f4dfa/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L56-73
Thrift is JDBC/ODBC server.
You can connect to it via JDBC/ODBC connections and access content through the HiveDriver.
You can not get RDDs back from it, because HiveContext is not available.
What you refered to is an experimental feature not available for Java.
As a workaround, you could re-parse the results and create your structures for your client.
For example:
private static String driverName = "org.apache.hive.jdbc.HiveDriver";
private static String hiveConnectionString = "jdbc:hive2://YourHiveServer:Port";
private static String tableName = "SOME_TABLE";
Class c = Class.forName(driverName);
Connection con = DriverManager.getConnection(hiveConnectionString, "user", "pwd");
Statement stmt = con.createStatement();
String sql = "select * from "+tableName;
ResultSet res = stmt.executeQuery(sql);
parseResultsToObjects(res);

real time log processing using apache spark streaming

I want to create a system where I can read logs in real time, and use apache spark to process it. I am confused if I should use something like kafka or flume to pass the logs to spark stream or should I pass the logs using sockets. I have gone through a sample program in the spark streaming documentation- Spark stream example. But I will be grateful if someone can guide me a better way to pass logs to spark stream. Its kind of a new turf to me.
Apache Flume may help to read the logs in real time.
Flume provides logs collection and transport to the application where Spark Streaming is used to analyze required information.
1. Download Apache Flume from official site or follow the instructions from here
2. Setup and run Flume
modify flume-conf.properties.template from the directory where Flume is installed (FLUME_INSTALLATION_PATH\conf), here you need to provide logs source, channel and sinks (output). More details about setup here
There is an example of launching flume which collects log information from ping comand running on windows host and writes it to a file:
flume-conf.properties
agent.sources = seqGenSrc
agent.channels = memoryChannel
agent.sinks = loggerSink
agent.sources.seqGenSrc.type = exec
agent.sources.seqGenSrc.shell = powershell -Command
agent.sources.seqGenSrc.command = for() { ping google.com }
agent.sources.seqGenSrc.channels = memoryChannel
agent.sinks.loggerSink.type = file_roll
agent.sinks.loggerSink.channel = memoryChannel
agent.sinks.loggerSink.sink.directory = D:\\TMP\\flu\\
agent.sinks.loggerSink.serializer = text
agent.sinks.loggerSink.appendNewline = false
agent.sinks.loggerSink.rollInterval = 0
agent.channels.memoryChannel.type = memory
agent.channels.memoryChannel.capacity = 100
To run the example go to FLUME_INSTALLATION_PATH and execute
java -Xmx20m -Dlog4j.configuration=file:///%CD%\conf\log4j.properties -cp .\lib\* org.apache.flume.node.Application -f conf\flume-conf.properties -n agent
OR you may create your java application that has flume libraries in a classpath and call org.apache.flume.node.Application instance from the application passing corresponding arguments.
How to setup Flume to collect and transport logs?
You can use some script for gathering logs from the specified location
agent.sources.seqGenSrc.shell = powershell -Command
agent.sources.seqGenSrc.command = your script here
instead of windows script you also can launch java application (put 'java path_to_main_class arguments' in field) which provides smart logs collection. For example, if the file is modified in real-time you can use Tailer from Apache Commons IO.
To configure the Flume to transport the log infromation read this article
3. Get the Flume stream from your source code and analyze it with Spark.
Take a look on a code sample from github https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaFlumeEventCount.java
You can use Apache Kafka as queue system for your logs. The system that generated your logs e.g websever will send logs to Apache KAFKA. Then you can use apache storm or spark streaming library to read from KAFKA topic and process logs at real time.
You need to create stream of logs , which you can create using Apache Kakfa. There are integration available for kafka with storm and apache spark. both has its pros and cons.
For Storm Kafka Integration look here
For Apache Spark Kafka Integration take a look here
Although this is a old question, posting a link from Databricks, which has a great step by step article for log analysis with Spark considering many areas.
https://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/index.html
Hope this helps.

Resources