Spark structured streaming Elasticsearch integration issue - apache-spark

I am writing a Spark structured streaming application in which data processed with Spark needs be sink'ed to elastic search.
This is my development environment.
Hadoop 2.6.0-cdh5.16.1
Spark version 2.3.0.cloudera4
elasticsearch 6.8.0
I ran spark-shell as
spark2-shell --jars /tmp/elasticsearch-hadoop-2.3.2/dist/elasticsearch-hadoop-2.3.2.jar
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, TimestampType};
import java.util.Calendar
import org.apache.spark.sql.SparkSession
import org.elasticsearch.spark.sql
import sys.process._
val checkPointDir = "/tmp/rt/checkpoint/"
val spark = SparkSession.builder
.config("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
.config("fs.s3n.awsAccessKeyId","aaabbb")
.config("fs.s3n.awsSecretAccessKey","aaabbbccc")
.config("spark.sql.streaming.checkpointLocation",s"$checkPointDir")
.config("es.index.auto.create", "true").getOrCreate()
import spark.implicits._
val requestSchema = new StructType().add("log_type", StringType).add("time_stamp", StringType).add("host_name", StringType).add("data_center", StringType).add("build", StringType).add("ip_trace", StringType).add("client_ip", StringType).add("protocol", StringType).add("latency", StringType).add("status", StringType).add("response_size", StringType).add("request_id", StringType).add("user_id", StringType).add("pageview_id", StringType).add("impression_id", StringType).add("source_impression_id", StringType).add("rnd", StringType).add("publisher_id", StringType).add("site_id", StringType).add("zone_id", StringType).add("slot_id", StringType).add("tile", StringType).add("content_id", StringType).add("post_id", StringType).add("postgroup_id", StringType).add("brand_id", StringType).add("provider_id", StringType).add("geo_country", StringType).add("geo_region", StringType).add("geo_city", StringType).add("geo_zip_code", StringType).add("geo_area_code", StringType).add("geo_dma_code", StringType).add("browser_group", StringType).add("page_url", StringType).add("document_referer", StringType).add("user_agent", StringType).add("cookies", StringType).add("kvs", StringType).add("notes", StringType).add("request", StringType)
val requestDF = spark.readStream.option("delimiter", "\t").format("com.databricks.spark.csv").schema(requestSchema).load("s3n://aa/logs/cc.com/r/year=" + Calendar.getInstance().get(Calendar.YEAR) + "/month=" + "%02d".format(Calendar.getInstance().get(Calendar.MONTH)+1) + "/day=" + "%02d".format(Calendar.getInstance().get(Calendar.DAY_OF_MONTH)) + "/hour=" + "%02d".format(Calendar.getInstance().get(Calendar.HOUR_OF_DAY)) + "/*.log")
requestDF.writeStream.format("org.elasticsearch.spark.sql").option("es.resource", "rt_request/doc").option("es.nodes", "localhost").outputMode("Append").start()
I have tried following two ways to sink the data in the DataSet to ES.
1.ds.writeStream().format("org.elasticsearch.spark.sql").start("rt_request/doc");
2.ds.writeStream().format("es").start("rt_request/doc");
In both cases I am getting the following error:
Caused by:
java.lang.UnsupportedOperationException: Data source es does not support streamed writing
java.lang.UnsupportedOperationException: Data source org.elasticsearch.spark.sql does not support streamed writing
at org.apache.spark.sql.execution.datasources.DataSource.createSink(DataSource.scala:320)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:293)
... 57 elided

ES-hadoop jar version I used is old one elasticsearch-hadoop-2.3.2.jar. we need 6 or above.
Now I use elasticsearch-hadoop-6* or above jars for it to work as a streaming sink.
I have downloaded it from https://artifacts.elastic.co/downloads/elasticsearch-hadoop/elasticsearch-hadoop-7.1.1.zip

Related

Azure HDI Spark import sqlContext.implicits._ error

I;ve got problems with importing data from Azure Blob storage csv file to my Spark by Jupyter notebook. I'm trying to realize one of tutorials about ML and Spark. When I fill Jupyter notebook like this:
import sqlContext.implicits._
val flightDelayTextLines = sc.textFile("wasb://sparkcontainer#[my account].blob.core.windows.net/sparkcontainer/Scored_FlightsAndWeather.csv")
case class AirportFlightDelays(OriginAirportCode:String,OriginLatLong:String,Month:Integer,Day:Integer,Hour:Integer,Carrier:String,DelayPredicted:Integer,DelayProbability:Double)
val flightDelayRowsWithoutHeader = flightDelayTextLines.map(s => s.split(",")).filter(line => line(0) != "OriginAirportCode")
val resultDataFrame = flightDelayRowsWithoutHeader.map(
s => AirportFlightDelays(
s(0), //Airport code
s(13) + "," + s(14), //Lat,Long
s(1).toInt, //Month
s(2).toInt, //Day
s(3).toInt, //Hour
s(5), //Carrier
s(11).toInt, //DelayPredicted
s(12).toDouble //DelayProbability
)
).toDF()
resultDataFrame.write.mode("overwrite").saveAsTable("FlightDelays")
I receive error like this:
SparkSession available as 'spark'.
<console>:23: error: not found: value sqlContext
import sqlContext.implicits._
^
I used shortes paths as well like ("wasb:///sparkcontainer/Scored_FlightsAndWeather.csv") this same error.
Any ideas?
BR,
Marek
When I see your code snippet, I don't see the sqlContext is created, refer the following code and get the sqlContext created and then start using it.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._

Difference between spark_session and sqlContext on loading a local file

I'm tried to load a local file as dataframe with using spark_session and sqlContext.
df = spark_session.read...load(localpath)
It couldn't read local files. df is empty.
But, after creating sqlcontext from spark_context, it could load a local file.
sqlContext = SQLContext(spark_context)
df = sqlContext.read...load(localpath)
It worked fine. But I can't understand why. What is the cause ?
Envionment: Windows10, spark 2.2.1
EDIT
Finally I've resolved this problem. The root cause is version difference between PySpark installed with pip and PySpark installed in local file system. PySpark failed to start because of py4j failing.
I am pasting a sample code that might help. We have used this to create a Sparksession object and read a local file with it:
import org.apache.spark.sql.SparkSession
object SetTopBox_KPI1_1 {
def main(args: Array[String]): Unit = {
if(args.length < 2) {
System.err.println("SetTopBox Data Analysis <Input-File> OR <Output-File> is missing")
System.exit(1)
}
val spark = SparkSession.builder().appName("KPI1_1").getOrCreate()
val record = spark.read.textFile(args(0)).rdd
.....
On the whole, in Spark 2.2 the preferred way to use Spark is by creating a SparkSession object.

What if OOM happens for the complete output mode in spark structured streaming

I am new and learning spark structured streaming,
I have following code that is using complete as the output mode
import java.util.Date
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.sql.types.StructType
object StreamingWordCount {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder
.appName("StreamingWordCount")
.config("spark.sql.shuffle.partitions", 1)
.master("local[2]")
.getOrCreate()
import spark.implicits._
val lines = spark
.readStream
.schema(new StructType().add("value", "string"))
.option("maxFilesPerTrigger", 1)
.text("file:///" + data_path)
.as[String]
val wordCounts = lines.flatMap(_.split(" ")).groupBy("value").count()
val query = wordCounts.writeStream
.queryName("t")
.outputMode("complete")
.format("memory")
.start()
while (true) {
spark.sql("select * from t").show(truncate = false)
println(new Date())
Thread.sleep(1000)
}
query.awaitTermination()
}
}
A quick question is that over time, the spark runtime remembers too many states of word and count, so OOM should happen at some time,
I would ask how to do in practice for such kind of scenario.
Memory sink should be used only for debugging purposes on low data volumes as the entire output will be collected and stored in the driver’s memory. The output will be stored in memory as an in-memory table.
So if OOM error occurs, the driver will crashes and all the state maintained in Driver's memory will be lost.
The same applies for Console sink as well.

how to use a whole hive database in spark and read sql queries from external files?

I am using hortonworks sandbox in Azure with spark 1.6.
I have a Hive database populated with TPC-DS sample data. I want to read some SQL queries from external files and run them on the hive dataset in spark.
I follow this topic Using hive database in spark which is just using a table in my dataset and also it writes SQL query in spark again, but I need to define whole, dataset as my source to query on that, I think i should use dataframes but i am not sure and do not know how!
also I want to import the SQL query from external .sql file and do not write down the query again!
would you please guide me how can I do this?
thank you very much,
bests!
Spark Can read data directly from Hive table. You can create, drop Hive table using Spark and even you can do all Hive hql related operations through the Spark. For this you need to use Spark HiveContext
From the Spark documentation:
Spark HiveContext, provides a superset of the functionality provided by the basic SQLContext. Additional features include the ability to write queries using the more complete HiveQL parser, access to Hive UDFs, and the ability to read data from Hive tables. To use a HiveContext, you do not need to have an existing Hive setup.
For more information you can visit Spark Documentation
To Avoid writing sql in code, you can use property file where you can put all your Hive query and then you can use the key in you code.
Please see below the implementation of Spark HiveContext and use of property file in Spark Scala.
package com.spark.hive.poc
import org.apache.spark._
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql._
import org.apache.spark._
import org.apache.spark.sql.DataFrame;
import org.apache.spark.rdd.RDD;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.spark.sql.hive.HiveContext;
//Import Row.
import org.apache.spark.sql.Row;
//Import Spark SQL data types
import org.apache.spark.sql.types.{ StructType, StructField, StringType };
object ReadPropertyFiles extends Serializable {
val conf = new SparkConf().setAppName("read local file");
conf.set("spark.executor.memory", "100M");
conf.setMaster("local");
val sc = new SparkContext(conf)
val sqlContext = new HiveContext(sc)
def main(args: Array[String]): Unit = {
var hadoopConf = new org.apache.hadoop.conf.Configuration();
var fileSystem = FileSystem.get(hadoopConf);
var Path = new Path(args(0));
val inputStream = fileSystem.open(Path);
var Properties = new java.util.Properties;
Properties.load(inputStream);
//Create an RDD
val people = sc.textFile("/user/User1/spark_hive_poc/input/");
//The schema is encoded in a string
val schemaString = "name address";
//Generate the schema based on the string of schema
val schema =
StructType(
schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)));
//Convert records of the RDD (people) to Rows.
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim));
//Apply the schema to the RDD.
val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema);
peopleDataFrame.printSchema();
peopleDataFrame.registerTempTable("tbl_temp")
val data = sqlContext.sql(Properties.getProperty("temp_table"));
//Drop Hive table
sqlContext.sql(Properties.getProperty("drop_hive_table"));
//Create Hive table
sqlContext.sql(Properties.getProperty("create_hive_tavle"));
//Insert data into Hive table
sqlContext.sql(Properties.getProperty("insert_into_hive_table"));
//Select Data into Hive table
sqlContext.sql(Properties.getProperty("select_from_hive")).show();
sc.stop
}
}
Entry in Properties File :
temp_table=select * from tbl_temp
drop_hive_table=DROP TABLE IF EXISTS default.test_hive_tbl
create_hive_tavle=CREATE TABLE IF NOT EXISTS default.test_hive_tbl(name string, city string) STORED AS ORC
insert_into_hive_table=insert overwrite table default.test_hive_tbl select * from tbl_temp
select_from_hive=select * from default.test_hive_tbl
Spark submit Command to run this job:
[User1#hadoopdev ~]$ spark-submit --num-executors 1 \
--executor-memory 100M --total-executor-cores 2 --master local \
--class com.spark.hive.poc.ReadPropertyFiles Hive-0.0.1-SNAPSHOT-jar-with-dependencies.jar \
/user/User1/spark_hive_poc/properties/sql.properties
Note: Property File location should be HDFS location.

error: value cassandraTable is not a member of org.apache.spark.SparkContext

I want to access Cassandra table in Spark. Below are the version that I am using
spark: spark-1.4.1-bin-hadoop2.6
cassandra: apache-cassandra-2.2.3
spark cassandra connector: spark-cassandra-connector-java_2.10-1.5.0-M2.jar
Below is the script:
sc.stop
import com.datastax.spark.connector._, org.apache.spark.SparkContext, org.apache.spark.SparkContext._, org.apache.spark.SparkConf
val conf = new SparkConf(true).set("spark.cassandra.connection.host", "localhost")
val sc = new SparkContext(conf)
val test_spark_rdd = sc.cassandraTable("test1", "words")
when i run the last statement i get an error
:32: error: value cassandraTable is not a member of
org.apache.spark.SparkContext
val test_spark_rdd = sc.cassandraTable("test1", "words")
hints to resolve the error would be helpful.
Thanks
Actually on shell you need to import respective packages. No need to do anything extra.
e.g. scala> import com.datastax.spark.connector._;

Resources