Spark structure streaming write operations into couchbase?

Spark structure streaming write operations into couchbase? - apache-spark

I have tried write data into couchbase using structure streaming, i referred spark connector 2.0, i declared couchbase connection in spark session, But am not able to connect with cocuhbase,below examble i tried , Any help will be appreciate thanks in advance
SparkSession spark = SparkSession
.builder()
.appName("couch")
.master("local[*]")
.config("spark.couchbase.nodes", "127.0.0.1")
.config("spark.couchbase.bucket.mybucket", "123")
.config("com.couchbase.username", "Administrator")
.config("com.couchbase.password", "password")
.getOrCreate()
Dataset<row> df=....//dataset reading data from kafka
StreamingQuery query = df.writeStream
.outputMode("complete")
.option("checkpointLocation", "mycheckpointlocation")
.option("idField", "value")
.format("com.couchbase.spark.sql")
.start();
query.awaitTermination();

Related

Multiple Spark streaming readstream & writestream

What's the best way to read, transform and write streams with different structures?
I have an application with two readstreams and writestreams from two different kafka topics. The two streams have different structures and follow different transformation processes. When writing the individual dataframe to console, I see only one stream in the console output, instead of two:
val df1 = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBroker)
.option("subscribe", "topic-event-1")
.load()
val df2 = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBroker)
.option("subscribe", "topic-event-2")
.load()
eventOneStream(df1)
eventTwoStream(df2)
def eventOneStream(dataframe: DataFrame): Unit = {
// do some transformations
//then write streams to console
dataframe
.writeStream
.outputMode("append")
.format("console")
.start().awaitTermination()
}
def eventTwoStream(dataframe: DataFrame): Unit = {
// do some other type of transformations
//then write streams to console
dataframe
.writeStream
.outputMode("append")
.format("console")
.start().awaitTermination()
}

You see only one stream because the first query is blocking the second query when you did awaitTermination(). What you can do is start both and then use StreamingQueryManager.awaitAnyTermination()
val query1 = eventOneStream(df1)
val query2 = eventTwoStream(df2)
spark.streams.awaitAnyTermination()
def eventOneStream(dataframe: DataFrame): StreamingQuery = {
// do some transformations
//then write streams to console
dataframe
.writeStream
.outputMode("append")
.format("console")
.start()
}
def eventTwoStream(dataframe: DataFrame): StreamingQuery = {
// do some other type of transformations
//then write streams to console
dataframe
.writeStream
.outputMode("append")
.format("console")
.start()
}

Error in Spark task with Kafka and .NET core

I am trying to read data from Kafka in Spark using .Net Core 3.1. I get NullPointerException and cannot find a reason for it. Maybe someone has encountered this error and found a solution?
Reading from file works.
Also tried to change connection details to an external Kafka broker (with authentication), but still, I am getting the same error.
Source topic on broker exists.
Exception:
WARN KafkaOffsetReaderConsumer: Error in attempt 1 getting Kafka offsets:
java.lang.NullPointerException
at org.apache.spark.kafka010.KafkaConfigUpdater.setAuthenticationConfigIfNeeded(KafkaConfigUpdater.scala:60)
at org.apache.spark.sql.kafka010.ConsumerStrategy.setAuthenticationConfigIfNeeded(ConsumerStrategy.scala:61)
at org.apache.spark.sql.kafka010.ConsumerStrategy.setAuthenticationConfigIfNeeded$(ConsumerStrategy.scala:60)
at org.apache.spark.sql.kafka010.SubscribeStrategy.setAuthenticationConfigIfNeeded(ConsumerStrategy.scala:102)
at org.apache.spark.sql.kafka010.SubscribeStrategy.createConsumer(ConsumerStrategy.scala:106)
at org.apache.spark.sql.kafka010.KafkaOffsetReaderConsumer.consumer(KafkaOffsetReaderConsumer.scala:82)
at org.apache.spark.sql.kafka010.KafkaOffsetReaderConsumer.$anonfun$partitionsAssignedToConsumer$2(KafkaOffsetReaderConsumer.scala:533)
at org.apache.spark.sql.kafka010.KafkaOffsetReaderConsumer.$anonfun$withRetriesWithoutInterrupt$1(KafkaOffsetReaderConsumer.scala:578)
at ...
Code:
static void Main(string[] args)
{
SparkSession spark = SparkSession
.Builder()
.AppName("kafka_sample2")
.GetOrCreate();
var stream = spark.ReadStream()
.Format("kafka")
.Option("kafka.bootstrap.servers", "127.0.0.1:9093")
.Option("subscribe", "spark-input")
.Option("startingOffsets", "earliest")
.Option("failOnDataLoss", "false");
var dataFrame = stream.Load();
dataFrame.WriteStream()
.Format("console")
.Start();
spark.Stop();
}
Spark version:
version 3.1.2
Using Scala version 2.12.10, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_311
Branch HEAD
Compiled by user centos on 2021-05-24T04:46:13Z
Revision de351e30a90dd988b133b3d00fa6218bfcaba8b8
Command line:
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 --class org.apache.spark.deploy.dotnet.DotnetRunner --master local bin\Debug\netcoreapp3.1\microsoft-spark-3-1_2.12-2.0.0.jar dotnet bin\Debug\netcoreapp3.1\SparkTest.dll

I found a solution.
static void Main(string[] args)
{
SparkSession spark = SparkSession
.Builder()
.AppName("kafka_sample2")
.GetOrCreate();
var stream = spark.ReadStream()
.Format("kafka")
.Option("kafka.bootstrap.servers", "127.0.0.1:9093")
.Option("subscribe", "spark-input")
.Option("startingOffsets", "earliest")
.Option("failOnDataLoss", "false");
var dataFrame = stream.Load();
var query = dataFrame.WriteStream()
.Format("console")
.Start();
query.AwaitTermination(); <---needed to call this method
spark.Stop();
}

Spark structured and Dstream application is writing duplicates

We are trying to write spark streaming application that will write to the hdfs. However whenever we are writing the files lots of duplicates shows up. This behavior is with or without we crashing application using the kill. And also for both Dstream and Structured apis. The source is kafka topic. The behavior of checkpoint directory sounds very random. I have not come across very relevant information on the issue.
Question is: Can checkpoint directory provide exactly once behavior?
scala version: 2.11.8
spark version: 2.3.1.3.0.1.0-187
kafka version : 2.11-1.1.0
zookeeper version : 3.4.8-1
HDP : 3.1
Any help is appreciate.
Thanks,
Gautam
object sparkStructuredDownloading {
val kafka_brokers="kfk01.*.com:9092,kfk02.*.com:9092,kfk03.*.com:9092"
def main(args: Array[String]): Unit = {
var topic = args(0).trim().toString()
new downloadingAnalysis(kafka_brokers ,topic).process()
}
}
class downloadingAnalysis(brokers: String,topic: String) {
def process(): Unit = {
// try{
val spark = SparkSession.builder()
.appName("sparkStructuredDownloading")
// .appName("kafka_duplicate")
.getOrCreate()
spark.conf.set("spark.streaming.stopGracefullyOnShutdown", "true")
println("Application Started")
import spark.implicits._
import scala.concurrent.duration._
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
import org.apache.spark.sql.streaming.Trigger
val inputDf = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", brokers)
.option("subscribe", topic)
.option("startingOffsets", "latest")
//.option("kafka.group.id", "testduplicate")
.load()
val personJsonDf = inputDf.selectExpr("CAST(value AS STRING)") //Converting binary to text
println("READ STREAM INITIATED")
import scala.concurrent.duration._
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
import org.apache.spark.sql.streaming.Trigger
import spark.implicits._
val filteredDF= personJsonDf.filter(line=> new ParseLogs().validateLogLine(line.get(0).toString()))
spark.sqlContext.udf.register("parseLogLine", (logLine: String) => {
val df1 = filteredDF.selectExpr("parseLogLine(value) as result")
println(df1.schema)
println("WRITE STREAM INITIATED")
val checkpoint_loc="/warehouse/test_duplicate/download/checkpoint1"
val kafkaOutput = result.writeStream
.outputMode("append")
.format("orc")
.option("path", "/warehouse/test_duplicate/download/data1")
.option("maxRecordsPerFile", 10)
.trigger(Trigger.ProcessingTime("10 seconds"))
.start()
.awaitTermination()
}

Why does consuming from kafka not finish in cloudera but finishes in hortonworks?

I have this code:
import org.apache.spark.sql.SparkSession
object TopicIngester {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.master("local[*]") // remove this later
.appName("Ingester")
.getOrCreate()
spark.read
.format("kafka")
.option("kafka.bootstrap.servers", "sandbox-hdp.hortonworks.com:6667" /*my.cluster.com:9092 in case of cloudera*/)
.option("subscribe", "test")
.option("startingOffsets", "earliest")
.option("endingOffsets", "latest")
.load()
.write
.parquet("/user/maria_dev/test")
spark.stop()
}
}
When I run it in the hortonworks sandbox everything works fine. All available data is read from the test topic and saved into the /user/maria_dev/test folder.
I also have a topic with the same name on my cloudera cluster, and for some reason it gets stuck at .parquet("/path/to/folder") and never finishes, as if it is waiting for more data forever or something.
What could be the problem?

Resource not found exception using kinesis spark streaming

I am trying to use spark structured streaming with kinesis. Following is my code.
val kinesis = spark
.readStream
.format("kinesis")
.option("streams", streamName)
.option("region", "us-east-1")
.option("initialPosition", "TRIM_HORIZON")
.option("endpointUrl", "kinesis.us-east-1.amazonaws.com")
.option("awsAccessKey", accessKey)
.option("awsSecretKey", secretKey)
.option("format", "json")
.option("inferSchema", "true")
.schema(schema)
.load
val query = kinesis.writeStream
.outputMode("append")
.format("console")
.start()
query.awaitTermination()
I also exported accesskey and secret key as environment variables. But when I execute it. I have following exception
ERROR StreamExecution: Query [id = 45b3e45a-6a3b-48c6-9183-8531f85ea537, runId = ab5bbfd6-d9f7-455c-8065-fa65de932914] terminated with error
com.amazonaws.services.kinesis.model.ResourceNotFoundException: Stream dev-rt-stream-ds-5 under account 438156723281 not found. (Service: AmazonKinesis; Status Code: 400; Error Code: ResourceNotFoundException; Request ID: f1b1384c-2528-e50f-a358-2e59f9ad3b45)
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1182)
Following code returns correct number of shards.
val credentials = new BasicAWSCredentials(accessKey, secretKey)
val kinesisClient = new AmazonKinesisClient(credentials)
kinesisClient.setEndpoint("kinesis.us-east-1.amazonaws.com")
val numShards = kinesisClient.describeStream(streamName).getStreamDescription().getShards().size
println("num shards => " + numShards)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark structure streaming write operations into couchbase? - apache-spark

Related

Multiple Spark streaming readstream & writestream

Error in Spark task with Kafka and .NET core

Spark structured and Dstream application is writing duplicates

Why does consuming from kafka not finish in cloudera but finishes in hortonworks?

Resource not found exception using kinesis spark streaming

Categories

Resources