I have a spark application which runs multiple structured streams. (Spark 2.3.2)
The problem is, that they all log to the same file.
Is there a way to set log file per spark stream?
Related
Hi I am trying to use Kafka as a log aggregator and filtering layer so they input into Splunk for eg.
Input side of Kafka will be Kafka S3 connectors and other connectors getting logs from S3 and Amazon Kinesis Data streams.See this pic for reference:
However what I want to know is inside the Kafka data pipeline for processing or filtering is it necessary to do Spark jobs? Or can that be just done with a simple Kafka streams app and if we have to do this design for several different logs what would be an efficient solution to implement. I am looking at a solution which we can replicate across different log streams without major changes each time.
Thank you
Spark (or Flink) can essentially replace Kafka Streams and Kafka Connect for transforming topics and writing to S3.
If you want to write directly to Splunk, then there is a Kafka Connector written explicitly for that, and you could use any Kafka client to consume+produce processed data before writing it downstream
I am using spark structured streaming where I read a stream from Kafka and after some transformation I write the resulted stream to Kafka.
I see a lot of hidden ..*tmp.crc files within my checkpoint directory. These files are not getting cleaned up and ever growing in number.
Am I missing some configuration?
I am not running spark on Hadoop. Using EBS based volume for checkpointing.
I am using spark streaming to do analysis. after analysis I have to save the kafka message in hdfs. Each kafka message is a xml file. I can't use rdd.saveAsTextFile because it will save whole rdd. Each element of rdd is kafka message ( xml file ). How to save each rdd element (file) in hdfs using spark.
I would go about this a different way. Stream your transformed data back into Kafka, and then use the HDFS connector for Kafka Connect to stream the data to HDFS. Kafka Connect is part of Apache Kafka. The HDFS connector is open source and available standalone or as part of Confluent Platform.
Doing it this way you decouple your processing from writing your data to HDFS, which makes it easier to manage, to troubleshoot, to scale.
I'm building spark streaming from Apache Kafka to our columnar DataBase.
To ensure fault tolerance I'm using HDFS checkpoint and write ahead log.
Apache Kafka topic -> spark streaming -> HDFS checkpoint-> spark SQl ( for messages manipulation)-> spark jdbc for our Db.
When I'm using spark jobs for one topic and table everything is working file.
I'm trying to stream in one spark job multiple Kafka topics and to write for multiple tables in here started the problem with the checkpoint ( which is per one topic table )
The problem is with checkpoints :(
1) If I will use "KafkaUtils.createDirectStream" with a list of topics and "groupBy" topic name but the checkpoint folder is one and if, for example, I will need to increase resources during the ongoing streaming (change amount of cors due to Kafka Lag) this will be impossible because today it's possible only if I delete the checkpoint folder and restart the spark job.
2) Use multiple spark streamingContext this I will try today and see if it works.
3) Multiple sparkStreaming with high-level consumers ( offset saved in kafka10...)
Any other ideas/solutions that I'm missing
Does structure streams with multiple Kafka topics and checkpoints behave differently?
Thx
I have apache access log file and i want to store access counts (total/daily/hourly) of each page in a cassandra table.
I am trying to do it by using kafka connect to stream from log file to a kafka topic. In order to increment metrics counters in Cassandra can I use Kafka Connect again? Otherwise which other tool should be used here e.g. kafka streams, spark, flink, kafka connect etc?
You're talking about doing stream processing, which Kafka can do - either with Kafka's Streams API, or KSQL. KSQL runs on top of Kafka Streams, and gives you a very simple way to build the kind of aggregations that you're talking about.
Here's an example of doing aggregations of streams of data in KSQL
SELECT PAGE_ID,COUNT(*) FROM PAGE_CLICKS WINDOW TUMBLING (SIZE 1 HOUR) GROUP BY PAGE_ID
See more at : https://www.confluent.io/blog/using-ksql-to-analyse-query-and-transform-data-in-kafka
You can take the output of KSQL which is actually just a Kafka topic, and stream that through Kafka Connect e.g. to Elasticsearch, Cassandra, and so on.
You mention other stream processing tools, they're valid too - depends in part on existing skills and language preferences (e.g. Kafka Streams is Java library, KSQL is … KSQL, Spark Streaming has Python as well as Java, etc), but also deployment preferences. Kafka Streams is just a Java library to deploy within your existing application. KSQL is deployable in a cluster, and so on.
This can be easily done with Flink, either as a batch or streaming job, and either with or without Kafka (Flink can read from files and write to Cassandra). This sort of time windowed aggregation is easily done with Flink's SQL api; see the examples here.