Problem:
Flink task manager reports: apache.kafka.common.errors.TimeoutException: Timeout expired while fetching topic metadata
Deployment overview:
A Java project to try out Stateful Functions.The streaming app reads messages from Kafka, processes messages and sends the final result to kafka egress.
Deployed on Azure:
Azure Event Hub (Kafka Endpoint) as ingress and egress
Azure Kubernetes Service as k8s deployment
Azure Data Lake Gen 2 as storage for checkpoint
Deployment is good, job manager and task manager has been launched, then I see task failed to run due to the exception
Diagnostics:
I created a simple Java consumer with the identical kafka config,
just with a different consumer group. The Java app works well both
on my laptop and in AKS (deployed in the same namespace as the
stateful function app is) So I get a conclusion that the Event Hub
and my kafka config are both good.
I checked the task manager log (kubectl logs xxx), and the kafka properties have been correctly loaded. The sasl.jaas.config shows as "sasl.jaas.config = [hidden]" but I assume this is by design.
My Kafka Settings:
I'm using the following config:
kind: io.statefun.kafka.v1/ingress
spec:
id: io.streaming/eventhub-ingress
address: xxxx.servicebus.windows.net:9093
consumerGroupId: group-receiver-00
startupPosition:
type: group-offsets
topics:
- topic: streaming-topic-rec-32
valueType: streaming.types/rec
targets:
- streaming.fns/bronze_rec
- topic: streaming-topic-eng-32
valueType: streaming.types/eng
targets:
- streaming.fns/bronze_eng
properties:
- request.timeout.ms: 60000
- security.protocol: SASL_SSL
- sasl.mechanism: PLAIN
- sasl.jaas.config: org.apache.kafka.common.security.plain.PlainLoginModule required username="$ConnectionString" password="primary connection string of the event hub ns";
Can anyone help me with this? Thank you!
Resolved after reducing replicas of task manager. No config changed
Related
I have a test Flink app I am trying to running on Azure Kubernetes connected to Azure Storage. In my Flink app I have configured the following configuration:
Configuration cfg = new Configuration();
cfg.setString("fs.azure.account.key.<storage-account.blob.core.windows.net", "<access-key>");
FileSystem.initialize(cfg, null);
I have also enabled checkpointing as follows:
env.enableCheckpointing(10000);
env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.DELETE_ON_CANCELLATION);
env.setStateBackend(new EmbeddedRocksDBStateBackend());
env.getCheckpointConfig().setCheckpointStorage("wasbs://<container>#<storage-account>.blob.core.windows.net/checkpoint/");
The storage account has been created on the Azure Portal. I have used the Access Key in the code above.
When I deploy the app to Kubernetes the JobManager runs and creates the checkpoint folder in the Azure Storage container, however, the size of the Block blob data is always 0B. The app also continuously throws this exception.
The fun error I am getting is:
Caused by: org.apache.flink.fs.shaded.hadoop3.org.apache.hadoop.fs.azure.AzureException: No credentials found for account <storage-account>.blob.core.windows.net in the configuration, and its container <container> is not accessible using anonymous credentials. Please check if the container exists first. If it is not publicly available, you have to provide account credentials.
org.apache.flink.fs.azure.shaded.com.microsoft.azure.storage.StorageException: Public access is not permitted on this storage account
The part that has been scratching my head (apart from the fleas) is the fact that it does create the checkpoint folders and files and continues to create further checkpoints.
This account is not publicly accessible and company policy has restricted enabling public access.
I also tried using the flink-conf.yaml and this was my example:
state:backend: rocksdb
state.checkpoints.dir: wasbs://<container>#<storage-account>.blob.core.windows.net/checkpoint/
fs.azure.account.key.**flinkstorage**.blob.core.windows.net: <access-key>
fs.azure.account.key.<storage-account>.blob.core.windows.net: <access-key>
I tried both account.key options above. I tried with wasb protocol as well. I also tried rotating the access keys on Azure Storage all resulting the same errors.
I eventually got this working by moving all of my checkpointing configurations to the flink-conf.yaml. All reference to checkpointing was removed from my code i.e. the StreamExecutionEnvironment.
My flink-config.yaml looks like this
execution.checkpointing.interval: 10s
execution.checkpoint.mode: EXACTLY_ONCE
state.backend: rocksdb
state.checkpoints.dir: wasbs://<container>#<storage-account.blob.core.windows.net/checkpoint/
# azure storage access key
fs.azure.account.key.psbombb.blob.core.windows.net: <access-key>
Checkpoints are now being written to Azure Storage with the size of the metadata files no longer 0B.
I deployed my Flink cluster to Kubernetes as follows with Azure Storage plugins enabled:
./bin/kubernetes-session.sh -Dkubernetes.cluster-id=<cluster-name> -Dkubernetes.namespace=<your-namespace> -Dcontainerized.master.env.ENABLE_BUILT_IN_PLUGINS=flink-azure-fs-hadoop-1.14.0.jar -Dcontainerized.taskmanager.env.ENABLE_BUILT_IN_PLUGINS=flink-azure-fs-hadoop-1.14.0.jar
I then deployed the job to the Flink cluster as follows:
./bin/flink run --target kubernetes-session -Dkubernetes.namespace=<your-namespace> -Dkubernetes.cluster-id=<cluster-name> ~/path/to/project/<your-jar>.jar
The TaskManager on the WebUI will not show StdOut logs. You'll need to kubectl logs -f <taskmanager-pod-name> -n <your-namespace> to see the job logs.
Remember to port-forward 8081 if you want to see the Flink WebUI:
kubectl port-forward svc/<cluster-name> -n <namespace>
e.g. http://localhost:8081
If you're using Minikube and you wish to access the cluster through the Flink LoadBalancer external IP you need to run minikube tunnel
e.g. http://<external-ip>:8081
How to logs (stdout / stderr) from all container pods azure Kubernetes to the event hub.
I can able to see all logs by Log Analytics workspaces >> Logs using an Azure query language.
I want to send all logs to the event hub.
Can anyone suggest on this?
You can easily forward container logs to Event Hubs via Fluent-Bit's Kafka output.
Here is Fluent-Bit documentation for Kafka - https://docs.fluentbit.io/manual/pipeline/outputs/kafka
And here is Kafka client integration with Event Hubs - https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-for-kafka-ecosystem-overview
Worked for me using fluentbit kafka output to Azure EventHub
td-agent-bit.conf
[INPUT]
Name tail
Path xxx.log
Refresh_Interval 10
[OUTPUT]
Name kafka
Match *
brokers xxx.xxx.windows.net:9093
topics xxx
rdkafka.security.protocol SASL_SSL
rdkafka.sasl.username $ConnectionString
rdkafka.sasl.password Endpoint=sb://xxx.xxx.windows.net/;SharedAccessKeyName=xxx;SharedAccessKey=xxx
rdkafka.sasl.mechanism PLAIN
[OUTPUT]
name stdout
match *
Inside docker container (MUST HAVE or broker down/ssl fail)
docker-compose.yml
version: "3.7"
services:
fluent-bit:
image: fluent/fluent-bit:1.6.2
container_name: fluentbit
restart: always
volumes:
- ./td-agent-bit.conf:/fluent-bit/etc/fluent-bit.conf
- ./xxx.log:/fluent-bit/etc/xxx.log:ro
I'm using the Python 3.8 SDK for Azure service bus, (azure-servicebus v. 0.50.3). I use the following code to send a message to a topic ...
service = ServiceBusService(service_namespace,
shared_access_key_name=key_name,
shared_access_key_value=key_value)
msg = Message(json.dumps({'type': 'my_message'}))
service.send_topic_message(topic_name, msg)
How do I create a Docker image that runs the service bus with a topic or two already created? I found this image
version: '3.7'
services:
azure_sb:
container_name: azure_sb
image: microsoft/azure-storage-emulator
tty: true
restart: always
ports:
- "10000:10000"
- "10001:10001"
- "10002:10002"
but I'm unclear how to connect to it using the code I have or if the above is even a valid service bus image.
Azure Service Bus does not provide a docker image. The image that you are using (microsoft/azure-storage-emulator) is for the Azure Storage system, which can provide similar queuing capabilities with Azure Storage Queues. For more details check out How to use Azure Queue storage from Python.
If you need to use Azure Service Bus locally, check out the GitHub Issue: Local Development story?. TLDR: Use AMQP libraries and connect to another AMQP provider for local, and swap out for Service Bus in production.
I am trying to use spark on Kubernetes. Idea is to using spark-submit to k8s cluster which is running prometheus operator. Now I know that prometheus operator can respond to ServiceMonitor yaml but I am confused how to provide some of the things required in the YAML using spark-submit
Here is the YAML:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: sparkloads-metrics
namespace: runspark
spec:
selector:
matchLabels:
app: runspark
namespaceSelector:
matchNames:
- runspark
endpoints:
- port: 8192 ---> How to provide the name to port using `spark-submit`
interval: 30s
scheme: http
You cannot provide additional ports and their names to the Service created by SparkSubmit yet (Spark v2.4.4). Things can change in the later versions.
What you can do is create additional Kubernetes Service (Spark Monitoring Service, eg. of type Cluster IP) per Spark job after the Job submission with SparkSubmit, for instance running spark-submit ... && kubectl apply ... . Or use any of the available Kubernetes clients with the language of your choice.
Note that you can use Kubernetes OwnerReference to configure automatic Service deletion/GC on Spark Driver Pod deletion.
Then you can supply the ServiceMonitor's via the Prometheus Operator Helm values:
prometheus:
additionalServiceMonitors:
- name: spark-metrics # <- Spark Monitoring Service name
selector:
matchLabels:
k8s-app: spark-metrics # <- Spark Monitoring Service label
namespaceSelector:
any: true
endpoints:
- interval: 10s
port: metrics # <- Spark Monitoring Service port name
Be aware of the fact that Spark doesn't provide a way to customize Spark Pods yet, so your Pod ports which should expose metrics are not exposed on a Pod level and won't be accessible via Service. To overcome it you can add additional EXPOSE ... 8088 statement in the Dockerfile and rebuild Spark image.
This guide should help you to setup Spark monitoring with PULL strategy using for example Jmx Exporter.
There is an alternative (though it is recommended only for short-running Spark jobs, but you can try it in your environment if you do not run huge workloads):
Deploy Prometheus Pushgateway and integrate it with your Prometheus Operator
Configure Spark Prometheus Sink
By doing that your Spark Pods will PUSH metrics to the Gateway and Prometheus will PULL them from the Gateway in order.
You can refer the Spark Monitoring Helm chart example with the Prometheus Operator and Prometheus Pushgateway combined.
Hope it helps.
I have deployed 5 apps using Azure container instances, these are working fine, the issue I have is that currently, all containers are running all the time, which gets expensive.
What I want to do is to start/stop instances when required using for this a Master container or VM that will be working all the time.
E.G.
This master service gets a request to spin up service number 3 for 2 hours then shut it down and all other containers will be off until they receive a similar request.
For my use case, each service will be used for less than 5 hours a day most of the time.
Now, I know Kubernetes its an engine made to manage containers but all examples I have found are for high scale services, not for 5 services with only one container each, also not sure if Kubernetes allows to have all the containers off most of the time.
What I was thinking on is to handle all these throw some API, but I'm not fiding any service in Azure that allows something similar to this, I have only found options to create new containers, not to spin up and shut them down.
EDIT:
Also, this apps run process that are to heavy to have them on a serverless platform.
Solution is to define horizontal pod autoscaler for your deployment.
The Horizontal Pod Autoscaler automatically scales the number of pods in a replication controller, deployment or replica set based on observed CPU utilization (or, with custom metrics support, on some other application-provided metrics). Note that Horizontal Pod Autoscaling does not apply to objects that can’t be scaled, for example, DaemonSets.
The Horizontal Pod Autoscaler is implemented as a Kubernetes API resource and a controller. The resource determines the behavior of the controller. The controller periodically adjusts the number of replicas in a replication controller or deployment to match the observed average CPU utilization to the target specified by user.
Configuration file should looks like this:
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: hpa-images-service
spec:
scaleTargetRef:
apiVersion: apps/v1beta1
kind: Deployment
name: example-deployment
minReplicas: 2
maxReplicas: 100
targetCPUUtilizationPercentage: 75
scaleRef should refer toyour deployment definition and minReplicas you can set as 0, value of targetCPUUtilization you can set according to your preferences.. Such approach should help you to save money due to termination pod which have high CPU utilization.
Kubernetes official documentation: kubernetes-hpa.
GKE autoscaler documentation: gke-autoscaler.
Useful blog about saving cash using GCP: kubernetes-google-cloud.