I am using cloudwatch-exporter to scrape metrics from CloudWatch and expose them in its localhost:9106/metrics.
The configuration for this is the following:
region: us-east-1
set_timestamp: false
metrics:
- aws_namespace: AWS/CloudFront
aws_metric_name: TotalErrorRate
aws_statistics: [Average]
aws_dimensions: [DistributionId, Region]
aws_dimensions_select:
Region: [Global]
And I can indeed see the fetched metrics:
$> curl localhost:9106/metrics
# HELP aws_cloudfront_total_error_rate_average CloudWatch metric AWS/CloudFront TotalErrorRate Dimensions: [DistributionId, Region] Statistic: Average Unit: Percent
# TYPE aws_cloudfront_total_error_rate_average gauge
aws_cloudfront_total_error_rate_average{job="aws_cloudfront",instance="",region="Global",distribution_id="E1XXXXXX",} 26.666666666666668
aws_cloudfront_total_error_rate_average{job="aws_cloudfront",instance="",region="Global",distribution_id="EXXXXXXX",} 0.0
aws_cloudfront_total_error_rate_average{job="aws_cloudfront",instance="",region="Global",distribution_id="E38XXXXXX",} 0.0
aws_cloudfront_total_error_rate_average{job="aws_cloudfront",instance="",region="Global",distribution_id="E6XXXXXXX",} 100.0
# HELP cloudwatch_exporter_scrape_duration_seconds Time this CloudWatch scrape took, in seconds.
# TYPE cloudwatch_exporter_scrape_duration_seconds gauge
cloudwatch_exporter_scrape_duration_seconds 14.487444391
# HELP cloudwatch_exporter_scrape_error Non-zero if this scrape failed.
# TYPE cloudwatch_exporter_scrape_error gauge
cloudwatch_exporter_scrape_error 0.0
However, Prometheus does not scrape them, and outputs the following logs:
level=warn ts=2018-06-20T07:00:37.578384931Z caller=scrape.go:932 component="scrape manager" scrape_pool=kubernetes-service-endpoints target=http://100.106.248.21:9106/metrics msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=24
level=warn ts=2018-06-20T07:01:36.821700134Z caller=scrape.go:932 component="scrape manager" scrape_pool=kubernetes-service-endpoints target=http://100.106.248.21:9106/metrics msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=24
level=warn ts=2018-06-20T07:02:35.593731873Z caller=scrape.go:932 component="scrape manager" scrape_pool=kubernetes-service-endpoints target=http://100.106.248.21:9106/metrics msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=24
Specifically:
msg="Error on ingesting samples that are too old or are too far into
the future"
My guess is that CloudWatch being located in Virginia, and our cloudwatch_exporter and Prometheus being in EU, there is a Timestamp difference preventing Prometheus to scrape these metrics.
Hence my guess to use set_timestamp: false as pointed in this Merge Request.
However, that does not work.
I am not a professional of Prometheus, and may be it ils wrongly configured. How can I further investigate?
Related
I have been reading this article - https://www.databricks.com/session_na20/native-support-of-prometheus-monitoring-in-apache-spark-3-0 and it has been mentioned that we can get the spark streaming metrics like input rows, processing rate and batch duration into prometheus.
I was able to get host/infra metrics like memory, disk etc via below API.
https://eastus-c3.databricks.net/driver-proxy-api/o/<org-id>/<cluster-id>/40001/metrics/executors/prometheus
I couldnt find any apis or references to get the streaming metrics, processing info etc.
Any help on how to get those streaming UI metrics to prometheus ?
Spark configs being set on cluster:
spark.ui.prometheus.enabled true
spark.sql.streaming.metricsEnabled true
Here is the prometheus config file:
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
- job_name: 'prometheus'
scheme: https
scrape_interval: 5s
static_configs:
- targets: ['eastus-c3.azuredatabricks.net']
metrics_path: '/driver-proxy-api/o/<orgid>/<clusterid>/40001/metrics/executors/prometheus'
basic_auth:
username: 'token'
password: 'user gen token'
Streaming metrics are emitted from the driver, not the executors.
Try /driver-proxy-api/o/<orgid>/<clusterid>/40001/metrics/prometheus.
For driver metrics, you'll need to enable the PrometheusServlet. You can do this on Databricks by attaching an init script like:
#!/bin/bash
cat << EOF > /databricks/spark/conf/metrics.properties
*.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet
*.sink.prometheusServlet.path=/metrics/prometheus
master.sink.prometheusServlet.path=/metrics/master/prometheus
applications.sink.prometheusServlet.path=/metrics/applications/prometheus
EOF
And don't forget to name your streaming queries (Python, Scala), this helps identify the query and differentiate queries when executing multiple streaming queries on the same cluster.
I'm trying to understand how the price estimation works for Azure Data Factory from the official guide, section "Estimating Price - Use Azure Data Factory to migrate data from Amazon S3 to Azure Storage
I managed to understand everything except the 292 hours that are required to complete the migration.
Could you please explain to me how did they get that number?
Firstly, feel free to submit a feedback here with the MS docs team to clarify with an official response on same.
Meanwhile, I see, as they mention "In total, it takes 292 hours to complete the migration" it would include listing from source, reading from source, writing to sink, other activities, other than the data movement itself.
If we consider approximately, for data volume of 2 PB and aggregate throughput of 2 GBps would give
2PB = 2,097,152 GB BINARY and
Aggregate throughput =
2BGps --> 2,097,152/2 = 1,048,576 secs --> 1,048,576 secs / 3600 =
291.271 hours
Again, these are hypothetical. Further you can refer Plan to manage costs for Azure Data Factory and Understanding Data Factory pricing through examples.
I have a vast database comprised of ~2.4 million JSON files that by themselves contain several records. I've created a simple apache-beam data pipeline (shown below) that follows these steps:
Read data from a GCS bucket using a glob pattern.
Extract records from JSON data.
Transform data: convert dictionaries to JSON strings, parse timestamps, others.
Write to BigQuery.
# Pipeline
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = save_main_session
p = beam.Pipeline(options=pipeline_options)
# Read
files = p | 'get_data' >> ReadFromText(files_pattern)
# Transform
output = (files
| 'extract_records' >> beam.ParDo(ExtractRecordsFn())
| 'transform_data' >> beam.ParDo(TransformDataFn()))
# Write
output | 'write_data' >> WriteToBigQuery(table=known_args.table,
create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER,
write_disposition=beam.io.BigQueryDisposition.WRITE_EMPTY,
insert_retry_strategy='RETRY_ON_TRANSIENT_ERROR',
temp_file_format='NEWLINE_DELIMITED_JSON')
# Run
result = p.run()
result.wait_until_finish()
I've tested this pipeline with a minimal sample dataset and is working as expected. But I'm pretty doubtful regarding the optimal use of BigQuery resources and quotas. The batch load quotas are very restrictive, and due to the massive amount of files to parse and load, I want to know if I'm missing some settings that could guarantee the pipeline will respect the quotas and run optimally. I don't want to exceed the quotas as I am running other loads to BigQuery in the same project.
I haven't finished understanding some parameters of the WriteToBigQuery() transform, specifically batch_size, max_file_size, and max_files_per_bundle, or if they could help to optimize the load jobs to BigQuery. Could you help me with this?
Update
I'm not only concerned about BigQuery quotas, but GCP quotas of other resources used by this pipeline are also a matter of concern.
I tried to run my simple pipeline over the target data (~2.4 million files), but I'm receiving the following warning message:
Project [my-project] has insufficient quota(s) to execute this workflow with 1 instances in region us-central1. Quota summary (required/available): 1/16 instances, 1/16 CPUs, 250/2096 disk GB, 0/500 SSD disk GB, 1/99 instance groups, 1/49 managed instance groups, 1/99 instance templates, 1/0 in-use IP addresses. Please see https://cloud.google.com/compute/docs/resource-quotas about requesting more quota.
I don't understand that message completely. The process activated 8 workers successfully and is using 8 from the 8 available in-use IP addresses. Is this a problem? How could I fix it?
If you're worried about load job quotas, you can try streaming data into bigquery that comes with a less restrictive quota policy.
To achieve what you want to do, you can try the Google provided templates or just refer to their code.
Cloud Storage Text to BigQuery (Stream) [code]
Cloud Storage Text to BigQuery (Batch)
And last but not the least, more detailed information can be found on the Google BigQuery I/O connector.
I am trying to save metrics : loss, validation loss and mAP at every epoch during 100 and 50 epochs but at the end of the experiment I have this error:
Run failed: RunHistory finalization failed: ServiceException: Code: 400 Message: (ValidationError) Metric Document is too large
I am using this code to save the metrics
run.log_list("loss", history.history["loss"])
run.log_list("val_loss", history.history["val_loss"])
run.log_list("val_mean_average_precision", history.history["val_mean_average_precision"])
I don't understand why trying to save only 3 metrics exceeds the limits of Azure ML Service.
You could break the run history list writes into smaller blocks like this:
run.log_list("loss", history.history["loss"][:N])
run.log_list("loss", history.history["loss"][N:])
Internally, the run history service concatenates the blocks with same metric name into a contiguous list.
Can I set different reporting frequency for different types of metrics in micrometer? For example I want send endpoints metrics with 10s step and the others with 5s step.
There's a property for reporting frequency per meter registry but AFAICT, there's no concept for reporting frequency per meter. If you'd like to pursue this, you can create an issue to request the feature in its issue tracker: https://github.com/micrometer-metrics/micrometer/issues