Collect performance metrics using Google Cloud Ops Agent and send to Google Cloud Monitoring - google-cloud-monitoring

I'm looking for a general way to collect performance metrics on several Linux VM instances (Azure, GCP, other) and monitor the metrics in GCP.
On an Ubuntu VM in Azure, I have installed Google Cloud Ops Agent, which uses fluentd (to collect logs) and OpenTelemetry (to collect performance metrics) behind the scenes.
I added overrides for the two services to set environment variables so that they pick up the service account JSON credentials file, as follows:
google-cloud-ops-agent-fluent-bit.service đź ’ GOOGLE_SERVICE_CREDENTIALS
google-cloud-ops-agent-opentelemetry-collector.service đź ’ GOOGLE_APPLICATION_CREDENTIALS
See this post for more details on authentication.
I could see log messages appearing in Google Cloud Logging, which must have been scraped and sent by google-cloud-ops-agent-fluent-bit.service. However, I couldn't find any performance metrics from google-cloud-ops-agent-opentelemetry-collector. Where should I expect to find these in GCP? I'm convinced that there is some additional configuration I need to get this working, but the documentation seems to be about getting Ops Agent running on GCP Compute Engine instances.
Update 1:
I can see that the service is running (sudo systemctl status google-cloud-ops-agent-opentelemetry-collector.service), but I now notice errors that I hadn't noticed before which might suggest why metrics are not making it to Google Cloud,
exporterhelper/queued_retry.go:215 Exporting failed. Will retry the request after interval. {"kind": "exporter", "name": "googlecloud", "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could no
t be written: No matching retention policy was found for one or more points.: timeSeries[0]\nerror details: name = Unknown desc = total_point_count:1 errors:{sta
tus:{code:9} point_count:1}", "interval": "5.52330144s"}
I don't know where to find the logs for the service other than the excerpt printed by systemctl status.
The commandline for the service is /opt/google-cloud-ops-agent/subagents/opentelemetry-collector/otelopscol --config=/run/google-cloud-ops-agent-opentelemetry-collector/otel.yaml. I took a look in the config file and see a few mentions of googlecloud as an exporter, e.g.
exporters:
googlecloud:
metric:
prefix: ""
user_agent: Google-Cloud-Ops-Agent-Metrics/2.11.0 (BuildDistro=focal;Platform=linux;ShortName=ubuntu;ShortVersion=20.04)
Update 2: Output of service status
â—Ź google-cloud-ops-agent-opentelemetry-collector.service - Google Cloud Ops Agent - Metrics Agent
Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent-opentelemetry-collector.service; static; vendor preset: enabled)
Drop-In: /etc/systemd/system/google-cloud-ops-agent-opentelemetry-collector.service.d
└─override.conf
Active: active (running) since Tue 2022-03-15 06:36:44 UTC; 1 day 17h ago
Process: 1053790 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -service=otel -in /etc/google-cloud-ops-agent/config.yaml -logs ${LOGS_DIRECTORY} (code=exited, status=0/SUCCESS)
Main PID: 1053796 (otelopscol)
Tasks: 10 (limit: 19198)
Memory: 381.2M
CGroup: /system.slice/google-cloud-ops-agent-opentelemetry-collector.service
└─1053796 /opt/google-cloud-ops-agent/subagents/opentelemetry-collector/otelopscol --config=/run/google-cloud-ops-agent-opentelemetry-collector/otel.yaml
Mar 16 23:47:37 HOSTNAME otelopscol[1053796]: go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send
Mar 16 23:47:37 HOSTNAME otelopscol[1053796]: /root/go/pkg/mod/go.opentelemetry.io/collector#v0.44.0/exporter/exporterhelper/metrics.go:134
Mar 16 23:47:37 HOSTNAME otelopscol[1053796]: go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1
Mar 16 23:47:37 HOSTNAME otelopscol[1053796]: /root/go/pkg/mod/go.opentelemetry.io/collector#v0.44.0/exporter/exporterhelper/queued_retry_inmemory.go:105
Mar 16 23:47:37 HOSTNAME otelopscol[1053796]: go.opentelemetry.io/collector/exporter/exporterhelper/internal.consumerFunc.consume
Mar 16 23:47:37 HOSTNAME otelopscol[1053796]: /root/go/pkg/mod/go.opentelemetry.io/collector#v0.44.0/exporter/exporterhelper/internal/bounded_memory_queue.go:99
Mar 16 23:47:37 HOSTNAME otelopscol[1053796]: go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func2
Mar 16 23:47:37 HOSTNAME otelopscol[1053796]: /root/go/pkg/mod/go.opentelemetry.io/collector#v0.44.0/exporter/exporterhelper/internal/bounded_memory_queue.go:78
Mar 16 23:47:37 HOSTNAME otelopscol[1053796]: 2022-03-16T23:47:37.980Z info exporterhelper/queued_retry.go:215 Exporting failed. Will retry the request after interval. {"kind": "exporter", "name": "googlecloud", "error": "[rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: No matching retention policy was found for one or more points.: timeSeries[0-199]\nerror details: name = Unknown desc = total_point_count:200 errors:{status:{code:9} point_count:200}; rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: No matching retention policy was found for one or more points.: timeSeries[0-199]\nerror details: name = Unknown desc = total_point_count:200 errors:{status:{code:9} point_count:200}; rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: No matching retention policy was found for one or more points.: timeSeries[0-199]\nerror details: name = Unknown desc = total_point_count:200 errors:{status:{code:9} point_count:200}; rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: No matching retention policy was found for one or more points.: timeSeries[0-111]\nerror details: name = Unknown desc = total_point_count:112 errors:{status:{code:9} point_count:112}]", "interval": "10.435795045s"}
Mar 16 23:47:49 HOSTNAME otelopscol[1053796]: 2022-03-16T23:47:49.299Z info exporterhelper/queued_retry.go:215 Exporting failed. Will retry the request after interval. {"kind": "exporter", "name": "googlecloud", "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: No matching retention policy was found for one or more points.: timeSeries[0-4]\nerror details: name = Unknown desc = total_point_count:5 errors:{status:{code:9} point_count:5}", "interval": "44.913550864s"}

Related

whats the meaning of "verb=LIST" in kubernetes apiserver logs and alerts?

Recently i renewed certificates on cluster using kubeadm alpha certs renew and then I saw logs in Kubernetes Apiserver pods as below:
kubectl -n kube-system logs --tail 10 kube-apiserver-master-1
I1011 07:27:25.703052 1 trace.go:116] Trace[989041745]: "List" url:/api/v1/persistentvolumeclaims (started: 2022-10-11 07:27:22.702071048 +0000 UTC m=+165036.176710383) (total time: 3.000954622s):
and i received too many alerts from Alertmanager(I'm using Prometheus-operator on Kubernetes).
this is a sample alert:
FIRING
Alert: - critical
Description:
Details:
• alertname: KubeAPIErrorsHigh
• cluster: myCluster
• prometheus: monitoring/prometheus-prometheus-oper-prometheus
• resource: persistentvolumeclaims
• severity: critical
• verb: LIST
Prometheus metric that is created by Prometheus-operator is:
expr: sum
by(resource, subresource, verb) (rate(apiserver_request_total{code=~"5..",job="apiserver"}[5m]))
/ sum by(resource, subresource, verb) (rate(apiserver_request_total{job="apiserver"}[5m]))
> 0.1
I want to now are there any problems in the cluster or not.
verb: LIST as in HTTP verb. (GET, POST, PUT, LIST etc). Whats most likely happening is something is trying to LIST all the PVC's and thats taking too long or failing. In which case yes you need to look into why.

Cloud Service (Classic) Roles and Instances reverts all updates and are not maintained by Azure

I have an issue with my roles and instances in Azure (cloud service classic). The Windows Server 2012R2 reverts the installed updates frequently. Also the Windows in not activated from the last reboot of the instances (environments).
We are using PaaS, which means the provider should maintain the OS and the health of our cloud services.
This is the error, which I receive:
Busy (Recovering role... Application startup task 0 is running.)
Other collected information:
[Information] (2:54 AM) Execute:Monitoring cycled started
[Warning] (2:54 AM) CaptureResourceStatus:Resource status is Down because Azure Instance State is not Ready or Stopped: Down and
[Warning] (2:54 AM) CaptureResourceMetrics:Instance Holden.eFox.Web_IN_0 status is BusyRole (expected ReadyRole). State Details: Recovering role... Failed to load role entrypoint. System.Reflection.ReflectionTypeLoadException: Unable to load one or more of the requested types. Retrieve the LoaderExceptions property for more information.
at System.Reflection.RuntimeModule.GetTypes(RuntimeModule module)
at System.Reflection.Assembly.GetTypes()
at Microsoft.WindowsAzure.ServiceRuntime.RoleEnvironment.GetRoleEntryPoint(Assembly entryPointAssembly)
--- End of inner exception stack trace ---
at Microsoft.WindowsAzure.ServiceRuntime.RoleEnvironment.GetRoleEntryPoint(Assembly entryPointAssembly)
at Microsoft.WindowsAzure.ServiceRuntime.RoleEnvironment.CreateRoleEntryPoint(RoleType roleTypeEnum)
at Microsoft.WindowsAzure.ServiceRuntime.RoleEnvironment.InitializeRoleInternal(RoleType roleTypeEnum)' [2022-09-01T02:53:36.000Z] Last exit time: [2022/09/01, 02:53:37.605]. Last exit code: 0.
[Warning] (2:54 AM) CaptureResourceStatus:Resource status is Down because Azure Instance State is not Ready or Stopped: Down and
[Information] (2:54 AM) CleanDuplicates:Removed 28 duplicate metrics. (Before cleaning: 44 metrics, after 16.)
[Warning] (2:54 AM) CaptureResourceMetrics:Metric SystemEventLogs: no values captured
[Warning] (2:54 AM) CaptureResourceMetrics:Metric ApplicationEventLogs: no values captured
Could you help me, how to resolve this problem or could you direct me whom to contact for some support? (I do not pay for technical support plans.)

Azure Storage: How to avoid clock skew issues with a Blob level SAS token

I'm occasionally having trouble with Azure Storage SAS tokens generated on the server. I don't set anything for start time since this was recommended to avoid clock skew issues, and I set my expiry time to 1 hour after DateTime.UtcNow. Every now and then, the SAS tokens don't work, and I'm guessing this might have to do with a clock skew issue. Here are two errors I received recently:
<Error>
<Code>AuthenticationFailed</Code>
<Message>Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature. RequestId:cb371f2b-801e-0063-16a1-08d06f000000 Time:2021-02-21T22:35:53.9832140Z</Message>
<AuthenticationErrorDetail>Signed expiry time [Sun, 21 Feb 2021 20:39:40 GMT] must be after signed start time [Sun, 21 Feb 2021 22:35:53 GMT]</AuthenticationErrorDetail>
</Error>
<Error>
<Code>AuthenticationFailed</Code>
<Message>Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature. RequestId:8818c581-401e-0058-6477-08717d000000 Time:2021-02-21T17:35:37.1284611Z</Message>
<AuthenticationErrorDetail>Signature not valid in the specified time frame: Start [Sat, 20 Feb 2021 00:15:01 GMT] - Expiry [Sat, 20 Feb 2021 01:30:01 GMT] - Current [Sun, 21 Feb 2021 17:35:37 GMT]</AuthenticationErrorDetail>
</Error>
This is how I generate the token:
var blobSasBuilder = new BlobSasBuilder
{
BlobContainerName = containerName,
BlobName = fileName,
Resource = "b",
ExpiresOn = DateTime.UtcNow.AddHours(1),
Protocol = SasProtocol.Https
};
How can I fix this issue? According to the above error, it looks like I tried to access this resource after the token expired, but in reality I tried to access it immediately after the token was generated and sent to the client. As I said, this does not happen often, but it's a recurring problem.
On a second thought, I wonder if this is a bug with the v12 SDK.
According to the error, the start time is later than your expiration time and current time. Please set the start time to be at least 15 minutes in the past.
For example
I use Net SDK Azure.Storage.Blobs
// Creates a client to the BlobService using the connection string.
var blobServiceClient = new BlobServiceClient(storageConnectionString);
// Gets a reference to the container.
var blobContainerClient = blobServiceClient.GetBlobContainerClient(<ContainerName>);
// Gets a reference to the blob in the container
BlobClient blobClient = containerClient.GetBlobClient(<BlobName>);
// Defines the resource being accessed and for how long the access is allowed.
var blobSasBuilder = new BlobSasBuilder
{
StartsOn = DateTime.UtcNow.Subtract(-15),
ExpiresOn = DateTime.UtcNow.AddHours(1),
BlobContainerName = <ContainerName>,
BlobName = <BlobName>,
};
// Defines the type of permission.
blobSasBuilder.SetPermissions(BlobSasPermissions.Write);
// Builds an instance of StorageSharedKeyCredential
var storageSharedKeyCredential = new StorageSharedKeyCredential(<AccountName>, <AccountKey>);
// Builds the Sas URI.
var sasQueryParameters = blobSasBuilder.ToSasQueryParameters(storageSharedKeyCredential);
The code that generates the SAS must run on a machine where the date, time and time zone are correctly set.
The error messages are a little different for both cases.
First error: is saying that Signed expiry time is ~1h:56m before the error time, how it can be possible? Maybe the SAS expire time was set to a value that is too early, I mean almost 2 hours earlier and not 15 minutes earlier? or Most likely the SAS start time is greater than the SAS end time?
Second error: the time of error is 21 February, but the SAS expires on 20 February, again, it looks like the SAS time is expired, but this time with more than 35 hours and not 15 minutes.
Maybe the machine that runs the code for generating the SAS has some issues with time? This can be checked by polling that machine for its time at regular intervals(once per minute for example) and comparing the results.

Azure IoT Device Offline Commands Issue

I have an Azure IoT Device in an IoT Central application.
We don't want it to execute offline commands. Is there any way to switch off this offline commands execution capability.
Based on my test (sync command), the behavior of executing "offline command" is working well. In the case, when the device is disconnected from Azure IoT Central App, the error Not Found is returned back after 30 seconds, see my example:
{
"error": {
"code": "NotFound",
"message": "Could not connect to device in order to send command. You can contact support at https://aka.ms/iotcentral-support. Please include the following information. Request ID: cic9xs38, Time: Sun, 09 Aug 2020 05:08:00 GMT.",
"requestId": "cic9xs38",
"time": "Sun, 09 Aug 2020 05:08:00 GMT"
}
}
and the following screen snippet shows a command history in the IoT Central App:
Note, that in the present version there is no feature such as re-executing (retrying) a sync or async command on the re-connected device. If the device is not connected, the command is completed with a failed status = NotFound, in other words, the command is invoking in the sync manner, see more details here.

DCOS unable to install & run ArangoDB

I have installed DCOS with one agent and 3 masters and tried installing ArangoDB but it is failing to deploy arangodb.
Below is the config seen as per the log.
ArangoDB Image: arangodb/arangodb-mesos:3.0
Mode: cluster
Asynchronous replication flag: 0
SecondariesWithDBservers: 0
CoordinatorsWithDBservers: 0
SecondarySameServer: 0
ArangoDBForcePullImage: 1
ArangoDBPrivilegedImage: 0
Minimal resources agent: mem():2048;cpus():0.25;disk(*):2048
Minimal resources DBserver: mem():4096;cpus():1;disk(*):4096
Minimal resources secondary DBserver:
mem():4096;cpus():1;disk(*):4096
Minimal resources coordinator: mem():4096;cpus():1;disk(*):1024
Number of agents in agency: 3
Number of DBservers: 2
Number of coordinators: 2
zookeeper: zk://master.mesos:2181/arangodb3
And below is the error seen in the log file.
0901 07:07:34.769537 23 CaretakerCluster.cpp:422] planned agent
instances: 3, running agent instances: 1
I0901 07:07:34.769601 23 Caretaker.cpp:400] Declining offer
e2301ebe-fff0-46a5-b71b-ef77b9a7a764-O11
I0901 07:07:37.474743 24 HttpServer.cpp:439] handling http request
'GET /v1/health.json'
I0901 07:07:40.802276 23 CaretakerCluster.cpp:416] And here the
offer:
{"id":{"value":"e2301ebe-fff0-46a5-b71b-ef77b9a7a764-O12"},"framework_id":{"value":"37ac79b8-bc37-4493-9558-aa72638290db-0002"},"slave_id":{"value":"37ac79b8-bc37-4493-9558-aa72638290db-S0"},"hostname":"192.168.12.167","url":{"scheme":"http","address":{"hostname":"192.168.12.167","ip":"192.168.12.167","port":5051},"path":"/slave(1)","query":[]},"resources":[{"name":"ports","type":1,"ranges":{"range":[{"begin":1026,"end":2180},{"begin":2182,"end":3887},{"begin":3889,"end":5049},{"begin":5052,"end":8079},{"begin":8082,"end":8180},{"begin":8182,"end":17140},{"begin":17144,"end":32000}]},"role":""},{"name":"disk","type":0,"scalar":{"value":1.17866e+06},"role":""},{"name":"cpus","type":0,"scalar":{"value":7.5},"role":""},{"name":"mem","type":0,"scalar":{"value":12298},"role":""}],"attributes":[],"executor_ids":[]}
I0901 07:07:40.802320 23 CaretakerCluster.cpp:422] planned agent
instances: 3, running agent instances: 1
I0901 07:07:40.802383 23 Caretaker.cpp:400] Declining offer
e2301ebe-fff0-46a5-b71b-ef77b9a7a764-O12
I believe one agent server is also sufficient. Is it that number of the agents should also be 3 servers ?
Also need to know how to restart the entire cluster and single service if need be ? (Killing processes doesn't seem to be right way)
Can someone suggest what exactly needs to done here...
Thanks in advance!
Do I understand correctly that you only have one Agent node (which would explain only one instance running)? ArangoDB needs at least 3 agent nodes.
See the pre-install note: https://github.com/mesosphere/universe/blob/version-3.x/repo/packages/A/arangodb3/4/package.json#L10

Resources