I am beginner to aws-cli. So far, my first exploratory query for a given resource. I find circuitous (and may be costly) ways of fetching entire data and then counting them. For example, number of instances, number of regions, number of S3 objects etc. It will be helpful if this is supported in cli. Is it supported already?
There is no problem with the current cli but this is more of generic enhancement request to avoid circuitous logic at the client side
Ex:
I'd like to know how many instances I have in a given region. I know I can start with aws ec2 describe-instances and then do some processing.
Same applies to get count of any given resource
A pattern of command(s) I'd be interested:
aws ec2 count-instances
aws ec2 count-users
aws <> count-<>
wherever count is applicable. In some cases, some other aggregators will also be useful (say, total number of cpus etc)
Yes there is a command to get the resource count. Find the example below
aws configservice get-discovered-resource-counts --resource-types "AWS::EC2::Instance" --region us-east-1
Follow the documentation link here
Is it also possible to get this information for a specific time range or date?
For example i want to plot and therefore need the data for all ec 2 instance from the last year on a daily bases.
Related
I'm looking into Azure Workbooks now, and I'm wondering if the below is actually even possible.
Scenario
List all function apps under Subscription / Resource Group scope.
This step is done - simple Azure Resource Graph query with parameters does the job.
Execute Azure Resource Manager action against each Function App that was returned as a part of Query in Step1.
Specifically, I'm interested in detectors/functionExecutionErrors ARM API - and return a parsed result out of it. When doing that for hardcoded resource, I can get the results I need. Using the following JSON Path $.properties.dataset[0].table.rows[0][1] I get back the summary: All running functions in healthy state with execution failure rate less than 0.1%.
I realize this might either be undoable in Workbooks or something trivial that I missed - it would be easiest if I could just run 'calculated columns' when rendering outputs. So, the summarizing question is:
How, if possible, can I combine Azure Resource Graph query with Azure Resource Manager DataSource, where Azure Resource Manager query runs per each returned Graph resource and display them as table in form: "Resource ID | ARM api results".
I think I have achieved closest result to this by marking Resource Graph Query output as parameter (id -> FunctionAppId) and referencing that in ARM query as /{FunctionAppId}/detectors/functionExecutionErrors - this works fine as long as only one resource is selected, but there are two obstacles: I want to execute against all query results regardless if they are selected, and I need Azure Resource Manager understand it needs to loop resources - not concatenate them (as seen in invoke HTTP call from F12 dev tools, the resource names are just joined together).
Hopefully there's someone out there who could help out with this. Thanks! :-)
I'm also new to Workbooks and I think creating a parameter first with the functionId is best. I do the same ;)
With multiple functions the parameter will have them all. You can use split() to get an array and then loop.
Will that work for you?
Can you share your solution if you managed to solve this differently?
cloudsma.com is a resource I use a lot to understand the queries and options better. Like this one: https://www.cloudsma.com/2022/03/azure-monitor-alert-reports
Workbooks doesn't currently have an ability to run the ARM data source against many resources, though it is on our backlog and are actively investigating a way to run any datasource for a set of values and merge the results togther.
The general workaround is to do as stated, either use a parameter so select a resource and run that one query for the selected item, or do similar with something like a query step with grid, and have selection of the grid output a parameter used as input to the ARM query step.
As the title states, my question pertains mostly to reading CSV data from AWS S3. I will provide details about the other technologies I am using, but they are not important to the core problem.
Context (not the core issue, just some extra detail)
I have a use case where I need to process some very large CSVs using a Node.js API on AWS Lambda and store some data from each CSV row to DynamoDB.
My implementation works well for small-to-medium-sized CSV files. However, for large CSV files (think 100k - 1m rows), the process takes way more than 15 minutes (the maximum execution time for an AWS Lambda function).
I really need this implementation to be serverless (because the rest of the project is serverless, because of a lack of predictable usage patterns, etc...).
So I decided to try and process the beginning of the file for 14.5 minutes or so, and then queue a new Lambda function to pick up where the last one left off.
I can easily pass the row number from the last function to the new function, so the new Lambda function knows where to start from.
So if the 1st function processed lines 1 - 15,000, then the 2nd function would pick up the processing job at row 15,001 and continue from there. That part is easy.
But I can't figure out how to start a read stream from S3 beginning in the middle. No matter how I set up my read stream, it always starts data flow from the beginning of the file.
It is impossible to break the processing task into smaller pieces (like queueing new Lambdas for each row) as I have already done this and optimized the process to be as minimal as possible.
Even if the 2nd job starts reading at the beginning of the file and I set it up to skip the already-processed rows, it will still take too long to get to the end of the file.
And even if I do some other implementation (like using EC2 instead of Lambda), I still run into the same problem. What if the EC2 process fails at row 203,001? I would need to queue up a new job to pick up from the next row. No matter what technology I use or what container/environment, I still need to be able to read from the middle of a file.
Core Problem
So... let's say I have a CSV file saved to S3. And I know that I want to start reading from row 15,001. Or alternatively, I want to start reading from the 689,475th byte. Or whatever.
Is there a way to do that? Using the AWS SDK for Node.js or any other type of request?
I know how to set up a read stream from S3 in Node.js, but I don't know how it works under the hood as far as how the requests are made. Maybe that knowledge would be helpful.
Ah it was so much easier than I was making it... Here is the answer in Node.js:
new aws.S3()
.getObject({
Key: 'bigA$$File.csv',
Bucket: 'bucket-o-mine',
Range: 'bytes=65000-100000',
})
.createReadStream()
Here is the doc: https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/S3.html
You can do this in any of the AWS SDKs or via HTTP header.
Here's what AWS has to say about the range header:
Downloads the specified range bytes of an object. For more information about the HTTP Range header, see https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35.
I have time series daily data which I run a model on. The model runs in Spark.
I only want to run the model daily, and append the results to the historic results. It is important to have a 'merged single data source' containing historical data for the model to run successfully.
I have to use an AWS service to store the results. If I store in S3, I will end up storing backfill + 1 file per day (too many files). If I store in Redshift, it doesn't merge + upsert, therefore becoming complicated. The customer facing data is in Redshift, so dropping the table and reloading daily is not an option.
I am not sure how to cleverly (defined as minimal cost and subsequent processing) store the incremental data without re-processing everything daily to get a single file.
S3 is still your best shot. Since your job doesn't seems need to be accessed on a real-time fashion, it's more of a rolling data set.
If you are worried about the amount of file it generates, there is at least 2 things you can do:
S3 object lifecycle management
You can define your objects to be removed or transition to another storage class(cheaper) after x days.
More examples: https://docs.aws.amazon.com/AmazonS3/latest/dev/lifecycle-configuration-examples.html
S3 notification
Basically you can set up a listener in your S3 bucket, 'listening for' all the objects that match your specified prefix and suffix, to trigger other AWS services. One easy thing you can do is to trigger a Lambda, do your processing and then you can do whatever you would like.
https://docs.aws.amazon.com/AmazonS3/latest/user-guide/enable-event-notifications.html
Use S3 as your database whenever it's possible. It's damn cheap and it's AWS's backbone.
You can also switch to an ETL. A very efficient one, which is OpenSource, specialized in BigData, fully automatizable and easy to use is the Pentaho Data Integrator.
It comes equipped with ready made plugins for S3, Redshift (and others), and there is a single step to compare with previous values. From my experience it runs pretty fast. Plus it works for you during the night and sends you a morning mail saying every thing went OK (or not).
Note to the moderators: this is a agnostic point of view, I could have recommended many others, but this one seams the most suited for the OP's need.
I am working in a specific project to change my repository to hazelcast.
I need find some documents by data range, store type and store ids.
During my tests i got 90k throughput using one instance c3.large, but when i execute the same test with more instances the result decrease significantly (10 instances 500k and 20 instances 700k).
These numbers were the best i could tuning some properties:
hazelcast.query.predicate.parallel.evaluation
hazelcast.operation.generic.thread.count
hz:query
I have tried to change instance to c3.2xlarge to get more processing but but the numbers don't justify the price.
How can i optimize hazelcast to be more fast in this scenario?
My user case don't use map.get(key), only map.values(predicate).
Settings:
Hazelcast 3.7.1
Map as Data Structure;
Complex object using IdentifiedDataSerializable;
Map index configured;
Only 2000 documents on map;
Hazelcast embedded configured by Spring Boot Application (singleton);
All instances in same region.
Test
Gatling
New Relic as service monitor.
Any help is welcome. Thanks.
If your use-case only contains map.values with a predicate, I would strongly suggest to use object type as in in-memory storage model. This way, there will not be any serialization involved during Query execution.
On the other end, it is normal to get very high numbers when you only have 1 member. Because, there is no data moving across network. Potentially to improve, I would check EC2 instances with high network capacity. For example c3.8xlarge has 10 Gbit network, compared to High that comes with c3.2xlarge.
I can't promise, how much increase you can get, but I would definitely try these changes first.
I need to run identical jobs in schedule, and they differ only in few strings.
As you may know, there is no a convenient way to create identical jobs with different parameters. For now i prefer so "codeless" way to do so, or with "as less code as possilbe".
So lets imagine they are stored in a rows of JobsConfigurations table of the website-related database.
How I can get the Job name of job being running to pick the right configuration from the table?
Thanks for help!
See https://github.com/projectkudu/kudu/wiki/Web-Jobs#environment-settings
The WEBJOBS_NAME environment variable will give you the name of the current WebJob.