Good Morning,
I am currently working on running and monitoring Databricks Notebooks from Airflow. I am currently able to login, spin up a new cluster, and execute a notebook. The issue I have is the monitoring side of things. Using the 2.1 jobs API I am making the following call.
curl --location --request GET 'https://<MY-SERVER>.azuredatabricks.net/api/2.1/jobs/runs/list?job_id=<MY-Job-Id>
On the API page it looks like that API should feed back some stats about the job itself. Most importantly to me is state.result_state as I want to use it as a sensor.
My issue is that when I hit the list endpoint for any of my jobs the only thing I get back is
{
"has_more": false
}
I cant find anywhere in the documentation where I need to add something in the Notebook itself to omit the metric im looking for. Is there maybe an elevated set of permissions I need for my Auth Token that would give me the full set of metrics I am looking for?
I have written some simple functions and enabled Application Insights,
Its all showing as connected and I can see that's its tracking http statues, eg I get a failed request count and server response times etc.
I understand that I can add application insights to node with the following code
let appInsights = require("applicationinsights");
appInsights.setup("[your ikey]").start();
But I was hoping it would just work without this, I can see that the function is outputting logs when I use the log stream
But when I use app insights I don't see anything in any of the log tables
Do I need to add insights via code to my function or I am missing some secret config option.
That's also a good idea to add application insights module into your node project to achieve monitor feature for your function. Both code and codeless are good choices.
In my opinion, the biggest difference between code and codeless monitor is the custom telemetry data. But I think in most scenarios, default information collected is enough for daily using, official doc says:
Application Insights collects log, performance, and error data, and
automatically detects performance anomalies.
So I think it's ok for you when you could get traces and error messages after adding appinsights module and recreate a new appinsights instance. And you can also try to use codeless configuration I mentioned in the comment(azure portal-> your function written by nodejs-> Application insights-> enable-> create new resource)
My needs are following:
- Need to fetch data from a 3rd party API into SQL azure.
The API's will be queried everyday for incremental data and may require pagination as by default any API response will give only Top N records.
The API also needs an auth token to work, which is the first call before we start downloading data from endpoints.
Due to last two reasons, I've opted for Function App which will be triggered daily rather than data factory which can query web APIs.
Is there a better way to do this?
Also I am thinking of pushing all JSON into Blob store and then parsing data from the JSON into SQL Azure. Any recommendations?
How long does it take to call all of the pages? If it is under ten minutes, then my recommendation would be to build an Azure Function that queries the API and inserts the json data directly into a SQL database.
Azure Function
Azure functions are very cost effective. The first million execution are free. If it takes longer than ten, then have a look at durable functions. For handling pagination, we have plenty of examples. Your exact solution will depend on the API you are calling and the language you are using. Here is an example in C# using HttpClient. Here is one for Python using Requests. For both, the pattern is similar. Get the total number of pages from the API, set a variable to that value, and loop over the pages; Getting and saving your data in each iteration. If the API won't provide the max number of pages, then loop until you get an error. Protip: Make sure specify an upper bound for those loops. Also, if your API is flakey or has intermittent failures, consider using a graceful retry pattern such as exponential backoff.
Azure SQL Json Indexed Calculated Columns
You mentioned storing your data as json files into a storage container. Are you sure you need that? If so, then you could create an external table link between the storage container and the database. That has the advantage of not having the data take up any space in the database. However, if the json will fit in the database, I would highly recommend dropping that json right into the SQL database and leveraging indexed calculated columns to make querying the json extremely quick.
Using this pairing should provide incredible performance per penny value! Let us know what you end up using.
Maybe you can create a time task by SQL server Agent.
SQL server Agent--new job--Steps--new step:
In the Command, put in your Import JSON documents from Azure Blob Storage sql statemanets for example.
Schedules--new schedule:
Set Execution time.
But I think Azure function is better for you to do this.Azure Functions is a solution for easily running small pieces of code, or "functions," in the cloud. You can write just the code you need for the problem at hand, without worrying about a whole application or the infrastructure to run it. Functions can make development even more productive, and you can use your development language of choice, such as C#, F#, Node.js, Java, or PHP.
It is more intuitive and efficient.
Hope this helps.
If you could set the default top N values in your api, then you could use web activity in azure data factory to call your rest api to get the response data.Then configure the response data as input of copy activity(#activity('ActivityName').output) and the sql database as output. Please see this thread :Use output from Web Activity call as variable.
The web activity support authentication properties for your access token.
Also I am thinking of pushing all JSON into Blob store and then
parsing data from the JSON into SQL Azure. Any recommendations?
Well,if you could dump the data into blob storage,then azure stream analytics is the perfect choice for you.
You could run the daily job to select or parse the json data with asa sql ,then dump the data into sql database.Please see this official sample.
One thing to consider for scale would be to parallelize both the query and the processing. If there is no ordering requirement, or if processing all records would take longer than the 10 minute function timeout. Or if you want to do some tweaking/transformation of the data in-flight, or if you have different destinations for different types of data. Or if you want to be insulated from a failure - e.g., your function fails halfway through processing and you don't want to re-query the API. Or you get data a different way and want to start processing at a specific step in the process (rather than running from the entry point). All sorts of reasons.
I'll caveat here to say that the best degree of parallelism vs complexity is largely up to your comfort level and requirements. The example below is somewhat of an 'extreme' example of decomposing the process into discrete steps and using a function for each one; in some cases it may not make sense to split specific steps and combine them into a single one. Durable Functions also help make orchestration of this potentially easier.
A timer-driven function that queries the API to understand the depth of pages required, or queues up additional pages to a second function that actually makes the paged API call
That function then queries the API, and writes to a scratch area (like Blob) or drops each row into a queue to be written/processed (e.g., something like a storage queue, since they're cheap and fast, or a Service Bus queue if multiple parties are interested (e.g., pub/sub)
If writing to scratch blob, a blob-triggered function reads the blob and queues up individual writes to a queue (e.g., a storage queue, since a storage queue would be cheap and fast for something like this)
Another queue-triggered function actually handles writing the individual rows to the next system in line, SQL or whatever.
You'll get some parallelization out of that, plus the ability to start from any step in the process, with a correctly-formatted message. If your processors encounter bad data, things like poison queues/dead letter queues would help with exception cases, so instead of your entire process dying, you can manually remediate the bad data.
I want to build a nodejs application to scrape data from a website every 20mins and store it in firebase. Can you please tell me which product of google( compute engine, app engine or cloud functions ) is effective for this requirement as below are the things i am expecting to do,
1. Run Nodejs, cheerio to scrape data from website and store in firebase
2. Schedule it to run 20mins initially later may change it to 30mins or 1hr.
After reading the docs, i know that there are too many ways to implement this, but i am looking for a cost/resource effective way.
Pointers and ideas would be good.
Host the Node.js application within the App Engine[1] as Cloud Functions are event-driven[2]. You can use App Engine standard[3] or App Engine flexible[4] environment. For the scheduling part, Google Cloud Platform has a Cron Service[5] and you can create a cron job for your task hitting App Engine[6]. You can find a sample design here[7].
It depends on how much time your script spends waiting on requests. During that time the script is idle but you're getting charged at a super-high rate.
If you're doing a lot of concurrency then I would say do it with cloud functions.Another pro of doing it that way is your ip won't get blocked because it will be different every time.
Regarding scheduling, I'm not sure if Google lets do that, but I know AWS does.
A cost effective/simple way would be to use cronjob.org and have it send an http request to your cloud functions url to trigger it. If you're worried about other people triggering it, tell your cronjob to send an http header w/ an api key. Check this api key in your cloud function code to verify cronjob.org sent the request. I don't think it gets any more easy/cheap than this.
I am struggling to work out something that seems like it would be so simple.
Here is some context:
I have a web app, which has 6 graphs powered by D3 and this data is stored in one table in DynamoDB. I am using AWS and NodeJS with the awssdk.
I need to have the graphs updating in real-time when new information is added.
I currently have it set so that the scan function runs every 30 seconds for each graph, however, when I have multiple users it causes the db to be hit so many times that it maxes out the reads.
I want it so that when data in the database is updated, potentially the server will save that data to a document so that the users can poll that instead of the database itself and that doc will simply update when new info is added to the database.
Basically, any way to have it where dynamodb is only scanned when there is new information.
I was looking into using streams however I am completely lost on where to start and if that is the best approach to take.
You would want to configure a DynamoDB Stream of your table to trigger something like an AWS Lambda function. That function could then scan the table and generate your new file and store it somewhere like S3.