I'm trying to choose between two patterns or maybe even another one that I have yet to consider for handling logging in my application.
I have a nodejs express server serving clients in an auto scaling group.
The goal is to ideally be able to see each user's activity very easily so that I can trouble shoot in production.
Approach 1, centralized logging using ELK to query based on certain json fields such as customerId requestId etc.
Approach 2, create a log filer per customer and query each file as needed.
In both approaches, log files will be rotated.
Creating a log file per customer just doesn't feel right to me especially when considering the scenarios of having millions of customers but in terms of performance...
query 1 million files based on customer ID then subsequently query a much smaller file for the information you need
OR
query centralized log file filtering results based on customerID etc.
Is one approach significantly better in performance than the other? What is the best practice in the industry at the moment for this scenario and is there a better approach to consider?
Lastly AWS Services seem to charge based on file size that you are querying. As such would one approach be more cost effective than the other?
Related
My needs are following:
- Need to fetch data from a 3rd party API into SQL azure.
The API's will be queried everyday for incremental data and may require pagination as by default any API response will give only Top N records.
The API also needs an auth token to work, which is the first call before we start downloading data from endpoints.
Due to last two reasons, I've opted for Function App which will be triggered daily rather than data factory which can query web APIs.
Is there a better way to do this?
Also I am thinking of pushing all JSON into Blob store and then parsing data from the JSON into SQL Azure. Any recommendations?
How long does it take to call all of the pages? If it is under ten minutes, then my recommendation would be to build an Azure Function that queries the API and inserts the json data directly into a SQL database.
Azure Function
Azure functions are very cost effective. The first million execution are free. If it takes longer than ten, then have a look at durable functions. For handling pagination, we have plenty of examples. Your exact solution will depend on the API you are calling and the language you are using. Here is an example in C# using HttpClient. Here is one for Python using Requests. For both, the pattern is similar. Get the total number of pages from the API, set a variable to that value, and loop over the pages; Getting and saving your data in each iteration. If the API won't provide the max number of pages, then loop until you get an error. Protip: Make sure specify an upper bound for those loops. Also, if your API is flakey or has intermittent failures, consider using a graceful retry pattern such as exponential backoff.
Azure SQL Json Indexed Calculated Columns
You mentioned storing your data as json files into a storage container. Are you sure you need that? If so, then you could create an external table link between the storage container and the database. That has the advantage of not having the data take up any space in the database. However, if the json will fit in the database, I would highly recommend dropping that json right into the SQL database and leveraging indexed calculated columns to make querying the json extremely quick.
Using this pairing should provide incredible performance per penny value! Let us know what you end up using.
Maybe you can create a time task by SQL server Agent.
SQL server Agent--new job--Steps--new step:
In the Command, put in your Import JSON documents from Azure Blob Storage sql statemanets for example.
Schedules--new schedule:
Set Execution time.
But I think Azure function is better for you to do this.Azure Functions is a solution for easily running small pieces of code, or "functions," in the cloud. You can write just the code you need for the problem at hand, without worrying about a whole application or the infrastructure to run it. Functions can make development even more productive, and you can use your development language of choice, such as C#, F#, Node.js, Java, or PHP.
It is more intuitive and efficient.
Hope this helps.
If you could set the default top N values in your api, then you could use web activity in azure data factory to call your rest api to get the response data.Then configure the response data as input of copy activity(#activity('ActivityName').output) and the sql database as output. Please see this thread :Use output from Web Activity call as variable.
The web activity support authentication properties for your access token.
Also I am thinking of pushing all JSON into Blob store and then
parsing data from the JSON into SQL Azure. Any recommendations?
Well,if you could dump the data into blob storage,then azure stream analytics is the perfect choice for you.
You could run the daily job to select or parse the json data with asa sql ,then dump the data into sql database.Please see this official sample.
One thing to consider for scale would be to parallelize both the query and the processing. If there is no ordering requirement, or if processing all records would take longer than the 10 minute function timeout. Or if you want to do some tweaking/transformation of the data in-flight, or if you have different destinations for different types of data. Or if you want to be insulated from a failure - e.g., your function fails halfway through processing and you don't want to re-query the API. Or you get data a different way and want to start processing at a specific step in the process (rather than running from the entry point). All sorts of reasons.
I'll caveat here to say that the best degree of parallelism vs complexity is largely up to your comfort level and requirements. The example below is somewhat of an 'extreme' example of decomposing the process into discrete steps and using a function for each one; in some cases it may not make sense to split specific steps and combine them into a single one. Durable Functions also help make orchestration of this potentially easier.
A timer-driven function that queries the API to understand the depth of pages required, or queues up additional pages to a second function that actually makes the paged API call
That function then queries the API, and writes to a scratch area (like Blob) or drops each row into a queue to be written/processed (e.g., something like a storage queue, since they're cheap and fast, or a Service Bus queue if multiple parties are interested (e.g., pub/sub)
If writing to scratch blob, a blob-triggered function reads the blob and queues up individual writes to a queue (e.g., a storage queue, since a storage queue would be cheap and fast for something like this)
Another queue-triggered function actually handles writing the individual rows to the next system in line, SQL or whatever.
You'll get some parallelization out of that, plus the ability to start from any step in the process, with a correctly-formatted message. If your processors encounter bad data, things like poison queues/dead letter queues would help with exception cases, so instead of your entire process dying, you can manually remediate the bad data.
I have two different Node Projects that access to the same database with sequelize.
One of the Node Apps (kind of backoffice) updates some tables and the other one use the data of that tables to performs some operation.
The thing is that this data should not be changed constantly and the second app need to be as fast as possible, thats why the second app querys the tables once (when app starts) and then stores this data in memory so it can do the operations faster (because the are no i/o to database).
My problem is that sometimes, this data may change throw the first app, and as this two apps have no contact between them (for security reasons) the only way I see is to have some "dirty" flag in some table of the database, and then make the first app to change it after some update and the second app to query each X seconds in order to check if this flag has been changed.
I don't like this approach and that's why I'm posting this question:
Does Sequelize provides a better or fancy way to do this ?
like some kind of "changes/dirty" watcher
Thanks in advance
To start: I've tried Loopback. Loopback is nice but does not allow for relations across multiple REST data services, but rather makes a call to the initial data service and passes query parameters that ask it to perform the joined query.
Before I go reinventing the wheel and writing a massive wrapper around Loopback's loopback-rest-connector, I need to find out if there are any existing libraries or frameworks that already tackle this. My extensive Googling has turned up nothing so far.
In a true microservice environment, there is a service per database.
http://microservices.io/patterns/data/database-per-service.html
From this article:
Implementing queries that join data that is now in multiple databases
is challenging. There are various solutions:
Application-side joins - the application performs the join rather than
the database. For example, a service (or the API gateway) could
retrieve a customer and their orders by first retrieving the customer
from the customer service and then querying the order service to
return the customer’s most recent orders.
Command Query Responsibility Segregation (CQRS) - maintain one or more
materialized views that contain data from multiple services. The views
are kept by services that subscribe to events that each services
publishes when it updates its data. For example, the online store
could implement a query that finds customers in a particular region
and their recent orders by maintaining a view that joins customers and
orders. The view is updated by a service that subscribes to customer
and order events.
EXAMPLE:
I have 2 data microservices:
GET /pets - Returns an object like
{
"name":"ugly",
"type":"dog",
"owner":"chris"
}
and on a completely different microservice....
GET /owners/{OWNER_NAME} - Returns the owner info
{
"owner":"chris",
"address":"under a bridge",
"phone":"123-456-7890"
}
And I have an API-level microservice that is going to call these two data services. This is the microservice that I will be applying this at.
I'd like to be able to establish a model for Pet such that, when I query pet, upon a successful response from GET /pets, it will "join" with owners (send a GET /owners/{OWNERS_NAME} for all responses), and to the user, simply return a list of pets that includes their owner's data.
So GET /pets (maybe something like Pets.find()) would return
{
"name":"ugly",
"type":"dog",
"owner": "chris",
"address": "under a bridge",
"phone": "123-456-7890"
}
Applying any model/domain logic on your API-gateway is bad decision, and considered as bad practice. API Gateway should only do your systems's CAS (with relying onto Auth service which holds the logic), And convert incoming external requests into inner system requests (different headers/ requester payload data) and proxy formatted requests to services for any other work, recieves them, cares about encapsulating errors, and presents every response in proper external form.
Another point, if there is alot of joins between two models required for application core flow (validation/scoping etc) then perhaps you should reconsider to which Business Domain your models/services are bound. If it's same BD perhaps they should be together. Priciples of Domain-Driven-Design helped me to understand where real boundaries between micro-services are.
If you work with loopback (like we are and face same problem we faced - that loopback have no proper join implementation) you can have separate Report/Combined data service, which is only one who can access to all the service databases and does it only for READ purposes - i.e. queries. Provide it with separately set-up read-only wide access to the db - instead of having only one datasource being set up (single database) it should be able to read from all the databases which are in scope of this query-join db user.
Such service should able to generate proper joins with expected output schema from configuration json - like loopback models (thats what I did in same case). Once abstraction is done it's pretty simple to build/add any equery with any complex joins. It's clean, and it's easy to reason about. Also, it's DBA friendly. For me such approach worked well so far.
I'm just getting started with CouchDB and looking for some best practices. My current project is a CMS/Wiki-like tool that contains many pages of content. So far, this seems to fit well with CouchDB. The next thing I want to do is track every time a page on the site is accessed.
Each access log should contain the timestamp, the URI of the page that was accessed and the UUID of the user who accessed it. How is the best way to structure this access log information in CouchDB? It's likely that any given page will be accessed up to 100 times per day.
A couple thoughts I've had so far:
1 CouchDB document per page that contains ALL access logs.
1 CouchDB document per log.
If it's one document per log, should all the logs be in their own CouchDB database to keep the main DB cleaner?
Definitely not 1st option. Because CouchDb is an append only storage, each time you update document, new document with same ID but different revision is created. And if you have 100 hits for a page in a day 100 new documents will be created, as a result you database will quickly get huge. So its better to use your second option.
As for the separate database for logs, it depends on your data and how you plan to use it. You can create separate view just for your logs if you decide to keep all your data in same place.
I want to use the spotify api to create a webapp. Without going into too much detail about the project, I want to clear up whether it would be against the terms and conditions or not.
After reading the terms and conditions, i read this line under things NOT to do: "aggregate Metadata to create data bases, or any other compilations of Metadata".
I don't plan to do any automated requests, for example, hammering the service with different queries to build a database... I'm just wondering whether I can store results from users who have performed searches via my application to the api, so that I can build content from my database on other parts of the application.
Thanks
I'm not a lawyer, so you'll need to have a lawyer confirm this (contracts, including ToS contracts, are important), but the general gist is that if you cache the results of user-generated requests to create features then you're ok. If you start caching stuff not generated by a user, you're in muddy water.
Good:
Other users who searched for "Madonna" in MyAwesomeApp also searched for "Backstreet Boys"!
Bad:
Here's a list of all the blue cover arts on Spotify: [list]
To generate the first example you can cache and work with searches explicitly done by users of your application. The second would require scraping all of the coverart in the service, which isn't allowed.