How to access hudi metrics

How to access hudi metrics - apache-spark

How can the Hudi metrics be accessed programatically. After a commit I would like to get metrics like records updated / records inserted and log them into a database.
I tried setting hoodie.metrics.on=true and hoodie.metrics.reporter.type=INMEMORY. But how can I get a HoodieMetrics object that contains the actual information?

Related

How to save Databricks Log Audit as a Table?

I want to access the Databricks Audit Logs to check user activity. For example, the number of times that a table was viewed by a user.
I'd like to know if there is any way to get Logs as a Databricks table. I mean, saving the Logs as a table (let's say delta table).
Also, I want it to work continuously; adding new logs to the table when a new event happens (not just one time).
I am not sure if I can use SQL language for this purpose or not (instead of Rest API).
Any idea how to do that?

Enable audit logging at database workspace settings as below:
1.The data of logs is stored at '/mnt/audit-logs so create a table with that location using SQL:
CREATE TABLE audit_logs
USING delta
LOCATION '/mnt/audit-logs'
This creates a Delta table named audit_logs that points to the /mnt/audit-logs directory in the Databricks File System (DBFS).
2.Run a job or execute some code that generates audit logs in Databricks. The audit logs will be written to the /mnt/audit-logs directory.
3.Use the following code to continuously update the audit_logs Delta table with new audit log events:
INSERT INTO audit_logs
SELECT *
FROM delta.`/mnt/audit-logs`
This code inserts the new audit log events into the audit_logs Delta table by reading the events from the /mnt/audit-logs directory and filtering out any events that have already been processed.
4.Run the code from step 3 periodically (for example, using a Databricks job or a cron job) to continuously update the audit_logs Delta table with new audit log events.
In this way you can create audit logs as table by updating continuously.

Scan AWS DynamoDB records only when there is new information

I am struggling to work out something that seems like it would be so simple.
Here is some context:
I have a web app, which has 6 graphs powered by D3 and this data is stored in one table in DynamoDB. I am using AWS and NodeJS with the awssdk.
I need to have the graphs updating in real-time when new information is added.
I currently have it set so that the scan function runs every 30 seconds for each graph, however, when I have multiple users it causes the db to be hit so many times that it maxes out the reads.
I want it so that when data in the database is updated, potentially the server will save that data to a document so that the users can poll that instead of the database itself and that doc will simply update when new info is added to the database.
Basically, any way to have it where dynamodb is only scanned when there is new information.
I was looking into using streams however I am completely lost on where to start and if that is the best approach to take.

You would want to configure a DynamoDB Stream of your table to trigger something like an AWS Lambda function. That function could then scan the table and generate your new file and store it somewhere like S3.

CouchDB - Preferred structure for access/event logging

I'm just getting started with CouchDB and looking for some best practices. My current project is a CMS/Wiki-like tool that contains many pages of content. So far, this seems to fit well with CouchDB. The next thing I want to do is track every time a page on the site is accessed.
Each access log should contain the timestamp, the URI of the page that was accessed and the UUID of the user who accessed it. How is the best way to structure this access log information in CouchDB? It's likely that any given page will be accessed up to 100 times per day.
A couple thoughts I've had so far:
1 CouchDB document per page that contains ALL access logs.
1 CouchDB document per log.
If it's one document per log, should all the logs be in their own CouchDB database to keep the main DB cleaner?

Definitely not 1st option. Because CouchDb is an append only storage, each time you update document, new document with same ID but different revision is created. And if you have 100 hits for a page in a day 100 new documents will be created, as a result you database will quickly get huge. So its better to use your second option.
As for the separate database for logs, it depends on your data and how you plan to use it. You can create separate view just for your logs if you decide to keep all your data in same place.

Sql Azure - Timeout on query

I have setup an Azure website with a SQL Server Azure back-end. I used a migration tool to populate a single table with 80000 rows of data. During the data migration I could access the new data via the website without any issues. Since the migration has completed I keep getting a exception: [Win32Exception (0x80004005): The wait operation timed out].
This exception suggests to me that the database queries I am doing are taking more than 30 seconds to return. If I query the database from Visual Studio I can confirm that the queries are taking more than 30 seconds to return. I have indexes on my filter columns and on my local SQL database my queries take less than a second to return. Each row does contain a varchar(max) column that stores json which means that a bit of data is held in each row, but this shouldn't really affect the query performance.
Any input that could help me sole this issue would be much appreciated.

I seem to be around the query timeout issues for now. What appeared to do the trick for me was to update the SQL Server stats.
EXEC sp_updatestats;
Another performance enhancement that worked well was to enable json compression on my azure website.
See: enter link description here

Does any cloud object stores support object metadata indices?

I have a very large document store - about 50 million JSON docs, with 50m more added per year. Each is about 10K. I would like to store them in a cloud storage and retrieve them via a couple structured metadata indices that I would update as I add documents to the store.
It looks like AWS S3, Google Cloud Storage and Azure allow custom metadata to be returned with an object, but not used as part of a GET request to filter a collection of objects.
Is there a good solution "out-of-the-box" for this? I can't find any, but it seems like my use case shouldn't be really unusual. I don't need to query by document attributes or to return partial documents, I just need to GET a collection of documents by filtering on a handful of metadata fields.
The AWS SimpleDB page mentions "Indexing Amazon S3 Object Metadata" as a use case, and links to a library that hasn't been updated since 2009.

They are simply saying that you can store and query the metadata in amazon simple DB which is a NoSQL database provided by amazon for you. Depending on the kind of metadata you have, you could also store it in an RDBMS. Few 100 million rows isn’t too much if you create the proper indices and you can store URLs or file names, to access the files stored on S3, Azure, … afterwards.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to access hudi metrics - apache-spark

Related

How to save Databricks Log Audit as a Table?

Scan AWS DynamoDB records only when there is new information

CouchDB - Preferred structure for access/event logging

Sql Azure - Timeout on query

Does any cloud object stores support object metadata indices?

Categories

Resources