How to test data retention policy and cron job for deleting old data? - cron

There are some retention policies for the data that my company uses, for different types of data we have different rules and these different types of data and files are used in different microservices and databases.
For example : file type "A" should get deleted after 30 days, file type "B" should get deleted after 100 days, and file type "C" should get deleted after 7 days.
We have a cron job that should delete these files for each microservice. I want to test this cron job (automated, not manual) and make sure our data would get deleted in the right time. I need to do it with the bare minimum amount of work (kinda in a hurry), and I don't want to use an approach that may cause us liability issues and security issues.
What is the best approach that I can do?

Related

Is it bad to run cron jobs to poll from a huge table of scheduled job records?

I've a table which a cron job would poll at every minute to send out messages to other services. The records in the table are essentially activities that are scheduled to run at a certain time. The cron job simply checks to see which of those activities are ready to be run and send a message of that activity through SQS to the other services.
When an activity is found to be ready to run by the cron job, that record will be marked as done after sending a message through SQS. There is an API which allows other services to check whether a scheduled activity has already been done. So keeping a history of those done records is needed.
My concern here, however, is whether a design like this is scalable in the long run. There are around 200k scheduled activities a day, or even more on some days. Since I'm keeping the records by marking them as done after they are completed, I'm worried that the table will eventually get very huge with ten over millions of rows and become an issue for the cron job to run as frequently.
Even with a properly indexed table, is my concern valid? Otherwise, what other alternatives can I design it if I had to somehow persist those scheduled activities for a cron or something to poll and check when they are ready to run?
I'm using Postgres database.
As long as the number of rows that the cron job's query has to fetch stays constant and you can use an index, the size of the table won't matter.
Index scans are O(n) with respect to the number of rows scanned and O(log(n)) with respect to the table size. To be more specific, increasing the table size by a factor between 10 and 200 (smaller size of the index key leads to better fan-out) will make an index scan use one more block, and that block is normally cached.
If the table gets large, you might still want to consider partitioning, but mostly so that you can get rid of old data efficiently.
With the right index, the cron job should have no serious problem. You can have a partial/filtered index, like
create index on jobs (id) where status <> 'done'.
To keep the size of the index small. The query has to match the index where clause.
I used (id) just because an empty list is not allowed and so something has to be there. Based on your comment, schedule_dt might be a better choice. If you include all the columns you select, you can get an index-only scan. But if you don't, it will still use the index, it just has to visit the table to fetch the columns for those specific rows. I suspect the index only scan attempt won't be worth it to you as the pages you need probably won't be marked all visible, as modifications were made to neighboring tuples just one minute ago.
However, it does seem a bit odd to mark a job as done when it has only been scheduled, rather than being done.
There is an API which allows other services to check whether a scheduled activity has already been done.
A table that increases in size without bound is likely to present management problems apart from the cron job. Surely the services aren't going to have to look back months in order to do this, are they? Could you delete 'done' jobs after a few days? What if a service tries to look up a job and rather than finding it 'done', it just doesn't find it at all?
I don't think the cron job is inherently a problem, but it would be cleaner not to have it. Why doesn't whoever inserts the job just invoke SQS in real time?

Preparing archive data for Stream Analytics Import

Before I had time to get an ingestion strategy & process setup, I started collecting data that will eventually go through a Stream Analytics job. Now I'm sitting on an Azure blob storage container with over 500,000 blobs in it (no folder organization), another with 300,000 and a few others with 10,000 - 90,000.
The production collection process now writes these blobs to different containers in the YYYY-MM-DD/HH format, but that's only great going forward. This archived data I have is critical to get into my system and I'd like to just modify the inputs a bit for the existing production ASA job so I can leverage the same logic in the query, functions and other dependencies.
I know ASA doesn't like batches of more than a few hundred / thousand, so I'm trying to figure a way to stage my data in order to work well under ASA. This would be a one time run...
One idea was to write a script that looked at every blob, looked at the timestamp within the blob and re-create the YYYY-MM-DD/HH folder setup, but in my experience, the ASA job will fail when the blob's lastModified time doesn't match the folders it's in...
Any suggestions how to tackle this?
EDIT: Failed to mention (1) there are no folders in these containers... all blobs live at the root of the container and (2) my LastModifiedTime on the blobs is no longer useful or have meaning. The reason for the latter is these blobs were collected from multiple other containers and merged together using the Azure CLI copy-batch command.
Can you please try below?
Do this processing in two different jobs, one for the folders with date partitioning (say partitionedJob). Another for old blobs without any date partitioning (say RefillJob)
Since RefillJob has a fixed number of blobs, put a predicate on System.Timestamp to make sure that it only processes old events. Start this job with at least 6 SUs and run it until all the events have been processed. You can confirm by looking at LastOutputProcessedTime or by looking at the input event count or by inspecting your output source. After this check, stop the job. This job is no longer needed.
Start the partitionedJob with timestamp > RefillJob. This assumes the folders for the timestamps exists.

Incremental load in Azure Data Lake

I have a big blob storage full of log files organized according to their identifiers at a number of levels: repository, branch, build number, build step number.
These are JSON files that contain an array of objects, each object has a timestamp and an entry value. I've already implemented a custom extractor (extending IExtractor) that takes an input stream and produces a number of plain-text lines.
Initial load
Now I am trying to load all of that data to ADL Store. I created a query that looks similar to this:
#entries =
EXTRACT
repo string,
branch string,
build int,
step int,
Line int,
Entry string
FROM #"wasb://my.blob.core.windows.net/{repo}/{branch}/{build}/{step}.json"
USING new MyJSONExtractor();
When I run this extraction query I get a compiler error - it exceeds the limit of 25 minutes of compilation time. My guess is: too many files. So I add a WHERE clause in the INSERT INTO query:
INSERT INTO Entries
(Repo, Branch, Build, Step, Line, Entry)
SELECT * FROM #entries
WHERE (repo == "myRepo") AND (branch == "master");
Still no luck - compiler times out.
(It does work, however, when I process a single build, leaving {step} as the only wildcard, and hard-coding the rest of names.)
Question: Is there a way to perform a load like that in a number of jobs - but without the need to explicitly (manually) "partition" the list of input files?
Incremental load
Let's assume for a moment that I succeeded in loading those files. However, a few days from now I'll need to perform an update - how am I supposed to specify the list of files? I have a SQL Server database where all the metadata is kept, and I could extract exact log file paths - but U-SQL's EXTRACT query forces me to provide a static string that specifies the input data.
A straightforward scenario would be to define a top-level directory for each date and process them day by day. But the way the system is designed makes this very difficult, if not impossible.
Question: Is there a way to identify files by their creation time? Or maybe there is a way to combine a query to a SQL Server database with the extraction query?
For your first question: Sounds like your FileSet pattern is generating a very large number of input files. To deal with that you may want to try the FileSets v2 preview which is documented under U-SQL Preview Features section in:
https://github.com/Azure/AzureDataLake/blob/master/docs/Release_Notes/2017/2017_04_24/USQL_Release_Notes_2017_04_24.md
Input File Set scales orders of magnitudes better (opt-in statement is
now provided)
Previously, U-SQL's file set pattern on EXTRACT expressions ran into
compile time time-outs around 800 to 5000 files.
U-SQL's file set pattern now scales to many more files and generates
more efficient plans.
For example, a U-SQL script querying over 2500 files in our telemetry
system previously took over 10 minutes to compile now compiles in 1
minute and the script now executes in 9 minutes instead of over 35
minutes using a lot less AUs. We also have compiled scripts that
access 30'000 files.
The preview feature can be turned on by adding the following statement
to your script:
SET ##FeaturePreviews = "FileSetV2Dot5:on";
If you wanted to generate multiple extract statements based on partitions of your filepaths, you'd have to do it with some external code that generates one or more U-SQL scripts.
I don't have a good answer to your second question so I will get a colleague to respond. Hopefully the first part can get you unblocked for now.
To address your second question:
You could read your data from the SQL Server database using a federated query, and then use the information in a join with the virtual columns that you create from the fileset. The problem with that is that the values are only known at execution time and not at compile time, so you would not get the reduction in the accessed files.
Alternatively, you could write a SQL query that gets you the data you need and then parameterize your U-SQL script so you can pass that information into the U-SQL script.
As to the ability to select files based on their creation time: This is a feature on our backlog. I would recommend to upvote and add a comment to the following feature request: https://feedback.azure.com/forums/327234-data-lake/suggestions/10948392-support-functionality-to-handle-file-properties-fr and add a comment you want to also query on them over a fileset.

update 40+ million entities in azure table with many instances how to handle concurrency issues

So here is the problem. I need to update about 40 million entities in an azure table. Doing this with a single instance (select -> delete original -> insert with new partitionkey) will take until about Christmas.
My thought is use an azure worker role with many instances running. The problem here is the query grabs the top 1000 records. That's fine with one instance but with 20 running their selects will obviously overlap.. a lot. this would result in a lot of wasted compute trying to delete records that were already deleted by another instance and updating a record that has already been updated.
I've run through a few ideas, but the best option I have is to have the roles fill up a queue with partition and row keys then have the workers dequeue and do the actual processing?
Any better ideas?
Very interesting question!!! Extending #Brian Reischl's answer (and a lot of it is thinking out loud, so please bear with me :))
Assumptions:
Your entities are serializable in some shape or form. I would assume that you'll get raw data in XML format.
You have one separate worker role which is doing all the reading of entities.
You know how many worker roles would be needed to write modified entities. For the sake of argument, let's assume it is 20 as you mentioned.
Possible Solution:
First you will create 20 blob containers. Let's name them container-00, container-01, ... container-19.
Then you start reading entities - 1000 at a time. Since you're getting raw data in XML format out of table storage, you create an XML file and store those 1000 entities in container-00. You fetch next set of entities and save them in XML format in container-01 and so on and so forth till the time you hit container-19. Then the next set of entities go into container-00. This way you're evenly distributing your entities across all the 20 containers.
Once all the entities are written, your worker role for processing these entities would come into picture. Since we know that instances in Windows Azure are sequentially ordered, you get instance names like WorkerRole_IN_0, WorkerRole_IN_1, ... and so on.
What you would do is take the instance name, get the number "0", "1" etc. Based on this you would determine which worker role instance will read from which blob container...WorkerRole_IN_0 will read files from container-00, WorkerRole_IN_1 will read files from container-01 and so on.
Now your individual worker role instance will read the XML file, create the entities from that XML file, update those entities and save it back into table storage. Once this process is done, you would then delete the XML file and you move on to next file in that container. Once all files are read and processed, you can just delete the container.
As I said earlier, this is a lot "thinking out loud" kind of solution and some things must be considered like what happens when "reader" worker role goes down and other things.
If your PartitionKeys and/or RowKeys fall into a known range, you could attempt to divide them into disjoint sets of roughly equal size for each worker to handle. eg, Worker1 handles keys starting with 'A' through 'C', Worker2 handles keys starting with 'D' through 'F', etc.
If that's not feasible, then your queuing solution would probably work. But again, I would suggest that each queue message represent a range of keys if possible. eg, a single queue message specifies deleting everything in the range 'A' through 'C', or something like that.
In any case, if you have multiple entities in the same PartitionKey then use batch transactions to your advantage for both inserting and deleting. That could cut down the number of transactions by almost a factor of ten in the best case. You should also use parallelism within each worker role. Ideally use the async methods (either Begin/End or *Async) to do the writing, and run several transactions (12 is probably a good number) in parallel. You can also run multiple threads, but that's somewhat less efficient. In either case, a single worker can push a lot of transactions with table storage.
As a side note, your process should go "Select -> Insert New -> Delete Old". Going "Select -> Delete Old -> Insert New" could result in permanent data loss if a failure occurs between steps 2 & 3.
I think you should mark your question as the answer ;) I cant think of a better solution since I don't know what your partition and row keys look like. But to enhance your solution, you may choose to pump multiple partition/row keys into each queue message to save on transaction cost. Also when consuming from the queue, get them in batches of 32. Process asynchronously. I was able to transfer 170 million records from SQL server (Azure) to Table storage in less than a day.

Lotus notes agent runs slower in server compared to development PC

I have an attendance recording system that has 2 databases, one for current, another for archiving. The server processes attendance records, and puts records marked completed into the archive. There is no processing done in the archive database.
Here's the issue. One of the requirement was to build a blank record for each staff every day, for which attendance records are put into. The agent that does this calls a few procedures and does some checking within the database. As of current, there are roughly 1,800 blank records created daily. On the development PC, processing each records takes roughly 2 to 3 seconds, which translates to an average of an hour and a half. However, when we deployed it on the server, processing each records takes roughly 7 seconds, roughly translates into 3 and a half hours to complete. We have had instances when the agent takes 4.5 to 5 hours to complete.
Note that in both instances, agents are scheduled. There are no other lotus apps in the server, and the server is free and idle most of the time (no other application except Windows Server and Lotus Notes). Is there anything that could cause the additional processing time compared on the development PC and the server?
Your process is generating 1800 new documents every day, and you have said that you are also archiving documents regularly, so I presume that means that you are deleting them after you archive them. Performance problems can build up over time in applications like this. You probably have a large number of deletion stubs in the database, and the NSF file is probably highly fragmented (internally and/or externally).
You should use the free NotesPeek utility to examine the database and see how many deletion stubs it contains. Then you should check the purge interval setting and consider lowering it to the smallest value that you are comfortable with. (I.e., big enough so you know that all servers and users will replicate within that time, but small enough to avoid allowing a large buildup of deletion stubs.) If you change the purge interval, you can wait 24 hours for the stubs to be purged, or you can manually run updall against the database on the server console to force it.
Then you should run compact -c on the NSF file, and also run a defrag on the server disk volume where the NSF lives.
If these steps do improve your performance, then you may want to take steps in your code to prevent recurrence of the problem by using coding techniques that minimize deletion stubs, database growth and fragmentation.
I.e., go into your code for archiving, and change it so it doesn't delete them after archiving. Instead, have your code mark them with a field such as FreeDocList := "1". Then add a hidden view called (FreeDocList) with a selction formula of FreeDocList = "1". Also go into ever other view in the database and add & (!(FreeDocList = "1")) to the selection formulas. Then change the code adds the new blank documents, so that instead of creating new docs it just goes to the FreeDocList view, finds the first document, sets FreeDocList = "0", and clears all the previous field values. Of course, if there aren't enough documents the FreeDocList view, your code would revert to the old behavior and create a new document.
With the above changes, you will be re-using your existing documents whenever possible instead of deleting and creating new ones. I've run benchmarks on code like this and found that it can help; but I can't guarantee it in all cases. Much would depend on what else is going on in the application.

Resources