LOGSTREAMID parameter for VSAM - change-data-capture

I'm trying to alter a vsam file for write logs for any update operation.
I perform the updates through a cics transaction.
Can anyone give me an idea, how can i save immediately all updates in logstream file?

To get update log records written by CICS for VSAM file updates you will need to configure the recovery attributes for that VSAM file. Depending upon the type of file, how the file is accessed (RLS or non-RLS) and the types of log records required will determine what options can be set and where to set them.
To keep it simple, if you set the recovery attributes in the ICF catalog definition for the VSAM data set with RECOVERY(ALL) and LOGSTREAMID(your_logstream_name) then before and after images will be written. Depending upon what the log records are needed for also consider using the LOGREPLICATE(YES) option instead or as well.
Be careful turning recovery on, records (or CIs) in the file will be locked until the transaction making the updates completes. This could lead to deadlocks and rollbacks if multiple transactions make multiple updates to the file concurrently. Also if the file is an ESDS then there are further complexities.
Make sure the general log stream or model log stream has been created so CICS has or can create somewhere to write the log records to.
I'd also recommend reading more on the recovery options available so that only the log records needed are written. You can find more info on CICS logging here

Related

Limiting Kismet log files to a size or duration

Looking for a solid way to limit the size of Kismet's database files (*.kismet) through the conf files located in /etc/kismet/. The version of Kismet I'm currently using is 2021-08-R1.
The end state would be to limit the file size (10MB for example) or after X minutes of logging the database is written to and closed. Then, a new database is created, connected, and starts getting written to. This process would continue until Kismet is killed. This way, rather than having one large database, there will be multiple smaller ones.
In the kismet_logging.conf file there are some timeout options, but that's for expunging old entries in the logs. I want to preserve everything that's being captured, but break the logs into segments as the capture process is being performed.
I'd appreciate anyone's input on how to do this either through configuration settings (some that perhaps don't exist natively in the conf files by default?) or through plugins, or anything else. Thanks in advance!
Two interesting ways:
One could let the old entries be taken out, but reach in with SQL and extract what you wanted as a time-bound query.
A second way would be to automate the restarting of kismet... which is a little less elegant.. but seems to work.
https://magazine.odroid.com/article/home-assistant-tracking-people-with-wi-fi-using-kismet/
If you read that article carefully... there are lots of bits if interesting information here.

Best practices for ArangoDB compaction on-demand for file space reclamation

Part of my evaluation of ArangoDB involves importing a few CSV files of over 1M rows into a staging area, then deleting the resulting collections or databases. I will need to do this repeatedly for the production processes I envision.
I understand the the ArangoDB service invokes compaction periodically per this page:
https://docs.arangodb.com/3.3/Manual/Administration/Configuration/Compaction.html
After deleting a database, I waited over 24 hours and no disk space has been reclaimed, so I'm not sure this automated process is working.
I'd like answers to these questions:
What are the default values for the automatic compaction parameters shown in the link above?
Other than observing a change in file space, how do I know that a compaction worked? Is the a log file or other place that would indicate this?
How can I execute a compaction on-demand? All the references I found that discussed such a feature indicated that it was not possible, but they were from several years ago and I'm hoping this feature has been added.
Thanks!
The GET route /_api/collection/{collection-name}/figures contains a sub-attribute compactionStatus in the attribute figures with time and message of the last compaction for debugging purposes. There is also some other information in the response that you might be interested in. See if doCompact is set to true at all.
https://docs.arangodb.com/3.3/HTTP/Collection/Getting.html#return-statistics-for-a-collection
You can run arangod --help-compaction to see the startup options for compaction including the default values. This information is also available online in the 3.4 docs:
https://docs.arangodb.com/3.4/Manual/Programs/Arangod/Options.html#compaction-options
The PUT route /_api/collection/{collection-name}/rotate, quoting the documentation directly:
Rotates the journal of a collection. The current journal of the
collection will be closed and made a read-only datafile. The purpose
of the rotate method is to make the data in the file available for
compaction (compaction is only performed for read-only datafiles, and
not for journals)
Saving new data in the collection subsequently will create a new journal file automatically if there is no current journal.
https://docs.arangodb.com/3.3/HTTP/Collection/Modifying.html#rotate-journal-of-a-collection

Spark _temporary creation reason

Why does spark, while saving result to a file system, uploads result files to a _temporary directory and then move them to output folder instead of directly uploading them to output folder?
Two stage process is the simplest way to ensure consistency of the final result when working with file systems.
You have to remember that each executor thread writes its result set independent of the other threads and writes can be performed at different moments in time or even reuse the same set of resources. At the moment of write Spark cannot determine if all writes will succeed.
In case of failure one can rollback the changes by removing temporary directory.
In case of success one can commit the changes by moving temporary directory.
Another benefit of this model is clear distinction between writes in progress and finalized output. As a result it can easily integrated with simple workflow management tools, without a need of having a separate state store or other synchronization mechanism.
This model is simple, reliable and works well with file systems for which it has been designed. Unfortunately it doesn't perform that well with object stores, which don't support moves.

Is it possible to retrieve the list of files when a DataFrame is written, or or have spark store it somewhere?

With a call like
df.write.csv("s3a://mybucket/mytable")
I obviously know where files/objects are written, but because of S3's eventual consistency guarantees, I can't be 100% sure that getting a listing from that location will return all (or even any) of the files that were just written. If I could get the list of files/objects spark just wrote, then I could prepare a manifest file for a Redshift COPY command without worrying about eventual consistency. Is this possible-- and if so how?
The spark-redshift library can take care of this for you. If you want to do it yourself you can have a look at how they do it here: https://github.com/databricks/spark-redshift/blob/1092c7cd03bb751ba4e93b92cd7e04cffff10eb0/src/main/scala/com/databricks/spark/redshift/RedshiftWriter.scala#L299
EDIT: I avoid further worry about consistency by using df.coalesce(fileCount) to output a known number of file parts (for Redshift you want a multiple of the slices in your cluster). You can then check how many files are listed in the Spark code and also how many files are loaded in Redshift stl_load_commits.
It's good to be aware of consistency risks; you can get it in listings with delayed create visibility and deleted objects still being found.
AFAIK, You can't get a list of files created, as its somewhere where tasks can generate whatever they want into the task output dir, which is then marshalled (via listing and copy) into the final output dir,
In the absence of a consistency layer atop S3 (S3mper, s3guard, etc), you can read & spin for "a bit" to allow for the shards to catch up. I have no good idea of what is a good value of "a bit".
However, if you are calling fs.write.csv(), you may have been caught by listing inconsistencies within the committer used to propagate task output to the job dir; s that's done in S3A via list + copy, see.

Mongodb, can i trigger secondary replication only at the given time or manually?

I'm not a mongodb expert, so I'm a little unsure about server setup now.
I have a single instance running mongo3.0.2 with wiredtiger, accepting both read and write ops. It collects logs from client, so write load is decent. Once a day I want to process this logs and calculate some metrics using aggregation framework, data set to process is something like all logs from last month and all calculation takes about 5-6 hours.
I'm thinking about splitting write and read to avoid locks on my collections (server continues to write logs while i'm reading, newly written logs may match my queries, but i can skip them, because i don't need 100% accuracy).
In other words, i want to make a setup with a secondary for read, where replication is not performing continuously, but starts in a configured time or better is triggered before all read operations are started.
I'm making all my processing from node.js so one option i see here is to export data created in some period like [yesterday, today] and import it to read instance by myself and make calculations after import is done. I was looking on replica set and master/slave replication as possible setups but i didn't get how to config it to achieve the described scenario.
So maybe i wrong and miss something here? Are there any other options to achieve this?
Your idea of using a replica-set is flawed for several reasons.
First, a replica-set always replicates the whole mongod instance. You can't enable it for individual collections, and certainly not only for specific documents of a collection.
Second, deactivating replication and enabling it before you start your report generation is not a good idea either. When you enable replication, the new slave will not be immediately up-to-date. It will take a while until it has processed the changes since its last contact with the master. There is no way to tell how long this will take (you can check how far a secondary is behind the primary using rs.status() and comparing the secondaries optimeDate with its lastHeartbeat date).
But when you want to perform data-mining on a subset of your documents selected by timespan, there is another solution.
Transfer the documents you want to analyze to a new collection. You can do this with an aggregation pipeline consisting only of a $match which matches the documents from the last month followed by an $out. The out-operator specifies that the results of the aggregation are not sent to the application/shell, but instead written to a new collection (which is automatically emptied before this happens). You can then perform your reporting on the new collection without locking the actual one. It also has the advantage that you are now operating on a much smaller collection, so queries will be faster, especially those which can't use indexes. Also, your data won't change between your aggregations, so your reports won't have any inconsistencies between them due to data changing between them.
When you are certain that you will need a second server for report generation, you can still use replication and perform the aggregation on the secondary. However, I would really recommend you to build a proper replica-set (consisting of primary, secondary and an arbiter) and leave replication active at all times. Not only will that make sure that your data isn't outdated when you generate your reports, it also gives you the important benefit of automatic failover should your primary go down for some reason.

Resources