Spring batch multithreading

Spring batch multithreading - multithreading

Problem Statement: After successfully completion of spring job won't be able to access data from ExecutionContext which is set inside spring batch partition.
Partition code:
for (String files : fileNameListmatch) {
ExecutionContext executionContext = new ExecutionContext();
executionContext.putString("file", files);
partitionData.put("partition: " + partitionNo, executionContext);
partitionNo++;
}
Inside partition code, I added list of files to ExecutionContext.
JobListener code:
#Value("#{stepExecutionContext['file']}")
String file;
#Override
public void afterJob(JobExecution jobExecution) {
for (String file1 : file) {
moveCSVFile = Files.move(Paths.get(inputFilePath + "/" + file1 + ".csv"),
Paths.get(archiveFilePath + file1 + ".csv"));
moveCTLFile = Files.move(Paths.get(inputFilePath + "/" + file1 + ".ctl"),
Paths.get(archiveFilePath + file1 + ".ctl"));
}
}
Inside afterJob, I tried to access list of files from ExecutionContext after completion of job.Getting null inside ExecutionContext.
After completion of job successfully, I have to move input files to another folder but won't be able to access files (getting null inside executionContext). After completion of job I have to move input files to one folder to another folder.

There are two different execution contexts: one at the step level and one at the job level. Make sure to use the job scoped one since you want to access the execution context from a job listener.
If you use the step scoped one, you can always promote keys to the job execution context using a ExecutionContextPromotionListener. Please refer to the Passing Data to Future Steps section for more details.

Related

Azure Blobs C# client - using multiple filters server side

I'm trying to load blob names for filtering in my program, then after applying all filters I plan to download and process each blob.
Currently we have around 30k blobs in storage which are stored inside container like this:
year/month/day/hour/file.csv (or file.json for unprocessed files)
My program needs to dynamically enter start and end date (max length of 30 days) for downloading. Using Azure.Storage.Blobs.BlobContainerItem and method GetBlobs allows me to use single string prefix for server side filtering.
If my dates are 2020/06/01 and 2020/06/02 program works very fast and takes around 2 seconds to get blobs and apply rest of filters to it. However, if i have 2020/05/30 and 2020/06/01 then I'm unable to put month prefix because it takes only 1 string so my prefix will be just 2020, which takes around 15 seconds to complete. Rest of the filtering is done locally but biggest delay is the GetBlobs() function.
Is there any other way to use multiple filters server side from .NETCore app?
Here are relevant functions:
BlobContainerClient container = new BlobContainerClient(resourceGroup.Blob, resourceGroup.BlobContainer);
var blobs = container.GetBlobs(prefix : CreateBlobPrefix(start, end))
.Select(item=> item.Name)
.ToList();
blobs = FilterBlobList(blobs, filter, start, end);
private string CreateBlobPrefix(DateTime start, DateTime end)
{
string prefix = null;
bool sameYear = start.Year == end.Year;
bool sameMonth = start.Month == end.Month;
bool sameDay = start.Day == end.Day;
if (sameYear)
{
prefix = start.Year.ToString();
if (sameMonth)
{
if(start.Month<10)
prefix += "/0" + start.Month.ToString();
else
prefix += "/" + start.Month.ToString();
if (sameDay)
if(start.Day<10)
prefix += "/0" + start.Day.ToString();
else
prefix += "/" + start.Day.ToString();
}
}
return prefix;
EDIT: here's how i did it in the end. Because it's faster to make multiple requests with better specified prefixes i did the following:
create a list of different dates in selected time window (coming from UI application where user inputs any window)
for each prefix created I send the request to Azure to get blobs
concat all blob names into 1 list
process the list by using blob client for each blob name
Here's the code:
foreach (var blobPrefix in CreateBlobPrefix(start, end))
{
var currentList = container.GetBlobs(prefix: blobPrefix)
.Select(item => item.Name)
.ToList();
blobs = blobs.Concat(currentList).ToList();
}

You could filter more than once, finding the common denominator between the dates:
First filter with the string prefix by the start month and year, 2020/05, and then filter locally for exact date.
Then you can gradually increase the day/month filter until you reach the end of the range.
The granularity of your stepping really depends on the time it takes to make a call to Azure for a given average number of results.
Another advantage is you could run these sub-queries in parallel.
I've used this code:
var prefixDateFilters = Enumerable.Range(0, 1 + endDateInclusive.Subtract(startDateInclusive).Days)
.Select(offset => startDateInclusive.AddDays(offset))
.Select(date => $"{date.ToString(BlobFileDateTimeFormat)}").ToList();
prefixFilters.AsParallel()
.Select(filter => containerClient.GetBlobs(prefix: filter))

Sink to java list possible with Hazelcast Jet?

I have a list of accounts and perform a hashjoin on ticks and return the accounts with ticks data. But after hashjoin I have drainTo lListJet and then read it with DistributedStream and return it.
public List<Account> populateTicksInAccounts(List<Account> accounts) {
...
...
Pipeline p = Pipeline.create();
BatchSource<Tick> ticksSource = Sources.list(TICKS_LIST_NAME);
BatchSource<Account> accountSource = Sources.fromProcessor(AccountProcessor.of(accounts));
p.drawFrom(ticksSource)
.hashJoin(p.drawFrom(accountSource), JoinClause.joinMapEntries(Tick::getTicker), accountMapper())
.drainTo(Sinks.list(TEMP_LIST));
jet.newJob(p).join();
IListJet<Account> list = jet.getList(TEMP_LIST);
return DistributedStream.fromList(list).collect(DistributedCollectors.toIList());
}
Is it possible to drainTo to java List instead of lListJet after performing a hashjoin?
Something like below is possible?
IListJet<Account> accountWithTicks = new ArrayList<>();
p.drawFrom(ticksSource)
.hashJoin(p.drawFrom(accountSource), JoinClause.joinMapEntries(Tick::getTicker), accountMapper())
.drainTo(<CustomSinkProcessor(accountWithTicks)>);
return accountWithTicks;
where in CustomSinkProcessor will take empty java list and return with the accounts?

Keep in mind that the code you submit to Jet for execution runs outside the process where you submit it from. While it would be theoretically possible to provide the API you're asking for, under the hood it would just have to perform some tricks to run the code on each member of the cluster, let all members send their results to one place, and fill up a list to return to you. It would go against the nature of distributed computing.
If you think it will help the readability of your code, you can write a helper method such as this:
public <T, R> List<R> drainToList(GeneralStage<T> stage) {
String tmpListName = randomListName();
SinkStage sinkStage = stage.drainTo(Sinks.list(tmpListName));
IListJet<R> tmpList = jet.getList(tmpListName);
try {
jet.newJob(sinkStage.getPipeline()).join();
return new ArrayList<>(tmpList);
} finally {
tmpList.destroy();
}
}
Especially note the line
return new ArrayList<>(tmpList);
as opposed to your
IListJet<Account> list = jet.getList(TEMP_LIST);
return DistributedStream.fromList(list).collect(DistributedCollectors.toIList());
This just copies one Hazelcast list to another one and returns a handle to it. Now you have leaked two lists in the Jet cluster. They don't automatically disappear when you stop using them.
Even the code I provided can still be leaky. The JVM process that runs it can die during Job.join() without reaching finally. Then the temporary list lingers on.

No, it's not, due to the distributed nature of Jet. The sink will execute in multiple parallel processors (workers). It can't add to plain Collection. The sink has to be able to insert items on multiple cluster members.

Jenkins: How to get value from a Groovy Script to be used in a downstream project?

I need to be able to get some slave information to be used in one of my jobs.
I have a Groovy system script to access the slave information
for (aSlave in hudson.model.Hudson.instance.slaves) {
println('====================');
println('Name: ' + aSlave.name);
println('getLabelString: ' + aSlave.getLabelString());
... In here I can dig out the information that I need
}
Is there a way how I can get the information back to use in a Post Build Job?
I need to add the output to a parameter or something that can be used by a Post Build Job?

If you are running windows I have got a solution for you: You can save your settings into environment varables, which are usable for the currently running job. They will no longer exist once the job is finished, but they are usable for post-build action. Here is an example:
//Creating String to make my example more clear
String myString = 'this is just a test!';
//Create an environment variable, so the Jenkins job can use the parameter
def pa = new ParametersAction([new StringParameterValue('PARAMETER_NAME', myString)]);
// Add variable to current jobs environment variables.
Thread.currentThread().executable.addAction(pa)
println 'Script finished! \n';
After the script ran you can use %PARAMETER_NAME% (if post-build-actions etc.) to gain access to its content.
Additional Hint: To see all available environment variables you can use the build step "execute windows batch command" and click on "See the list of available environment variables" on the buttom (the variables you create while executing scripts are excluded). But you can use these variables within your groovy script using e.g.:
String jenkinsHome = System.getenv('JENKINS_HOME');

I used The EnjEnv plugin and it has a 'Evaludated Groovy Script' section that basically you can do anything... but it should return a property map that will be used as Environment variables. I don't know how to return a value from a Groovy script so this worked kewl for me as I can reference property (or Environment variables) from almost anyware
import hudson.model.*
String labelIWantServersOf = TheLabelUsedOnTheElasticAxisPlugin; // This is the label assosiated with nodes for which i want the server names of
String serverList = '';
for (aSlave in hudson.model.Hudson.instance.slaves) {
out.println('Evaluating Server(' + aSlave.name + ') with label = ' + aSlave.getLabelString());
if (aSlave.getLabelString().indexOf(labelIWantServersOf ) > -1) {
serverList += aSlave.name + ' ';
out.println('Valid server found: ' + aSlave.name);
}
}
out.println('Final server list where SOAP projects will run on = ' + serverList + ' which will be used in the environment envInject map');
Map<String, String> myMap = new HashMap<>(2);
myMap.put("serverNamesToExecuteSoapProjectOn", serverList );
return myMap;
And then I write the environment variable serverNamesToExecuteSoapProjectOn to a property file using a windows batch script and pass the property file to the next build as a parameterized build

How to Create Same resource Twice in Puppet

My requirement is to do some repetitive file configuration stuff using a loop, Something like following,
$no_of_managers = 2
$array = ['One','two','Three']
define loop() {
notice("Configuring The Manager Nodes!!")
if ($name == $no_of_managers+1) {
notice("Loop Iteration Finished!!!")
}
else
{
notice("Iteration Number : $name \n")
# Doing All Stuff Here
resource {$array:}
$next = $name + 1
loop { $next: }
}
}
loop { "1":}
define resource () {
# Doing my other Stuff
notice ("The Parsed value Name : ${name}\n")
}
Now when The second iteration is running the following error occurs,
Error: Duplicate declaration: Resource[One] is already declared in file
How can I overcome this, What I'm doing is a cluster setup. Is there a workaround to do this, I'm a newbie for puppet so Your kind guidance highly appreciated.
The Use Case :
I'm trying to setup a cluster which have multiple Manager/Worker nodes, So using this script the user has the privilege to select how many manager nodes he needs. So the first loop is for that to copy necessary files and create required number of nodes.
The second loop is there to push all .erb templates. Because each Node has slightly different configs the .erb files have there own logic inside them.
So after each Iteration I want to push the .erb templates to the respective node.

In Puppet 3.x, you cannot build a loop in the fashion you are trying.
resource { $array: }
is a loop over the contents of $array if you will.
It is not really clear what you are trying to solve. If you can make your question a bit more concrete, we may be able to suggest an actual implementation.
Update
If you really want to go down this road, you need to generate unique names for your derived resources.
$local_names = regsubst($array, '$', "-$name")
resource { $local_names: }
In your defined type, you will have to retrieve the original meaning by removing the suffix.
define resource() {
$orig_name = regsubst($name, '-[0-9]+$', '')
# use $orig_name where you used $name before
}
Note that even exported resources must have unique names. So the transformation may have to happen on in the manifest of the receiving node.

log4j - How do I include date/time in file names AND delete older logs?

My boss requires that I have a logging solution where
All log files have a date/time in the file name (e.g. myapp.2-28-2012.log)
Only the most recent log files remain. Older log files are deleted so the hard drive doesn't run out of space
It seems with log4j I can only get one or the other criteria, but not both. With the log4j extras TimeBasedRollingPolicy I'm able to get log files to contain the date/time, which fulfills 1. However, there doesn't seem to be a way to make TimeBasedRollingPolicy delete old log files. According to this post it is not possible to make TimeBasedRollingPolicy delete old log files
With the log4j extras FixedWindowRollingPolicy and SizeBasedTriggeringPolicy I can get log4j to delete all but the last 10 log files so my hard drive doesn't run out of space, fulling 2. However I can't get this solution to put the date/time in the file name. With this configuration
def myAppAppender = new org.apache.log4j.rolling.RollingFileAppender(name: 'myApp', layout: pattern(conversionPattern: "%m%n"))
def rollingPolicy = new org.apache.log4j.rolling.FixedWindowRollingPolicy(fileNamePattern: '/tmp/myapp-%d{MM-dd-yyyy_HH:mm:ss}.log.%i',maxIndex:10,activeFileName: '/tmp/myapp.log')
rollingPolicy.activateOptions()
def triggeringPolicy = new org.apache.log4j.rolling.SizeBasedTriggeringPolicy(maxFileSize:10000000)
triggeringPolicy.activateOptions()
eventAppender.setRollingPolicy(rollingPolicy)
eventAppender.setTriggeringPolicy(triggeringPolicy)
the rolled log files do not contain the date/time. They look like this
myapp-.log.1
myapp-.log.2
...
Is is possible to fulfill both criteria 1) and 2) with log4j? Would I have to subclass TimeBasedRollingPolicy? If so, what methods should I override?

It is somewhat hard to do within Log4j generically but you can always extend the policy against your specific needs as follows.
You first need to copy paste TimeBasedRollingPolicy source code to a new class, for example MyAppDeletingTimeBasedRollingPolicy. Apache license permits this.
The critical part is to implement the delete logic against your needs. Below is my class, which deletes the files starting with "myapp-debug" and ending with "gz" and have not been modified in 3 days. You probably will need other checks, so be careful while blindly copying and pasting the code.
private static class DeleteOldMyAppLogFilesInDirAction extends ActionBase {
private static final long MAX_HISTORY = 3l * 24 * 60 * 60 * 1000; //3 days
private File dir;
public DeleteOldMyAppLogFilesInDirAction (File dir) {
this.dir = dir;
}
#Override
public boolean execute() throws IOException {
for (File f : dir.listFiles()) {
if (f.getName().startsWith("myapp-debug.") && f.getName().endsWith(".gz")
&& f.lastModified() < System.currentTimeMillis() - MAX_HISTORY) {
f.delete();
}
}
return true;
}
}
Then you need to change the return value from this:
return new RolloverDescriptionImpl(nextActiveFile, false, renameAction, compressAction);
To this:
Action deleteOldFilesAction = new DeleteOldMyAppLogFilesInDirAction (new File(currentActiveFile).getParentFile());
List<Action> asynchActions = new ArrayList<Action>();
if (compressAction != null) {
asynchActions.add(compressAction);
}
asynchActions.add(deleteOldFilesAction);
return new RolloverDescriptionImpl(nextActiveFile, false, renameAction, new CompositeAction(asynchActions, false));
There are some hardcoded assumptions in this class, like the file name and old files residing within the same directory as current file, be careful with those assumptions and be happy to hack your way for your own needs.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spring batch multithreading - multithreading

Related

Azure Blobs C# client - using multiple filters server side

Sink to java list possible with Hazelcast Jet?

Jenkins: How to get value from a Groovy Script to be used in a downstream project?

How to Create Same resource Twice in Puppet

log4j - How do I include date/time in file names AND delete older logs?

Categories

Resources