How to perform a series of steps in a single thread, with an async flow in spring-integration? - spring-integration

I currently have a spring-integration (v4.3.24) flow that looks like the following:
|
| list of
| filepaths
+----v---+
|splitter|
+----+---+
| filepath
|
+----------v----------+
|sftp-outbound-gateway|
| "get" |
+----------+----------+
| file
+---------------------+
| +----v----+ |
| |decryptor| |
| +----+----+ |
| | |
| +-----v------+ | set of transformers
| |decompressor| | (with routers before them
| +-----+------+ | because some steps are optional)
| | | that process the file;
| +--v--+ | call this "FileProcessor"
| | ... | |
| +--+--+ |
+---------------------+
|
+----v----+
|save file|
| to disk |
+----+----+
|
All of the channels above are DirectChannels - Yup, I know this is a poor structure. This was working fine for files in small numbers. But now, I have to deal with thousands of files which need to go through the same flow - benchmarks reveal that this takes ~ 1 day to finish processing. So, I'm planning to introduce some parallel processing to this flow. I want to modify my flow to achieve something like this:
|
|
+----------v----------+
|sftp-outbound-gateway|
| "mget" |
+----------+----------+
| list of files
|
+----v---+
|splitter|
+----+---+
one thread one | thread ...
+------------------------+---------------+--+--+--+--+
| file | file | | | | |
+---------------------+ +---------------------+
| +----v----+ | | +----v----+ |
| |decryptor| | | |decryptor| |
| +----+----+ | | +----+----+ |
| | | | | |
| +-----v------+ | | +-----v------+ | ...
| |decompressor| | | |decompressor| |
| +-----+------+ | | +-----+------+ |
| | | | | |
| +--v--+ | | +--v--+ |
| | ... | | | | ... | |
| +--+--+ | | +--+--+ |
+---------------------+ +---------------------+
| |
+----v----+ +----v----+
|save file| |save file|
| to disk | | to disk |
+----+----+ +----+----+
| |
| |
For parallel processing, I output the files from the splitter on to a ExecutorChannel with a ThreadPoolTaskExecutor.
Some of the questions that I have:
I want all of the "FileProcessor" steps for one file to happen on the same thread, while multiple files are processed in parallel. How can I achieve this?
I saw from this answer, that a ExecutorChannel to MessageHandlerChain flow would offer such functionality. But, some of the steps inside "FileProcessor" are optional (using selector-expression with routers to skip some of the steps) - ruling out using a MessageHandlerChain. I can rig up a couple of MessageHandlerChains with Filters inside, but this more or less becomes the approach mentioned in #2.
If #1 cannot be achieved, will changing all of the channel types starting from the splitter, from DirectChannel to ExecutorChannel help in introducing some parallelism? If yes, should I create a new TaskExecutor for each channel or can I reuse one TaskExecutor bean for all channels (I cannot set scope="prototype" on a TaskExecutor bean)?
In your opinion, which approach (#1 or #2) is better? Why?
If I perform global error handling, like the approach mentioned here, will the other files continue to process even if one file errors out?

It will work as you need by using an ExecutorChannel as an input to the decrypter and leave all the rest as direct channels; the remaining flow does not have to be a chain, each component will run on one of the executor's threads.
You will need to be sure all your downstream components are thread-safe.
Error handling should remain as is; each sub flow is independent.

Related

Recursively add prefix to file names and moving these files from all subdirectories to a specified directory (linux environment)

I'd like to rename the files with the unique sample name (which is the title of the subdirectory 2 levels above the files).
Here is a snippet of the directory structure:
|-RNAdata
| |-Sample1
| | |-cufflinks
| | | |-genes.fpkm_tracking
| | | |-skipped.gtf
| | | |-isoforms.fpkm_tracking
| | | |-transcripts.gtf
| |-Sample2
| | |-cufflinks
| | | |-genes.fpkm_tracking
| | | |-skipped.gtf
| | | |-isoforms.fpkm_tracking
| | | |-transcripts.gtf
There are about 1000 files like this. I'd like to be able to see something like this:
|-RNAdata
| |-Sample1_genes.fpkm_tracking
| |-Sample1_skipped.gtf
| |-Sample1_isoforms.fpkm_tracking
| |-Sample1_transcripts.gtf
| |-Sample2_genes.fpkm_tracking
| |-Sample2_skipped.gtf
| |-Sample2_isoforms.fpkm_tracking
| |-Sample2_transcripts.gtf
I'm working in a Linux environment and have very basic knowledge on file management with this language. Any advice/suggestions for resources on this type of work, that would be great! I'd like to learn this so I can be more independent with this. Thank you!

Formula to detect for doubles, and choose the one with the highest POP-number

Have a problem with a formula that I can't seem to wrap my head around. When presented with the same Object, I need the formula to return a 1 when Object is there twice, at the row where the POP-number is the highest (Which would be POP03 every time). It does work, but the problem appears when Object is seen only once. It should give a 1 then as well, but I can't get it to work. What am I missing?
Sample data looks as following;
+-------+------------+
| POP | Object |
+-------+------------+
| POP02 | B0005-8701 |
| POP02 | B0005-8702 |
| POP02 | B0005-8703 |
| POP02 | B0005-8704 |
| POP02 | B0006-4359 |
| POP02 | LBK-0013 |
| POP03 | LBK-0017 |
| POP02 | LBK-0017 |
| POP03 | LBK-0018 |
| POP02 | LBK-0018 |
| POP03 | LBK-0019 |
| POP02 | LBK-0019 |
| POP03 | LBK-0020 |
| POP02 | LBK-0020 |
| POP03 | LBK-0021 |
| POP02 | LBK-0021 |
+-------+------------+
Used formula is as following (POP is in Column B, and Object in Column C);
=IF(C2="";"";IF(C2=C3;IF(Q2<Q3;0;IF(Q2>Q3;1;))))
I would use a countifs like this:
=IF(B$2:B$20="","",IF(COUNTIFS(C$2:C$20,C2,B$2:B$20,">"&B2)=0,1,""))

SparkSQL Get all prefixes of a word

Say I have a column in a SparkSQL DataFrame like this:
+-------+
| word |
+-------+
| chair |
| lamp |
| table |
+-------+
I want to explode out all the prefixes like so:
+--------+
| prefix |
+--------+
| c |
| ch |
| cha |
| chai |
| chair |
| l |
| la |
| lam |
| lamp |
| t |
| ta |
| tab |
| tabl |
| table |
+--------+
Is there a good way to do this WITHOUT using udfs, or functional programming methods such as flatMap in spark sql? (I'm talking about a solution using the codegen optimal functions in org.apache.spark.sql.functions._)
Technically it is possible but I doubt it will perform any better than a simple flatMap (if performance is the reason to avoid flatMap):
val df = Seq("chair", "lamp", "table").toDF("word")
df.withColumn("len", explode(sequence(lit(1), length($"word"))))
.select($"word".substr(lit(1), $"len") as "prefix")
.show()
Output:
+------+
|prefix|
+------+
| c|
| ch|
| cha|
| chai|
| chair|
| l|
| la|
| lam|
| lamp|
| t|
| ta|
| tab|
| tabl|
| table|
+------+

Calc/Spreadsheet: combine/assign/migrate cols

Sorry, i can't use the right terms but i try to explain my task:
In Calc, or Spreadsheet I have two worksheets with columns like this:
| ID|
| 32|
| 51|
| 51|
| 63|
| 70|
and
| ID|Name |
| 01|name1 |
| 02|name2 |
...
| 69|name69 |
| 70|name70 |
i need to combine/assign/migrate these together, like:
| ID|Name |
| 32|name32 |
| 51|name51 |
| 51|name51 |
| 63|name63 |
| 70|name70 |
I have no idea how can is start to solve it. Please help!
Thank you #PsysicalChemist, the VLookup function is working in Calc to.

Write a command to increase or decrease the number of vertical splits

I usually have my Vim screen split into two vertical windows, each of which may be further horizontally split. Sometimes, I want to add or delete a vertical window. Is there a way to detect how many top-level vertical splits there are and add or remove vsplits as necessary?
For example, suppose my screen looks like this:
+--------+--------+
| | |
| | |
+--------+ |
| | |
| | |
| +--------+
| | |
+--------+--------+
I want :Columns 1 to give me
+--------+
| |
| |
+--------+
| |
| |
| |
| |
+--------+
by closing the two right-most windows.
I want :Columns 2 to do nothing, detecting that two columns are already open.
And I want :Columns 3 to give me
+--------+--------+--------+
| | | |
| | | |
+--------+ | |
| | | |
| | | |
| +--------+ |
| | | |
+--------+--------+--------+
I am fine if the function ignores vertical splits within horizontal splits. For example, if I had
+--------+
| |
| |
+---+----+
| | |
| | |
| | |
| | |
+---+----+
and I ran :Columns 2, I would get
+--------+--------+
| | |
| | |
+---+----+ |
| | | |
| | | |
| | | |
| | | |
+---+----+--------+
There is indeed a way, but it is involved; the first step is to count the currently-open vertical windows, and I don’t know of any built-in function that facilitates this. The working approach I found to it is basically to start at the first window (the top of the first — if not the entirety of the first — vertical split), and to then, using wincmd l, move to the next window to the right for as long as wincmd l moves to a new window, adding each to a count of open vertical windows including the first one. (I think this is what Gary Fixler referred to in the comments on the question.)
I started trying to write the code for posting here, and it grew to become larger than any function I would want to put in my ~/.vimrc, so I ended up turning it into a plugin which takes the above approach and provides the :Columns command; see Columcille (on vim.org at http://www.vim.org/scripts/script.php?script_id=4742.) The plugin also provides a command for similarly managing horizontal split windows: :Rows divides the current column (or the main window, if there are no open vertical splits) into the specified number of “rows.”

Resources