Design batch job to process multiple files in a FTP folder - multithreading

I want to design a batch job to process multiple zip files in the folder. Basically, the input zip file contains a directory structure and last directory have CSV file and set of PDFs. The job should take zip file and unzip and upload to an external system and database based on the index file in the leaf node folder.
Ex: input zip file structure
input1.zip
--Folder 1
--> Folder2
--> abc.pdf
...
...
...
--> cdf.pdf
--> metadata.csv
I can add spring integration and invoke the job just after the FTP coping completed. However, My question is, how should I design the job to pick up multiple zip files and allow them to process in parallelly.
Since each zip file takes around 10 min to process, I need multiple instances to process zip files in an efficient manner.
Appreciate any suggestions. Thank you.

Related

Is there a tool to extract a file from a ZIP archive when that file is not present in central directory but has its own LFH?

I'm looking for a tool that can extract files by searching aggressively through a ZIP archive. The compressed files are preceded with LFHs but no CDHs are present. Unzip outputs an empty folder.
I found one called 'binwalk' but even though it finds the hidden files inside ZIP archives it seems not to know how to extract them.
Thank You in advance.
You can try sunzip. It reads the zip file as a stream, and will extract files as it encounters the local headers and compressed data.
Use the -r option to retain the files decompressed in the event of an error. You will be left with a temporary directory starting with _z containing the extracted files, but with temporary, random names.

How to read / readSream a directory containing files with completely different schemas

What if I have this:
Data:
/user/1_data/1.parquet
/user/1_data/2.parquet
/user/1_data/3.parquet
/user/2_data/1.parquet
/user/2_data/2.parquet
/user/3_data/1.parquet
/user/3_data/2.parquet
Each directory has files containing completely different schemas.
I don't want to have to create a single stream job for each folder. At the same time, I also want to save them in different locations.
How would I read / readStream them all without having collect data to the driver or hard coding the directories path?

Parquet file format on S3: which is the actual Parquet file?

Scala 2.12 and Spark 2.2.1 here. I used the following code to write the contents of a DataFrame to S3:
myDF.write.mode(SaveMode.Overwrite)
.parquet("s3n://com.example.mybucket/mydata.parquet")
When I go to com.example.mybucket on S3 I actually see a directory called "mydata.parquet", as well as file called "mydata.parquet_$folder$"!!! If I go into the mydata.parquet directory I see two files under it:
_SUCCESS; and
part-<big-UUID>.snappy.parquet
Whereas I was just expecting to see a single file called mydata.parquet living in the root of the bucket.
Is something wrong here (if so, what?!?) or is this expected with the Parquet file format? If its expected, which is the actual Parquet file that I should read from:
mydata.parquet directory?; or
mydata.parquet_$folder$ file?; or
mydata.parquet/part-<big-UUID>.snappy.parquet?
Thanks!
The mydata.parquet/part-<big-UUID>.snappy.parquet is the actual parquet data file. However, often tools like Spark break data sets into multiple part files, and expect to be pointed to a directory that contains multiple files. The _SUCCESS file is a simple flag indicating that the write operation has completed.
According to the api to save the parqueat file it saves inside the folder you provide. Sucess is incidation that the process is completed scuesffuly.
S3 create those $folder if you write directly commit to s3. What happens is it writes to temporory folders and copies to the final destination inside the s3. The reason is there no concept of rename.
Look at the s3-distcp and also DirectCommiter for performance issue.
The $folder$ marker is used by s3n/amazon's emrfs to indicate "empty directory". ignore.
The _SUCCESS file is, as the others note, a 0-byte file. ignore
all other .parquet files in the directory are the output; the number you end up with depends on the number of tasks executed on the input
When spark uses a directory (tree) as a source of data, all files beginning with _ or . are ignored; s3n will strip out those $folder$ things too. So if you use the path for a new query, it will only pick up that parquet file.

CORB Batch process output Report Extraction issue

While running the CORB job, I am Extracting 100,000 URI's and loading the
data in one file at Linux server. The expectation is all the output records should be store in one file with 100k count. However The data was stored in multiple files with different counts. Can anyone help me out with root cause why the CORB process is creating multiple files in the output directory?
Please find the details of the CORB properties file that I configured in my local directory
Properties file :
THREAD-COUNT=4
PROCESS-TASK=com.marklogic.developer.corb.extension.ResilientTransform
SSL-CONFIG-CLASS=com.marklogic.developer.corb.TwoWaySSLConfig
SSL-PROPERTIES-FILE=/eiestore/ssl-configs/common-corb-sslconfig.properties
DECRYPTER=com.marklogic.developer.corb.HostKeyDecrypter
MODULE-ROOT=/a/abcmodules/corb-process/
MODULES-DATABASE="abcmodules"
URIS-MODULE=corb-select-uris.xqy
XQUERY-MODULE=corb-get-process.xqy
PROCESS-TASK=com.marklogic.developer.corb.ExportBatchToFileTask
PRE-BATCH-TASK=com.marklogic.developer.corb.PreBatchUpdateFileTask
EXPORT-FILE-TOP-CONTENT=Id,value,type
EXPORT-FILE-DIR=/a/b/c/d/

I want to add addtional files to existing archives (ZIP / RAR) or have the files added when compressing

I know how to do this for one archive at a time, but I want to add files, to multiple archives, in the same folder, simultaneously; if that is possible. I understand that I can do this with a batch file... but I don't know how to write the script / text.
So... I have several zip files in one folder. I want to add a specific text file and a specific image file to each/all of those zips. I don't want any other modifications of the zip files.
Or... is there a way to set WinRAR so that specific files will be automatically added whenever an archive is created?
Thanks
import zipfile
z = zipfile.ZipFile('cal.zip', mode='a', compression=zipfile.ZIP_DEFLATED)
z.write('/your/file/path') # or, z.writestr('your-filename', 'file-content')
z.close()

Resources