CORB Batch process output Report Extraction issue - linux

While running the CORB job, I am Extracting 100,000 URI's and loading the
data in one file at Linux server. The expectation is all the output records should be store in one file with 100k count. However The data was stored in multiple files with different counts. Can anyone help me out with root cause why the CORB process is creating multiple files in the output directory?
Please find the details of the CORB properties file that I configured in my local directory
Properties file :
THREAD-COUNT=4
PROCESS-TASK=com.marklogic.developer.corb.extension.ResilientTransform
SSL-CONFIG-CLASS=com.marklogic.developer.corb.TwoWaySSLConfig
SSL-PROPERTIES-FILE=/eiestore/ssl-configs/common-corb-sslconfig.properties
DECRYPTER=com.marklogic.developer.corb.HostKeyDecrypter
MODULE-ROOT=/a/abcmodules/corb-process/
MODULES-DATABASE="abcmodules"
URIS-MODULE=corb-select-uris.xqy
XQUERY-MODULE=corb-get-process.xqy
PROCESS-TASK=com.marklogic.developer.corb.ExportBatchToFileTask
PRE-BATCH-TASK=com.marklogic.developer.corb.PreBatchUpdateFileTask
EXPORT-FILE-TOP-CONTENT=Id,value,type
EXPORT-FILE-DIR=/a/b/c/d/

Related

How to read / readSream a directory containing files with completely different schemas

What if I have this:
Data:
/user/1_data/1.parquet
/user/1_data/2.parquet
/user/1_data/3.parquet
/user/2_data/1.parquet
/user/2_data/2.parquet
/user/3_data/1.parquet
/user/3_data/2.parquet
Each directory has files containing completely different schemas.
I don't want to have to create a single stream job for each folder. At the same time, I also want to save them in different locations.
How would I read / readStream them all without having collect data to the driver or hard coding the directories path?

Design batch job to process multiple files in a FTP folder

I want to design a batch job to process multiple zip files in the folder. Basically, the input zip file contains a directory structure and last directory have CSV file and set of PDFs. The job should take zip file and unzip and upload to an external system and database based on the index file in the leaf node folder.
Ex: input zip file structure
input1.zip
--Folder 1
--> Folder2
--> abc.pdf
...
...
...
--> cdf.pdf
--> metadata.csv
I can add spring integration and invoke the job just after the FTP coping completed. However, My question is, how should I design the job to pick up multiple zip files and allow them to process in parallelly.
Since each zip file takes around 10 min to process, I need multiple instances to process zip files in an efficient manner.
Appreciate any suggestions. Thank you.

SSIS Package strange data flow issue, spitting out empty excel with large dataset

I am having issue with the SSIS package, by Running from BIDS I could export 400K records successfully, But when I tried to run from the Job the package ran successfully but the excel file is empty.
The user which I am running the package with having the full access to the C:\Users folders. and I see it saving the data into the temporary folder but not writing that data into the file and finish with empty file.
For example : 230000 records (works good)
Create the excel file
Load the temporary data
Write data into the file
close the file
330000 records (not working)
Create the excel file
Load the temporary data
Write data into the file xxxxxxx this line missing from the process monitor
close the file
Solution : give permission to the user executing the package to the folder C:\Users\Default doesn't work for me.
Please help!
Sorry for bugging you guys, Found the problem. There was just 1.6GB of disk space on the server, thought the file is taking just 200MB of space but generate lots of temporary files causing the disk full error. Strange that SSIS package ran successfully without giving any warning or error. Thanks for looking into it.

NetSuite SuiteScript How to bridge the 10MG limit?

HI and thanks for any help. Is there a way to work with files larger than 10mg? I have to check for updates on items in a file that would be uploaded, but the file contains all items in the system and is approximately 20MG. This 10MG limit is killing me. I see streaming for file save and appending but not for file reading. So I am open to any suggestions. The provider in this instance doesn't offer the facility to chunk the files. thanks in advance for your help.
If you are using SS2 to process a file from the file cabinet then if you use file.lines.iterator() to process a file the size limit is 10MB per line.
I believe returning a file object from a map reduce script's getInputStage automatically parses the file into lines.
The 10MB file size limit comes into play if you try to create a file larger than 10MB.
If you are trying to read in a an external file via script then one approach that I've used is to proxy the call via an external service. e.g. query an AWS lambda function that checks for and saves the file to S3. Return the file path and size to your SuiteScript. The SuiteScript then asks for "pages" of the file that are less than 10MB and saves those. If you are uploading something like a .csv then the lambda function can send the header with each paged request.

File Name and Variable in Flume

Right now I am working in a project where we are trying to read tomcat access log using flume and process those data in Spark and dump those in DB in proper format. But problem is that tomcat access log file is a daily rolling file and file name will change every day. Some thing like...
localhost_access_log.2017-09-19.txt
localhost_access_log.2017-09-18.txt
localhost_access_log.2017-09-17.txt
and my flume conf file for source section is something like
# Describe/configure the source
flumePullAgent.sources.nc1.type = exec
flumePullAgent.sources.nc1.command = tail -F /tomcatLog/localhost_access_log.2017-09-17.txt
#flumePullAgent.sources.nc1.selector.type = replicating
Which is running tail command on a fixed file name(I used fixed name , for testing only). How can I pass the file name as a parameter in flume conf file.
In fact , If some how I able to pass the file name as parameter , then also it will not be a actual solution. say , I start flume today with some file name (example : "localhost_access_log.2017-09-19.txt"), tomorrow when I will change the file name (localhost_access_log.2017-09-19.txt to localhost_access_log.2017-09-20.txt) some one has to stop the flume and restart with new file name. In that case it will not be a continues process, I have to stop / start the flume using cron job or something like this. Another problem is that I will loss some data(The server we are working now is high throughput server , 700-800 TPS almost ) every day during the processing time.(I mean time it will take to generate the new file name+time to stop flume+time to start flule)
Any one , have idea how to run flume with roll over file name in production environment? Any help will be highly appreciated...
exec source is not suitable for your task, you can instead use Spooling Directory Source. From Flume user guide:
This source lets you ingest data by placing files to be ingested into a “spooling” directory on disk. This source will watch the specified directory for new files, and will parse events out of new files as they appear.
Then, in config file you'd mention your logs directory like this:
agent.sources.spooling_src.spoolDir = /tomcatLog

Resources