Does Oracle sqlldr process multiple INFILE in parallel - multithreading

I am using sqlldr to load data into Oracle RAC (on Linux), and I am trying to improve the performance of my data loading. I am using 'Direct Path' and I've set 'parallel=true' for the sqlldr. Moreover, since my servers are multi-core, multithreading is set to be true by default.
Now, I am thinking about splitting the input file, on each server, into several chunks, and load them in parallel. I learned that one can list multiple INFILE files in the control file for sqlldr. My question is:
if I list several INFILE files in a single control file, and launch one sqlldr instance, does it process the files in parallel, or goes through them sequentially ?
Because another option for me is to launch, in parallel, as many sqlldr instances as the number of chunks that I create on each server, while each sqlldr instance has its own control file that lists only one INFILE file. But this option only makes sense, if sqlldr processes multiple INFILE files sequentially.

Since you are using "direct load" you cannot parallelize it.
Direct load "locks" the high water mark of the table / partition, and puts the data there... therefore - another process cannot lock it in parallel. A parallel process would have to wait for the current load to finish.
(I assume you don't control the partitions you load into.. if you can control it - you can get a better grain-fined tuning for it.. but usually the data to load is not divided in files as it will be in partitions, if you use partitions at all...)
If you'll "give up that", the parallel would be managed "automagically" for you by the parameters you give..
BUT - I would recommend you to stay with the "direct load" since it is probably much much faster than any other method of loading that exist (although its lock is very "big" for it).

Related

How to run multiple queries in Scylla using "Non Atomic" Batch/Pipeline

I understand that Scylla allows batch statements like these.
BEGIN BATCH
<insert-stmt>/ <update-stmt>/ <delete-stmt>
APPLY BATCH
These statements have performance implications as it ensures atomicity. However, I simply have many insert statements which I want to perform from my node client in a single IO. Atomicity among these inserts is not needed. Any idea how I can do that? Can't find anything.
Batching multiple inserts in Cassandra world usually is an antipattern (except when they go into one partition, see the docs). When you're sending inserts into multiple partitions in one batch, the coordinator node will need to take care for taking data from this batch and sending them to nodes that are owning the data. And this puts an additional load onto the coordinating node that first needs to backup the content of the batch just not to lose it if it crashes in the middle of execution, and then need to execute all operations, and wait for results of execution before sending it back to caller (see this diagram to understand how so-called logged batch works).
When you don't need atomicity, then the best performance would be by sending multiple parallel inserts, and waiting for their execution - it will be faster, it will put less load onto nodes, and driver can use token-aware load balancing policy, so requests will be sent to nodes that own data (if you're using prepared statements). In node.js you can achieve this by using Concurrent Execution API - there are several variants of its usage, so it's better to look into the documentation to select what is best for your use case.

use of df.coalesce(1) in csv vs delta table

When saving to a delta table we avoid 'df.coalesce(1)' but when saving to csv or parquet we(my team) add 'df.coalesce(1)'. Is it a common practise? Why? Is it mandatory?
In most cases when I have seen df.coalesce(1) it was done to generate only one file, for example, import CSV file into Excel, or for Parquet file into the Pandas-based program. But if you're doing .coalesce(1), then the write happens via single task, and it's becoming the performance bottleneck because you need to get data from other executors, and write it.
If you're consuming data from Spark or other distributed system, having multiple files will be beneficial for performance because you can write & read them in parallel. By default, Spark writes N files into the directory where N is the number of partitions. As #pltc noticed, this may generate the big number of files that's often not desirable because you'll get performance overhead from accessing them. So we need to have a balance between the number of files and their size - for Parquet and Delta (that is based on Parquet), having the bigger files bring several performance advantages - you read less files, you can get better compression for data inside the file, etc.
For Delta specifically, having .coalesce(1) having the same problem as for other file formats - you're writing via one task. Relying on default Spark behaviour and writing multiple files is beneficial from performance point of view - each node is writing its data in parallel, but you can get too many small files (so you may use .coalesce(N) to write bigger files). For Databricks Delta, as it was correctly pointed by #Kafels, there are some optimizations that will allow to remove that .coalesce(N) and do automatic tuning achieve the best throughput (so called "Optimized Writes"), and create bigger files ("Auto compaction") - but they should be used carefully.
Overall, the topic of optimal file size for Delta is an interesting topic - if you have big files (1Gb is used by default by OPTIMIZE command), you can get better read throughput, but if you're rewriting them with MERGE/UPDATE/DELETE, then big files are bad from performance standpoint, and it's better to have smaller (16-64-128Mb) files, so you can rewrite less data.
TL;DR: it's not mandatory, it depends on the size of your dataframe.
Long answer:
If your dataframe is 10Mb, and you have 1000 partitions for example, each file would be about 10Kb. And having so many small files would reduce Spark performance dramatically, not to mention when you have too many files, you'll eventually reach OS limitation of the number of files. Any how, when your dataset is small enough, you should merge them into a couple of files by coalesce.
However, if your dataframe is 100G, technically you still can use coalesce(1) and save to a single file, but later on you will have to deal with less parallelism when reading from it.

What is the fastest way to put a large amount of data on a local file system onto a distributed store?

I have a single local directory on the order of 1 terabyte. It is made up of millions of very small text documents. If I were to iterate through each file sequentially for my ETL, it would take days. What would be the fastest way for me to perform ETL on this data, ultimately loading it onto a distributed store like hdfs or a redis cluster?
Generically: try to use several/many parallel asynchronous streams, one per file. How many will depend on several factors (number of destination endpoints, disk IO for traversing/reading data, network buffers, errors and latency...)

Parallel loading csv file into one table by SQLLDR with sequence (MAX,1)

I have around 100 threads running parallel and dumping data in a single table using sqlldr ctl file. the query generates values for ID using expression ID SEQUENCE(MAX,1).
The process fails to load files properly due to parallel execution and may be two or more threads get same ID. it works fine when I run it sequentially with one single thread.
Please suggest a workaround.
Each CSV file contains data associated with a test cases and cases are supposed to be run in parallel. I can not concatenate all files in one go.
You could load the data and then run a separate update in which you could update ID with a traditional oracle sequence?

Should access to files stored in an hsqldb database be serialized?

Given:
One can access an HSQLDB database concurrently using connections pooled with the help of the apache commons dbcp package.
I store files in a cached table in an embedded hsqldb database.
It is known that files on a conventional hard drive (as opposed to a solid state) should not be accessed from multiple threads, because we are likely to get performance degradation rather than boost. This is because of the time it takes to move the mechanical reading head back and forth between the files with each thread context switch.
Question:
Does this rule hold to files managed in an HSQLDB database? The file sizes may range from several KB to several MB.
HSQLDB accesses two files for data storage during operations. One file for all CACHED table data, and another file for all the lobs. It manages access to these files internally.
With multiple threads, there is a possibility of reduced access speed in the following circumstances.
Simultaneous read and write access to large tables.
Simultaneous read and write access to lobs larger than 500KB.

Resources