Spool file larger than 4GB in sqlplus - linux

I am selecting billions of queries from oracle database. Running sqlplus on RHEL 5. I am doing following:
>spool on
>set heading off
>set trimsp on pages 0
1) Problem is that spool file is becoming larger than 4GB and it doesn't write selected rows in this file after that.
How can i achieve that? (i found that resultant file should be larger approx. 9GB)
2) Also i don't want the result queries to be displayed on console. They should go directly in the spool file. (as it decreases the speed of this operation)

1) Im not sure but according to this source there seems to be a limit on the size of spool file. Usually file system used also places limits on the size of the file. So this has something to do with file size limit perhaps.
2) For this you can add SET ECHO OFF which displays any output from coming onto the console

Related

PostgreSQL 12 - large cache files in "base" directory

I have noticed that my small 8MB database is cached by PostgreSQL 12 in this folder:
/var/lib/postgresql/12/main/base/16384
This folder contains multiple 1GB files for example "16417" that weighs 1073741824 bytes.
How to remove these files and limit cache file space to max 100 GB? Now it uses as much space as it can and crashes my disk (no space left).
In postgresql.conf file i have changed these options:
temp_file_limit = 10000000
shared_buffers = 128MB
checkpoint_timeout = 12h
max_wal_size = 100MB
min_wal_size = 80MB
but unfortunately it did not help.
What else can I do to resolve this issue? In one our these files grew up to 80 GB...
EDIT: This issue occurs even with default settings. My system is Ubuntu 18.04.4.
This is not a cache, these are the actual tables and indexes. If you mess with these files, you will break your database and lose data.
Figure out what database 16384 is:
SELECT datname FROM pg_database WHERE oid = 16384;
Then connect to that database and figure out what 16417 is:
SELECT relname, relnamespace::regnamespace, relkind
FROM pg_class WHERE relfilenode = 16417;
If the size of that object is bigger than it should be, perhaps you have a bloated table or index, and VACUUM (FULL) tab can make it smaller (but don't forget that the table is inaccessible while it is rewritten!).
Again, make sure you don't manipulate any of those files yourself.

How to COPY a large Cassandra table without running out of memory?

I am attempting to run a simple Cassandra database COPY script, like the example below (or some variation that is very similar):
COPY my_keyspace_name.my_table_name TO 'cassandra_dump/my_keyspace_name.my_table_name.csv' WITH HEADER=true AND PAGETIMEOUT=40 AND PAGESIZE=20 AND DELIMITER='|';
It works on most tables except my largest one. In that case I get an error where it cannot allocate enough memory. The file size of the table is nowhere near as large in data as the error message claims (less than 1GB).
749314 rows exported to 1 files in 9 minutes and 11.240 seconds.
./dump_cassandra.sh: xmalloc: ../../.././lib/sh/strtrans.c:63: cannot allocate 18446744072166431589 bytes (6442528768 bytes allocated)", "stdout_lines": ["[Thu May 17 13:41:47 UTC 2018] Executing the following query:", "COPY my_keyspace_name.my_table_name TO 'cassandra_dump/my_keyspace_name.my_table_name.csv' WITH HEADER=true AND PAGETIMEOUT=40 AND PAGESIZE=20 AND DELIMITER='|';"
This answer seemed promising, but unfortunately it does not work for me.
Is there something I am missing that is preventing me from running a successful COPY on a large (relatively speaking) table?
--
EDIT: This error seems to be environmental. I have had mixed results on different servers with nearly identical amounts of data.
Setting MAXOUTPUTSIZE will split the backup data across multiple files and does not cause this error to occur
COPY my_keyspace_name.my_table_name TO 'cassandra_dump/my_keyspace_name.my_table_name.csv' WITH HEADER=true AND PAGETIMEOUT=40 AND MAXOUTPUTSIZE=100000 AND DELIMITER='|';

Liferay: huge DLFileRank table

I have a Liferay 6.2 server that has been running for years and is starting to take a lot of database space, despite limited actual content.
Table Size Number of rows
--------------------------------------
DLFileRank 5 GB 16 million
DLFileEntry 90 MB 60,000
JournalArticle 2 GB 100,000
The size of the DLFileRank table sounds to me as abnormally big (if it is totally normal please let me know).
While the file ranking feature of Liferay is nice to have, we would not really mind resetting it if it halves the size of the database.
Question: Would a DELETE * FROM DLFileRank be safe? (stop Liferay, run that SQL command, maybe set dl.file.rank.enabled=false in portal-ext.properties, start Liferay again)
Is there any better way to do it?
Bonus if there is a way to keep recent ranking data and throw away only the old data (not a strong requirement).
Wow. According to the documentation here (Ctrl-F rank), I'd not have expected the number of entries to be so high - did you configure those values differently?
Set the interval in minutes on how often CheckFileRankMessageListener
will run to check for and remove file ranks in excess of the maximum
number of file ranks to maintain per user per file. Defaults:
dl.file.rank.check.interval=15
Set this to true to enable file rank for document library files.
Defaults:
dl.file.rank.enabled=true
Set the maximum number of file ranks to maintain per user per file.
Defaults:
dl.file.rank.max.size=5
And according to the implementation of CheckFileRankMessageListener, it should be enough to just trigger DLFileRankLocalServiceUtil.checkFileRanks() yourself (e.g. through the scripting console). Why you accumulate that large number of files is beyond me...
As you might know, I can never be quoted by stating that direct database manipulation is the way to go - in fact I refuse thinking about the problem from that way.

Why does Spark job fail with "too many open files"?

I get "too many open files" during the shuffle phase of my Spark job. Why is my job opening so many files? What steps can I take to try to make my job succeed.
This has been answered on the spark user list:
The best way is definitely just to increase the ulimit if possible,
this is sort of an assumption we make in Spark that clusters will be
able to move it around.
You might be able to hack around this by decreasing the number of
reducers [or cores used by each node] but this could have some performance implications for your
job.
In general if a node in your cluster has C assigned cores and you run
a job with X reducers then Spark will open C*X files in parallel and
start writing. Shuffle consolidation will help decrease the total
number of files created but the number of file handles open at any
time doesn't change so it won't help the ulimit problem.
-Patrick Wendell
the default ulimit is 1024 which is ridiculously low for large scale applications. HBase recommends up to 64K; modern linux systems don't seem to have trouble with this many open files.
use
ulimit -a
to see your current maximum number of open files
ulimit -n
can temporarily change the number of open files; you need to update the system configuration files and per-user limits to make this permanent. On CentOS and RedHat systems, that can be found in
/etc/sysctl.conf
/etc/security/limits.conf
Another solution for this error is reducing your partitions.
check to see if you've got a lot of partitions with:
someBigSDF.rdd.getNumPartitions()
Out[]: 200
#if you need to persist the repartition, do it like this
someBigSDF = someBigSDF.repartition(20)
#if you just need it for one transformation/action,
#you can do the repartition inline like this
someBigSDF.repartition(20).groupBy("SomeDt").agg(count("SomeQty")).orderBy("SomeDt").show()

Fastest way to shuffle lines in a file in Linux

I want to shuffle a large file with millions of lines of strings in Linux. I tried 'sort -R' But it is very slow (takes like 50 mins for a 16M big file). Is there a faster utility that I can use in the place of it?
Use shuf instead of sort -R (man page).
The slowness of sort -R is probably due to it hashing every line. shuf just does a random permutation so it doesn't have that problem.
(This was suggested in a comment but for some reason not written as an answer by anyone)
The 50 minutes is not caused by the actual mechanics of sorting, based on your description. The time is likely spent waiting on /dev/random to generate enough entropy.
One approach is to use an external source of random data (http://random.org, for example) along with a variation on a Schwartzian Transform. The Schwartzian Transform turns the data to be sorted into "enriched" data with the sort key embedded. The data is sorted using the key and then the key is discarded.
To apply this to your problem:
generate a text file with random numbers, 1 per line, with the same number of lines as the file to be sorted. This can be done at any time, run in the background, run on a different server, downloaded from random.org, etc. The point is that this randomness is not generated while you are trying to sort.
create an enriched version of the file using paste:
paste random_number_file.txt string_data.txt > tmp_string_data.txt
sort this file:
sort tmp_string_data.txt > sorted_tmp_string_data.txt
remove the random data:
cut -f2- sorted_tmp_string_data.txt > random_string_data.txt
This is the basic idea. I tried it and it does work, but I don't have 16 million lines of text or 16 million lines of random numbers. You may want to pipeline some of those steps instead of saving it all to disk.
You may try my tool: HugeFileProcessor. It's capable of shuffling files of hundreds of GBs in a reasonable time.
Here are the details on shuffling implementation. It requires specifying batchSize - number of lines to keep in RAM when writing to output. The more is the better (unless you are out of RAM), because total shuffling time would be (number of lines in sourceFile) / batchSize * (time to fully read sourceFile). Please note that the program shuffles whole file, not on per-batch basis.
The algorithm is as follows.
Count lines in sourceFile. This is done simply by reading whole file line-by-line. (See some comparisons here.) This also gives a measurement of how much time would it take to read whole file once. So we could estimate how many times it would take to make a complete shuffle because it would require Ceil(linesCount / batchSize) complete file reads.
As we now know the total linesCount, we can create an index array of linesCount size and shuffle it using Fisher–Yates (called orderArray in the code). This would give us an order in which we want to have lines in a shuffled file. Note that this is a global order over the whole file, not per batch or chunk or something.
Now the actual code. We need to get all lines from sourceFile in a order we just computed, but we can't read whole file in memory. So we just split the task.
We would go through the sourceFile reading all lines and storing in memory only those lines that would be in first batchSize of the orderArray. When we get all these lines, we could write them into outFile in required order, and it's a batchSize/linesCount of work done.
Next we would repeat whole process again and again taking next parts of orderArray and reading sourceFile from start to end for each part. Eventually the whole orderArray is processed and we are done.
Why it works?
Because all we do is just reading the source file from start to end. No seeks forward/backward, and that's what HDDs like. File gets read in chunks according to internal HDD buffers, FS blocks, CPU cahce, etc. and everything is being read sequentially.
Some numbers
On my machine (Core i5, 16GB RAM, Win8.1, HDD Toshiba DT01ACA200 2TB, NTFS) I was able to shuffle a file of 132 GB (84 000 000 lines) in around 5 hours using batchSize of 3 500 000. With batchSize of 2 000 000 it took around 8 hours. Reading speed was around 118000 lines per second.

Resources