I have a linux server on AWS that is hosting a postgres database server and a nodejs api server. It has 240GB RAM, and major portion is dedicated to the database. Approx 32GB is the memory is left open to the OS to deal among the
app server, various OS-level caches, and whatever need arises. There is no explicit fsync. The database + indexes add up to about 500-600 GB.
Within the database some operations may spill over to disk for want of enough work memory. E.g., a sort operation or a temp file. Thus, within a span
of say 1-2 seconds, a temp file may be created, written to, read back and then deleted. The temp file size I am guessing should seldom exceed 100MB.
Question - will the temp file in this scenario necessarily cause disk io? If not, how does one assess the likelihood? What parameters majorly influence this?
Related
I want to know the max Excel File size which we can load into db using a Simple ETL SSIS package. If file size depends upon system configs or resources, Then how can we calculate it? In my case I am trying to load an excel file of 500+Mbs.
My Package gets hanged even while trying to map columns.
Thanks.
The only real limitation is the size of the machine's memory (RAM) where the package is running on, as SSIS loads the data into memory.
Thus, if you only have 2GB of RAM, I wouldn't try to load files bigger than 1 GB. (you must have RAM left for SQL Server to operate, and don't forget about all your other applications)
Also remember if you're not pipelining your data flows properly, and you have blocking parts like Aggregate or SQL Command objects, then you are going to be loading way more into memory than you should be.
The file size is not as important if you have no blocking parts. SSIS won't load the entire object into memory, and you can specify how much it uses. But if there are blocking parts, then it will need the entire object in memory.
Note that another big memory hog could be Lookup tasks with Full Caching - these can take large amounts of memory up if you are loading big tables.
Hope this helps.
I am writing software in C, on Linux running on AWS, that has to handle 240 terabytes of data, in 72 million files.
The data will be spread across 24 or more nodes, so there will only be 10 terabytes on each node, and 3 million files per node.
Because I have to append data to each of these three million files every 60 seconds, the easiest and fastest thing to do would to be able to keep each of these files open at one time.
I can't store the data in a database, because the performance in reading/writing the data will be too slow. I need to be able to read the data back very quickly.
My questions:
1) is it even possible to keep open 3 million files
2) if it is possible, how much memory would it consume
3) if it is possible, would performance be terrible
4) if it is not possible, I will need to combine all of the individual files into a couple of dozen large files. Is there a maximum file size in Linux?
5) if it is not possible, what technique should I use to append data every 60 seconds, and keep track of it?
The following is a very coarse description of an architecture that can work for your problem, assuming that the maximum number of file descriptors is irrelevant when you have enough instances.
First, take a look at this:
https://aws.amazon.com/blogs/aws/amazon-elastic-file-system-shared-file-storage-for-amazon-ec2/
https://aws.amazon.com/efs/
EFS provides a shared storage that you can mount as a filesystem.
You can store ALL your files in a single storage unit of EFS. Then, you will need a set of N worker-machines running at full capacity of filehandlers. You can then use a Redis queue to distribute the updates. Each worker has to dequeue a set of updates from Redis, and then will open necessary files and perform the updates.
Again: the maximum number of open filehandlers will not be a problem, because if you hit a maximum, you only need to increase the number of worker machines until you achieve the performance you need.
This is scalable, though I'm not sure if this is the cheapest way to solve your problem.
I'm currently running MongoDB 2.6.7 on CentOS Linux release 7.0.1406 (Core).
Soon after the server starts, as well as (sometimes) after a period of inactivity (at least a few minutes) all queries take longer than usual. After a while, their speeds increase and stabilize around a much shorter duration (for a particular (more complex) query, the difference is from 30 seconds initially, to about 7 after the "warm-up" period).
After monitoring my VM using both top and various network traffic monitoring tools, I've noticed that the bottleneck is hit because of hard page faults (which had been my hunch from the beginning).
Given that my data takes up <2Gb and that my machine has 3.5 GB available, my collections should all fit in-memory (even with their indexes). And they actually do end up being fetched, but only on an on-demand basis, which can end up having a relatively negative impact on the user experience.
MongoDB uses memory-mapped files to operate on collections. Is there any way to force the operating system to prefetch the whole file into memory as soon as MongoDB starts up, instead of waiting for queries to trigger random page faults?
From mongodb docs:
The touch command loads data from the data storage layer into memory. touch can load the data (i.e. documents) indexes or both documents and indexes. Use this command to ensure that a collection, and/or its indexes, are in memory before another operation. By loading the collection or indexes into memory, mongod will ideally be able to perform subsequent operations more efficiently.
Given:
One can access an HSQLDB database concurrently using connections pooled with the help of the apache commons dbcp package.
I store files in a cached table in an embedded hsqldb database.
It is known that files on a conventional hard drive (as opposed to a solid state) should not be accessed from multiple threads, because we are likely to get performance degradation rather than boost. This is because of the time it takes to move the mechanical reading head back and forth between the files with each thread context switch.
Question:
Does this rule hold to files managed in an HSQLDB database? The file sizes may range from several KB to several MB.
HSQLDB accesses two files for data storage during operations. One file for all CACHED table data, and another file for all the lobs. It manages access to these files internally.
With multiple threads, there is a possibility of reduced access speed in the following circumstances.
Simultaneous read and write access to large tables.
Simultaneous read and write access to lobs larger than 500KB.
I am using sqlserver 2008, How can I calculate the total memory occupied by the Database with tables (>30) and also data in it.
I mean if I have DB (DB_name) and with few tables(tblabc, tbldef......) with data in it, how to calculate the total memory occupied by the database in the server.
Kindly Help me.
Thanks
Ramm
See the sizes of mdf and log files
EDIT: Sql Server stores its db in mdf files(one or multiple). You need the lof file too. See where your db is stored and these files are files you need.
Be aware that if you are using FILESTREAM, the actual files are not in the db (mdf)
EDIT2:Books Online
When you create a database, you must either specify an initial size for the data and log files or accept the default size. As data is added to the database, these files become full.
So, there is a file with some size even if you have no data..
By default, the data files grow as much as required until no disk space remains.
...
Alternatively, SQL Server lets you create data files that can grow automatically when they fill with data, but only to a predefined maximum size. This can prevent the disk drives from running out of disk space completely.
If data is added (and there is no more space in the file)the file grows, but when it is deleted, it keeps its size, you need to shrink it...
I suppose that you refer to disk space and not memory. That would be very hard to get correct since you would have to know exactly how SQL Server stores the data, indexes and so on. Fortunately you do not have to calculate it, just fire up Microsoft SQL Server Management Studio. Right click on your database->Reports->Disk usage.