How much ram do different python objects use? - python-3.x

I want to optimize my program as it consumes a lot of ram and stores a lot of data. To decide which datatype to use I need to know how much RAM a single python objects uses. How can I determine those sizes.

Just use sys.getsizeof().
If you want the size of a container pluss all of its content, then the doc even has a link to a recipe to do that.

Related

How do I read a large .conll file with python?

My attempt to read a very large file with pyconll and conllu keeps running into memory errors. The file is 27Gb in size and even using iterators to read it does not help. I'm using python 3.7.
Both pyconll and conllu have iterative versions that use less memory at a given moment. If you call pyconll.load_from_file it is going to try to read and parse the entire file into memory and likely your machine has much less than 27Gb of RAM. Instead use pyconll.iter_from_file. This will read the sentences one by one and use minimal memory, and you can extract that's needed from those sentences piecemeal.
If you need to do some larger processing that requires having all information at once, it's a bit outside the scope of either of these libraries to support that type of scenario.

Is there a way to dynamically determine the vhdSize flag?

I am using the MSIX manager tool to convert a *.msix (an application installer) to a *.vhdx so that it can be mounted in an Azure virtual machine. One of the flags that the tool requires is -vhdSize, which is in megabytes. This has proven to be problematic because I have to guess what the size should be based off the MSIX. I have ran into numerous creation errors due to too small of a vhdSize.
I could set it to an arbitrarily high value in order to get around these failures, but that is not ideal. Alternatively, guessing the correct size is an imprecise science and a chore to do repeatedly.
Is there a way to have the tool dynamically set the vhdSize, or am I stuck guessing a value that is both large enough to accommodate the file, but not too large as to waste disk space? Or, is there a better way to create a *.vhdx file?
https://techcommunity.microsoft.com/t5/windows-virtual-desktop/simplify-msix-image-creation-with-the-msixmgr-tool/m-p/2118585
There is an MSIX Hero app that could select a size for you, it will automatically check how big the uncompressed files are, add an extra buffer for safety (currently double the original size), and round it to the next 10MB. Reference from https://msixhero.net/documentation/creating-vhd-for-msix-app-attach/

multithreading and reading from one file (perl)

Hej sharp minds!
I need your expert guidance in making some choices.
Situation is like this:
1. I have approx. 500 flat files containing from 100 to 50000 records that have to be processed.
2. Each record in the files mentioned above has to be replaced using value from the separate huge file (2-15Gb) containing 100-200 million entries.
So I thought to make the processing using multicores - one file per thread/fork.
Is that a good idea? Since each thread needs to read from same huge file? It's a bit of a problem loading it into memory do to the size? Using file::tie is an option, but is that working with threads/forks?
Need your advise how to proceed.
Thanks
Yes, of course, using multiple cores for multi-threaded application is a good idea, because that's what those cores are for. Though it sounds like your problem involves heavy I/O, so, it might be that you will not use that much of CPU anyway.
Also since you are only going to read that big file, tie should work perfectly. I haven't heard of problems with that. But if you are going to search that big file for each record in your smaller files, then I guess it would take you a long time despite of the number of threads you use. If data from big file can be indexed based on some key, then I would advice to put it in some NoSQL databse and access it from your program. That would probably speed up your task even more than using multiple threads/cores.

R stats - memory issues when allocating a big matrix / Linux

I have read several threads about memory issues in R and I can't seem to find a solution to my problem.
I am running a sort of LASSO regression on several subsets of a big dataset. For some subsets it works well, and for some bigger subsets it does not work, with errors of type "cannot allocate vector of size 1.6Gb". The error occurs at this line of the code:
example <- cv.glmnet(x=bigmatrix, y=price, nfolds=3)
It also depends on the number of variables that were included in "bigmatrix".
I tried on R and R64 for both Mac and R for PC but recently went onto a faster virtual machine on Linux thinking I would avoid any memory issues. It was better but still had some limits, even though memory.limit indicates "Inf".
Is there anyway to make this work or do I have to cut a few variables in the matrix or take a smaller subset of data ?
I have read that R is looking for some contiguous bits of memory and that maybe I should pre-allocate the matrix ? Any idea ?
Let me build slightly on what #richardh said. All of the data you load with R chews up RAM. So you load your main data and it uses some hunk of RAM. Then you subset the data so the subset is using a smaller hunk. Then the regression algo needs a hunk that is greater than your subset because it does some manipulations and gyrations. Sometimes I am able to better use RAM by doing the following:
save the initial dataset to disk using save()
take a subset of the data
rm() the initial dataset so it is no longer in memory
do analysis on the subset
save results from the analysis
totally dump all items in memory: rm(list=ls())
load the initial dataset from step 1 back into RAM using load()
loop steps 2-7 as needed
Be careful with step 6 and try not to shoot your eye out. That dumps EVERYTHING in R memory. If it's not been saved, it'll be gone. A more subtle approach would be to delete the big objects that you are sure you don't need and not do the rm(list=ls()).
If you still need more RAM, you might want to run your analysis in Amazon's cloud. Their High-Memory Quadruple Extra Large Instance has over 68GB of RAM. Sometimes when I run into memory constraints I find the easiest thing to do is just go to the cloud where I can be as sloppy with RAM as I want to be.
Jeremy Anglim has a good blog post that includes a few tips on memory management in R. In that blog post Jeremy links to this previous StackOverflow question which I found helpful.
I don't think this has to do with continuous memory, but just that R by default works only in RAM (i.e., can't write to cache). Farnsworth's guide to econometrics in R mentions package filehash to enable writing to disk, but I don't have any experience with it.
Your best bet may be to work with smaller subsets, manage memory manually by removing variables you don't need with rm (i.e., run regression, store results, remove old matrix, load new matrix, repeat), and/or getting more RAM. HTH.
Try bigmemory package. It is very easy to use. The idea is that data are stored in file on HDD and you create an object in R as reference to this file. I have tested this one and it works very well.
There are some alternative as well, like "ff". See CRAN Task View: High-Performance and Parallel Computing with R for more information.

How to parallelize file reading and writing

I have a program which reads data from 2 text files and then save the result to another file. Since there are many data to be read and written which cause a performance hit, I want to parallize the reading and writing operations.
My initial thought is, use 2 threads as an example, one thread read/write from the beginning, and another thread read/write from the middle of the file. Since my files are formatted as lines, not bytes(each line may have different bytes of data), seek by byte does not work for me. And the solution I could think of is use getline() to skip over the previous lines first, which might be not efficient.
Is there any good way to seek to a specified line in a file? or do you have any other ideas to parallize file reading and writing?
Environment: Win32, C++, NTFS, Single Hard Disk
Thanks.
-Dbger
Generally speaking, you do NOT want to parallelize disk I/O. Hard disks do not like random I/O because they have to continuously seek around to get to the data. Assuming you're not using RAID, and you're using hard drives as opposed to some solid state memory, you will see a severe performance degradation if you parallelize I/O(even when using technologies like those, you can still see some performance degradation when doing lots of random I/O).
To answer your second question, there really isn't a good way to seek to a certain line in a file; you can only explicitly seek to a byte offset using the read function(see this page for more details on how to use it.
Queuing multiple reads and writes won't help when you're running against one disk. If your app also performed a lot of work in CPU then you could do your reads and writes asynchronously and let the CPU work while the disk I/O occurs in the background. Alternatively, get a second physical hard drive: read from one, write to the other. For modestly sized data sets that's often effective and quite a bit cheaper than writing code.
This isn't really an answer to your question but rather a re-design (which we all hate but can't help doing). As already mentioned, trying to speed up I/O on a hard disk with multiple threads probably won't help.
However, it might be possible to use another approach depending on data sensitivity, throughput needs, data size, etc. It would not be difficult to create a structure in memory that maintains a picture of the data and allows easy/fast updates of the lines of text anywhere in the data. You could then use a dedicated thread that simply monitors that structure and whose job it is to write the data to disk. Writing data sequentially to disk can be extremely fast; it can be much faster than seeking randomly to different sections and writing it in pieces.

Resources