I want to know more about this as this is new for me..
I am trying to query InfluxDB with python to fetch data in 5 min time interval. I used a simple for-loop to get my data in small chunks and appended the chunks into another empty dataframe inside for loop one after another. This worked out pretty smoothly and I see my output. But while I try to perform mathematical operations on this large dataframe , it gives me a Memory error stated below:
"Memory Error : Unable to allocate 6.95GiB for an array with shape (993407736) and datatype int64"
My system has these info 8.00GB RAM, 64 bit OS x64 based processor.
Could my system be not supporting this ?
Is there an alternate way I can append small dataframes into another dataframe without these memory issues. I am new to this data stuff with python and I need to work with this large chunk of data.... may be an year
Even though, your system has 8GB memory, it will be used by OS and other applications running in your system. Hence it is not able to allocate 6.95GiB only for this program. In case you are building a ML model & trying to run with huge data, You need to consider any of the below options
Use GPU machines offered by any of the cloud provider.
Process the data in small chunks (If it is not ML)
Related
I was working on extracting some data wherein I constantly need to manipulate some part of fetched data and then append it to another dataframe which contains the combined dataset. I constantly save the dataframe using dataframe.to_excel. Since there is a lot of data, it has started to become a time taking operation, reading the previous file, appending and saving it again, inspite of ample of CPU and RAM. I am using GCP, an N1 type 8vCPU along a 30GB memory. Moreover since I am running various instances of the same script for various projects together, would using a GPU speed these things up ?
I never did it by myself but I think this is possible by using some Pandas alternative.
I found this thread which users seems to provide some solutions to a similar question.
I too have not tried this. I could offer couple of suggestions
rather than to_excel try to use to_csv probably there might be small gains.
you can try this library https://github.com/modin-project/modin, this library seems to make the read and operations faster, but i am not sure able to the write operations.
or you could move it to to_excel line to a different function and perform that operation by spinning out a new thread.
I am currently working on a framework for analysis application of an large scale experiment. The experiment contains about 40 instruments each generating about a GB/s with ns timestamps. The data is intended to be analysed in time chunks.
For the implemetation I would like to know how big such a "chunk" aka batch can get before Flink or Spark stop processing the data. I think it goes with out saying that I intend to recollect the processed data.
For live data analysis
In general, there is no hard limit on how much data you can process with the systems. It all depends on how many nodes you have and what kind of a query you have.
As it sounds as you would mainly want to aggregate per instrument on a given time window, your maximum scale-out is limited to 40. That's the maximum number of machines that you could throw at your problem. Then, the question arises on how big your time chunks are/how complex the aggregations become. Assuming that your aggregation requires all data of a window to be present, then the system needs to hold 1 GB per second. So if you window is one hour, the system needs to hold at least 3.6 TB of data.
If the main memory of the machines is not sufficient, data needs to be spilled to disk, which slows down processing significantly. Spark really likes to keep all data in memory, so that would be the practical limit. Flink can spill almost all data to disk, but then disk I/O becomes a bottleneck.
If you rather need to calculate small values (like sums, averages), main memory shouldn't become an issue.
For old data analysis
When analysis old data, the system can do batch processing and have much more options to handle the volume including spilling to local disk. Spark usually shines if you can keep all data of one window in main memory. If you are not certain about that or you know it will not fit into main memory, Flink is the more scalable solution. Nevertheless, I'd expect both frameworks to work well for your use case.
I'd rather look at the ecosystem and the suit for you. Which languages do you want to use? It feels like using Jupyter notebooks or Zeppelin would work best for your rather ad-hoc analysis and data exploration. Especially if you want to use Python, I'd probably give Spark a try first.
I have three GeoTIFFs, each roughly 500 MB in size on AWS' S3, which I am trying to process on an EMR cluster using Dask, but I obtain a MemoryError after the processing the first tiff.
After reading the GeoTIFF using xarray.open_rasterio(), I convert the grid values to boolean then multiply the array by a floating point value. This workflow has executed successfully on three GeoTIFFs 50 MBs in size. Additionally, I have tried using chunking when reading with xarray, but have obtained the same results.
Is there a size limitation with Dask or another possible issue I could be running into?
Is there a size limitation with Dask or another possible issue I could be running into?
Dask itself does not artificially impose any size limitations. It is just a normal Python process. I recommend thinking about normal Python or hardware issues. My first guess would be that you're using very small VMs, but that's just a guess. Good luck!
I'm working with huge sattelite data that i'm splitting into small tiles to feed a deep learning model. I'm using pytorch, which means the data loader can work with multiple thread.
[settings : python, Ubuntu 18.04]
I can't find any answer of which is the best in term of data accessing and storage between :
registering all the data in one huge HDF5 file (over 20Go)
splitting it into multiple (over 16 000) small HDF5 files (approx
1.4Mo).
Is there any problem of multiple access of one file by multiple thread ? and in the other case is there an impact of having that amount of files ?
I would go for multiple files if I were you (but read till the end).
Intuitively, you could load at least some files into memory speeding the process a little bit (it is unlikely you would able to do so with 20GB, if you are, than you definitely should as RAM access is much faster).
You could cache those examples (inside custom torch.utils.data.Dataset instance) during the first past and retrieve cached examples (say in list or other more memory-efficient data structure with better cache-locality preferably) instead of reading from disk (similar approach to the one in Tensorflow's tf.data.Dataset object and it's cache method).
On the other hand, this approach is more cumbersome and harder to implement correctly,
though if you are only reading the file with multiple threads you should be fine and there shouldn't be any locks on this operation.
Remember to measure your approach with pytorch's profiler (torch.utils.bottleneck) to pinpoint exact problems and verify solutions.
I am using RandomForestClassifier in python to predict whether the pixel in the input image is inside the cell or outside it as a pre-processing stage to improve the image , the problem is that the data size of the training set is 8.36GB and also the size of the test data is 8.29GB so whenever I run my program I get (out of memory) error. Will extending the memory not work?. Is there any way to read csv files which contain the data in more than one step and then free the memory after each step?
Hopefully you are using pandas to process this csv file as it would be nearly impossible in native python. As for your memory problem here is a great article explaining how to process large csv files by chunking the data in pandas.
http://pythondata.com/working-large-csv-files-python/