Read .PART files - python-3.x

I am building a model which used large datasets in .csv files (~50Gb). My machine is a Windows 10 with 16Gb of RAM.
Since I don't have enough RAM to load the whole dataset, I used Dask to read the file and split them into smaller data sets. It worked just fine and I was able to save it into files like these. However, when I read the files, it only showed ... in every boxes like in this image
I have tried
!pip install dask
import dask.dataframe as dd
cat = dd.read_csv(paths.data + "cat.csv/*")
cat.head(5)
but it simply kept loading even though the data is kept to a minimum.
Can anyone please help me? Thank you.

The ... symbol is expected, since the data is not loaded in memory. There is a detailed tutorial on dask dataframes here: https://tutorial.dask.org/04_dataframe.html

Related

Confusion about the data location when applying Scikit-learn on cluster (Dask)

I'm currently working on implementing machine learning (Scikit-Learn) from a single machine to a Slurm cluster via dask. According to some tutorials (e.g. https://examples.dask.org/machine-learning/scale-scikit-learn.html), it's quite simple by using job_lib.parallel_backend('dask'). However, the location of the read in data confuses me and none of the tutorials mention it. Should I use dask.dataframe to read in data to make sure it is passed to the cluster or it doesn't matter if I just read in it using pd.dataframe (then the data is stored in the RAM of which machine I run the Jupiter notebook)?
Thank you very much.
If your data is small enough (which it is in the tutorial), and preprocessing steps are rather trivial, then it is okay to read in with pandas. This will read the data in to your local session, not yet any of the dask workers. Once you call with joblib.parallel_backend('dask'), the data will be copied to each worker process and the scikit work will be done there.
If your data is large or you have intensive preprocessing steps its best to "load" the data with dask, and then use dask's built-in preprocessing and grid search where possible. In this case the data will actually be loaded directly from the workers, because of dask's lazy execution paradigm. Dask's grid search will also cache repeated steps of the cross validation and can speed up computation immensely. More can be found here: https://ml.dask.org/hyper-parameter-search.html

Out of memory error because of giant input data

I am using RandomForestClassifier in python to predict whether the pixel in the input image is inside the cell or outside it as a pre-processing stage to improve the image , the problem is that the data size of the training set is 8.36GB and also the size of the test data is 8.29GB so whenever I run my program I get (out of memory) error. Will extending the memory not work?. Is there any way to read csv files which contain the data in more than one step and then free the memory after each step?
Hopefully you are using pandas to process this csv file as it would be nearly impossible in native python. As for your memory problem here is a great article explaining how to process large csv files by chunking the data in pandas.
http://pythondata.com/working-large-csv-files-python/

Create recordio file for linear regression

I'm using AWS Sagemaker to run linear regression on a CSV dataset. I have made some tests, and with my sample dataset that is 10% of the full dataset, the csv file ends up at 1.5 GB in size.
Now I want to run the full dataset, but I'm facing issues with the 15 GB file. When I compress the file with Gzip, it ends up only 20 MB. However, Sagemaker only supports Gzip on "Protobuf-Recordio" files. I know I can make Recordio files with im2rec, but it seems to be intended for image files for image classication. I'm also not sure how to generate the protobuf file.
To make things even worse(?) :) I'm generating the dataset in Node.
I would be very grateful to get some pointers in the right direction how to do this.
This link https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-preprocess-data-transform.html has useful information if you are willing to use a Python script to transform your data.
The actual code from the SDK is https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/amazon/common.py
Basically, you could load your CSV data into an NDArray (in batches so that you can write to multiple files), and then use https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/amazon/common.py to convert to Recordio-protobuf. You should be able to write the buffer with the Recordio-protobuf into a file.
Thanks

Best way to copy 20Gb csv file to cassandra

I have a huge 20Gb csv file to copy into cassandra, of course i need to manage the case of errors ( if the the server or the Transfer/Load application crashes ).
I need to re-start the processing(or an other node or not) and continue the transfer without starting the csv file from it begning.
what is the best and easiest way to do that ?
using the Copy CQLSH Command ? using flume or sqoop ? or using native java application, using spark... ?
thanks a lot
If it was me, I would split the file.
I would pick a preferred way to load any csv data in, ignoring the issues of huge file size and error handling. For example, I would use a python script and the native driver and test it with a few lines of csv to see that it can insert from a tiny csv file with real data.
Then I would write a script to split the file into manageable sized chunks, however you define it. I would try a few chunk sizes to get a file size that loads in about a minute. Maybe you will need hundreds of chunks for 20 GB, but probably not thousands.
Then I would split the whole file into chunks of that size and loop over the chunks, logging how it is going. On an error of any kind, fix the problem and just start loading again from the last chunk that loaded successfully as found in the log file.
Here are a two considerations that I would try first since they are simple and well contained:
cqlsh COPY has been vastly improved in 2.1.13, 2.2.5, 3.0.3 and 3.2+. If you do consider using it, make sure to be at one of those versions or newer.
Another option is to use Brian Hess' cassandra-loader which is an effective way of bulk loading to and from csv files in an efficient manner.
I think CQLSH doesn't handle the case of application crash, so why not using both of the solution exposed above, split the file into several manageable chunks and uses the copy cqlsh command to import the data ?

copy command row size limit in cassandra

Could anyone tell the maximum size(no. of rows or file size) of a csv file we can load efficiently in cassandra using copy command. Is there a limit for it? if so is it a good idea to breakdown the size files into multiple files and load or we have any better option to do it? Many thanks.
I've run into this issue before... At least for me there was no clear statement in any datastax or apache documentation of the max size. Basically, it may just be limited to your pc/server/cluster resources (e.g. cpu and memory).
However, in an article by jgong found here it is stated that you can import up to 10MB. For me it was something around 8.5MB. In the docs for cassandra 1.2 here its stated that you can import a few million rows and that you should use the bulk-loader for more heavy stuff.
All in all, I do suggest importing via multiple csv files (just dont make them too small so your opening/closing files constantly) so that you can keep a handle on data being imported and finding errors easier. It can happen that waiting for an hour for a file to load it fails and you start over whereas if you have multiple files you dont need to start over on the ones that already have been successfully imported. Not to mention key duplicate errors.
Check out cassandra-9303 and 9302
and check out brian's cassandra-loader
https://github.com/brianmhess/cassandra-loader

Resources