I'm using AWS Sagemaker to run linear regression on a CSV dataset. I have made some tests, and with my sample dataset that is 10% of the full dataset, the csv file ends up at 1.5 GB in size.
Now I want to run the full dataset, but I'm facing issues with the 15 GB file. When I compress the file with Gzip, it ends up only 20 MB. However, Sagemaker only supports Gzip on "Protobuf-Recordio" files. I know I can make Recordio files with im2rec, but it seems to be intended for image files for image classication. I'm also not sure how to generate the protobuf file.
To make things even worse(?) :) I'm generating the dataset in Node.
I would be very grateful to get some pointers in the right direction how to do this.
This link https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-preprocess-data-transform.html has useful information if you are willing to use a Python script to transform your data.
The actual code from the SDK is https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/amazon/common.py
Basically, you could load your CSV data into an NDArray (in batches so that you can write to multiple files), and then use https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/amazon/common.py to convert to Recordio-protobuf. You should be able to write the buffer with the Recordio-protobuf into a file.
Thanks
Related
I am building a model which used large datasets in .csv files (~50Gb). My machine is a Windows 10 with 16Gb of RAM.
Since I don't have enough RAM to load the whole dataset, I used Dask to read the file and split them into smaller data sets. It worked just fine and I was able to save it into files like these. However, when I read the files, it only showed ... in every boxes like in this image
I have tried
!pip install dask
import dask.dataframe as dd
cat = dd.read_csv(paths.data + "cat.csv/*")
cat.head(5)
but it simply kept loading even though the data is kept to a minimum.
Can anyone please help me? Thank you.
The ... symbol is expected, since the data is not loaded in memory. There is a detailed tutorial on dask dataframes here: https://tutorial.dask.org/04_dataframe.html
I have a script written in Python and it's taking a fairly long time to process (around 30 minutes) on my laptop. I was thinking I could create a EC2 instance in AWS and try to see if it's possible to speed up the process. I have a AWS account so my question:
Which EC2 instance type should I create in order to run the process in a faster way? Process reads a csv file does some calculations and then writes csv with results. Script bottleneck is in the mathematical calculations as csv files are fairly small.
I can go with either a free tier or paid tier instance.
I would say go with p2 extra large If you have to use ec2.
Try to understand what's causing this delay. Which library are you using to read csv. There are various ways in python through which you can manipulate the csv file For example see the image:
Image source here
NumPy, SciPy, joblib and hdf5 are the recommended options for quickly saving and loading csv data.
Try to change your algorithm. In my experience pandas is not speedy when it comes to csv operations. Try to tweak your code if that dosent work switch to p2 extra large
I am using RandomForestClassifier in python to predict whether the pixel in the input image is inside the cell or outside it as a pre-processing stage to improve the image , the problem is that the data size of the training set is 8.36GB and also the size of the test data is 8.29GB so whenever I run my program I get (out of memory) error. Will extending the memory not work?. Is there any way to read csv files which contain the data in more than one step and then free the memory after each step?
Hopefully you are using pandas to process this csv file as it would be nearly impossible in native python. As for your memory problem here is a great article explaining how to process large csv files by chunking the data in pandas.
http://pythondata.com/working-large-csv-files-python/
I have a huge 20Gb csv file to copy into cassandra, of course i need to manage the case of errors ( if the the server or the Transfer/Load application crashes ).
I need to re-start the processing(or an other node or not) and continue the transfer without starting the csv file from it begning.
what is the best and easiest way to do that ?
using the Copy CQLSH Command ? using flume or sqoop ? or using native java application, using spark... ?
thanks a lot
If it was me, I would split the file.
I would pick a preferred way to load any csv data in, ignoring the issues of huge file size and error handling. For example, I would use a python script and the native driver and test it with a few lines of csv to see that it can insert from a tiny csv file with real data.
Then I would write a script to split the file into manageable sized chunks, however you define it. I would try a few chunk sizes to get a file size that loads in about a minute. Maybe you will need hundreds of chunks for 20 GB, but probably not thousands.
Then I would split the whole file into chunks of that size and loop over the chunks, logging how it is going. On an error of any kind, fix the problem and just start loading again from the last chunk that loaded successfully as found in the log file.
Here are a two considerations that I would try first since they are simple and well contained:
cqlsh COPY has been vastly improved in 2.1.13, 2.2.5, 3.0.3 and 3.2+. If you do consider using it, make sure to be at one of those versions or newer.
Another option is to use Brian Hess' cassandra-loader which is an effective way of bulk loading to and from csv files in an efficient manner.
I think CQLSH doesn't handle the case of application crash, so why not using both of the solution exposed above, split the file into several manageable chunks and uses the copy cqlsh command to import the data ?
I have stack overflow data dump file in .xml format,nearly 27GB and I want to convert them in .csv file. Please somebody tell me, tools to convert xml to csv file or python program
Use one of the python xml modules to parse the .xml file. Unless you have much more that 27GB ram, you will need to do this incrementally, so limit your choices accordingly. Use the csv module to write the .csv file.
Your real problem is this. Csv files are lines of fields. They represent a rectangular table. Xml files, in general, can represent more complex structures: hierarchical databases, and/or multiple tables. So your real problem to to understand the data dump format well enough to extract records to write to the .csv file.
I have written a PySpark function to parse the .xml in .csv. XmltoCsv_StackExchange is the github repo. Used it to convert 1 GB of xml within 2-3 minutes on a minimal 2-core and 2 GB RAM Spark setup. It can convert 27GB file too, just increase minPartitions from 4 to around 128 in this line.
raw = (sc.textFile(fileName, 4))