Best way to copy 20Gb csv file to cassandra - apache-spark

I have a huge 20Gb csv file to copy into cassandra, of course i need to manage the case of errors ( if the the server or the Transfer/Load application crashes ).
I need to re-start the processing(or an other node or not) and continue the transfer without starting the csv file from it begning.
what is the best and easiest way to do that ?
using the Copy CQLSH Command ? using flume or sqoop ? or using native java application, using spark... ?
thanks a lot

If it was me, I would split the file.
I would pick a preferred way to load any csv data in, ignoring the issues of huge file size and error handling. For example, I would use a python script and the native driver and test it with a few lines of csv to see that it can insert from a tiny csv file with real data.
Then I would write a script to split the file into manageable sized chunks, however you define it. I would try a few chunk sizes to get a file size that loads in about a minute. Maybe you will need hundreds of chunks for 20 GB, but probably not thousands.
Then I would split the whole file into chunks of that size and loop over the chunks, logging how it is going. On an error of any kind, fix the problem and just start loading again from the last chunk that loaded successfully as found in the log file.

Here are a two considerations that I would try first since they are simple and well contained:
cqlsh COPY has been vastly improved in 2.1.13, 2.2.5, 3.0.3 and 3.2+. If you do consider using it, make sure to be at one of those versions or newer.
Another option is to use Brian Hess' cassandra-loader which is an effective way of bulk loading to and from csv files in an efficient manner.

I think CQLSH doesn't handle the case of application crash, so why not using both of the solution exposed above, split the file into several manageable chunks and uses the copy cqlsh command to import the data ?

Related

Unable to open a large .csv file

A very simple question...
I have downloaded a very large .csv file (around 3.7 GB) and now wish to open it; but excel can't seem to manage this.
Please how do I open this file?
Clearly I am missing a trick!
Please help!
There are a number of other Stackoverflow questions addressing this problem, such as:
Excel CSV. file with more than 1,048,576 rows of data
The bottom line is that you're getting into database territory with that sort of size. The best solution I've found is Bigquery from Google's cloud platform. It's super cheap, astonishingly fast, and it automatically detects schemas on most CSVs. The downside is you'll have to learn SQL to do even the simplest things with the data.
Can you not tell excel to only "open" the file with the first 10 lines ...
This would allow you to inspect the format and then use some database functions on the contents.
Another thing that can impact whether you can open a large Excel file is the resources and capacity of the computer. That's a huge file and you have to have a lot of on-disk swap space (page file in windows terms) + memory to open a file of that size. So, one thing you can do is find another computer that has more memory and resources or increase your swap space on your computer. If you have windows just google how to increase your page file.
This is a common problem. The typical solutions are
Insert your .CSV file into a SQL database such as MySQL, PostgreSQL etc.
Processing you data using Python, or R.
Find a data hub for your data. For example, Acho Studio.
The problem with solution one is that you'll have to design a table schema and find a server to host the database. Also you need to write server side code to maintain or change the database. The problem with Python or R is that running processes on GBs of data will put a of stress to your local computer. A data hub is much easier but its costs may vary.

How to properly load millions of files into an RDD

I have a very large set of json files (>1 million files) that I would like to work on with Spark.
But, I've never tried loading this much data into an RDD before, so I actually don't know if it can be done, or rather if it even should be done.
What is the correct pattern for dealing with this amount of data within RDD(s) in Spark?
Easiest way would be to create directory, copy all the files to the directory and pass directory as path while reading the data.
If you try to use patterns in the directory path, Spark might run into out of memory issues.

copy command row size limit in cassandra

Could anyone tell the maximum size(no. of rows or file size) of a csv file we can load efficiently in cassandra using copy command. Is there a limit for it? if so is it a good idea to breakdown the size files into multiple files and load or we have any better option to do it? Many thanks.
I've run into this issue before... At least for me there was no clear statement in any datastax or apache documentation of the max size. Basically, it may just be limited to your pc/server/cluster resources (e.g. cpu and memory).
However, in an article by jgong found here it is stated that you can import up to 10MB. For me it was something around 8.5MB. In the docs for cassandra 1.2 here its stated that you can import a few million rows and that you should use the bulk-loader for more heavy stuff.
All in all, I do suggest importing via multiple csv files (just dont make them too small so your opening/closing files constantly) so that you can keep a handle on data being imported and finding errors easier. It can happen that waiting for an hour for a file to load it fails and you start over whereas if you have multiple files you dont need to start over on the ones that already have been successfully imported. Not to mention key duplicate errors.
Check out cassandra-9303 and 9302
and check out brian's cassandra-loader
https://github.com/brianmhess/cassandra-loader

Spark: How to read & write temporary files?

I need to write a Spark app that uses temporary files.
I need to download many many large files, read them with some legacy code, do some processing, delete the file, and write the results to a database.
The files are on S3 and take a long time to download. However, I can do many at once, so I want to download a large number in parallel. The legacy code reads from the file system.
I think I can not avoid creating temporary files. What are the rules about Spark code reading and writing local files?
This must be a common issue, but I haven't found any threads or docs that talk about it. Can someone give me a pointer?
Many thanks
P

Cannot find efficient method for dumping leveldb to flat file(s)

I'm using LevelDB as part of a local process that, when all is done, has ~10-100 million JSON entries.
I need to get these into a portable format, ideally as one or more csv (or even line delimited json) files to import into a separate mongodb system.
I did a quick test in node.js to stream the db contents to a file (using node-levelup, and on my machine it took about 18.5 minutes for 10 million pairs. Seems pretty slow.
Looking for suggestions on quicker dump/export from leveldb.
I've considered using mongodb as the store for the local processing, because mongoexport is much quicker, but there is a lot more overhead in setup as I'd need several shards to get more speed on my writes.
The fastest way to retrieve all entries in leveldb is using it's iterator which may node-levelup already did for that.
Since you are still need a tool to parse the exported file, I suggest you just copy leveldb's data dir as the exported file. You can open it and iterate it in python/ruby/..., just any script which have a leveldb wrapper.

Resources