A very simple question...
I have downloaded a very large .csv file (around 3.7 GB) and now wish to open it; but excel can't seem to manage this.
Please how do I open this file?
Clearly I am missing a trick!
Please help!
There are a number of other Stackoverflow questions addressing this problem, such as:
Excel CSV. file with more than 1,048,576 rows of data
The bottom line is that you're getting into database territory with that sort of size. The best solution I've found is Bigquery from Google's cloud platform. It's super cheap, astonishingly fast, and it automatically detects schemas on most CSVs. The downside is you'll have to learn SQL to do even the simplest things with the data.
Can you not tell excel to only "open" the file with the first 10 lines ...
This would allow you to inspect the format and then use some database functions on the contents.
Another thing that can impact whether you can open a large Excel file is the resources and capacity of the computer. That's a huge file and you have to have a lot of on-disk swap space (page file in windows terms) + memory to open a file of that size. So, one thing you can do is find another computer that has more memory and resources or increase your swap space on your computer. If you have windows just google how to increase your page file.
This is a common problem. The typical solutions are
Insert your .CSV file into a SQL database such as MySQL, PostgreSQL etc.
Processing you data using Python, or R.
Find a data hub for your data. For example, Acho Studio.
The problem with solution one is that you'll have to design a table schema and find a server to host the database. Also you need to write server side code to maintain or change the database. The problem with Python or R is that running processes on GBs of data will put a of stress to your local computer. A data hub is much easier but its costs may vary.
Related
I currently have approximately 10M rows, ~50 columns in a table that I wrap up and share as a pivot. However, this also means that it takes approximately 30mins-1hour to download the csv or much longer to do a powerquery ODBC connection directly to Redshift.
So far the best solution I've found is to use Python -- Redshift_connector to run update queries and perform an unload a zipped resultset to an S3 bucket then use BOTO3/gzip to download and unzip the file, then finally performing a refresh from the CSV. This resulted in a 600MB excel file compiled in ~15-20 mins.
However, this process still feel clunky and sharing a 600MB excel file among teams isn't the best either. I've searched for several days but I'm not closer to finding an alternative: What would you use if you had to share a drillable table/pivot among a team with a 10GB datastore?
As a last note: I thought about programming a couple of PHP scripts, but my office doesn't have the infrastructure to support that.
Any help would or ideas would be most appreciated!
Call a meeting with the team and let them know about the constraints, you will get some suggestions and you can give some suggestions
Suggestions from my side:
For the file part
reduce the data, for example if it is time dependent, increase the interval time, for example an hourly data can be reduced to daily data
if the data is related to some groups you can divide the file into different parts each file belonging to each group
or send them only the final reports and numbers they require, don't send them full data.
For a fully functional app:
you can buy a desktop PC (if budget is a constraint buy a used one or use any desktop laptop from old inventory) and create a PHP/Python web application that can do all the steps automatically
create a local database and link it with the application
create the charting, pivoting etc modules on that application, and remove the excel altogether from your process
you can even use some pre build applications for charting and pivoting part, Oracle APEX is one examples that can be used.
I currently have an excel based data extraction method using power query and vba (for docs with passwords). Ideally this would be programmed to run once or twice a day.
My current solution involves setting up a spare laptop on the network that will run the extraction twice a day on its own. This works but I am keen to understand the other options. The task itself seems to be quite a struggle for our standard hardware. It is 6 network locations across 2 servers with around 30,000 rows and increasing.
Any suggestions would be greatly appreciated
Thanks
if you are going to work with increasing data, and you are going to dedicate a exclusive laptot for the process, i will think about install a database in the laptot (MySQL per example), you can use Access too... but Access file corruptions are a risk.
Download to this db all data you need for your report, based on incremental downloads (only new, modified and deleted info).
then run the Excel report extracting from this database in the same computer.
this should increase your solution performance.
probably your bigger problem can be that you query ALL data on each report generation.
I have a huge 20Gb csv file to copy into cassandra, of course i need to manage the case of errors ( if the the server or the Transfer/Load application crashes ).
I need to re-start the processing(or an other node or not) and continue the transfer without starting the csv file from it begning.
what is the best and easiest way to do that ?
using the Copy CQLSH Command ? using flume or sqoop ? or using native java application, using spark... ?
thanks a lot
If it was me, I would split the file.
I would pick a preferred way to load any csv data in, ignoring the issues of huge file size and error handling. For example, I would use a python script and the native driver and test it with a few lines of csv to see that it can insert from a tiny csv file with real data.
Then I would write a script to split the file into manageable sized chunks, however you define it. I would try a few chunk sizes to get a file size that loads in about a minute. Maybe you will need hundreds of chunks for 20 GB, but probably not thousands.
Then I would split the whole file into chunks of that size and loop over the chunks, logging how it is going. On an error of any kind, fix the problem and just start loading again from the last chunk that loaded successfully as found in the log file.
Here are a two considerations that I would try first since they are simple and well contained:
cqlsh COPY has been vastly improved in 2.1.13, 2.2.5, 3.0.3 and 3.2+. If you do consider using it, make sure to be at one of those versions or newer.
Another option is to use Brian Hess' cassandra-loader which is an effective way of bulk loading to and from csv files in an efficient manner.
I think CQLSH doesn't handle the case of application crash, so why not using both of the solution exposed above, split the file into several manageable chunks and uses the copy cqlsh command to import the data ?
Could anyone tell the maximum size(no. of rows or file size) of a csv file we can load efficiently in cassandra using copy command. Is there a limit for it? if so is it a good idea to breakdown the size files into multiple files and load or we have any better option to do it? Many thanks.
I've run into this issue before... At least for me there was no clear statement in any datastax or apache documentation of the max size. Basically, it may just be limited to your pc/server/cluster resources (e.g. cpu and memory).
However, in an article by jgong found here it is stated that you can import up to 10MB. For me it was something around 8.5MB. In the docs for cassandra 1.2 here its stated that you can import a few million rows and that you should use the bulk-loader for more heavy stuff.
All in all, I do suggest importing via multiple csv files (just dont make them too small so your opening/closing files constantly) so that you can keep a handle on data being imported and finding errors easier. It can happen that waiting for an hour for a file to load it fails and you start over whereas if you have multiple files you dont need to start over on the ones that already have been successfully imported. Not to mention key duplicate errors.
Check out cassandra-9303 and 9302
and check out brian's cassandra-loader
https://github.com/brianmhess/cassandra-loader
I have a 5GB+ TSV file. I need to visualize the data it contains, but Excel cannot open the file (apparently is too big). Tableau does not work with TSV files and neither does Access. I tried with 010 Editor, which can open the file but no export it in a useful format. How can I open/export/transform it?
I have encountered this problem before. The trouble is that in order to open a file in Excel, you usually have to load the entire file into memory. This is fine when the file is 50 or 500k, but when it's 5GB, the system cannot load it into memory.
In order to work with that much data, you really need to load it into a database and run queries on it. Databases are optimized to work with large quantities of data (even way in excess of 5GB).
The tricky part will be loading this data into a database. You need a program which can parse your file (read line by line) and insert each TSV value into the appropriate database column. Writing an app to do this yourself may be best. If you're a windows person, you can use C# (http://www.microsoft.com/visualstudio/eng/products/visual-studio-2010-express) and MSSQL Express (http://www.microsoft.com/en-us/download/details.aspx?id=29062). Here's a helpful resource for parsing (Modify CSV Parser to work with TSV files C#). Here's a resource for inserting rows into MSSQL (How to insert data into SQL Server)
Agree with Dan, such data should be loaded into database and run queries on it. One handy tool to do that is DB Browser for SQLite. You can import csv, tsv files into this as tables and run SQL queries on it. It uses sqlite underline and supports most of the SQL functions. Works on Mac and Windows as well.