How to work with a large TSV file - excel

I have a 5GB+ TSV file. I need to visualize the data it contains, but Excel cannot open the file (apparently is too big). Tableau does not work with TSV files and neither does Access. I tried with 010 Editor, which can open the file but no export it in a useful format. How can I open/export/transform it?

I have encountered this problem before. The trouble is that in order to open a file in Excel, you usually have to load the entire file into memory. This is fine when the file is 50 or 500k, but when it's 5GB, the system cannot load it into memory.
In order to work with that much data, you really need to load it into a database and run queries on it. Databases are optimized to work with large quantities of data (even way in excess of 5GB).
The tricky part will be loading this data into a database. You need a program which can parse your file (read line by line) and insert each TSV value into the appropriate database column. Writing an app to do this yourself may be best. If you're a windows person, you can use C# (http://www.microsoft.com/visualstudio/eng/products/visual-studio-2010-express) and MSSQL Express (http://www.microsoft.com/en-us/download/details.aspx?id=29062). Here's a helpful resource for parsing (Modify CSV Parser to work with TSV files C#). Here's a resource for inserting rows into MSSQL (How to insert data into SQL Server)

Agree with Dan, such data should be loaded into database and run queries on it. One handy tool to do that is DB Browser for SQLite. You can import csv, tsv files into this as tables and run SQL queries on it. It uses sqlite underline and supports most of the SQL functions. Works on Mac and Windows as well.

Related

Is there any way to insert csv file data using cassandra stress?

I have explored bit on cassandra stress tool using yaml file and it is working fine. I just wanted to know is there anyway by which we can specify the location of any external csv file in yaml profile to insert data into Cassandra table using cassandra stress?
So instead of random data i wanted to see the cassandra stres test result on specific dataload on this data model?
Standard cassandra-stress doesn't have such functionality, but you can use the NoSQLBench tool that was recently open sourced by DataStax. It also uses YAML to describe workloads, but it's much more flexible, and has a number of functions for sampling data from CSV files.
P.S. there is also a separate Slack workspace for this project (to get invite, fill this form)

Can I get Metadata of Files, or Stats of Files, Stored on Azure Databricks

As I mentioned in the title, I'm curious to know if I can get metadata on a bunch of files, basically all the files in a blob, which are loaded on Azure Databricks. I'm hoping there is some kind of generic script that can be run to give stats on files (mostly CSV format). I know it is pretty easy to get all kinds of stats on tables in SQL Server, which is also a Microsoft product. Or, maybe there is some kind of report that can be generated to show metadata, stats, etc., of files. Ultimately, I would like to get a list of files names, file sizes, and if possible, counts of nulls in fields and total counts of nulls, in all fields, in all files. Thanks.
For files the only thing available is dbutils.fs.ls which will list files in a folder including the file size.
You cannot get stats on a csv file without opening it and performing a query - csv is a text file.
Formats such as parquet do store statistics of data distribution. There are probably python and scala libraries available that can read them for you if you really want to.
If you are registering the files as a table in Databricks (Hive) then there can be statistics generated for query optimisation. https://docs.databricks.com/spark/latest/spark-sql/language-manual/analyze-table.html
That link includes details of the DESCRIBE command to view them.
Like SQL Server table stats are distributions and are estimates only. They will not give you true null counts for example. Both use them to improve query performance, neither intend users to use the stats directly.
Also Databricks is not a Microsoft product.

Import big excel file using excel connection manager in SSIS

Situation:
Right now I have several excel files that need to be populated to
databases. Like usual, first I tried to import files using import
task in SSMS, but it failed when I tried to edit the mapping and
proceed to the next step. (Here I am guessing it is because of the
size of excel that cannot process in the cache?), then I switch to
using SSIS, but still, the excel connection manager does not allow me
to preview or just finish connecting to the file. (after few minutes,
it give me error says source does not exist?)
I tried to break the file into small size, it worked.
Here are my question :
what is the max number of rows that SSIS or SSMS could pre-load the file from excel?
Instead of breaking the big file into small piece (not sure what is the exact size I need to distribute each time), is there any other way
to import big size excel file? Because it is kind of not doable when
having lots of files.
Thanks
SSIS does not limit the number of rows it can import from any particular source (Unless you are using some sort of constraint). By default, the preview usually gives about 200 rows only and I dont know of a way to change it.
Although you can change the number of rows you can use to query the metadata to probably 10000 but for preview, it is pretty standard.
What is the version of excel that you are importing data from and what is your SSDT version that you are using? What is the maximum number of rows you have encountered on your biggest excel files?

Unable to open a large .csv file

A very simple question...
I have downloaded a very large .csv file (around 3.7 GB) and now wish to open it; but excel can't seem to manage this.
Please how do I open this file?
Clearly I am missing a trick!
Please help!
There are a number of other Stackoverflow questions addressing this problem, such as:
Excel CSV. file with more than 1,048,576 rows of data
The bottom line is that you're getting into database territory with that sort of size. The best solution I've found is Bigquery from Google's cloud platform. It's super cheap, astonishingly fast, and it automatically detects schemas on most CSVs. The downside is you'll have to learn SQL to do even the simplest things with the data.
Can you not tell excel to only "open" the file with the first 10 lines ...
This would allow you to inspect the format and then use some database functions on the contents.
Another thing that can impact whether you can open a large Excel file is the resources and capacity of the computer. That's a huge file and you have to have a lot of on-disk swap space (page file in windows terms) + memory to open a file of that size. So, one thing you can do is find another computer that has more memory and resources or increase your swap space on your computer. If you have windows just google how to increase your page file.
This is a common problem. The typical solutions are
Insert your .CSV file into a SQL database such as MySQL, PostgreSQL etc.
Processing you data using Python, or R.
Find a data hub for your data. For example, Acho Studio.
The problem with solution one is that you'll have to design a table schema and find a server to host the database. Also you need to write server side code to maintain or change the database. The problem with Python or R is that running processes on GBs of data will put a of stress to your local computer. A data hub is much easier but its costs may vary.

Best way to copy 20Gb csv file to cassandra

I have a huge 20Gb csv file to copy into cassandra, of course i need to manage the case of errors ( if the the server or the Transfer/Load application crashes ).
I need to re-start the processing(or an other node or not) and continue the transfer without starting the csv file from it begning.
what is the best and easiest way to do that ?
using the Copy CQLSH Command ? using flume or sqoop ? or using native java application, using spark... ?
thanks a lot
If it was me, I would split the file.
I would pick a preferred way to load any csv data in, ignoring the issues of huge file size and error handling. For example, I would use a python script and the native driver and test it with a few lines of csv to see that it can insert from a tiny csv file with real data.
Then I would write a script to split the file into manageable sized chunks, however you define it. I would try a few chunk sizes to get a file size that loads in about a minute. Maybe you will need hundreds of chunks for 20 GB, but probably not thousands.
Then I would split the whole file into chunks of that size and loop over the chunks, logging how it is going. On an error of any kind, fix the problem and just start loading again from the last chunk that loaded successfully as found in the log file.
Here are a two considerations that I would try first since they are simple and well contained:
cqlsh COPY has been vastly improved in 2.1.13, 2.2.5, 3.0.3 and 3.2+. If you do consider using it, make sure to be at one of those versions or newer.
Another option is to use Brian Hess' cassandra-loader which is an effective way of bulk loading to and from csv files in an efficient manner.
I think CQLSH doesn't handle the case of application crash, so why not using both of the solution exposed above, split the file into several manageable chunks and uses the copy cqlsh command to import the data ?

Resources