Import big excel file using excel connection manager in SSIS - excel

Situation:
Right now I have several excel files that need to be populated to
databases. Like usual, first I tried to import files using import
task in SSMS, but it failed when I tried to edit the mapping and
proceed to the next step. (Here I am guessing it is because of the
size of excel that cannot process in the cache?), then I switch to
using SSIS, but still, the excel connection manager does not allow me
to preview or just finish connecting to the file. (after few minutes,
it give me error says source does not exist?)
I tried to break the file into small size, it worked.
Here are my question :
what is the max number of rows that SSIS or SSMS could pre-load the file from excel?
Instead of breaking the big file into small piece (not sure what is the exact size I need to distribute each time), is there any other way
to import big size excel file? Because it is kind of not doable when
having lots of files.
Thanks

SSIS does not limit the number of rows it can import from any particular source (Unless you are using some sort of constraint). By default, the preview usually gives about 200 rows only and I dont know of a way to change it.
Although you can change the number of rows you can use to query the metadata to probably 10000 but for preview, it is pretty standard.
What is the version of excel that you are importing data from and what is your SSDT version that you are using? What is the maximum number of rows you have encountered on your biggest excel files?

Related

Live Connection to Database for Excel PowerQuery?

I currently have approximately 10M rows, ~50 columns in a table that I wrap up and share as a pivot. However, this also means that it takes approximately 30mins-1hour to download the csv or much longer to do a powerquery ODBC connection directly to Redshift.
So far the best solution I've found is to use Python -- Redshift_connector to run update queries and perform an unload a zipped resultset to an S3 bucket then use BOTO3/gzip to download and unzip the file, then finally performing a refresh from the CSV. This resulted in a 600MB excel file compiled in ~15-20 mins.
However, this process still feel clunky and sharing a 600MB excel file among teams isn't the best either. I've searched for several days but I'm not closer to finding an alternative: What would you use if you had to share a drillable table/pivot among a team with a 10GB datastore?
As a last note: I thought about programming a couple of PHP scripts, but my office doesn't have the infrastructure to support that.
Any help would or ideas would be most appreciated!
Call a meeting with the team and let them know about the constraints, you will get some suggestions and you can give some suggestions
Suggestions from my side:
For the file part
reduce the data, for example if it is time dependent, increase the interval time, for example an hourly data can be reduced to daily data
if the data is related to some groups you can divide the file into different parts each file belonging to each group
or send them only the final reports and numbers they require, don't send them full data.
For a fully functional app:
you can buy a desktop PC (if budget is a constraint buy a used one or use any desktop laptop from old inventory) and create a PHP/Python web application that can do all the steps automatically
create a local database and link it with the application
create the charting, pivoting etc modules on that application, and remove the excel altogether from your process
you can even use some pre build applications for charting and pivoting part, Oracle APEX is one examples that can be used.

Unable to open a large .csv file

A very simple question...
I have downloaded a very large .csv file (around 3.7 GB) and now wish to open it; but excel can't seem to manage this.
Please how do I open this file?
Clearly I am missing a trick!
Please help!
There are a number of other Stackoverflow questions addressing this problem, such as:
Excel CSV. file with more than 1,048,576 rows of data
The bottom line is that you're getting into database territory with that sort of size. The best solution I've found is Bigquery from Google's cloud platform. It's super cheap, astonishingly fast, and it automatically detects schemas on most CSVs. The downside is you'll have to learn SQL to do even the simplest things with the data.
Can you not tell excel to only "open" the file with the first 10 lines ...
This would allow you to inspect the format and then use some database functions on the contents.
Another thing that can impact whether you can open a large Excel file is the resources and capacity of the computer. That's a huge file and you have to have a lot of on-disk swap space (page file in windows terms) + memory to open a file of that size. So, one thing you can do is find another computer that has more memory and resources or increase your swap space on your computer. If you have windows just google how to increase your page file.
This is a common problem. The typical solutions are
Insert your .CSV file into a SQL database such as MySQL, PostgreSQL etc.
Processing you data using Python, or R.
Find a data hub for your data. For example, Acho Studio.
The problem with solution one is that you'll have to design a table schema and find a server to host the database. Also you need to write server side code to maintain or change the database. The problem with Python or R is that running processes on GBs of data will put a of stress to your local computer. A data hub is much easier but its costs may vary.

copy command row size limit in cassandra

Could anyone tell the maximum size(no. of rows or file size) of a csv file we can load efficiently in cassandra using copy command. Is there a limit for it? if so is it a good idea to breakdown the size files into multiple files and load or we have any better option to do it? Many thanks.
I've run into this issue before... At least for me there was no clear statement in any datastax or apache documentation of the max size. Basically, it may just be limited to your pc/server/cluster resources (e.g. cpu and memory).
However, in an article by jgong found here it is stated that you can import up to 10MB. For me it was something around 8.5MB. In the docs for cassandra 1.2 here its stated that you can import a few million rows and that you should use the bulk-loader for more heavy stuff.
All in all, I do suggest importing via multiple csv files (just dont make them too small so your opening/closing files constantly) so that you can keep a handle on data being imported and finding errors easier. It can happen that waiting for an hour for a file to load it fails and you start over whereas if you have multiple files you dont need to start over on the ones that already have been successfully imported. Not to mention key duplicate errors.
Check out cassandra-9303 and 9302
and check out brian's cassandra-loader
https://github.com/brianmhess/cassandra-loader

Creating an excel library (DLL for excel)?

I am working on a project within excel and am starting prepare my document for future performance related problems. The excel file contains large amounts of data and large amounts of images which are all in sets, ie, 40 images belong to one function of the program, another 50 belong to another etc... and only one set of them is used at a time.
This file is only going to get bigger as the number of jobs/functions it has to handle increase. Now, I could just make multiple excel files and let the user choose which one is appropriate for the job but it is requested that this is all done from one file.
Baring this in mind, I started thinking about methods of creating such a large file whilst keeping its performance levels high and had an idea which I am not sure is possible or not. This is to have multiple protected workbooks each one containing the information for each job "set" and a main workbook which accesses these files depending on the user inputs. This will result in many excel files which take time to download initially but whilst being used should eliminate the low performance issues as the computer only has to access a subset of these files.
From what I understand this is sort of like what DLL's are for but I am not sure if the same can be done by excel and if possible would the performance increase be significant?
If anyone has any other suggestions or elegant solutions on how this can be done please let me know.
Rather than saving data such as images in the excel file itself, write your macro to load the appropriate images from files and have your users select which routine to run. This way, you load only files you need. If your data is text / numbers, you can store it in a CSV or, if your data gets very large, use a Microsoft Access database and retrieve the data using the ADODB library.
Inserting Images: How to insert a picture into Excel at a specified cell position with VBA
More on using ADODB: http://msdn.microsoft.com/en-us/library/windows/desktop/ms677497%28v=vs.85%29.aspx

How to work with a large TSV file

I have a 5GB+ TSV file. I need to visualize the data it contains, but Excel cannot open the file (apparently is too big). Tableau does not work with TSV files and neither does Access. I tried with 010 Editor, which can open the file but no export it in a useful format. How can I open/export/transform it?
I have encountered this problem before. The trouble is that in order to open a file in Excel, you usually have to load the entire file into memory. This is fine when the file is 50 or 500k, but when it's 5GB, the system cannot load it into memory.
In order to work with that much data, you really need to load it into a database and run queries on it. Databases are optimized to work with large quantities of data (even way in excess of 5GB).
The tricky part will be loading this data into a database. You need a program which can parse your file (read line by line) and insert each TSV value into the appropriate database column. Writing an app to do this yourself may be best. If you're a windows person, you can use C# (http://www.microsoft.com/visualstudio/eng/products/visual-studio-2010-express) and MSSQL Express (http://www.microsoft.com/en-us/download/details.aspx?id=29062). Here's a helpful resource for parsing (Modify CSV Parser to work with TSV files C#). Here's a resource for inserting rows into MSSQL (How to insert data into SQL Server)
Agree with Dan, such data should be loaded into database and run queries on it. One handy tool to do that is DB Browser for SQLite. You can import csv, tsv files into this as tables and run SQL queries on it. It uses sqlite underline and supports most of the SQL functions. Works on Mac and Windows as well.

Resources