I have a use case where I need to process huge excel file in a fraction of second which was not possible. Hence, I wish to store the selected information from the excel file in memory so that my application can read it from memory instead of loading excel file every time. By the way, I am using groovy for developing the application. My question is as follows:
What is in-memory data structure? How can I use in groovy?
What happens when multiple processes running in different nodes want to access the in-memory data structure?
Any pointer/link will be very much helpful
Just use apache poi, it will load the workbook into memory (example)
They will need to each load a copy. Or you will need to do something clever
See above
Related
I have a large file of JSON (over 7 MB) and I want to iterate it to find my match data.
Is it a good a way to read the entire file in memory and keep it for next call or there are other ways which have better performance and a few memory using?
Data stored in the JSON format is meant to be read in all at once. That's how the format works. It's not a format that you would generally incrementally search without first reading all the data in. While there are some modules that support streaming it in and somewhat examining it incrementally, that is not what the format was intended for, nor what it is best suited for.
So, you really have several questions to ask yourself:
Can you read the whole block of data into memory at once and parse it into Javascript?
Is the amount of memory it takes to do that OK in your environment?
Is the time to do that OK for your application?
Can you cache it in memory for awhile so you can more efficiently access it the next time you need something from it?
Or, should this data really be in a database that supports efficient searches and efficient modifications with far lower memory usage and much better performance?
If you're OK with the first four questions, then just read it in, parse it and keep the resulting Javascript object in memory. If you're not OK with any of the first four questions, then you probably should put the data into a format that can more efficiently be queried without loading it all into memory (e.g. a simple database).
A very simple question...
I have downloaded a very large .csv file (around 3.7 GB) and now wish to open it; but excel can't seem to manage this.
Please how do I open this file?
Clearly I am missing a trick!
Please help!
There are a number of other Stackoverflow questions addressing this problem, such as:
Excel CSV. file with more than 1,048,576 rows of data
The bottom line is that you're getting into database territory with that sort of size. The best solution I've found is Bigquery from Google's cloud platform. It's super cheap, astonishingly fast, and it automatically detects schemas on most CSVs. The downside is you'll have to learn SQL to do even the simplest things with the data.
Can you not tell excel to only "open" the file with the first 10 lines ...
This would allow you to inspect the format and then use some database functions on the contents.
Another thing that can impact whether you can open a large Excel file is the resources and capacity of the computer. That's a huge file and you have to have a lot of on-disk swap space (page file in windows terms) + memory to open a file of that size. So, one thing you can do is find another computer that has more memory and resources or increase your swap space on your computer. If you have windows just google how to increase your page file.
This is a common problem. The typical solutions are
Insert your .CSV file into a SQL database such as MySQL, PostgreSQL etc.
Processing you data using Python, or R.
Find a data hub for your data. For example, Acho Studio.
The problem with solution one is that you'll have to design a table schema and find a server to host the database. Also you need to write server side code to maintain or change the database. The problem with Python or R is that running processes on GBs of data will put a of stress to your local computer. A data hub is much easier but its costs may vary.
I want to know the max Excel File size which we can load into db using a Simple ETL SSIS package. If file size depends upon system configs or resources, Then how can we calculate it? In my case I am trying to load an excel file of 500+Mbs.
My Package gets hanged even while trying to map columns.
Thanks.
The only real limitation is the size of the machine's memory (RAM) where the package is running on, as SSIS loads the data into memory.
Thus, if you only have 2GB of RAM, I wouldn't try to load files bigger than 1 GB. (you must have RAM left for SQL Server to operate, and don't forget about all your other applications)
Also remember if you're not pipelining your data flows properly, and you have blocking parts like Aggregate or SQL Command objects, then you are going to be loading way more into memory than you should be.
The file size is not as important if you have no blocking parts. SSIS won't load the entire object into memory, and you can specify how much it uses. But if there are blocking parts, then it will need the entire object in memory.
Note that another big memory hog could be Lookup tasks with Full Caching - these can take large amounts of memory up if you are loading big tables.
Hope this helps.
I have a huge 20Gb csv file to copy into cassandra, of course i need to manage the case of errors ( if the the server or the Transfer/Load application crashes ).
I need to re-start the processing(or an other node or not) and continue the transfer without starting the csv file from it begning.
what is the best and easiest way to do that ?
using the Copy CQLSH Command ? using flume or sqoop ? or using native java application, using spark... ?
thanks a lot
If it was me, I would split the file.
I would pick a preferred way to load any csv data in, ignoring the issues of huge file size and error handling. For example, I would use a python script and the native driver and test it with a few lines of csv to see that it can insert from a tiny csv file with real data.
Then I would write a script to split the file into manageable sized chunks, however you define it. I would try a few chunk sizes to get a file size that loads in about a minute. Maybe you will need hundreds of chunks for 20 GB, but probably not thousands.
Then I would split the whole file into chunks of that size and loop over the chunks, logging how it is going. On an error of any kind, fix the problem and just start loading again from the last chunk that loaded successfully as found in the log file.
Here are a two considerations that I would try first since they are simple and well contained:
cqlsh COPY has been vastly improved in 2.1.13, 2.2.5, 3.0.3 and 3.2+. If you do consider using it, make sure to be at one of those versions or newer.
Another option is to use Brian Hess' cassandra-loader which is an effective way of bulk loading to and from csv files in an efficient manner.
I think CQLSH doesn't handle the case of application crash, so why not using both of the solution exposed above, split the file into several manageable chunks and uses the copy cqlsh command to import the data ?
I am working on a project within excel and am starting prepare my document for future performance related problems. The excel file contains large amounts of data and large amounts of images which are all in sets, ie, 40 images belong to one function of the program, another 50 belong to another etc... and only one set of them is used at a time.
This file is only going to get bigger as the number of jobs/functions it has to handle increase. Now, I could just make multiple excel files and let the user choose which one is appropriate for the job but it is requested that this is all done from one file.
Baring this in mind, I started thinking about methods of creating such a large file whilst keeping its performance levels high and had an idea which I am not sure is possible or not. This is to have multiple protected workbooks each one containing the information for each job "set" and a main workbook which accesses these files depending on the user inputs. This will result in many excel files which take time to download initially but whilst being used should eliminate the low performance issues as the computer only has to access a subset of these files.
From what I understand this is sort of like what DLL's are for but I am not sure if the same can be done by excel and if possible would the performance increase be significant?
If anyone has any other suggestions or elegant solutions on how this can be done please let me know.
Rather than saving data such as images in the excel file itself, write your macro to load the appropriate images from files and have your users select which routine to run. This way, you load only files you need. If your data is text / numbers, you can store it in a CSV or, if your data gets very large, use a Microsoft Access database and retrieve the data using the ADODB library.
Inserting Images: How to insert a picture into Excel at a specified cell position with VBA
More on using ADODB: http://msdn.microsoft.com/en-us/library/windows/desktop/ms677497%28v=vs.85%29.aspx