Grails Excel import fails for huge data - excel

I am using grails 2.3.7 and the latest excel-import plugin (1.0.0). My requirement is that I need to copy the contents of an excel sheet completely as it is into the database. My database is mssql server 2012.
I have got the code working for the development version. The code works fine when the number of records are few or may be upto a few hundreds.
But while in production the excel sheet will be having as many as 50,000 rows and over 75 columns.
Initially I faced a data out of memory exception. I increased the heap size to as much as 8GB, but now the thread keeps running on and on without termination. No errors are generated.
Please note that this is a once in while operation and it will be carried out by a person who will ensure that this operation does not hamper other operations running parellely. So need to worry about the huge load of this operation. I can afford to run it.
When the records are upto 10,000 with the same number of columns the data gets copied in around 5 mins. If now I have 50,000 rows then the time taken should ideally be around 5 times more, which is around 25 mins. But the code kept running for more than an hour without termination.
Any idea how to go about this issue. Any help is highly appreciated.

If you load 5 times more data in memory, it doesn't always take 5 times more. I guess that most of 8GB are in virtual memory and the virtual memory is very slow on hardware. Try to decrease the memory, run some memory tests and try to use as much as possible the RAM.

In my experience, a normal problem with large batch operations in Grails. I think you have memory leaks that radically slow down the operation as it proceeds.
My solution has been to use an ETL tool such as Pentaho Kettle for the import, or chunk the import into manageable pieces. See this related question:
Insert 10,000,000+ rows in grails

Not technically an answer to your problem, but have you considered just using CSV instead of of excel?
From a users point of view, saving as a CSV before importing is not a lot of work.
I am loading, validating and saving CSVs with 200-300 000 rows without a hitch.
Just make sure you have the logic in a service so it puts a transaction around it.
A bit more code to decode csv maybe, especially to translate to various primitives, but it should be orders of magnitude faster.

Related

MariaDB got much faster, and I can't find the cause?

I have a concern about my MariaDB 10.4.12 database query execution time, which is getting much faster without any update to my database schema or data. While a speed-up is always welcome, I am concerned about the root cause of this speed-up, especially since I have not rolled out any changes in the last 24 hours. This specific query has sped up 60x overnight.
I have a NodeJS web application that filters a large dataset into "reporting" pages, which typically take 10-12 seconds to load. My main table has 3.5 million rows and the base query involves many joins, date comparisons, and text comparisons. There is room for fine-tuning the query, but it worked for what it was designed to do and I could live with 10 second load times. I noticed this morning, though, that my queries were executed in less than 1 second, without any recent changes on my part.
The most recent change to the application was pushed out five days ago, which affected the amount of data being pulled into this database. A separate application on the same server reaches out to a data set every 10 minutes and replicates these rows into the same database the "reporting" application communicates with. Up until this update, the query was collecting and inserting ~80,000 rows on average, taking about 8-10 seconds to fully replicate the data into this database. My change five days ago reduced the rows being inserted to ~20,000 on average.
Other clues:
PHPMyAdmin still takes 10-12 seconds to run the query, while the MySQL command-line tool takes in less than 1 second
The MariaDB temp directory was changed to a larger partition 7 days ago
The query was tested to be slow (10-12 seconds) 24 hours ago
The query is still slow on a pre-production server that runs the same application with an identical MySQL instance running (same schema and data)
My current running theory is that the ~80,000 inserts were not being executed in the time range being reported by NodeJS (8-10 seconds for the inserts), and they were instead waiting in the MariaDB temp directory until they could be fully written to the database. That would suggest that the database was constantly bogged down by these writes, and reducing the number to ~20k allowed the database to insert faster, allowing the select queries to run faster this morning.
Should I be concerned about this speed up? Could MariaDB have found a faster way to index my data? Am I going crazy?
Thank you.
Don't worry. This kind of thing can be caused by contention (multiple database clients using the database concurrently) and all sorts of other things.
(Cherish this moment. Performance usually goes the other direction.)
You can test for correctness to increase your confidence level. Check a few older and a few newer records to see if they still contain good data.
Or a full-table-scan query, something like this
SELECT COUNT(*), AVG(some_number_column), MIN(some_text_column) FROM mytable
That will take a while but it will hit every row in the table.
You probably don't need to do this, but it's a way to double check (and tell your boss, "I double checked.)
10 seconds, then 1 second. That is "normal".
The first was run when none of the data was cached in RAM; the second was with all cached.
Run it a third time; it will be 1 second again.
Restart MariaDB and run it again; it will again take 10 seconds.
Walk away from the machine for a long time; don't touch the table. It might be back to 10 seconds. For this, look at size of RAM and innodb_buffer_pool_size. Also look for big table scans that bump everything out of cache.

Speeding up dataframe.to_excel operations by a GPU

I was working on extracting some data wherein I constantly need to manipulate some part of fetched data and then append it to another dataframe which contains the combined dataset. I constantly save the dataframe using dataframe.to_excel. Since there is a lot of data, it has started to become a time taking operation, reading the previous file, appending and saving it again, inspite of ample of CPU and RAM. I am using GCP, an N1 type 8vCPU along a 30GB memory. Moreover since I am running various instances of the same script for various projects together, would using a GPU speed these things up ?
I never did it by myself but I think this is possible by using some Pandas alternative.
I found this thread which users seems to provide some solutions to a similar question.
I too have not tried this. I could offer couple of suggestions
rather than to_excel try to use to_csv probably there might be small gains.
you can try this library https://github.com/modin-project/modin, this library seems to make the read and operations faster, but i am not sure able to the write operations.
or you could move it to to_excel line to a different function and perform that operation by spinning out a new thread.

How big can batches in Flink respectively Spark get?

I am currently working on a framework for analysis application of an large scale experiment. The experiment contains about 40 instruments each generating about a GB/s with ns timestamps. The data is intended to be analysed in time chunks.
For the implemetation I would like to know how big such a "chunk" aka batch can get before Flink or Spark stop processing the data. I think it goes with out saying that I intend to recollect the processed data.
For live data analysis
In general, there is no hard limit on how much data you can process with the systems. It all depends on how many nodes you have and what kind of a query you have.
As it sounds as you would mainly want to aggregate per instrument on a given time window, your maximum scale-out is limited to 40. That's the maximum number of machines that you could throw at your problem. Then, the question arises on how big your time chunks are/how complex the aggregations become. Assuming that your aggregation requires all data of a window to be present, then the system needs to hold 1 GB per second. So if you window is one hour, the system needs to hold at least 3.6 TB of data.
If the main memory of the machines is not sufficient, data needs to be spilled to disk, which slows down processing significantly. Spark really likes to keep all data in memory, so that would be the practical limit. Flink can spill almost all data to disk, but then disk I/O becomes a bottleneck.
If you rather need to calculate small values (like sums, averages), main memory shouldn't become an issue.
For old data analysis
When analysis old data, the system can do batch processing and have much more options to handle the volume including spilling to local disk. Spark usually shines if you can keep all data of one window in main memory. If you are not certain about that or you know it will not fit into main memory, Flink is the more scalable solution. Nevertheless, I'd expect both frameworks to work well for your use case.
I'd rather look at the ecosystem and the suit for you. Which languages do you want to use? It feels like using Jupyter notebooks or Zeppelin would work best for your rather ad-hoc analysis and data exploration. Especially if you want to use Python, I'd probably give Spark a try first.

Options for running data extraction on a daily basis

I currently have an excel based data extraction method using power query and vba (for docs with passwords). Ideally this would be programmed to run once or twice a day.
My current solution involves setting up a spare laptop on the network that will run the extraction twice a day on its own. This works but I am keen to understand the other options. The task itself seems to be quite a struggle for our standard hardware. It is 6 network locations across 2 servers with around 30,000 rows and increasing.
Any suggestions would be greatly appreciated
Thanks
if you are going to work with increasing data, and you are going to dedicate a exclusive laptot for the process, i will think about install a database in the laptot (MySQL per example), you can use Access too... but Access file corruptions are a risk.
Download to this db all data you need for your report, based on incremental downloads (only new, modified and deleted info).
then run the Excel report extracting from this database in the same computer.
this should increase your solution performance.
probably your bigger problem can be that you query ALL data on each report generation.

Cassandra performance

I have read all these articles about how fast cassandra can be, for example single row read can take about 5ms.
So far i didn't care to much about my website speeds, but as the site grew bigger some pages started to require quite a few queries, for example one page requires to read 5 different tables and around 50 different rows, and so I have noticed that it takes from 0.7 sec to 2.0 secs which is really slow, so i took a closer look and found out that single query takes about 150ms.
The table that I'm testing is almost empty so size can't be an issue. I have installed APC and it did not help.
I am using PHPCassa and thrift comes along with this library.
Are these speeds normal, maybe php is just not fast enough? What could i do to improve this situation?
Note, I understand that running so many queries is to much and cassandra is optimized for writes not reads, but in some situations I can't find a way to put data in a single table/row.
EDIT I have just found out about optional C extension which should improve performance, and indeed it does, now single row read takes from 50ms to 100ms so that's a major improvement, thou still far away from those 5ms
EDIT2 Sorry for not updating my question with more information, but i have been very busy, and I have actually solved this problem, now 10 row reads from 4 different tables takes just 0.073158 s and average read time is just 0.005575 s so it's way more then I have expected to achieve. For those who are facing the same problem these are the things i would suggest to do:
Install optional C extension, steps to do that can be found here
Install APC
Make sure that right java version is installed, this could be causing slowdown
After installing all these things don't just restart apache, restart the whole server, I didn't do that at first and I only noticed this major speed improvement only after server restart
This still doesn't explain why a column family that is mostly empty performs worse than others. Next time you face that issue, you should give us how you use this table and what kind of query gave you bad performances.
Just a guess: Does this column family contains some frequently deleted data? Because the actual deletion for deleted (tombstoned) values takes a GcGracePeriod of 10 days by default.
So you might face some issues if you perform a lot of writes, reads and deletes on a lot of columns on the same keys.

Resources