optimizing inserts using core data

optimizing inserts using core data - core-data

I have to insert large amount of data say around 2000 records into sqlite database using core data. It takes 1 minute 30 sec after using caching and batch processing. Still I want to improve the insert time. Is there any way in core data to avoid this UI block up while insertion of large data.

Chaithanya,
A standard solution to your problem is to insert the items on a background queue. You should read the relevant sections of the CD programming guide for examples.
Andrew

Related

Node JS architecture to handle huge amount of Data returned by DB in better possible way

We have NodeJs application and SQL Server database, and there are couple of badly written queries with a lot of inner joins.
Problem and Use Case
We have use case of generating report (15-20 thousand reports) in PDF / Excel format and there is a query with a lot of joins, which takes almost 8-9 seconds, as there is a huge amount of data - 2-3 tables used in query which have a few million rows each.
For report generation we don't need the real-time data, it can contain a day old or week old data which is fine.
What I'm looking for: a few suggestions to handle this situation in better possible way.
We have few options on table
Dump data from multiple queries in separate table and use it (we are planning to do this activity in periodic manner with the help of scheduler or something on similar lines)
Use time series DB to store the result of query with the help of scheduler, and use it at the time of report generation.
Limiting report generation to use at max last 1 year of data.
Implement sharding in SQL Server
And yes improving query is also something we are working on; but I think there is scope to make it better and that's the reason I'm reaching out here to get few more suggestions.

Denormalization is a tried and true method of speeding up reporting. As Preben suggested, creating an indexed view in SQL server is an efficient way to do this with minimal plumbing. Alternatively, it may be worth thinking about whether a data warehouse implementation is needed for future queries.
If this is a 1-off issue, put together your indexed view (pay attention to the requirements), and move on. If this is the first of many reports that you need to optimize, think about creating a more substantial solution.

How to best stage large amounts of data with Hibernate/JPA?

How can I best stage large amounts of data for migration into our database using Hibernate efficiently? Performance when dealing with >25K records that are 100+ columns are not ideal.
Let me explain:
Background
I'm working for a large company that operates around the world. I've been tasked with leading a team (at least for backend) to create a full stack application that allows for various levels of management to perform their tasks. The current tech stack for backend is Java, Spring Boot, Hibernate, and PostgreSQL. Management would like to upload Excel files to our application and have our application parse them so we can refresh the data in our database.
Unfortunately, these files range from 25K to 50K records. We're aware that these Excel files are generated using SQL queries from Excel. However, we are not permitted to access the database with this data directly. The security is very tight and will not permit us access to any APIs, DB calls, etc. to work around Excel. Due to memory constraints and scalability concerns, we're using SAX parsing to keep a low footprint. Once we parse the Excel files, we're mapping them to a Hibernate entity that represents a staging table. Then we're migrating data from it to our other tables.
Currently to stage 25K records and migrate all the data to our other tables takes 15 minutes, which is unacceptable in the eyes of management. Especially, since this will need to be done on a daily basis.
Things I've tried
Enabling batch processing in Hibernate by following Vlad's answer here. This knocked maybe 20 seconds off the overall time for staging.
Rewriting criteria and other queries for fetching data.
Reducing amount of data to process (most fields are required so the amount can't be too heavily reduced).
Indexing important columns in both the staging and destination tables. I'm doing the indexing as part of schema generation.
Optimize parts of code that clean parsed data of imperfections.
I cannot post code due to NDA
Summary of Constraints
This app needs strong support for generating reports on related data (one of the reasons we went with RDBMS. Also, the data fits well into a relational model).
Must maintain a complete audit history of all records (currently using Hibernate Envers).
We have to approve any new dependency/library through the company's cybersecurity team. This can result in days of lost production while we wait for approval. It's not ideal to request new dependencies for the project.
There are no ways of working around the Excel files at this time. An API call or simple database query would be nice, but that's not an option to us for security reasons.
Scalability is a growing concern. Another team under this project has to parse an Excel file of 50K rows with 100 rows. All of this is only data for the USA. The project owner has said the company eventually wants to expand this app's management capabilities abroad.
My Thoughts
Purely regarding the staging issue, I think it's best to get rid of the Hibernate entities responsible for staging. I'll rewrite the migration of staged data into our live tables in SQL using stored procedures. Despite it being vendor-specific (to my knowledge, anyway) I'll use Postgres' COPY command to do the heavy lifting with the large amounts of rows. I can rewrite the parser to direct data to a CSV or other delimited file instead. The only issue I have then is how to migrate the data to tables that use Hibernate sequences and generators. I haven't figured out how to synchronize Hibernate's sequences after a manual update to the database like that. It likes the throw errors about duplicate primary keys until it comes across an ID in the sequence that's not used. But I feel that's another question entirely.
Edit 1:
I should clarify. The 15 minutes is the total time for all of staging. This includes staging and migration. Just the staging of the 25K records takes around 1:30, which also isn't ideal. I've run session metrics a few times and get around the following numbers for Spring Data persisting the 25K records:
2451000 nanoseconds spent acquiring 1 JDBC connection;
0 nanoseconds spent releasing 0 JDBC connections;
96970800 nanoseconds spent preparing 24851 JDBC statements;
9534006000 nanoseconds spent executing 24849 JDBC statements;
21666942900 nanoseconds spent executing 830 JDBC statements;
23513568700 nanoseconds spent executing 2 flushes (flushing a total of 49696 entities and 0 collections)
211588700 nanoseconds spent executing 1 partial-flushes (flushing a total of 24848 entities and 24848 collections)
For this specific case, I'm staging the roughly 25K entities and then using a stored procedure to move only employee data from staging to live tables (a small fraction of what makes up the 15 total minutes). That procedure seems to run instantly. But there's other data that we have to determine via joins, group by statements, etc., which appear to be costly. I'm just not sure why it's taking Spring Data so long to persist that many records when it would take pure SQL significantly less.

Load data from one table to another every 10 mins - Cassandra

We have a stream of data coming to Table A every 10 mins. No history preserved. The existing data has to be flushed to a new table B every time data is loaded in Table A. Can this be done dynamically or automated in Cassandra?
I can think of loading the Table A into a CSV file and then loading back to Table B every time Table A is flushed. But i would like to have something done at the database level itself.
Any ideas or suggestions appreciated.
Thanks,
Arun

For smaller amounts of data you could put this into cron:
https://dba.stackexchange.com/questions/58901/what-is-a-good-way-to-copy-data-from-one-cassandra-columnfamily-to-another-on-th
If larger and running newer versions of cassandra (3.8+)
http://cassandra.apache.org/doc/latest/operating/cdc.html
https://issues.apache.org/jira/browse/CASSANDRA-8844
and then replay the data to the table that you need (by some sort of outside process, script, app etc ...).
Basically there are already some tools around like:
https://github.com/carloscm/cassandra-commitlog-extract
You could use the samples there to cover your use-case.
But for most use cases this is handled at the application level, writes are relatively cheap with cassandra.

Grails Excel import fails for huge data

I am using grails 2.3.7 and the latest excel-import plugin (1.0.0). My requirement is that I need to copy the contents of an excel sheet completely as it is into the database. My database is mssql server 2012.
I have got the code working for the development version. The code works fine when the number of records are few or may be upto a few hundreds.
But while in production the excel sheet will be having as many as 50,000 rows and over 75 columns.
Initially I faced a data out of memory exception. I increased the heap size to as much as 8GB, but now the thread keeps running on and on without termination. No errors are generated.
Please note that this is a once in while operation and it will be carried out by a person who will ensure that this operation does not hamper other operations running parellely. So need to worry about the huge load of this operation. I can afford to run it.
When the records are upto 10,000 with the same number of columns the data gets copied in around 5 mins. If now I have 50,000 rows then the time taken should ideally be around 5 times more, which is around 25 mins. But the code kept running for more than an hour without termination.
Any idea how to go about this issue. Any help is highly appreciated.

If you load 5 times more data in memory, it doesn't always take 5 times more. I guess that most of 8GB are in virtual memory and the virtual memory is very slow on hardware. Try to decrease the memory, run some memory tests and try to use as much as possible the RAM.

In my experience, a normal problem with large batch operations in Grails. I think you have memory leaks that radically slow down the operation as it proceeds.
My solution has been to use an ETL tool such as Pentaho Kettle for the import, or chunk the import into manageable pieces. See this related question:
Insert 10,000,000+ rows in grails

Not technically an answer to your problem, but have you considered just using CSV instead of of excel?
From a users point of view, saving as a CSV before importing is not a lot of work.
I am loading, validating and saving CSVs with 200-300 000 rows without a hitch.
Just make sure you have the logic in a service so it puts a transaction around it.
A bit more code to decode csv maybe, especially to translate to various primitives, but it should be orders of magnitude faster.

SQLite DB building speed-up

I'm going to use SQLite in order to save a lot of data in real-time environment.
In order to avoid procedures of find disk space (or move pages in the DB file) for new data to be written to the DB in real-time, I want to build tables in advance and insert into them the largest data that any cell can has (according to its type), So in the real-time running, there will be only 'UPDATES' queries.
The building and inserting data made in journal_mode=WAL mode.
I have 6 different DB files that i have to build. Every DB has between 10 to 200 tables, where all the tables in all the DB look the same :
ID | TimeStart | Float data | Float data | Float data
--------------------------------------------------------------------------------
The difference is that there are some tables with 100000 rows and some with 500000 rows.
These DBs are built on a SD card with an ARM9 CPU (on linux), so it takes a lot of time to build the DBs. I am talking about some days.
How can i speed-up the building? are there any 'Pragmas' or tricks that i can do? Can i copy a ready-table?
It is important to mention that the robust of the DB is not important in the building process - Speed is much more important to me than the corruption possibility of the DB.

I concur with #Graham Borland's answer, but also: if you have any indexes, I'd advise you to not create them until after you've added all the data to the DB. If you add them before hand, the indexes update themselves every time you insert a new record, which slows things down immeasurably when you insert a very large number of rows in quick succession.

Read this, it is very relevant. The answers by Graham Borland and Nick Shaw are also very relevant and are part of the advice in the linked document I've given you.

Pre-generate the database on your host machine, and copy it when you install your application to the target device.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string