How to best stage large amounts of data with Hibernate/JPA?

How to best stage large amounts of data with Hibernate/JPA? - excel

How can I best stage large amounts of data for migration into our database using Hibernate efficiently? Performance when dealing with >25K records that are 100+ columns are not ideal.
Let me explain:
Background
I'm working for a large company that operates around the world. I've been tasked with leading a team (at least for backend) to create a full stack application that allows for various levels of management to perform their tasks. The current tech stack for backend is Java, Spring Boot, Hibernate, and PostgreSQL. Management would like to upload Excel files to our application and have our application parse them so we can refresh the data in our database.
Unfortunately, these files range from 25K to 50K records. We're aware that these Excel files are generated using SQL queries from Excel. However, we are not permitted to access the database with this data directly. The security is very tight and will not permit us access to any APIs, DB calls, etc. to work around Excel. Due to memory constraints and scalability concerns, we're using SAX parsing to keep a low footprint. Once we parse the Excel files, we're mapping them to a Hibernate entity that represents a staging table. Then we're migrating data from it to our other tables.
Currently to stage 25K records and migrate all the data to our other tables takes 15 minutes, which is unacceptable in the eyes of management. Especially, since this will need to be done on a daily basis.
Things I've tried
Enabling batch processing in Hibernate by following Vlad's answer here. This knocked maybe 20 seconds off the overall time for staging.
Rewriting criteria and other queries for fetching data.
Reducing amount of data to process (most fields are required so the amount can't be too heavily reduced).
Indexing important columns in both the staging and destination tables. I'm doing the indexing as part of schema generation.
Optimize parts of code that clean parsed data of imperfections.
I cannot post code due to NDA
Summary of Constraints
This app needs strong support for generating reports on related data (one of the reasons we went with RDBMS. Also, the data fits well into a relational model).
Must maintain a complete audit history of all records (currently using Hibernate Envers).
We have to approve any new dependency/library through the company's cybersecurity team. This can result in days of lost production while we wait for approval. It's not ideal to request new dependencies for the project.
There are no ways of working around the Excel files at this time. An API call or simple database query would be nice, but that's not an option to us for security reasons.
Scalability is a growing concern. Another team under this project has to parse an Excel file of 50K rows with 100 rows. All of this is only data for the USA. The project owner has said the company eventually wants to expand this app's management capabilities abroad.
My Thoughts
Purely regarding the staging issue, I think it's best to get rid of the Hibernate entities responsible for staging. I'll rewrite the migration of staged data into our live tables in SQL using stored procedures. Despite it being vendor-specific (to my knowledge, anyway) I'll use Postgres' COPY command to do the heavy lifting with the large amounts of rows. I can rewrite the parser to direct data to a CSV or other delimited file instead. The only issue I have then is how to migrate the data to tables that use Hibernate sequences and generators. I haven't figured out how to synchronize Hibernate's sequences after a manual update to the database like that. It likes the throw errors about duplicate primary keys until it comes across an ID in the sequence that's not used. But I feel that's another question entirely.
Edit 1:
I should clarify. The 15 minutes is the total time for all of staging. This includes staging and migration. Just the staging of the 25K records takes around 1:30, which also isn't ideal. I've run session metrics a few times and get around the following numbers for Spring Data persisting the 25K records:
2451000 nanoseconds spent acquiring 1 JDBC connection;
0 nanoseconds spent releasing 0 JDBC connections;
96970800 nanoseconds spent preparing 24851 JDBC statements;
9534006000 nanoseconds spent executing 24849 JDBC statements;
21666942900 nanoseconds spent executing 830 JDBC statements;
23513568700 nanoseconds spent executing 2 flushes (flushing a total of 49696 entities and 0 collections)
211588700 nanoseconds spent executing 1 partial-flushes (flushing a total of 24848 entities and 24848 collections)
For this specific case, I'm staging the roughly 25K entities and then using a stored procedure to move only employee data from staging to live tables (a small fraction of what makes up the 15 total minutes). That procedure seems to run instantly. But there's other data that we have to determine via joins, group by statements, etc., which appear to be costly. I'm just not sure why it's taking Spring Data so long to persist that many records when it would take pure SQL significantly less.

Related

Node JS architecture to handle huge amount of Data returned by DB in better possible way

We have NodeJs application and SQL Server database, and there are couple of badly written queries with a lot of inner joins.
Problem and Use Case
We have use case of generating report (15-20 thousand reports) in PDF / Excel format and there is a query with a lot of joins, which takes almost 8-9 seconds, as there is a huge amount of data - 2-3 tables used in query which have a few million rows each.
For report generation we don't need the real-time data, it can contain a day old or week old data which is fine.
What I'm looking for: a few suggestions to handle this situation in better possible way.
We have few options on table
Dump data from multiple queries in separate table and use it (we are planning to do this activity in periodic manner with the help of scheduler or something on similar lines)
Use time series DB to store the result of query with the help of scheduler, and use it at the time of report generation.
Limiting report generation to use at max last 1 year of data.
Implement sharding in SQL Server
And yes improving query is also something we are working on; but I think there is scope to make it better and that's the reason I'm reaching out here to get few more suggestions.

Denormalization is a tried and true method of speeding up reporting. As Preben suggested, creating an indexed view in SQL server is an efficient way to do this with minimal plumbing. Alternatively, it may be worth thinking about whether a data warehouse implementation is needed for future queries.
If this is a 1-off issue, put together your indexed view (pay attention to the requirements), and move on. If this is the first of many reports that you need to optimize, think about creating a more substantial solution.

Redis and Postgresql synchronization (online users status)

In an NodeJS application I have to maintain a "who was online in the last N minutes" state. Since there is potentially thousands of online users - for performance reasons - I decided to not update my Postgresql user table for this task.
I choosed to use Redis to manage the online status. It's very easy and efficient.
But now I want to make complex queries to the user table, sorted by the online status.
I was thinking of creating a online table filled every minute from a Redis snapshot, but I'm not sure it's the best solution.
Following the table filling, will the next query referencing the online table take a big hit caused by the new indexes creation or loading?
Does anyone know a better solution?

I had to solve almost this exact same issue, but I took a different approach because I Didn't like the issues caused by trying to mix Redis and Postgres.
My solution was to collect the online data in a queue (Zero MQ in my case) but any queueing system should work, or a stream processing facility like Amazon Kinesis (The alternative I looked at.) I then inserted the data in batches into a second table (not the users table). I don't delete or update that table, only inserts and queries are allowed.
Doing things this way preserved the ability to do joins between the last online data and the users table without bogging down the database or creating many updates on the user tables. It has the side effect of giving us a lot of other useful data.
One thing to note that I have though about when thinking of other solutions to this problem is that your users table in transactional data(OLTP) while the latest online information is really analytics data (OLAP), so if you have a data warehouse, data lake, big data, or whatever term of the week you want to use for storing this type of data and querying against it that may be a better solution.

Alternative of Cassandra for storing User data with high IO

We are looking for a technology stack which will have the following criteria.
We will be having around 10 million customer.
Each customer will be having around 20MB+ of data.
Data of each user will be updated everyday.
We need to store the data for more than six months.
We may need to query on the data any time within the time span of six months.
Currently we are thinking to use Cassandra, but the limitation of maximum storage per node in Cassandra should be less than 3TB, we are looking for other alternatives to use with or without Cassandra.

Well, I don't know if my suggestion applies for your case. We had a similar case with one of our products. There was created a blob field to record binary data, as pdf documents, that made the database grew considerably.
The solution we made was to create a second database, as a repository for records older then one year. At the application server there's a service running which:
1) Copies the records, from specific tables, older then one year to this second database;
2) Deletes records from the main database, once we have a copy in the other side;
3) Queries that need data older then one year are directed to this second database;
Sure, we had to do some implementations on the code to adapt to this situation, but is running good so far.

You can try ScyllaDB. It's a C++ reimplementation of Cassandra at 10x the speed. Scylla supports 10TB/node and there are examples of larger amounts per node. Proper disclosure - I work there but am speaking from experience.
You can definitely consider just to store the metadata itself in the database and the blobs on a separate nodes outside but it's complex and Scylla can store it all altogether. Such a similar system is already in production and we hope that user will eventually open source it

Is it bad to use INDEX in Cassandra if performance is not important?

Background
We have recently started a "Big Data" project where we want to track what users are doing with our product - how often they are logging in, which features they are clicking on, etc - your basic user analytics stuff. We still don't know exactly what questions we will be asking, but most of it will be "how often did X occur over the last Y months?" type of thing, so we started storing the data sooner rather than later thinking we can always migrate, re-shape etc when we need to but if we don't store it it is gone forever.
We are now looking at what sorts of questions we can ask. In a typical RDBMS, this stage would consist of slicing and dicing the data in many different dimensions, exporting to Excel, producing graphs, looking for trends etc - it seems that for Cassandra, this is rather difficult to do.
Currently we are using Apache Spark, and submitting Spark SQL jobs to slice and dice the data. This actually works really well, and we are getting the data we need, but it is rather cumbersome as there doesn't seem to be any native API for Spark that we can connect to from our workstations, so we are stuck using the spark-submit script and a Spark app that wraps some SQL from the command line and outputs to a file which we then have to read.
The question
In a table (or Column Family) with ~30 columns running on 3 nodes with RF 2, how bad would it be to add an INDEX to every non-PK column, so that we could simply query it using CQL across any column? Would there be a horrendous impact on the performance of writes? Would there be a large increase in disk space usage?
The other option I have been investigating is using Triggers, so that for each row inserted, we populated another handful of tables (essentially, custom secondary index tables) - is this a more acceptable approach? Does anyone have any experience of the performance impact of Triggers?

Impact of adding more indexes:
This really depends on your data structure, distribution and how you access it; you were right before when you compared this process to RDMS. For Cassandra, it's best to define your queries first and then build the data model.
These guys have a nice write-up on the performance impacts of secondary indexes:
https://pantheon.io/blog/cassandra-scale-problem-secondary-indexes
The main impact (from the post) is that secondary indexes are local to each node, so to satisfy a query by indexed value, each node has to query its own records to build the final result set (as opposed to a primary key query where it is known exactly which node needs to be quired). So there's not just an impact on writes, but on read performance as well.
In terms of working out the performance on your data model, I'd recommend using the cassandra-stress tool; you can combine it with a data modeler tool that Datastax have built, to quickly generate profile yamls:
http://www.datastax.com/dev/blog/data-modeler
For example, I ran the basic stress profile without and then with secondary indexes on the default table, and the "with indexes" batch of writes took a little over 40% longer to complete. There was also an increase in GC operations / duration etc.

Is there a way to split SSAS dimension in 2 without breaking references?

I have recently gotten involved with an old BI solution (SQL Server, SSIS, SSAS). One dimension is very bloated with 50ish attributes and it processes slowly. I want to break it down in at least 2-3 dimensions to reduce processing time. My concern is that all pivot tables and other front-end reporting utilising these attributes will break and need redesigning - we are a big company and tons and tons of excel-sheets etc. currently use this dimension.
Is there a way to split the dimension while maintaining references and filters to the affected attributes?

I would rethink this approach - I would expect splitting a dimension to increase processing time, not reduce it. SQL will need to run 2-3 queries to get the data (instead of 1) and SSAS will need to build and check it's dimension-fact relationships 2-3 times (instead of 1).
I would have a look at whether the time is being spent running the SQL queries to gather the info, or in SSAS's processing of that data. You can get a rough feel by watching Task Manager while that dimension is being processed - if the SQL queries are efficient then the sqlserver.exe process should only spike up in CPU briefly, before msmdsrv.exe takes over.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string