Data Factory V2 copy Data Activities and Data flow ETL - azure

My question about the Data Factory V2 copy-data activity i have 5 questions.
Questions 1
Should I use parquet file or SQL server With 500 DTU I want to transfer data fast to staging table or staging parquet file
Questions 2
Copy data activity data integration Unit should i use auto or 32 data integration Unit
Questions 3
What benefit of using degree of copy parallelism should I use Auto or use 32 again I want to transfer everything quick as possible I have around 50 million rows every day.
Questions 4
Data Flow Integration run time so should I use General Purpose, Compute Optimized or Memory Optimized as I mention we have 50 million rows every day, so we want to process the data as quickly as possible and somehow cheap if we can in Data Flow
Questions 5
A bulk insert is better in Data Factory and Data flow Sink

I think you have too many questions about too many topics, the answers to which will depend entirely on your desired end result. Even so, I will do my best to briefly address your situation.
If you are dealing with large volume and/or frequency, Data Flow (ADFDF) would probably be better than Copy activity. ADFDF runs on Spark via Data Bricks and is built from the ground up to run parallel workloads. Parquet is also built to support parallel workloads. If your SQL is an Azure Synapse (SQLDW) instance, then ADFDF will use Polybase to manage the upload, which is very fast because it is also built for parallel workloads. I'm not sure how this differs for Azure SQL, and there is no way to tell you what DTU level will work best for your task.
If having Parquet as your end result is acceptable, then that would probably be the easiest and least expensive to configure since it is just blob storage. ADFDF works just fine with Parquet, as either Source or Sink. For ETL workloads, Compute is the most likely IR configuration. The good news is it is the least expensive of the three. The bad news is I have no way to know what the core count should be, you'll just have to find out through trial and error. 50 million rows may sound like a lot, but it really depends on the row size (byte count and column count), and frequency. If the process is running many times a day, then you can include a "Time to live" value in the IR configuration. This will keep the cluster warm while it waits for another job, thus potentially reducing startup time (but incurring more run time cost).

Related

How to best stage large amounts of data with Hibernate/JPA?

How can I best stage large amounts of data for migration into our database using Hibernate efficiently? Performance when dealing with >25K records that are 100+ columns are not ideal.
Let me explain:
Background
I'm working for a large company that operates around the world. I've been tasked with leading a team (at least for backend) to create a full stack application that allows for various levels of management to perform their tasks. The current tech stack for backend is Java, Spring Boot, Hibernate, and PostgreSQL. Management would like to upload Excel files to our application and have our application parse them so we can refresh the data in our database.
Unfortunately, these files range from 25K to 50K records. We're aware that these Excel files are generated using SQL queries from Excel. However, we are not permitted to access the database with this data directly. The security is very tight and will not permit us access to any APIs, DB calls, etc. to work around Excel. Due to memory constraints and scalability concerns, we're using SAX parsing to keep a low footprint. Once we parse the Excel files, we're mapping them to a Hibernate entity that represents a staging table. Then we're migrating data from it to our other tables.
Currently to stage 25K records and migrate all the data to our other tables takes 15 minutes, which is unacceptable in the eyes of management. Especially, since this will need to be done on a daily basis.
Things I've tried
Enabling batch processing in Hibernate by following Vlad's answer here. This knocked maybe 20 seconds off the overall time for staging.
Rewriting criteria and other queries for fetching data.
Reducing amount of data to process (most fields are required so the amount can't be too heavily reduced).
Indexing important columns in both the staging and destination tables. I'm doing the indexing as part of schema generation.
Optimize parts of code that clean parsed data of imperfections.
I cannot post code due to NDA
Summary of Constraints
This app needs strong support for generating reports on related data (one of the reasons we went with RDBMS. Also, the data fits well into a relational model).
Must maintain a complete audit history of all records (currently using Hibernate Envers).
We have to approve any new dependency/library through the company's cybersecurity team. This can result in days of lost production while we wait for approval. It's not ideal to request new dependencies for the project.
There are no ways of working around the Excel files at this time. An API call or simple database query would be nice, but that's not an option to us for security reasons.
Scalability is a growing concern. Another team under this project has to parse an Excel file of 50K rows with 100 rows. All of this is only data for the USA. The project owner has said the company eventually wants to expand this app's management capabilities abroad.
My Thoughts
Purely regarding the staging issue, I think it's best to get rid of the Hibernate entities responsible for staging. I'll rewrite the migration of staged data into our live tables in SQL using stored procedures. Despite it being vendor-specific (to my knowledge, anyway) I'll use Postgres' COPY command to do the heavy lifting with the large amounts of rows. I can rewrite the parser to direct data to a CSV or other delimited file instead. The only issue I have then is how to migrate the data to tables that use Hibernate sequences and generators. I haven't figured out how to synchronize Hibernate's sequences after a manual update to the database like that. It likes the throw errors about duplicate primary keys until it comes across an ID in the sequence that's not used. But I feel that's another question entirely.
Edit 1:
I should clarify. The 15 minutes is the total time for all of staging. This includes staging and migration. Just the staging of the 25K records takes around 1:30, which also isn't ideal. I've run session metrics a few times and get around the following numbers for Spring Data persisting the 25K records:
2451000 nanoseconds spent acquiring 1 JDBC connection;
0 nanoseconds spent releasing 0 JDBC connections;
96970800 nanoseconds spent preparing 24851 JDBC statements;
9534006000 nanoseconds spent executing 24849 JDBC statements;
21666942900 nanoseconds spent executing 830 JDBC statements;
23513568700 nanoseconds spent executing 2 flushes (flushing a total of 49696 entities and 0 collections)
211588700 nanoseconds spent executing 1 partial-flushes (flushing a total of 24848 entities and 24848 collections)
For this specific case, I'm staging the roughly 25K entities and then using a stored procedure to move only employee data from staging to live tables (a small fraction of what makes up the 15 total minutes). That procedure seems to run instantly. But there's other data that we have to determine via joins, group by statements, etc., which appear to be costly. I'm just not sure why it's taking Spring Data so long to persist that many records when it would take pure SQL significantly less.

Differences between BigQuery BQ.insert_rows_json and BQ.load_from_json?

I want to stream data into BigQuery and I was thinking in use PubSub + Cloud Functions, since there is no transformation needed (for now, at least) and using Cloud Data Flow feels like a little bit over kill for just inserting rows to a table. I am correct?
The data is streamed from a GCP VM using a Python script into PubSub and it has the following format:
{'SEGMENT':'datetime':'2020-12-05 11:25:05.64684','values':(2568.025,2567.03)}
The BigQuery schema is datetime:timestamp, value_A: float, value_B: float.
My questions with all this are:
a) Do I need to push this into BigQuery as json/dictionary with all values as strings or it has to be with the data type of the table?
b) What's the difference between using BQ.insert_rows_json and BQ.load_table_from_json and which one should I use for this task?
EDIT:
What I'm trying to get is actually market data of some assets. Say around 28 instruments and capture all their ticks. On an average day, there are ~60.k ticks per instrument, so we are talking about ~33.6 M invocations per month. What is needed (for now) is to insert them in a table for further analysis. I'm currently not sure if real streaming should be performed or loads per batch. Since the project is in doing analysis yet, I don't feel that Data Flow is needed, but PubSub should be used since it allows to scale to Data Flow easier when the time comes. This is my first implementation of doing streaming pipelines and I'm using all what I've learned through courses and reading. Please, correct me if I'm having a wrong approach :).
What I would absolutely love to do is, for example, perform another insert to another table when the price difference between one tick and the n'th tick is, for example, 10. For this, should I use Data Flow or the Cloud Function approach is still valid? Because this is like a trigger condition. Basically, the trigger would be something like:
if price difference >= 10:
process all these ticks
insert the results in this table
But I'm unsure how to implement this trigger.
In addition to the great answer of Marton (Pentium10)
a) You can stream a JSON in BigQuery, a VALID json. your example isn't. About the type, there is an automatic coercion/conversion according with your schema. You can see this here
b) The load job loads file in GCS or a content that you put in the request. The batch is asynchronous and can take seconds or minutes. In addition, you are limited to 1500 load per days and per table -> 1 per minutes works (1440 minutes per day). There is several interesting aspect of the load job.
Firstly, it's free!
Your data are immediately loaded in the correct partition and immediately request-able in the partition
If the load fail, no data are inserted. So, it's easiest to replay a file without having doubled values.
At the opposite, the streaming job insert in real time the data into BigQuery. It's interesting when you have real time constraint (especially for visualisation, anomalie detections,...). But there is some bad sides
You are limited to 500k rows per seconds (in EU and US), 100k rows in other regions, and 1Gb max per seconds
The data aren't immediately in the partition, they are in a buffer name UNPARTITIONED for a while or up to have this buffer full.. So you have to take into account this specificity when you build and test your real time application.
It's not free. The cheapest region is $0.05 per Gb.
Now that you are aware of this, ask yourselves about your use case.
If you need real time (less than 2 minutes of delay), no doubt, streaming is for you.
If you have few Gb per month, streaming is also the easiest solution, for few $
If you have a huge volume of data (more than 1Gb per second), BigQuery isn't the good service, consider BigTable (that you can request with BigQuery as a federated table)
If you have an important volume of data (1 or 2Gb per minutes) and your use case required data freshness at the minute+, you can consider a special design
Create a PubSub pull subscription
Create a HTTP triggered Cloud Function (or a Cloud Run service) that pull the subscription for 1 minutes and then submit the pulled content to BigQuery as a load job (no file needed, you can post in memory content directly to BigQuery). And then exist gracefully
Create a Cloud Scheduler that trigger your service every minute.
Edit 1:
The cost shouldn't drive your use case.
If, for now, it's only for analytics, you simply imagine to trigger once per days your job to pull the full subscriptions. With your metrics: 60k metrics * 28 instruments * 100 bytes (24 + memory loss), you have only 168Mb. You can store this in Cloud Functions or Cloud Run memory and perform a load job.
Streaming is really important for real time!
Dataflow, in streaming mode, will cost you, at least $20 per month (1 small worker of type n1-standard1. Much more than 1.5Gb of streaming insert in BigQuery with Cloud Functions.
Eventually, about your smart trigger to stream or to batch insert, it's not really possible, you have to redesign the data ingestion if you change your logic. But before all, only if your use case requires this!!
To answer your questions:
a) you need to push to BigQuery using the library's accepting formats usually a collection or either a JSON document formatted to the table's definition.
b) To add data to BigQuery you can Stream data or Load a file.
For your example you need to stream data, so use the 'streaming api' methods insert_rows* family.

Azure Data factory and Data flow taking too much time to process data from staging to Database

So I have one data factory which runs every day, and it selects data from oracle on-premise database around 80M records and moves it to parquet file, which is taking around 2 hours I want to speed up this process... also the data flow process which insert and update data in db
parquet file setting
Next step is from parquet file it call the data flow which move data as upsert to database but this also taking too much time
data flow Setting
Let me know which compute type for data flow
Memory Optimized
Computed Optimized
General Purpose
After Round Robin Update
Sink Time
Can you open the monitoring detailed execution plan for the data flow? Click on each stage in your data flow and look to see where the bulk of the time is being spent. You should see on the top of the view how much time was spent setting-up the compute environment, how much time was taken to read your source, and also check the total write time on your sinks.
I have some examples of how to view and optimize this here.
Well, I would surmise that 45 min to stuff 85M files into a SQL DB is not horrible. You can break the task down into chunks and see what's taking the longest time to complete. Do you have access to Databricks? I do a lot of pre-processing with Databricks, and I have found Spark to be super-super-fast!! If you can pre-process in Databricks and push everything into your SQL world, you may have an optimal solution there.
As per the documentation - https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-performance#partitioning-on-sink can you try modifying your partition settings under Optimize tab of your Sink ?
I faced similar issue with the default partitioning setting, where the data load was taking close to 30+ mins for 1M records itself, after changing the partition strategy to round robin and provided number of partitions as 5 (for my case) load is happening in less than a min.
Try experimenting with both Source partition (https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-performance#partitioning-on-source) & Sink partition settings to come up with the optimum strategy. That should improve the data load time

Real time analytics Time series Database

I'm looking for a distributed Time series database which is free to use in a cluster setup up mode and production ready plus it has to fit well in the hadoop ecosystem.
I have an IOT project which is basically around 150k Sensors which send data every 10 minutes or One hour, so I'm trying to look at time series database that has useful functions like aggregating metrics, Down-sampling, pre-aggregate (roll-ups) i have found this comparative in this Google stylesheet document time series database comparative .
I have tested Opentsdb, the data model of the hbaserowkey really suits my use case : but the functions that sill need to be developed for my use case are :
aggregate multiples metrics
do rollups
I have tested also keirosDB which is a fork of opentsdb with a richer API and it uses Cassandra as a backend storage the thing is that their API does all what my looking for downsampling rollups querying multiples metrics and a lot more.
I have tested Warp10.io and Apache Phoenix which i have read here Hortonworks link that it will be used by Ambari Metrics so i assume that its well suited for time series data too.
My question is as of now what's the best Time series Database to do real time analytics with requests performance under 1S for all the type of requests example : we want the average of the aggregated data sent by 50 sensors in a period of 5 years resampled by months ?
Such requests I assume can't be done under 1S so I believe for such requests we need some rollups/ pre aggregate mechanism, but I'm not so sure because there's a lot of tools out there and i can't decide which one suits my need the best.
I'm the lead for Warp 10 so my answer can be considered opinionated.
Given your projected data volume, 150k sensors sending data every 10 minutes, it is a mean of 250 datapoints per second and less than 40B on a period of 5 years. Such a volume can easily fit on a simple Warp 10 standalone, and if you later need to have a larger infrastructure you can migrate to a distributed Warp 10 based on Hadoop.
In terms of requests, if your data is already resampled, fetch 5 years of monthly data for 50 sensors is only 3000 datapoints, Warp 10 can do that in far less than 1s, and doing the automatic rollups is just a matter of scheduling WarpScript code in a monthly manner, nothing fancy.
Lastly, in terms of integration with the Hadoop ecosystem, Warp 10 is on top of things with integration of the WarpScript language in Pig, Spark, Flink and Storm. With the Warp10InputFormat you can fetch data from a Warp 10 platform or you can load data using any other InputFormat and then manipulate them using WarpScript.
At OVH we are heavy users of #OvhMetrics which rely on Warp10/HBase, and we provide a protocol abstraction with OpenTSDB/WarpScript/PromQL/...
I'm not interested in Warp10, but it has been a great success for us. Both on the scaling challenge and for the use cases that WarpScript can cover.
Most of the time we don't even leverage hadoop/flink integration because our customers needs are addressed easily with the real time WarpScript API.
For real time analytics, you can try Druid, an open source project maintainted by Apache, or you can also check out database specialized for IoT: GridDB and CrateDB. The best way is to test these databases yourselves and see if they suit your need. You can also connect these databases as a sink to Kafka.
When you are dealing with IoT project, you need to forecast if you have to maintain large data set in the future or if you are happy with downsampled data. Some TSDB have good compression like InfluxDB, but others may not be scalable beyond tens of terabytes, so if you think you need to scale big, look also for one with scale-out architecture.

HIVE/HDFS for realtime storage of sensor data on a massive scale?

I am evaluating sensor data collection systems with the following requirements,
1 million endpoints sending in 100 bytes of data every minute (as a time series).
Basically millions of small writes to the storage.
This data is write-once, so basically it never gets updated.
Access requirements
a. Full data for a user needs to be accessed periodically (less frequent)
b. Partial data for a user needs to be access periodically (more frequent). For e.g I need sensor data collected over the last hour/day/week/month for analysis/reporting.
Have started looking at Hive/HDFS as an option. Can someone comments on the applicability of Hive in such a use case? I am concerned that while the distributed storage needs would work, it seems more suited to data warehousing applications than real time data collection/storage.
Do HBase/Cassandra make more sense in this scenario?
I think HBase can be a good option for you. In fact there's already an open/source implementation in HBase which solves similar problem that you might want to use. Take a look at openTSB which is an open source implementation for solving similar problems. Here's a short excerpt from their blurb:
OpenTSDB is a distributed, scalable Time Series Database (TSDB)
written on top of HBase. OpenTSDB was written to address a common
need: store, index and serve metrics collected from computer systems
(network gear, operating systems, applications) at a large scale, and
make this data easily accessible and graphable. Thanks to HBase's
scalability, OpenTSDB allows you to collect many thousands of metrics
from thousands of hosts and applications, at a high rate (every few
seconds). OpenTSDB will never delete or downsample data and can easily
store billions of data points. As a matter of fact, StumbleUpon uses
it to keep track of hundred of thousands of time series and collects
over 600 million data points per day in their main production
datacenter.
There are actually quite a few people collecting sensor data in a time-series fashion with Cassandra. It's a very good fit. I recommend you read this article on basic time series in Cassandra for an idea of what your data model would be like.
Writes in Cassandra are extremely cheap, so even a moderately sized cluster could easily handle one million writes per minute.
Both of your read queries could be answered very efficiently. For the second type of query, where you're reading data for a slice of time for a single sensor, you would end up reading a contiguous slice from a single row; this should take about 10ms for a completely cold read. For the first type of query, you would simply be running several of the per-sensor queries in parallel. Assuming you store a basic map of users to sensor IDs, you would lookup all of the sensor IDs for a user with one query, and then your second query would fetch the data for all of those sensors (although you might break up this query if the number of sensors is high).
Hive and HDFS don't really make sense when you're talking about real-time queries, as they're more suited for long-running batch jobs.

Resources