I am trying to ingest 9,000,000 rows in an elastic pool database with 6 Vcore. Data ingestion using python (pyodbc).
Since data is large, I am ingesting the data in chunks.
I am getting weird behaviour after the 9th chunk of the ingestion. Process disappear and randomly appears after an hour.
Is there any solution for this?
My suggestion use non-durable memory-optimized table to speed up data ingestion, while managing the In-Memory OLTP storage footprint by offloading historical data to a disk-based Columnstore table. Use a job to regularly batch-offload data to a disk-based Columnstore table.
With that adjustment you can obtain 1.4 million sustained rows per second during ingestion.
Related
I have a spark streaming query running in databricks. While loading data from a kafka topic to delta lake, the cell output while running displays "Compute snapshot for version : 3001". I saw this message many times before but it was the first time I'm seeing an abnormally huge number.
What exactly does this message mean ? How should one intrepret what's happening under the hood?
Also, does having a high number have any impact on performance of the task ?
From the question I've inferred that you are saving the data into a Delta Lake format, which by design have a concept of Time Travel, which in principle allows you to track the changes in the underlying data by saving so-called snapshots of the table:
more about Snaphots - https://books.japila.pl/delta-lake-internals/Snapshot/
more about Time Travel in Delta Lake - https://docs.delta.io/latest/delta-batch.html#query-an-older-snapshot-of-a-table-time-travel
We have a use-case where we receive large volume of data (i.e., 80 GB divided into 300 files comes every 5 mins) in ADLS-V2 and using spark-connector to write from ADLS-V2 to Kusto table.
During the write stage, noticed multiple cores are used to batch the entire data and only one core is used to write to Kusto table, i.e., 80Gb is writing with only one core and remaining cores are in idle state.
This process takes good amount of 20-25 mins and we have tight SLA of 10 mins.
Azure databricks(28GB RAM and 8 CPU cores each- 5 nodes)
Each file size is of ~260MB uncompressed and in parquet format. I also seen some best practices document where it says file size should be between 100MB to 1 GB uncompressed.
Using writestream API in databricks to write the data.
What is the ideal approach to write the data from ADLS to ADX in distributed way using spark-connector ?
First - the most efficient flow from ADLS storage to ADX is EventGrid, as the writing through the Spark connector means data is translated to Spark internal and then to csv which is sent to ADX. From the conversation with you guys it was clear you are using Spark for transforming the data before ingestion, in that case the Spark connector is a good choice.
From version 3.1.0 the connector flow got split by default into three Jobs (unless writeMode.Queued - is used), the first translates data into csv, writes it to storage, and queue and ingestion for ADX, this is done in distributed fashion. The second stage is polling on these ingestions until all finishes successfully to ensure transactionality, this is done using one core as the operation is really cheap (call to table storage) and there's no need to hold more than one worker for that. Third stage is sealing the transaction (this is metadata operation in ADX) - and therefore also needs one core.
Hello spark community,
Currently we have a batch pipeline in azure databricks that reads from a delta table. Job is run every night once per day. We extract its data, save it in our own location as delta table again and then we write to azure sql database table. Everything is partioned on date like this: Date=2021-01-01, etc.
Things are about to change now since our source delta table is about to get refreshed every 2-3 minutes and the requirement is to change the ETL from nightly batch to streaming mode but still using the same source and target tables as well as the same sql database table.
Right now this streaming challenge is imposing several questions:
Our delta source table is quite huge (30B+ rows), so far in our nightly batch we extracted only the changed date keys and used MERGE INTO to write the updates/inserts, however once we switch to streaming mode is it possible to tell spark to start streaming from a specific point in time since we do not want to load the whole table again this time with streaming mode?
How do you write a stream to a target delta table with MERGE INTO having to check huge amount of keys on both sides. I suppose we can get use of a foreachBatch and do that in a micro batch manner, but still how each new micro batch to check for existing/non-existing keys in our target delta table in a time-efficient manner (30B+ rows in the current target table)?
Not sure about that but is it possible for a streaming spark job to write directly to a sql database (azure) and will this become a bottleneck situation hence not advisable to be done?
I am really looking forward for some good advice and design decisions since this would be quite a big issue with the current data size. Appreciate every opinion here!
I have Spark dataframe with just 2 columns like { Key| Value}. And this dataframe has 10 million records. I am inserting this into HBase table (has 10 pre-split regions) using bulk load approach from Spark. This works fine and loads the data successfully. When I checked the size table it was like 151GB (453 gb with 3x hadoop replication). I ran major compaction on that table, and table size got reduced to 35GB (105gb with 3x replication).
I am trying to run the same code and same data in a different cluster. But here I have quota limitation of 2TB to my namespace. My process fails while loading HFiles to HBase saying its quota limit exceeded.
I would like to know whether Spark creates much more data files than the required 151GB during the bulk load? If so how to avoid that? or is there better approach to load the same?
The question is that if actual data is around 151gb (before major_compact), then why 2TB size is not enough?
I need to run 2 million queries against a three columns table t (s,p,o) which size is 10 billions rows. The data type of each column is string.
Only two types of queries:
select s p o from t where s = param
select s p o from t where o = param
If I store the table in a Postgresql database takes 6 hours using a Java ThreadPoolExecutor.
Do you think Spark can speed up the queries processing even more?
What would be the best strategy? These are my ideas:
Load the table into a dataframe and launch the queries against the dataframe.
Load the table into a parquet database and launch the queries against this database.
Use Spark 2.4 to launch queries against the Postgresql database instead of querying directly.
Use Spark 3.0 to launch queries against the database loaded into PG-Strom, an extension module of PostgreSQL with GPU support.
Thanks,
Using Apache Spark on top of the existing MySQL or PostgresSQL server(s) (without the need to export or even stream data to Spark or Hadoop) can increase query performance more than ten times. Using multiple MySQL servers (replication or Percona XtraDB Cluster) gives us an additional performance increase for some queries. You can also use the Spark cache function to cache the whole MySQL query results table.
The idea is simple: Spark can read MySQL or PostgresSQL data via JDBC and can also execute SQL queries, so we can connect it directly to DB's and run the queries. Why is this faster? For long-running (i.e., reporting or BI) queries, it can be much faster as Spark is a massively parallel system. For example, MySQL can only use one CPU core per query, whereas Spark can use all cores on all cluster nodes.
But I recommend you use No-SQL(HBase, Cassandra,...) or New-SQL solutions for your analyses because they have better performance when the scale of your data increase.
Static Data? Spark; Otherwise tune Postgres
If the 10 billion rows are static or rarely updated, your best bet is going to be using Spark with appropriate partitions. The magic happens with parallelization, so the more cores you have, the better. You want to aim for partitions that are about half a gig in size each.
Determine the size of the data by running SELECT pg_size_pretty( pg_total_relation_size('tablename')); Divide the result by the number of cores available to Spark until you get between 1/8 and 3/4 gig.
Save as parquet if you really have static data or if you want to recover from a failure quickly.
If the source data are updated frequently, you're going to want to add indices in Postgres. It could be as straightforward as adding an index on each column. Partitioning in Postgres would also help.
Stick to Postgres. Newer databases are not appropriate for structured data such as yours. There are parallelization options. Aurora, if you're on AWS.
PG-Strom is not going to work for you here. You have simple data with few columns. Getting them into and out of a GPU is going to slow you down too much.