App server keeps getting the following warning logs for few tables:
org.apache.cassandra.io.sstable.format.big.BigTableWriter .
maybeLogLargePartitionWarning Writing large partition
What does this mean? How to analyse and resolve this issue?
This means that you have some partitions that are bigger than a configured threshold (default is 100Mb). This is typically an indicator of problems with the data model. The set of actions really depends on the Cassandra version used, as for example, Cassandra 3.6 should handle big partitions better (earlier versions would just crash on big partitions), but they still can put an additional load on the Cassandra process, especially for maintenance tasks.
You need to analyze why do you have such big partitions, starting by using (or for older versions). Plus also analyze your schema, and how you can change it to handle such big partitions. DataStax documentation includes the guide developed by the customer-facing team for analysis of similar problems - here is the section about large partitions.
Related
I am having a json from ravenDB which is not valid json as it is having duplicate columns.
So my first step is to clean the json and if there are duplicates make separate json for each file.
I was able to do it for sample file and it ran successfully,
Then I tried for a 12 MB file and it also worked.
But when I tried for a full DB backup file which is 10GB in size , it is giving error.
This 10 GB file generates 3 separate json as it is having DOCS columns 3 times.
First file is 9.6GB and other 2 files are small like 120MB and 10KB.
For the first file when I am trying to load it in Synapse DWH I am getitng below error.
Job failed due to reason: Cluster ran into out of memory issue during execution. Also, Please note that the dataflow has one or more custom partitioning schemes. The transformation(s) using custom partition schemes: Json,Select1,FlattenDocsCS,Flatten2,Filter1,ChangeDataTypesDateColumns,CstomsShipment. 1. Please retry using an integration runtime with bigger core count and/or memory optimized compute type. 2. Please retry using different partitioning schemes and/or number of partitions.
I tried to publish the pipeline so that I am not running in debug mode and in a small cluster.
I changed cluster size to 32 cores and changed partition schemes in optimize tab to all possible things.
But still I a getting an error.
Kindly please help
Note: As mentioned in the Error message :
Please retry using an integration runtime with bigger core count and/or memory optimized compute type.
Successful execution of data flows depends on many factors, including
the compute size/type, numbers of source/sinks to process, the
partition specification, transformations involved, sizes of datasets,
the data skewness and so on.
Increasing the cluster size:
Data flows distribute the data processing over different nodes in a Spark cluster to perform operations in parallel. A Spark cluster with more cores increases the number of nodes in the compute environment. More nodes increase the processing power of the data flow. Increasing the size of the cluster is often an easy way to reduce the processing time.
MSFT Doc- Integration Runtime Performance | Cluster Size - here
Please retry using different partitioning schemes and/or number of partitions.
Note: Manually setting the partitioning scheme reshuffles the data and can offset the benefits of the Spark optimizer. A best practice is to not manually set the partitioning unless you need to.
By default, Use current partitioning is selected which instructs the
service keep the current output partitioning of the transformation. As
repartitioning data takes time, Use current partitioning is
recommended in most scenarios. Scenarios where you may want to
repartition your data include after aggregates and joins that
significantly skew your data or when using Source partitioning on a
SQL DB
MSFT Data Flow Tunning Performance : Here.
This will definitely contribute tunning your performance to next level. As the Error message has been well described.
I'm interested only in query performance reasons and architectural differences behind them. All answers I've seen before were outdated or hadn't provide me with enough context of WHY Impala is better for ad hoc queries.
From 3 considerations below only the 2nd point explain why Impala is faster on bigger datasets. Could you please contribute to the following statements?
Impala doesn't miss time for query pre-initialization, means impalad daemons are always running & ready. In other hand, Spark Job Server provide persistent context for the same purposes.
Impala is in-memory and can spill data on disk, with performance penalty, when data doesn't have enough RAM. The same is true for Spark. The main difference is that Spark is written on Scala and have JVM limitations, so workers bigger than 32 GB aren't recommended (because of GC). In turn, [wrong, see UPD] Impala is implemented on C++, and has high hardware requirements: 128-256+ GBs of RAM recommended. This is very significant, but should benefit Impala only on datasets that requires 32-64+ GBs of RAM.
Impala is integrated with Hadoop infrastructure. AFAIK the main reason to use Impala over another in-memory DWHs is the ability to run over Hadoop data formats without exporting data from Hadoop. Means Impala usually use the same storage/data/partitioning/bucketing as Spark can use, and do not achieve any extra benefit from data structure comparing to Spark. Am I right?
P.S. Is Impala faster than Spark in 2019? Have you seen any performance benchmarks?
UPD:
Questions update:
I. Why Impala recommends 128+ GBs RAM? What is an implementation language of each Impala's component? Docs say that "Impala daemons run on every node in the cluster, and each daemon is capable of acting as the query planner, the query coordinator, and a query execution engine.". If impalad is Java, than what parts are written on C++? Is there smth between impalad & columnar data? Are 256 GBs RAM required for impalad or some other component?
II. Impala loose all in-memory performance benefits when it comes to cluster shuffles (JOINs), right? Does Impala have any mechanics to boost JOIN performance compared to Spark?
III. Impala use Multi-Level Service Tree (smth like Dremel Engine see "Execution model" here) vs Spark's Directed Acyclic Graph. What does actually MLST vs DAG mean in terms of ad hoc query performance? Or it's a better fit for multi-user environment?
First off, I don't think comparison of a general purpose distributed computing framework and distributed DBMS (SQL engine) has much meaning. But if we would still like to compare a single query execution in single-user mode (?!), then the biggest difference IMO would be what you've already mentioned -- Impala query coordinators have everything (table metadata from Hive MetaStore + block locations from NameNode) cached in memory, while Spark will need time to extract this data in order to perform query planning.
Second biggie would probably be shuffle implementation, with Spark writing temp files to disk at stage boundaries against Impala trying to keep everything in-memory. Leading to a radical difference in resilience - while Spark can recover from losing an executor and move on by recomputing missing blocks, Impala will fail the entire query after a single impalad daemon crash.
Less significant performance-wise (since it typically takes much less time compared to everything else) but architecturally important is work distribution mechanism -- compiled whole stage codegens sent to the workers in Spark vs. declarative query fragments communicated to daemons in Impala.
As far as specific query optimization techniques (query vectorization, dynamic partition pruning, cost-based optimization) -- they could be on par today or will be in the near future.
I do an evaluation on HDFS and Cassandra's storage amount using the same input data in a single machine. Both HDFS and Cassandra has only 1 replica.
My input data are binary bytes, in total 31M. It turned out to be HDFS has less data than Cassandra.
HDFS : 16.4 M. (use COMPRESS.BLOCK strategy)(
Cassandra: 50M. (use CQL interface, with default setting(e.g. compression))
How could that be possible, since Cassandra use columnar storage ?
Is there anyone could help me figure it out? Thanks very much.
My Cassandra version is 2.1.9.
You will see better C* disk usage if using 3.+. its a 2.1 thing that requires the column name along with each field, so if you have 10 fields it will be a lot worse. 3.x is a lot better as it doesnt store redundant data.
HDFS and C* are two completely different things for solving different kinds of problems. If your looking just for most efficient use in disk space then hdfs is probably what you want, as it can store bulk binary data much more efficiently. If your looking for faster reads/writes, C* may be a better choice. C* adds to your data to organize and make queries more efficient and to provide guarantees about the data (for consistency). Compression will earn some of that back but in a lot of cases its gonna take up more space than just your raw data would.
I have deployed a 9 node DataStax Cluster in Google Cloud. I am new to Cassandra and not sure how generally people push the data to Cassandra.
My requirement is read the data from flatfiles and RDBMs table and load into Cassandra which is deployed in Google Cloud.
These are the options I see.
1. Use Spark and Kafka
2. SStables
3. Copy Command
4. Java Batch
5. Data Flow ( Google product )
Is there any other options and which one is best.
Thanks,
For flat files you have 2 most effective options:
Use Spark - it will load data in parallel, but requires some coding.
Use DSBulk for batch loading of data from command line. It supports loading from CSV and JSON, and very effective. DataStax's Academy blog just started a series of the blog posts on DSBulk, and first post will provide you enough information to start with it. Also, if you have big files, consider to split them into smaller ones, as it will allow DSBulk to perform parallel load using all available threads.
For loading data from RDBMS, it depends on what you want to do - load data once, or need to update data as they change in the DB. For first option you can use Spark with JDBC source (but it has some limitations too), and then saving data into DSE. For 2nd, you may need to use something like Debezium, that supports streaming of change data from some databases into Kafka. And then from Kafka you can use DataStax Kafka Connector for submitting data into DSE.
CQLSH's COPY command isn't as effective/flexible as DSBulk, so I won't recommend to use it.
And never use CQL Batch for data loading, until you know how it works - it's very different from RDBMS world, and if it's used incorrectly it will really make loading less effective than executing separate statements asynchronously. (DSBulk uses batches under the hood, but it's different story).
My jobs often hang with this kind of message:
14/09/01 00:32:18 INFO spark.MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 0 to spark#*:37619
Would be great if someone could explain what Spark is doing when it spits out this message. What does this message mean? What could the user be doing wrong to cause this? What configurables should be tuned?
It's really hard to debug because it doesn't OOM, it doesn't give an ST, it just sits and sits and sits.
This has been an issue from Spark at least as far back as 1.0.0 and is still ongoing with Spark 1.5.0
Based on this thread more recent versions of spark have gotten better at shuffling (and reporting errors if it fails anyway). Also, the following tips were mentioned:
This is very likely because the serialized map output locations buffer
exceeds the akka frame size. Please try setting "spark.akka.frameSize"
(default 10 MB) to some higher number, like 64 or 128.
In the newest version of Spark, this would throw a better error, for
what it's worth.
A possible workaround:
If the distribution of the keys in your groupByKey is skewed (some
keys appear way more often than others) you should consider modifying
your job to use reduceByKey instead wherever possible.
And a side track:
The issue was fixed for me by allocating just one core per executor.
maybe your executor-memory config should be divided by executor-cores