Merge very large hive Tables (11 to be precise) using Spark - apache-spark

I am basically substituting for another programmer.
Problem Description:
There are 11 hive tables each has 8 to 11 columns. All these tables have around 5 columns whose names are similar but hold different values.
For example Table A has mobile_no, date, duration columns so has Table B. But values are not same. other columns have different names table wise.
In all tables, Data types are string, integer, double I.e. simple data types. String data has a maximum 100 characters.
Each Table contains around 50 millions of data. I have requirement to merge these 11 table taking their columns as it is and make one big table.
Our spark cluster has 20 physical server, each has 36 cores (if count virtualization then 72), RAM 512 GB each. Spark version 2.2.x
I have to merge those with both memory & speed wise efficiently.
Can you guys, help me regarding this problem?
N.B: please let me know if you have questions

Related

How to partition Delta tables efficiently?

Looking for efficient partitioning strategies for my dataframe when storing my dataframe in the delta table.
My current dataframe 1.5000.000 rowa it takes 3.5h to move data from dataframe to delta table.
Looking for a more efficient way to do this writing I decided to try different columns of my table as partitioning columns.I searched for the cardinality of my columns and selected the following ones.
column1 = have 3 distinct_values
column2 = have 7 distinct values
column3 = have 26 disctinc values
column4 = have 73 distinc values
column5 = have 143 distinc values
column6 = have 246 distinct values
column7 = have 543 disctinc values
cluster: 64GB, 8 cores
using the folloging code in my notebook
df.write.partitionBy("column_1").format("delta").mode("overwrite").save(partition_1)
..
df.write.partitionBy("column_7").format("delta").mode("overwrite").save(partition7)
Thus, I wanted to see which partitioning strategy would bring better results: a column with high cardinality, one with low cardinality or one in between.
To my surprise this has not had any effect as it has taken practically the same time in all of them with differences of a few minutes but all of them with + 3h.
why have I failed ? is there no advantage to partitioning ?
When you use Delta (either Databricks or OSS Delta 1.2.x, better 2.0) then often you may not need to use partitioning at all for following reasons (that aren't applicable for Parquet or other file formats):
Delta supports data skipping that allows to read only necessary files, especially effective when you use it in combination with OPTIMIZE ZORDER BY that will put related data closer to each other.
Bloom filters allow to skip files even more granularly.
The rules of thumb of using partitioning with Delta lake tables are following:
use it when it will benefit queries, especially when you perform MERGE into the table, because it allows to avoid conflicts between parallel transactions
when it helps to delete old data (for example partitioning by date)
when it really benefits your queries. For example, you have data per country, and most of queries will use country as a part of condition. Or for example, when you partition by date, and querying data based on the time...
In all cases, don't use partitioning for high cardinality columns (hundreds of values) and having too many partition columns because in most cases it lead to creation of small files that are less efficient to read (each file is accessed separately), plus it leads to increased load to the driver as it needs to keep metadata for each of the file.

Spark dataframe distinct write is increasing the output size by almost 10 fold

I have a case where i am trying to write some results using dataframe write into S3 using the below query with input_table_1 size is 13 Gb and input_table_2 as 1 Mb
input_table_1 has columns account, membership and
input_table_2 has columns role, id , membership_id, quantity, start_date
SELECT
/*+ BROADCASTJOIN(input_table_2) */
account,
role,
id,
quantity,
cast(start_date AS string) AS start_date
FROM
input_table_1
INNER JOIN
input_table_2
ON array_contains(input_table_1.membership, input_table_2.membership_id)
where membership array contains list of member_ids
This dataset write using Spark dataframe is generating around 1.1TiB of data in S3 with around 700 billion records.
We identified that there are duplicates and used dataframe.distinct.write.parquet("s3path") to remove the duplicates . The record count is reduced to almost 1/3rd of the previous total count with around 200 billion rows but we observed that the output size in S3 is now 17.2 TiB .
I am very confused how this can happen.
I have used the following spark conf settings
spark.sql.shuffle.partitions=20000
I have tried to do a coalesce and write to s3 but it did not work.
Please suggest if this is expected and when can be done ?
There's two sides to this:
1) Physical translation of distinct in Spark
The Spark catalyst optimiser turns a distinct operation into an aggregation by means of the ReplaceDeduplicateWithAggregate rule (Note: in the execution plan distinct is named Deduplicate).
This basically means df.distinct() on all columns is translated into a groupBy on all columns with an empty aggregation:
df.groupBy(df.columns:_*).agg(Map.empty).
Spark uses a HashPartitioner when shuffling data for a groupBy on respective columns. Since the groupBy clause in your case contains all columns (well, implicitly, but it does), you're more or less randomly shuffling data to different nodes in the cluster.
Increasing spark.sql.shuffle.partitions in this case is not going to help.
Now on to the 2nd side, why does this affect the size of your parquet files so much?
2) Compression in parquet files
Parquet is a columnar format, will say your data is organised in columns rather than row by row. This allows for powerful compression if data is adequately laid-out & ordered. E.g. if a column contains the same value for a number of consecutive rows, it is enough to write that value just once and make a note of the number of repetitions (a strategy called run length encoding). But Parquet also uses various other compression strategies.
Unfortunately, data ends up pretty randomly in your case after shuffling to remove duplicates. The original partitioning of input_table_1 was much better fitted.
Solutions
There's no single answer how to solve this, but here's a few pointers I'd suggest doing next:
What's causing the duplicates? Could these be removed upstream? Or is there a problem with the join condition causing duplicates?
A simple solution is to just repartition the dataset after distinct to match the partitioning of your input data. Adding a secondary sorting (sortWithinPartition) is likely going to give you even better compression. However, this comes at the cost of an additional shuffle!
As #matt-andruff pointed out below, you can also achieve this in SQL using cluster by. Obviously, that also requires you to move the distinct keyword into your SQL statement.
Write your own deduplication algorithm as Spark Aggregator and group / shuffle the data just once in a meaningful way.

How does Parquet file size changes with the count in Spark Dataset

I came across a scenario where I had spark dataset with 24 columns, of which I was grouping by First 22 columns and summing up last two columns.
I removed the group by from the query and I have all 24 columns selected now.
Initial count of the dataset was 79,304.
After I removed group by the count has increased to 138,204 which is understood because I have removed the group by.
But I was not clear with the behavior that The initial size of parquet file was 2.3MB but later it got reduced to 1.5MB . Can anyone please help me understand this.
Also not every time the size reduces,
I had a similar scenario for 22 columns
count before was 35,298,226 and after removing group by was 59,874,208
and here the size has increased from 466.5MB to 509.8MB
When dealing with parquet sizes it's not about the number of rows it's about the data it self.
Parquet is columnar oriented format and therefore it store data column wise and compress the data column wise. Therefore it's not about the number of rows but rather the diversity of it's columns.
Parquet will do better compression as the diversity of the most diverse column in the table. So if you have one column dataframe it will be compress good as the distance between the values of the column.

What are the maximum number of columns allowed in Cassandra

Cassandra published its technical limitations but did not mention the max number of columns allowed. Is there a maximum number of columns? I have a need to store 400+ fields. Is this possible in Cassandra?
The maximum number of columns per row (or a set of rows, which is called "partition" in Cassandra's CQL) is 2 billion (but the partition must also fit on a physical node, see docs).
400+ fields is not a problem.
As per Cassandra technical limitation page, total no. of cells together cannot exceed 2 billion cells (rows X columns).
You can have a table with (1 row X 2 billion columns) and no more rows will be allowed in that table, so the limit is not 2 billion columns per row but limit is on total no. of cells in a partition.
https://wiki.apache.org/cassandra/CassandraLimitations
Rajmohan's answer is technically correct. On the other hand, if you have 400 CQL columns, you most likely aren't optimizing your data model. You want to generate cassandra wide rows using partition keys and clustering columns in CQL.
Moreover, you don't want to have rows that are too wide from a practical (performance) perspective. A conservative rule of thumb is keep your partitions under the 100's of megs or 100,000's of cells.
Take a look at these two links to help wrap your head around this.
http://www.datastax.com/dev/blog/does-cql-support-dynamic-columns-wide-rows
http://www.sestevez.com/sestevez/CASTableSizer/

Select count(*) unstable on wide rows - Cassandra 2.1.2

Im running a 4 node Cassandra 2.1.2 cluster (6 cores per machine, 32G RAM).
I have 2 similar tables with about 650K rows each. The rows are pretty wide - 150K columns
On the first table when running select count(*) from the cqlsh Im getting the same result in a stable manner (the actual number of rows), but on the second table I get completely different values between run to run.
The only difference between the two tables is that the 2nd tables has a column that contains a collection (list) of 3 Doubles, whereas the first table contains a single Double in that column.
There is no data being inserted into the tables, and there are no compactions going on.
The row cache is disabled.
Any ideas on how to fix this ?

Resources