I have to create a table consisting of 30 columns in Hbase , the problem is that I have 5 different queries running on 5 columns.The question basically is the table is having indexes on multiple columns in Oracle , how to use that in hbase.
All the 5 queries are sla bound.
What can be the possible approaches to start with.
Thanks.
Related
Let's say I have the following join (modified from Spark documentation):
impressionsWithWatermark.join(
clicksWithWatermark,
expr("""
clickAdId = impressionAdId AND
clickTime >= cast(impressionTime as date) AND
clickTime <= cast(impressionTime as date) + interval 1 day
""")
)
Assume that both tables have trillions of rows for 2 years of data. I think that joining everything from both tables is unnecessary. What I want to do is create subsets, similar to this: create 365 * 2 * 2 smaller dataframes so that there is 1 dataframe for each day of each table for 2 years, then create 365 * 2 join queries and take a union of them. But that is inefficient. I am not sure how to do it properly. I think I should add table.repartition(factor/multiple of 365 * 2) for both tables and add write.partitionBy(cast(impressionTime as date), cast(impressionTime as date)) to the streamwriter, and set the number of executors times cores to a factor or multiple of 365 * 2.
What is a proper way to do this? Does Spark analyze the query and optimizes it so that the entries from a single day are automatically put in the same partition? What if I am not joining all records from the same day, but rather from the same hour but there are very few records from 11pm to 1am? Does Spark know that it is most efficient to partition by day or will it be even more efficient?
Initially just trying to specify what i have understood from your question. You have two tables with two years worth of data and it has around trillion records in both of them. You want to join them efficiently based on the timeframe that you provided . for example could be for any specific month of any year or could be any specific custom dates but it should only read that much data and not all the data.
Now to answer your question you can do something as below:
First of all when you are writing data to create the table , you should partition the table by day column so that you have each day data in separate directory/partition for both the tables. Spark won't do that by default for you. You will have to decide that based on your dataset.
Second now when you are reading the data and performing the joins it should not be done on whole table. You will have to read the data from the specific partitions only by applying filter condition on the dataframe so that spark would apply partition pruning and it would read only the partitions that satisfy the condition in filter clause.
Once you have filtered the data at the time of reading from the table and stored it in a dataframe then you should join those dataframe based on the key relationship and that would be most efficient and performant way of doing it at first shot.
If it is still not fast enough you can look at bucketing your data along with partition but in most cases it is not required.
Currently we have to consider use case to join many columns (may be 20-30 or even more) between two dataframes to identify new rows to persist.
One dataframe can contain 200k rows and the other 40k rows but can keep growing.
we run the process in cluster, roughly 40 worker nodes..
so the question is not about can spark do it, but not paralyze the entire cluster
The question from this scenario:
How cluster performance differ based on numbers of columns to join (reshuffle etc)?
Is it practical to partition the dataframe across all the joining columns?
I am basically substituting for another programmer.
Problem Description:
There are 11 hive tables each has 8 to 11 columns. All these tables have around 5 columns whose names are similar but hold different values.
For example Table A has mobile_no, date, duration columns so has Table B. But values are not same. other columns have different names table wise.
In all tables, Data types are string, integer, double I.e. simple data types. String data has a maximum 100 characters.
Each Table contains around 50 millions of data. I have requirement to merge these 11 table taking their columns as it is and make one big table.
Our spark cluster has 20 physical server, each has 36 cores (if count virtualization then 72), RAM 512 GB each. Spark version 2.2.x
I have to merge those with both memory & speed wise efficiently.
Can you guys, help me regarding this problem?
N.B: please let me know if you have questions
I have 2 tables in my database. Each table is having 100 million rows.
Is there a way to join these 2 tables and extract data using apache spark in fastest way ?
I would say the most efficient way would be to use DataFrames and call join, followed by any other criteria. The benefit is that certain filters or selections will be pushed as far down as possible to cut down your network load...only the data that is needed will be pulled.
Without more information, that is the best suggestion I can give.
Im running a 4 node Cassandra 2.1.2 cluster (6 cores per machine, 32G RAM).
I have 2 similar tables with about 650K rows each. The rows are pretty wide - 150K columns
On the first table when running select count(*) from the cqlsh Im getting the same result in a stable manner (the actual number of rows), but on the second table I get completely different values between run to run.
The only difference between the two tables is that the 2nd tables has a column that contains a collection (list) of 3 Doubles, whereas the first table contains a single Double in that column.
There is no data being inserted into the tables, and there are no compactions going on.
The row cache is disabled.
Any ideas on how to fix this ?