Filter Partition Before Reading Hive table (Spark) - apache-spark

Currently I'm trying to filter a Hive table by the latest date_processed.
The table is partitioned by.
System
date_processed
Region
The only way I've managed to filter it, is by doing a join query:
query = "select * from contracts_table as a join (select (max(date_processed) as maximum from contract_table as b) on a.date_processed = b.maximum"
This way is really time consuming, as I have to do the same procedure for 25 tables.
Any one Knows a way to read directly the latest loaded partition of a table in Spark <1.6
This is the method I'm using to read.
public static DataFrame loadAndFilter (String query)
{
return SparkContextSingleton.getHiveContext().sql(+query);
}
Many thanks!

Dataframe with all table partitions can be received by:
val partitionsDF = hiveContext.sql("show partitions TABLE_NAME")
Values can be parsed, for get max value.

Related

How to improve performance my spark job here to load data into cassandra table?

I am using spark-sql-2.4.1 ,spark-cassandra-connector_2.11-2.4.1 with java8 and apache cassandra 3.0 version.
I have my spark-submit or spark cluster enviroment as below to load 2 billion records.
--executor-cores 3
--executor-memory 9g
--num-executors 5
--driver-cores 2
--driver-memory 4g
I am using Cassandra 6 node cluster with below settings :
cassandra.output.consistency.level=ANY
cassandra.concurrent.writes=1500
cassandra.output.batch.size.bytes=2056
cassandra.output.batch.grouping.key=partition
cassandra.output.batch.grouping.buffer.size=3000
cassandra.output.throughput_mb_per_sec=128
cassandra.connection.keep_alive_ms=30000
cassandra.read.timeout_ms=600000
I am loading using spark dataframe into cassandra tables.
After reading into spark data set I am grouping by on certain columns as below.
Dataset<Row> dataDf = //read data from source i.e. hdfs file which are already partitioned based "load_date", "fiscal_year" , "fiscal_quarter" , "id", "type","type_code"
Dataset<Row> groupedDf = dataDf.groupBy("id","type","value" ,"load_date","fiscal_year","fiscal_quarter" , "create_user_txt", "create_date")
groupedDf.write().format("org.apache.spark.sql.cassandra")
.option("table","product")
.option("keyspace", "dataload")
.mode(SaveMode.Append)
.save();
Cassandra table(
PRIMARY KEY (( id, type, value, item_code ), load_date)
) WITH CLUSTERING ORDER BY ( load_date DESC )
Basically I am groupBy "id","type","value" ,"load_date" columns. As the other columns ( "fiscal_year","fiscal_quarter" , "create_user_txt", "create_date") should be available for storing into cassandra table I have to include them also in the groupBy clause.
1) Frankly speaking I dont know how to get those columns after groupBy
into resultant dataframe i.e groupedDf to store. Any advice here
to how to tackle this please ?
2) With above process/steps , my spark job for loading is pretty slow due to lot of shuffling i.e. read shuffle and write shuffle processes.
What should I do here to improve the speed ?
While reading from source (into dataDf) do I need to do anything here to improve performance? This is already partitioned.
Should I still need to do any partitioning ? If so , what is the best way/approach given the above cassandra table?
HDFS file columns
"id","type","value","type_code","load_date","item_code","fiscal_year","fiscal_quarter","create_date","last_update_date","create_user_txt","update_user_txt"
Pivoting
I am using groupBy due to pivoting as below
Dataset<Row> pivot_model_vals_unpersist_df = model_vals_df.groupBy("id","type","value","type_code","load_date","item_code","fiscal_year","fiscal_quarter","create_date")
.pivot("type_code" )
.agg( first(//business logic)
)
)
Please advice.
Your advice/feedback are highly thankful.
So, as I got from comments your task is next:
Take 2b rows from HDFS.
Save this rows into Cassandra with some conversion.
Schema of Cassandra table is not the same as schema of HDFS dataset.
At first, you definitely don't need group by. GROUP BY doesn't group columns, it group rows invoking some aggregate function like sum, avg, max, etc. Semantic is similar to SQL "group by", so it's no your case. What you really need - make your "to save" dataset fit into desired Cassandra schema.
In Java this is a little bit trickier than in Scala. At first I suggest to define a bean that would represent a Cassandra row.
public class MyClass {
// Remember to declare no-args constructor
public MyClass() { }
private Long id;
private String type;
// another fields, getters, setters, etc
}
Your dataset is Dataset, you need to convert it into JavaRDD. So, you need a convertor.
public class MyClassFabric {
public static MyClass fromRow(Row row) {
MyClass myClass = new MyClass();
myClass.setId(row.getInt("id"));
// ....
return myClass;
}
}
In result we would have something like this:
JavaRDD<MyClass> rdd = dataDf.toJavaRDD().map(MyClassFabric::fromRow);
javaFunctions(rdd).writerBuilder("keyspace", "table",
mapToRow(MyClass.class)).saveToCassandra();
For additional info you can take a look https://github.com/datastax/spark-cassandra-connector/blob/master/doc/7_java_api.md

Spark SQL returns null for a column in HIVE table while HIVE query returns non null values

I have a hive table created on top of s3 DATA in parquet format and partitioned by one column named eventdate.
1) When using HIVE QUERY, it returns data for a column named "headertime" which is in the schema of BOTH the table and the file.
select headertime from dbName.test_bug where eventdate=20180510 limit 10
2) FROM a scala NOTEBOOK , when directly loading a file from a particular partition that also works,
val session = org.apache.spark.sql.SparkSession.builder
.appName("searchRequests")
.enableHiveSupport()
.getOrCreate;
val searchRequest = session.sqlContext.read.parquet("s3n://bucketName/module/search_request/eventDate=20180510")
searchRequest.createOrReplaceTempView("SearchRequest")
val exploreDF = session.sql("select headertime from SearchRequest where SearchRequestHeaderDate='2018-05-10' limit 100")
exploreDF.show(20)
this also displays the values for the column "headertime"
3) But, when using spark sql to query directly the HIVE table as below,
val exploreDF = session.sql("select headertime from tier3_vsreenivasan.test_bug where eventdate=20180510 limit 100")
exploreDF.show(20)
it keeps returning null always.
I opened the parquet file and see that the column headertime is present with values, but not sure why spark SQL is not able to read the values for that column.
it will be helpful if someone can point out from where the spark SQL gets the schema? I was expecting it to behave similar to the HIVE QUERY

How to stop load the whole table in spark?

The thing is, I have read right to one table,which is partition by year month and day.But I don't have right read the data from 2016/04/24.
when I execute in Hive command:
hive>select * from table where year="2016" and month="06" and day="01";
I CAN READ OTHER DAYS' DATA EXCEPT 2016/04/24
But,when I read in spark
sqlContext.sql.sql(select * from table where year="2016" and month="06" and day="01")
exceptition is throwable That I dont have the right to hdfs/.../2016/04/24
THIS SHOW SPARK SQL LOAD THE WHOLE TABLE ONCE AND THEN FILTER?
HOW CAN I AVOID LOAD THE WHOLE TABLE?
You can use JdbcRDDs directly. With it you can bypass spark sql engine therefore your queries will be directly sent to hive.
To use JdbcRDD you need to create hive driver and register it first (of course it is not registered already).
val driver = "org.apache.hive.jdbc.HiveDriver"
Class.forName(driver)
Then you can create a JdbcRDD;
val connUrl = "jdbc:hive2://..."
val query = """select * from table where year="2016" and month="06" and day="01" and ? = ?"""
val lowerBound = 0
val upperBound = 0
val numOfPartitions = 1
new JdbcRDD(
sc,
() => DriverManager.getConnection(connUrl),
query,
lowerBound,
upperBound,
numOfPartitions,
(r: ResultSet) => (r.getString(1) /** get data here or with a function**/)
)
JdbcRDD query must have two ? in order to create partition your data. So you should write a better query than me. This just creates one partition to demonstrate how it works.
However, before doing this I recommend you to check HiveContext. This supports HiveQL as well. Check this.

PHOENIX SPARK - Load Table as DataFrame

I have created a DataFrame from a HBase Table (PHOENIX) which has 500 million rows. From the DataFrame I created an RDD of JavaBean and use it for joining with data from a file.
Map<String, String> phoenixInfoMap = new HashMap<String, String>();
phoenixInfoMap.put("table", tableName);
phoenixInfoMap.put("zkUrl", zkURL);
DataFrame df = sqlContext.read().format("org.apache.phoenix.spark").options(phoenixInfoMap).load();
JavaRDD<Row> tableRows = df.toJavaRDD();
JavaPairRDD<String, AccountModel> dbData = tableRows.mapToPair(
new PairFunction<Row, String, String>()
{
#Override
public Tuple2<String, String> call(Row row) throws Exception
{
return new Tuple2<String, String>(row.getAs("ID"), row.getAs("NAME"));
}
});
Now my question - Lets say the file has 2 unique million entries matching with the table. Is the entire table loaded into memory as RDD or only the matching 2 million records from the table will be loaded into memory as RDD ?
Your statement
DataFrame df = sqlContext.read().format("org.apache.phoenix.spark").options(phoenixInfoMap)
.load();
will load the entire table into memory. You have not provided any filter for phoenix to push down into hbase - and thus reduce the number of rows read.
If you do a join to a non-HBase datasource - e.g a flat file - then all of the records from the hbase table would first need to be read in. The records not matching the secondary data source would not be saved in the new DataFrame - but the initial reading would have still happened.
Update A potential approach would be to pre-process the file - i.e. extracting the id's you want. Store the results into a new HBase table. Then perform the join directly in HBase via Phoenix not Spark .
The rationale for that approach is to move the computation to the data. The bulk of the data resides in HBase - so then move the small data (id's in the files) to there.
I am not familiar directly with Phoenix except that it provides a sql layer on top of hbase. Presumably then it would be capable of doing such a join and storing the result in a separate HBase table ..? That separate table could then be loaded into Spark to be used in your subsequent computations.

How to control Spark JavaRDD<MyTable> to take specific n rows?

I am having my own data structure called MyTable which is kind of columnar data store format table. Now I want to use Spark to create myTable in distributed environment as my datasets are in HDFS. I have used Spark earlier and I am familiar with it.
I am not able to figure out how can we control JavaRDD to take n rows. Here n could be 80k, 90k rows etc. If you see the following JavaRDD will always create one row MyTable, how do I create MyTable with n rows
JavaRDD<MyTable> rdd_records = sc.textFile("/path/to/hdfs").map(
new Function<String, MyTable>() {
public MyTable call(String line) throws Exception {
String[] fields = line.split(",");
Record record = create Record from above fields
MyTable table = new MyTable();
return table.append(record);
}
});
If I know how to command RDD to take certain no of rows then I can use it to create MyTable in distributed way.
when you load data using sc.textfile, spark automatically splits data on newlinesand puts them to partitions. So, what you need to do is a custom partitioning using your params (80k thing). Then you can use partitionBy on RDD. After that, you should be using mapPartitions instead of map to generate your data structures of Rows.
One advice, this seems a case to use Dataframes. If you are on 1.3, you take a look. It does converting tuples to schema in distributed way already

Resources