PHOENIX SPARK - Load Table as DataFrame - apache-spark

I have created a DataFrame from a HBase Table (PHOENIX) which has 500 million rows. From the DataFrame I created an RDD of JavaBean and use it for joining with data from a file.
Map<String, String> phoenixInfoMap = new HashMap<String, String>();
phoenixInfoMap.put("table", tableName);
phoenixInfoMap.put("zkUrl", zkURL);
DataFrame df = sqlContext.read().format("org.apache.phoenix.spark").options(phoenixInfoMap).load();
JavaRDD<Row> tableRows = df.toJavaRDD();
JavaPairRDD<String, AccountModel> dbData = tableRows.mapToPair(
new PairFunction<Row, String, String>()
{
#Override
public Tuple2<String, String> call(Row row) throws Exception
{
return new Tuple2<String, String>(row.getAs("ID"), row.getAs("NAME"));
}
});
Now my question - Lets say the file has 2 unique million entries matching with the table. Is the entire table loaded into memory as RDD or only the matching 2 million records from the table will be loaded into memory as RDD ?

Your statement
DataFrame df = sqlContext.read().format("org.apache.phoenix.spark").options(phoenixInfoMap)
.load();
will load the entire table into memory. You have not provided any filter for phoenix to push down into hbase - and thus reduce the number of rows read.
If you do a join to a non-HBase datasource - e.g a flat file - then all of the records from the hbase table would first need to be read in. The records not matching the secondary data source would not be saved in the new DataFrame - but the initial reading would have still happened.
Update A potential approach would be to pre-process the file - i.e. extracting the id's you want. Store the results into a new HBase table. Then perform the join directly in HBase via Phoenix not Spark .
The rationale for that approach is to move the computation to the data. The bulk of the data resides in HBase - so then move the small data (id's in the files) to there.
I am not familiar directly with Phoenix except that it provides a sql layer on top of hbase. Presumably then it would be capable of doing such a join and storing the result in a separate HBase table ..? That separate table could then be loaded into Spark to be used in your subsequent computations.

Related

Spark Cassandra write Dataframe, how to find which keys already exist in database during insertion

I have written the following JAVA method to persist data for multiple POJO's to an Apache Cassandra database through Apache Spark.
This seems to work OK, however Spark does not provide any information on whether the records were inserted (keys do not exist in cassandra) or were updated (keys already exist in DB).
Is there a way with minimal cost (I would like to avoid loading the contents of the table in a dataframe and checking for duplicate keys), to find out at the time of the insertion which records already exist (have duplicate keys) in the DB?
The exact code is shown below:
#Service
public class WriteDB {
#Autowired
private SparkSession sparkSession;
Logger LOG = LoggerFactory.getLogger(WriteDB.class);
public <T> void uploadData(List<T> objects, Class<T> clazz, String keyspaceName, String tableName) {
LOG.info("Number of records to be committed to database: " + objects.size());
//Create dataset from entity object
Dataset<Row> df = sparkSession.createDataFrame(objects, clazz);
//Write data from spark dataframe to cassandra schema
df.write().mode(SaveMode.Append).format("org.apache.spark.sql.cassandra").options(new HashMap<String, String>() {{
put("keyspace", keyspaceName);
put("table", tableName);
}}).save();
LOG.info("Records Commited");
}
}
In Cassandra, everything is upsert - there is no distinction between inserts & updates. Cassandra don't check if data exist or not when inserting or updating (except LWTs) - it just add data, and the previous copies are removed during compaction.
The only way to achieve your task is to load data from table - with Dataframe API it will be done on the level of Spark by reading whole table into Dataframe and then joining, or in RDD API by using joinWithCassandra or leftJoinWithCassandra (see doc).

Filter Partition Before Reading Hive table (Spark)

Currently I'm trying to filter a Hive table by the latest date_processed.
The table is partitioned by.
System
date_processed
Region
The only way I've managed to filter it, is by doing a join query:
query = "select * from contracts_table as a join (select (max(date_processed) as maximum from contract_table as b) on a.date_processed = b.maximum"
This way is really time consuming, as I have to do the same procedure for 25 tables.
Any one Knows a way to read directly the latest loaded partition of a table in Spark <1.6
This is the method I'm using to read.
public static DataFrame loadAndFilter (String query)
{
return SparkContextSingleton.getHiveContext().sql(+query);
}
Many thanks!
Dataframe with all table partitions can be received by:
val partitionsDF = hiveContext.sql("show partitions TABLE_NAME")
Values can be parsed, for get max value.

Keeping data together in spark based on cassandra table partition key

When loading data from Cassandra table, a spark partition represents all rows with same partition key. However, when I create data in spark with same partition key and re-partitioning the new RDD using .repartitionByCassandraReplica(..) method, it ends up in a different spark partition? How do I achieve consistent partitions in spark using the partition-scheme defined by the Spark-Cassandra connector?
Links to download CQL and Spark job code that I tested
.CQL with the keyspace and table schema.
Spark job and other classes.
Version and other information
Spark : 1.3
Cassandra : 2.1
Connector : 1.3.1
Spark nodes (5) and Cass* cluster nodes (4) runs in different data centers
Code extract. Download code using above links for more details
Step 1 : Loads data into 8 spark partitions
Map<String, String> map = new HashMap<String, String>();
CassandraTableScanJavaRDD<TestTable> tableRdd = javaFunctions(conf)
.cassandraTable("testkeyspace", "testtable", mapRowTo(TestTable.class, map));
Step 2 : Repartition data into 8 partitions
.repartitionByCassandraReplica(
"testkeyspace",
"testtable",
partitionNumPerHost,
someColumns("id"),
mapToRow(TestTable.class, map));
Step 3: Print partition id and values for both rdds
rdd.mapPartitionsWithIndex(...{
#Override
public Iterator<String> call(..) throws Exception {
List<String> list = new ArrayList<String>();
list.add("PartitionId-" + integer);
while (itr.hasNext()) {
TestTable value = itr.next();
list.add(Integer.toString(value.getId()));
}
return list.iterator();
}
}, true).collect();
Step 4 : Snapshot of results printed on Partition 1. Different for both Rdds but expect to be same
Load Rdd values
----------------------------
Table load - PartitionId -1
----------------------------
15
22
--------------------------------------
Repartitioned values - PartitionId -1
--------------------------------------
33
16
Repartition by Cassandra replica does not deterministically place keys. There is a ticket currently to change that.
https://datastax-oss.atlassian.net/projects/SPARKC/issues/SPARKC-278
A workaround now is to set the Partitionspernode parameter to 1.

How to control Spark JavaRDD<MyTable> to take specific n rows?

I am having my own data structure called MyTable which is kind of columnar data store format table. Now I want to use Spark to create myTable in distributed environment as my datasets are in HDFS. I have used Spark earlier and I am familiar with it.
I am not able to figure out how can we control JavaRDD to take n rows. Here n could be 80k, 90k rows etc. If you see the following JavaRDD will always create one row MyTable, how do I create MyTable with n rows
JavaRDD<MyTable> rdd_records = sc.textFile("/path/to/hdfs").map(
new Function<String, MyTable>() {
public MyTable call(String line) throws Exception {
String[] fields = line.split(",");
Record record = create Record from above fields
MyTable table = new MyTable();
return table.append(record);
}
});
If I know how to command RDD to take certain no of rows then I can use it to create MyTable in distributed way.
when you load data using sc.textfile, spark automatically splits data on newlinesand puts them to partitions. So, what you need to do is a custom partitioning using your params (80k thing). Then you can use partitionBy on RDD. After that, you should be using mapPartitions instead of map to generate your data structures of Rows.
One advice, this seems a case to use Dataframes. If you are on 1.3, you take a look. It does converting tuples to schema in distributed way already

Can data be loaded in Apache Spark RDD/Dataframe on the fly?

Can data be loaded on the fly or does it have be pre-loaded into the RDD/DataFrame?
Say I have a SQL database and I use the JDBC source to load 1,000,000 records into the RDD. If for example a new records comes in the DB, can I write a job that will add that 1 new record the RDD/Dataframe to make it 1,000,001? Or does the entire RDD/DataFrame have to be rebuilt?
I guess it depends on what you mean by add (...) record and rebuilt. It is possible to use SparkContext.union or RDD.union to merge RDDs and DataFrame.unionAll to merge DataFrames.
As long as RDDs, which are merged, use the same serializer there is no need for reserialization but, if the same partitioner is used for both, it will require repartitioning.
Using JDBC source as an example:
import org.apache.spark.sql.functions.{max, lit}
val pMap = Map("url" -> "jdbc:..", "dbtable" -> "test")
// Load first batch
val df1 = sqlContext.load("jdbc", pMap).cache
// Get max id and trigger cache
val maxId = df1.select(max($"id")).first().getInt(0)
// Some inserts here...
// Get new records
val dfDiff = sqlContext.load("jdbc", pMap).where($"id" > lit(maxId))
// Combine - only dfDiff has to be fetched
// Should be cached as before
df1.unionAll(dfDiff)
If you need an updatable data structure IndexedRDD implements key-value store on Spark.

Resources