I am trying to delete hbase rows using spark in java. My input to the below code is tablekeys which is a list of hbase keys.
Below code is giving compile time error
**Unhandled exception: Java.lang.exception **
tablekeys.parallelStream().forEach(key->{
Delete d = new Delete(Bytes.toBytes(key));
listOfBatchDelete.add(d);
if (listOfBatchDelete.size() >= 10000) {
try
{
table.delete(listOfBatchDelete);
listOfBatchDelete.clear(); // Clear the list of rows after deleting 10000 records.
} catch (Exception e) {
throw new Exception("DeleteRowsFromHbase :: Delete failed " + e);
}
}
});
Before the code was scanning the hbase table ( which is huge operation with the number of rows in millions). The code was creating a filter to filter out rows based on key list passed to it. A scanner was created to get the row and pass it to delete.
Now i changed to code to directly read keys from file and pass only the key to Delete.
Related
I have a large data file of 10GB or more with 150 columns in which we need to validate each of its data (datatype/format/null/domain value/primary key ..) with different rule and finally create 2 output file one is having success data and another having error data with error details. we need to move the row in error file if any of column having error at very first time no need to validate further.
I am reading a file in spark data frame does we validate it column-wise or row-wise by which way we got the best performance?
To answer your question
I am reading a file in spark data frame do we validate it column-wise or row-wise by which way we got the best performance?
DataFrame is a distributed collection of data that is organized as set of rows distributed across the cluster and most of the transformation which is defined in spark is applied on the rows which work on Row object .
Psuedo code
import spark.implicits._
val schema = spark.read.csv(ip).schema
spark.read.textFile(inputFile).map(row => {
val errorInfo : Seq[(Row,String,Boolean)] = Seq()
val data = schema.foreach(f => {
// f.dataType //get field type and have custom logic on field type
// f.name // get field name i.e., column name
// val fieldValue = row.getAs(f.name) //get field value and have check's on field value on field type
// if any error in field value validation then populate #errorInfo info object i.e (row,"error_info",false)
// otherwise i.e (row,"",true)
})
data.filter(x => x._3).write.save(correctLoc)
data.filter(x => !x._3).write.save(errorLoc)
})
The following function saves data in cassandra. It calls the abstract rowToModel method defined in the same class to convert the data returned from cassandra into the respective data model.
def saveDataToDatabase(data:M):Option[M] = { //TODOM - should this also be part of the Repository trait?
println("inserting in table "+tablename+" with partition key "+partitionKeyColumns +" and values "+data)
val insertQuery = insertValues(tablename,data)
println("insert query is "+insertQuery)
try {
val resultSet:ResultSet = session.execute(insertQuery) //execute can take a Statement. Insert is derived from Statement so I can use Insert.
println("resultset after insert: " + resultSet)
println("resultset applied: " + resultSet.wasApplied())
println(s"columns definition ${resultSet.getColumnDefinitions}")
if(resultSet.wasApplied()){
println(s"saved row ${resultSet.one()}")
val savedData = rowToModel(resultSet.one())
Some(savedData)
} else {
None
}
}catch {
case e:Exception => {
println("cassandra exception "+e)
None
}
}
}
The abstract rowToModel is defined as follows
override def rowToModel(row: Row): PracticeQuestionTag = {
PracticeQuestionTag(row.getLong("year"),row.getLong("month"),row.getLong("creation_time_hour"),
row.getLong("creation_time_minute"),row.getUUID("question_id"),row.getString("question_description"))
}
But the print statements I have defined in saveDataToDatabase are not printing the data. I expected that the prints will print PracticeQuestionTag but instead I see the following
I expect to see something like this - PracticeQuestionTag(2018,6,1,1,11111111-1111-1111-1111-111111111111,some description1) when I print one from ResultSet`. But what I see is
resultset after insert: ResultSet[ exhausted: false, Columns[[applied](boolean)]]
resultset applied: true
columns definition Columns[[applied](boolean)]
saved row Row[true]
row to Model called for row null
cassandra exception java.lang.NullPointerException
Why ResultSet, one and columnDefinitions is not showing me the values from the data model?
This is by design. The resultset of an insert will only contain a single row which tells if the result was applied or not.
When executing a conditional statement, the ResultSet will contain a
single Row with a column named “applied” of type boolean. This tells
whether the conditional statement was successful or not.
It makes sense also as ResultSet is supposed to return the result of the query and why would you want to make the result set object heavy by retuning all the inputs in the result set itself. More details can be found here.
Ofcourse Get queries will have the detailed result set.
We have an HDInsight cluster running HBase (Ambari)
We have created a table using Pheonix
CREATE TABLE IF NOT EXISTS Results (Col1 VARCHAR(255) NOT NULL,Col1
INTEGER NOT NULL ,Col3 INTEGER NOT NULL,Destination VARCHAR(255)
NOT NULL CONSTRAINT pk PRIMARY KEY (Col1,Col2,Col3) )
IMMUTABLE_ROWS=true
We have filled some data into this table (using some java code)
Later, we decided we want to create a local index on the destination column as follows
CREATE LOCAL INDEX DESTINATION_IDX ON RESULTS (destination) ASYNC
We have run the index tool to fill the index as follows
hbase org.apache.phoenix.mapreduce.index.IndexTool --data-table
RESULTS --index-table DESTINATION_IDX --output-path
DESTINATION_IDX_HFILES
When we run queries and filter using the destination columns everything is ok. For example
select /*+ NO_CACHE, SKIP_SCAN */ COL1,COL2,COL3,DESTINATION from
Results where COL1='data' DESTINATION='some value' ;
But, if we do not use the DESTINATION in the where query, then we will get a NullPointerException in BaseResultIterators.class
(from phoenix-core-4.7.0-HBase-1.1.jar)
This exception is thrown only when we use the new local index. If we query ignoring the index like this
select /*+ NO_CACHE, SKIP_SCAN ,NO_INDEX */ COL1,COL2,COL3,DESTINATION from
Results where COL1='data' DESTINATION='some value' ;
we will not get the exception
Showing some relevant code from the area where we get the exception
...
catch (StaleRegionBoundaryCacheException e2) {
// Catch only to try to recover from region boundary cache being out of date
if (!clearedCache) { // Clear cache once so that we rejigger job based on new boundaries
services.clearTableRegionCache(physicalTableName);
context.getOverallQueryMetrics().cacheRefreshedDueToSplits();
}
// Resubmit just this portion of work again
Scan oldScan = scanPair.getFirst();
byte[] startKey = oldScan.getAttribute(SCAN_ACTUAL_START_ROW);
byte[] endKey = oldScan.getStopRow();
====================Note the isLocalIndex is true ==================
if (isLocalIndex) {
endKey = oldScan.getAttribute(EXPECTED_UPPER_REGION_KEY);
//endKey is null for some reason in this point and the next function
//will fail inside it with NPE
}
List<List<Scan>> newNestedScans = this.getParallelScans(startKey, endKey);
We must use this version of the Jar since we run inside Azure HDInsight
and we can not select a newer jar version
Any ideas how to solve this?
What does "recover from region boundary cache being out of date" mean? it seems to be related to the problem
It appears that the version that azure HDInsight has for phoenix core (phoenix-core-4.7.0.2.6.5.3004-13.jar) has the bug but if i use a bit newer version (phoenix-core-4.7.0.2.6.5.8-2.jar, from http://nexus-private.hortonworks.com:8081/nexus/content/repositories/hwxreleases/org/apache/phoenix/phoenix-core/4.7.0.2.6.5.8-2/) we do not see the bug any more
note that it is not possible to take a much newer version like 4.8.0 since in this case the server will throw a version mismatch error
I need to Insert records into Cassandra ,so I wrote a function whose input is a csv file. Say the csv file's name is test.csv. In Cassandra I have a table test. I need to store each row of the csv file into the test table. Since I am using spark java api , I am also creating a POJO class or DTO class for mapping the fields of the Pojo and Columns of Cassandra.
The Problem here is test.csv is having some 50 comma seperated values that has to be stored in 50 columns of test table in cassandra which having is total of 400 columns. So In my test POJO class I created a constructor of those 50 fields.
JavaRDD<String> fileRdd = ctx.textFile("home/user/test.csv");
JavaRDD fileObjectRdd = fileRdd.map(
new Function<String, Object>() {
//do some tranformation with data
switch(fileName){
case "test" :return new TestPojo(1,3,4,--50); //calling the constructor with 50 fields .
}
});
switch(fileName){
test : javaFunctions(fileObjectRdd).writerBuilder("testKeyspace", "test", mapToRow(TestPojo.class)).saveToCassandra();
}
So here I am always returning the Object of the TestPojo class of each row of the test.csv file to an Rdd of Objects . Once that is done I am saving that rdd to the Cassandra Table Test using the TestPojo Mapping.
My Problem is In future if the test.csv will have say 60 columns , that time my code will not work because I am invoking the Constructor with only 50 fields.
My Question is how do I create a constructor with all the 400 fields in the TestPojo, so that no matter how many fields the test.csv has My code should be able to handle it.
I tried to create a general Constructor with all 400 fields but ended up with a compilation error saying the limit is only 255 fields for the constructor params.
or is there any better way to handle this use case ??
Question 2 : what if the data from test.csv is going to multiple tables in cassandra say 5 cols of test.csv going to test table in cassandra and 5 other cols are going to test2 table in cassandra .
Problem here is when I am doing
JavaRDD fileObjectRdd = fileRdd.map(
new Function<String, Object>() {
//do some tranformation with data
switch(fileName){
case "test" :return new TestPojo(1,3,4,--50); //calling the constructor with 50 fields .
}
});
I am returning only one Object of TestPojo. In case the data from test.csv is going to test table and test2 table , I will need to return 2 objects one of TestPojo and another of Test2Pojo.
I am new to CQL & composite keys (I previously used CLI)
I am looking to implement my old super-column-family with composite keys instead.
In short, my look-up model is:
blocks[file_id][position][block_id]=size
I have the folowing CQL table with composite keys:
CREATE TABLE blocks (
file_id text,
start_position bigint,
block_id text,
size bigint,
PRIMARY KEY (file_id, start_position,block_id)
);
I insert these sample values:
/*Example insertions*/
INSERT INTO blocks (file_id, start_position, block_id,size) VALUES ('test_schema_file', 0, 'testblock1', 500);
INSERT INTO blocks (file_id, start_position, block_id,size) VALUES ('test_schema_file', 500, '2testblock2', 501);
I query using this Astyanax code:
OperationResult result = m_keyspace.prepareQuery(m_BlocksTable).getKey(file).execute();
ColumnList<BlockKey> columns = (ColumnList<BlockKey>) result.getResult();
for (Column<BlockKey> column : columns) {
System.out.println(StaticUtils.fieldsToString(column.getName()));
try{
long value=column.getLongValue();
System.out.println(value);
}catch(Exception e){
System.out.println("Can't get size");
}
}
When I iterate over the result, I get 2 results for each column. One that contains a "size", and one where a "size" column doesn't exist.
recorder.data.models.BlockKey Object {
m_StartPosition: 0
m_BlockId: testblock1
m_Extra: null
}
Can't get size
recorder.data.models.BlockKey Object {
m_StartPosition: 0
m_BlockId: testblock1
m_Extra: size
}
500
recorder.data.models.BlockKey Object {
m_StartPosition: 500
m_BlockId: 2testblock2
m_Extra: null
}
Can't get size
recorder.data.models.BlockKey Object {
m_StartPosition: 500
m_BlockId: 2testblock2
m_Extra: size
}
501
So I have two questions:
Theoretically I do not need a size column, it should be a value of the composite key: blocks[file_id][position][block_id]=size instead of blocks[file_id][position][block_id]['size'] = size. . How do I correctly insert this data in CQL3 without creating the redundant size column?
Why am I getting the extra column without 'size', if I never inserted such a row?
The 'duplicates' are because, with CQL, there are extra thrift columns inserted to store extra metadata. With your example, from cassandra-cli you can see what's going on:
[default#ks1] list blocks;
------------------- RowKey: test_schema_file
=> (column=0:testblock1:, value=, timestamp=1373966136246000)
=> (column=0:testblock1:size, value=00000000000001f4, timestamp=1373966136246000)
=> (column=500:2testblock2:, value=, timestamp=1373966136756000)
=> (column=500:2testblock2:size, value=00000000000001f5, timestamp=1373966136756000)
If you insert data with CQL, you should query with CQL too. You can do this with Astyanax by using m_keyspace.prepareCqlStatement().withCql("SELECT * FROM blocks").execute();.