Azure HDInsight cluster with Hbase + pheonix using Local Index - azure

We have an HDInsight cluster running HBase (Ambari)
We have created a table using Pheonix
CREATE TABLE IF NOT EXISTS Results (Col1 VARCHAR(255) NOT NULL,Col1
INTEGER NOT NULL ,Col3 INTEGER NOT NULL,Destination VARCHAR(255)
NOT NULL CONSTRAINT pk PRIMARY KEY (Col1,Col2,Col3) )
IMMUTABLE_ROWS=true
We have filled some data into this table (using some java code)
Later, we decided we want to create a local index on the destination column as follows
CREATE LOCAL INDEX DESTINATION_IDX ON RESULTS (destination) ASYNC
We have run the index tool to fill the index as follows
hbase org.apache.phoenix.mapreduce.index.IndexTool --data-table
RESULTS --index-table DESTINATION_IDX --output-path
DESTINATION_IDX_HFILES
When we run queries and filter using the destination columns everything is ok. For example
select /*+ NO_CACHE, SKIP_SCAN */ COL1,COL2,COL3,DESTINATION from
Results where COL1='data' DESTINATION='some value' ;
But, if we do not use the DESTINATION in the where query, then we will get a NullPointerException in BaseResultIterators.class
(from phoenix-core-4.7.0-HBase-1.1.jar)
This exception is thrown only when we use the new local index. If we query ignoring the index like this
select /*+ NO_CACHE, SKIP_SCAN ,NO_INDEX */ COL1,COL2,COL3,DESTINATION from
Results where COL1='data' DESTINATION='some value' ;
we will not get the exception
Showing some relevant code from the area where we get the exception
...
catch (StaleRegionBoundaryCacheException e2) {
// Catch only to try to recover from region boundary cache being out of date
if (!clearedCache) { // Clear cache once so that we rejigger job based on new boundaries
services.clearTableRegionCache(physicalTableName);
context.getOverallQueryMetrics().cacheRefreshedDueToSplits();
}
// Resubmit just this portion of work again
Scan oldScan = scanPair.getFirst();
byte[] startKey = oldScan.getAttribute(SCAN_ACTUAL_START_ROW);
byte[] endKey = oldScan.getStopRow();
====================Note the isLocalIndex is true ==================
if (isLocalIndex) {
endKey = oldScan.getAttribute(EXPECTED_UPPER_REGION_KEY);
//endKey is null for some reason in this point and the next function
//will fail inside it with NPE
}
List<List<Scan>> newNestedScans = this.getParallelScans(startKey, endKey);
We must use this version of the Jar since we run inside Azure HDInsight
and we can not select a newer jar version
Any ideas how to solve this?
What does "recover from region boundary cache being out of date" mean? it seems to be related to the problem

It appears that the version that azure HDInsight has for phoenix core (phoenix-core-4.7.0.2.6.5.3004-13.jar) has the bug but if i use a bit newer version (phoenix-core-4.7.0.2.6.5.8-2.jar, from http://nexus-private.hortonworks.com:8081/nexus/content/repositories/hwxreleases/org/apache/phoenix/phoenix-core/4.7.0.2.6.5.8-2/) we do not see the bug any more
note that it is not possible to take a much newer version like 4.8.0 since in this case the server will throw a version mismatch error

Related

Datastax cassandra seem to cache preparestatent

When my application runs a long time, everything works as well. But when I change type a column from int to text(Drop table and recreate), I caught a Exception:
com.datastax.oss.driver.api.core.type.codec.CodecNotFoundException: Codec not found for requested operation: [INT <-> java.lang.String]
at com.datastax.oss.driver.internal.core.type.codec.registry.CachingCodecRegistry.createCodec(CachingCodecRegistry.java:609)
at com.datastax.oss.driver.internal.core.type.codec.registry.DefaultCodecRegistry$1.load(DefaultCodecRegistry.java:95)
at com.datastax.oss.driver.internal.core.type.codec.registry.DefaultCodecRegistry$1.load(DefaultCodecRegistry.java:92)
at com.datastax.oss.driver.shaded.guava.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527)
at com.datastax.oss.driver.shaded.guava.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2276)
at com.datastax.oss.driver.shaded.guava.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2154)
at com.datastax.oss.driver.shaded.guava.common.cache.LocalCache$Segment.get(LocalCache.java:2044)
at com.datastax.oss.driver.shaded.guava.common.cache.LocalCache.get(LocalCache.java:3951)
at com.datastax.oss.driver.shaded.guava.common.cache.LocalCache.getOrLoad(LocalCache.java:3973)
at com.datastax.oss.driver.shaded.guava.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4957)
at com.datastax.oss.driver.shaded.guava.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4963)
at com.datastax.oss.driver.internal.core.type.codec.registry.DefaultCodecRegistry.getCachedCodec(DefaultCodecRegistry.java:117)
at com.datastax.oss.driver.internal.core.type.codec.registry.CachingCodecRegistry.codecFor(CachingCodecRegistry.java:215)
at com.datastax.oss.driver.api.core.data.SettableByIndex.set(SettableByIndex.java:132)
at com.datastax.oss.driver.api.core.data.SettableByIndex.setString(SettableByIndex.java:338)
This exception appears occasionally. I'm using PreparedStatement to execute the query, I think it is cached from DataStax's driver.
I'm using AWS Keyspaces(Cassandra version 3.11.2), DataStax driver 4.6.
Here is my application.conf:
basic.request {
timeout = 5 seconds
consistency = LOCAL_ONE
}
advanced.connection {
max-requests-per-connection = 1024
pool {
local.size = 1
remote.size = 1
}
}
advanced.reconnect-on-init = true
advanced.reconnection-policy {
class = ExponentialReconnectionPolicy
base-delay = 1 second
max-delay = 60 seconds
}
advanced.retry-policy {
class = DefaultRetryPolicy
}
advanced.protocol {
version = V4
}
advanced.heartbeat {
interval = 30 seconds
timeout = 1 second
}
advanced.session-leak.threshold = 8
advanced.metadata.token-map.enabled = false
}
Yes, Java driver 4.x caches prepared statement - it's a difference from the driver 3.x. From documentation:
the session has a built-in cache, it’s OK to prepare the same string twice.
...
Note that caching is based on: the query string exactly as you provided it: the driver does not perform any kind of trimming or sanitizing.
I'm not sure 100% about the source code, but the relevant entries in the cache may not be cleared up on the table drop. I suggest to open the JIRA against Java driver, although, such type changes are often not really recommended - it's better to introduce new field with new type, even if it's possible to re-create table.
That's correct. Prepared statements are cached -- it's the optimisation that makes prepared statements more efficient if they are reused since they only need to be prepared once (the query doesn't need to get parsed again).
But I suspect that underlying issue in your case is that your queries involve SELECT *. Best practice recommendation (regardless of the database you're using) is to explicitly enumerate the columns you are retrieving from the table.
In the prepared statement, each of the columns are bound to a data type. When you alter the schema by adding/dropping columns, the order of the columns (and their data types) no longer match the data types of the result set so you end up in situations where the driver gets an int when it's expecting a text or vice-versa. Cheers!

Azure Kusto Java SDK get all columns of query results

I'm using the Azure Kusto Java SDK v2.0.1 with Scala over Java8.
I'm executing some query:
val query = " ... "
val tenantId = " ... "
val queryResponse = client.execute(tenantId, query)
val queryResponseResults = queryResponse.getPrimaryResults
I want to convert the given data structure to JSON eventually, so I want to get all columns, but I can't seem to find some kind of getColumns.
While debugging I see the object (KustoResultSetTable) has fields columnsAsArray (which is exactly what I want) and columns - but they are private and I didn't find any getters.
A getter will be added in the next version

Cassandra does not execute insert statement with timestamp field

--Update: it seems it was a glitch with the Virtual Machine. I restarted Cassandra service and it works as expected.
--Update: It seems that the problem is not in the code, I tried to execute the insert statement in a Cassandra client and I get the same behavior, No error is displayed, nothing is inserted.
The column that causes this behavior is of type timestamp. When I set this column value to some values (ex. '2015-08-25 22:15:12')
The table is:
create table player
(msisdn varchar primary key,
game int,keyword varchar, inserted timestamp,lang int,mo int,mt int,qid int,score int)
I am new to Cassandra, downloaded the VirtualBox snapshot to test it.
I am trying the example code of the batch and it did nothing, so I tried as people suggested to execute the prepared statement directly.
var addPlayer = casDemoSession.Prepare("INSERT INTO player (msisdn,qid,keyword,mo,mt,game,score,lang,inserted) Values (?,?,?,?,?,?,?,?,?)");
for (int i = 0; i < 20; i++) {
var bs = addPlayer.Bind(getRandMSISDN(), 1, "", 1, 0, 0, 10, 0, DateTime.Now);
bs.EnableTracing(true);
casDemoSession.Execute(bs);
}
The code above does not throw any exceptions nor insert any data. I tried to trace the query but it does not show the actual cql query.
PlannetCassandra V0.1 VM running Cassandra 2.0.1
datastax driver 2.6 https://github.com/datastax/csharp-driver
One thing that might be missing is the keyspace name for your player table. Usually you would have "INSERT INTO <keyspace>.name (..."
If you're able to run cqlsh, could you add the output from "DESCRIBE TABLE <keyspace>.player" to your question, and show what happens when you attempt to do the insert in cqlsh.

weight() in SELECT seems not allowed on distributed index

I am using Sphinx 2.1.4-release (rel21-r4421) on an index distributed over two machines. I want to rerank the results to be a mix between the weight() returned by the Sphinx ranker, as well as my own score that's included in a field. The query looks like the following:
SELECT id, name, weight()+score as rank FROM data WHERE match('test') ORDER BY rank DESC LIMIT 10;
According to the docs, this should be a valid query. And it does work on a single index. However, when I query a distributed index composed of two shards, only one returns a result set and from the other I get a warning:
index data: agent 127.0.0.1:9312: remote error: select: syntax error, unexpected SEL_WEIGHT near 'weight()+score as rank'
The configuration is set as follows:
index data
{
type = distributed
local = data_0
agent = 127.0.0.1:9312:data_1
}
If I move both shards to the local server and change the configuration to this:
index data
{
type = distributed
local = data_0
local = data_1
}
Everything works as it should and I get results from both shards.
This seems like a bug to me, or might it be a configuration issue?

Subsonic 3 Simple Query inner join sql syntax

I want to perform a simple join on two tables (BusinessUnit and UserBusinessUnit), so I can get a list of all BusinessUnits allocated to a given user.
The first attempt works, but there's no override of Select which allows me to restrict the columns returned (I get all columns from both tables):
var db = new KensDB();
SqlQuery query = db.Select
.From<BusinessUnit>()
.InnerJoin<UserBusinessUnit>( BusinessUnitTable.IdColumn, UserBusinessUnitTable.BusinessUnitIdColumn )
.Where( BusinessUnitTable.RecordStatusColumn ).IsEqualTo( 1 )
.And( UserBusinessUnitTable.UserIdColumn ).IsEqualTo( userId );
The second attept allows the column name restriction, but the generated sql contains pluralised table names (?)
SqlQuery query = new Select( new string[] { BusinessUnitTable.IdColumn, BusinessUnitTable.NameColumn } )
.From<BusinessUnit>()
.InnerJoin<UserBusinessUnit>( BusinessUnitTable.IdColumn, UserBusinessUnitTable.BusinessUnitIdColumn )
.Where( BusinessUnitTable.RecordStatusColumn ).IsEqualTo( 1 )
.And( UserBusinessUnitTable.UserIdColumn ).IsEqualTo( userId );
Produces...
SELECT [BusinessUnits].[Id], [BusinessUnits].[Name]
FROM [BusinessUnits]
INNER JOIN [UserBusinessUnits]
ON [BusinessUnits].[Id] = [UserBusinessUnits].[BusinessUnitId]
WHERE [BusinessUnits].[RecordStatus] = #0
AND [UserBusinessUnits].[UserId] = #1
So, two questions:
- How do I restrict the columns returned in method 1?
- Why does method 2 pluralise the column names in the generated SQL (and can I get round this?)
I'm using 3.0.0.3...
So far my experience with 3.0.0.3 suggests that this is not possible yet with the query tool, although it is with version 2.
I think the preferred method (so far) with version 3 is to use a linq query with something like:
var busUnits = from b in BusinessUnit.All()
join u in UserBusinessUnit.All() on b.Id equals u.BusinessUnitId
select b;
I ran into the pluralized table names myself, but it was because I'd only re-run one template after making schema changes.
Once I re-ran all the templates, the plural table names went away.
Try re-running all 4 templates and see if that solves it for you.

Resources