Hbase | Hbase col qualifier hidden using Hbase shell cmds but visible via hbaserdd spark code - apache-spark

I am stuck in a very odd situation related to Hbase design i would say.
Hbase version >> Version 2.1.0-cdh6.2.1
So, the problem statement is, in Hbase, we have a row in our table.
We perform new insert and then subsequent updates of the same Hbase row, as we receive the data from downstream.
say we received data like below
INSERT of {a=1,b=1,c=1,d=1,rowkey='row1'}
UPDATE of {b=1,c=1,d=1,rowkey='row1'}
and
say the final row is like this in our Hbase table
hbase(main):008:0> get 'test', 'row1'
COLUMN CELL
cf:b timestamp=1288380727188, value=value1
cf:c timestamp=1288380727188, value=value1
cf:d timestamp=1288380727188, value=value1
1 row(s) in 0.0400 seconds
So, cf:a, column qualifier is missing in above data as visible above when fetched via scan or get commands. But as per our ingestion flow/process, it should have been there. So, we are triaging as to where it went or what happened and so on. Still the analysis is in process and we are kind of clueless as to where it is.
Now, cut story short, we have a spark util to read the Hbase table into a Rdd, via
hbasecontext.hbaseRdd API function, convert it into a dataframe and display the tabular data. So, we ran this spark util on the same table to help locate this row and very surprisingly it returned 2 rows for the this same rowkey 'row1', where 1st row was the same as above get/scan row (above data) and the 2nd row had our missing column cf:a (surprising it had the same value which was expected). Say the output dataframe appeared something like below.
rowkey |cf:a |cf:b|cf:c|cf:d
row1 |null | 1 | 1 | 1 >> cf:a col qualifier missing (same as in Hbase shell)
row1 | 1 | 1 | 1 | 1 >> This cf:a was expected
We checked our Hbase table schema as well, so we dont have multiple versions of the cf:a in the describe or we dont do versioning on the table. The schema of the Hbase table describe has
VERSIONS => '1'
Anyways, i am clueless as to how hbaseRdd is able to read that row or missing col qualifier, but the Hbase shell cmds via get, scans does not read the missing col qualifier or row.
Any Hbase expert or suggestions please.
Fyi, i tried Hbase shell cmds as well via get - versions on the row, but it only returns the above get data and not the missing cf:a.
Is the col qualifier cf:a marked for deletion or something like that, which the Hbase shell cmd doesn't show ?
Any help would be appreciated.
Thanks !!

This is a strange problem, which I suspect has to do with puts with the same rowkey having different column qualifiers at different times. However, I just tried to recreate this behaviour and I don't seem to be getting this problem. But I have a regular HBase 2.x build, as opposed to yours.
One option I would recommend to explore the problem more closely is to inspect the HFiles physically, outside of hbase shell. You can use the HBase HFile utility to print the physical key-value content at the HFile level. Obviously try to do this on a small HFile! Don't forget to flush and major-compact your table before you do it though, because HBase stores all updates in memory while it can.
You can launch the utility as below, and it will print all key-values sequentially:
hbase hfile -f hdfs://HDFS-NAMENODE:9000/hbase/data/default/test/29cfaecf083bff2f8aa2289c6a078678/f/09f569670678405a9262c8dfa7af8924 -p --printkv
In the above command, HDFS-NAMENODE is your HDFS server, default is your namespace (assuming you have none), test is your table name, and f is the column family name. You can find out the exact path to your HFiles by using the HDFS browse command recursively:
hdfs dfs -ls /hbase/data

[Updated] We worked with Cloudera and found the issue was due to the Hbase region servers getting overlapped. Cloudera fixed it for us. I dont have the full details how they did it.

Related

Access "table$partitions" through Spark Sql

I figured out that running the following code will do full scan of the table:
select max(run_id) from database.table
So I switched my code to work with the following syntax:
select max(run_id) from "database"."table$partitions"
This query works great on Athena but when I try to execute it with Spark Sql I get the following error:
mismatched input '"database"' expecting <EOF>(line 1, pos 24)
It seems like spark sql identify the quotes as the end of the query.
Any ideas how to make this query work on spark sql?
Thanks
My solution for this problem was:
sql_context.sql(f'show partitions {table_name}').agg(
f.max(f.regexp_extract('partition', rf'''{partition_name}=([^/]+)''', 1))).collect()[0][0]
The advantage: It's not doing full scan on the table
Disadvantage: It's scan all partitions levels + code isn't elegant.
Anyway that's the best I found

What is the difference between dynamic.partition=True and dynamic.partition.mode = nonstrict?

Spark 2.0 - pyspark
I seen the following 2 properties paired. What is the difference between them?
hive> SET hive.exec.dynamic.partition=true;
hive> SET hive.exec.dynamic.partition.mode=non-strict;
I know what the outcome is when they are used - you can use dynamic partitioning to load/create multiple partitions, but I don't know the difference between these two similar commands.
When I was running this code
input_field_names=['id','code','num']
df \
.select(input_field_names) \
.write \
.mode('append')\
.insertInto('test_insert_into_partition')
I got an error message that says Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict
Using spark.sql("SET hive.exec.dynamic.partition.mode = nonstrict") the code works. It doesn't require me to use the other one.
Why don't I need to set SET hive.exec.dynamic.partition=true; and what else should I know to choose which one to use.
Although there is much to google, here is a short answer.
If you want to insert dynamically into Hive partitions both values need to be set and you can then load many partitions in one go:
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict
create table tblename (h string,m string,mv double,country string)partitioned by (starttime string) location '/.../...'
INSERT overwrite table tblename PARTITION(starttime) SELECT h,m,mv,country,starttime from tblename2
Otherwise you need to do like this, setting the partition col val yourself / explicity:
INSERT into table tblename PARTITION(starttime='2017-08-09') SELECT h,m,mv,country from tblname2 where to_date(starttime)='2017-08-09'
The purpose of default value of 'strict' for
hive.exec.dynamic.partition.mode is there to prevent a user from
accidentally overwriting all the partitions, i.e. to avoid data loss.
So, there is not a situation of difference, rather a situation of caution, a but like the safety catch on a firearm, as it were.

ERROR for load files in HBase at Azure with ImportTsv

Trying to load tsv file in HBase running in HDInsight in Microsoft Azure cloud using a recommended approach connecting through Remote Desktop and running on the command line trying to load t1.tsv file (with two tab separated columns) from hdfs into hbase t1 table:
C:\apps\dist\hbase-0.98.0.2.1.5.0-2057-hadoop2\bin>hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=HBASE_ROW_KEY,num t1 t1.tsv
and get:
ERROR: One or more columns in addition to the row key and timestamp(optional) are required
Usage: importtsv -Dimporttsv.columns=a,b,c
replacing order of the specified columns to num,HBASE_ROW_KEY
C:\apps\dist\hbase-0.98.0.2.1.5.0-2057-hadoop2\bin>hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=num,HBASE_ROW_KEY t1 t1.tsv
I get:
ERROR: Must specify exactly one column as HBASE_ROW_KEY
Usage: importtsv -Dimporttsv.columns=a,b,c
This tells me that comma separator in the column list is not recognized or column name is incorrect I also tried to use column with qualifier as num:v and as 'num' - nothing helps
Any ideas what could be wrong here? Thanks.
>hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns="HBASE_ROW_KEY,d:c1,d:c2" testtable /example/inputfile.txt
This works for me. I think there are some differences between terminals in Linux and Windows, thus in windows you need to add quotation marks to clarify this is a value string, otherwise might not be recognized.

Cassandra Pig example failing with wide row input enabled

Using Cassandra 1.1.6, Pig 0.10.0 and Hadoop 1.1.0, I can successfully run the pig_cassandra example script in provided with cassandra in examples/pig.
But when I change
rows = LOAD 'cassandra://PigTest/SomeApp' USING CassandraStorage();
to:
rows = LOAD 'cassandra://PigTest/SomeApp?widerows=true' USING CassandraStorage();
I get the following error:
java.lang.IndexOutOfBoundsException: Index: 8, Size: 2
at java.util.ArrayList.rangeCheck(ArrayList.java:604)
at java.util.ArrayList.get(ArrayList.java:382)
at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:156)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:579)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:248)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:316)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:233)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPreCombinerLocalRearrange.getNext(POPreCombinerLocalRearrange.java:126)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:233)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:271)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:266)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
This happens when running in both in local and mapreduce mode, or if I set PIG_WIDEROW_INPUT=true.
The following Pig Latin script will fail with the "widerows=true" parameter present.
rows = LOAD 'cassandra://PigTest/SomeApp?widerows=true' USING CassandraStorage();
cols = FOREACH rows GENERATE flatten(columns.name);
DUMP cols;
I can't seem to get beyond this, not read the static columns in the SomeApp column family when using wide row input. The same issue is present with other column families.
I had a similar issue. It may be because of bugs in get_paged_slices which were fixed in later 1.1.x releases. The solution would be to upgrade Cassandra to 1.1.8 1.1.9
See:
CASSANDRA-4919: StorageProxy.getRangeSlice sometimes returns incorrect number of columns
CASSANDRA-4816: Broken get_paged_slice
CASSANDRA-5098: CassandraStorage doesn't decode name in widerow mode

How do I delete all data in a Cassandra column family?

I'm looking for a way to delete all of the rows from a given column family in cassandra.
This is the equivalent of TRUNCATE TABLE in SQL.
You can use the truncate thrift call, or the TRUNCATE <table> command in CQL.
http://www.datastax.com/docs/1.0/references/cql/TRUNCATE
You can also do this via Cassandra CQL.
$ cqlsh
Connected to Test Cluster at localhost:9160.
[cqlsh 4.1.1 | Cassandra 2.0.6 | CQL spec 3.1.1 | Thrift protocol 19.39.0]
Use HELP for help.
cqlsh> TRUNCATE my_keyspace.my_column_family;
Its very simple in Astyanax. Just a Single Line statement
/* keyspace variable is Keyspace Type */
keyspace.truncateColumnFamily(ColumnFamilyName);
If you are using Hector it is easy as well:
cluster.truncate("our keyspace name here", "your column family name here");
If you are using cqlsh, then you can either do it in 2 ways
use keyspace; and then truncate column_family;
truncate keyspace.column_family;
If you want to use DataStax Java driver, you can look at -
http://www.datastax.com/drivers/java/1.0/com/datastax/driver/core/querybuilder/QueryBuilder.html
or
http://www.datastax.com/drivers/java/2.0/com/datastax/driver/core/querybuilder/Truncate.html
depending on your version.
if you are working on cluster setup, truncate can only be used when all the nodes of the cluster are UP.
By using truncate, we will miss the data(we are not sure with the importance of the data)
So the very safe way as well a trick to delete data is to use COPY command,
1) backup data using copy cassandra cmd
copy tablename to 'path'
2) duplicate the file using linux cp cmd
cp 'src path' 'dst path'
3) edit duplicate file in dst path, delete all lines expect first line.
save the file.
4) use copy cassandra cmd to import
copy tablename from 'dst path'

Resources