Apache Beam Python SDK for Windowing with SQL - python-3.x

The problem is, I want to make a windowing inside SqlTransform as
SELECT f_timestamp, line, COUNT(*)
FROM PCOLLECTION
GROUP BY
line,
HOP(f_timestamp, INTERVAL '30' MINUTE, INTERVAL '1' HOUR)
My Row transformation mapping is
| "Create beam Row" >> beam.Map(lambda x: beam.Row(f_timestamp= float(x["timestamp_date"]), line = unicode(x["line"])))
And I have an error on the Java side as
Caused by: org.apache.beam.vendor.calcite.v1_20_0.org.apache.calcite.sql.validate.SqlValidatorException:
Cannot apply 'HOP' to arguments of type 'HOP(<DOUBLE>, <INTERVAL MINUTE>, <INTERVAL HOUR>)'.
Supported form(s): 'HOP(<DATETIME>, <DATETIME_INTERVAL>, <DATETIME_INTERVAL>)'
The things I tried:
Make the f_timestamp UNIX timestamp, as a float.
Make the f_timestamp string timestamp as a unicode.
As I read, Java side uses java.util.Date on the timestamp, how can I work around this issue?

You should be able to use apache_beam.utils.timestamp.Timestamp for this.

Related

Hbase | Hbase col qualifier hidden using Hbase shell cmds but visible via hbaserdd spark code

I am stuck in a very odd situation related to Hbase design i would say.
Hbase version >> Version 2.1.0-cdh6.2.1
So, the problem statement is, in Hbase, we have a row in our table.
We perform new insert and then subsequent updates of the same Hbase row, as we receive the data from downstream.
say we received data like below
INSERT of {a=1,b=1,c=1,d=1,rowkey='row1'}
UPDATE of {b=1,c=1,d=1,rowkey='row1'}
and
say the final row is like this in our Hbase table
hbase(main):008:0> get 'test', 'row1'
COLUMN CELL
cf:b timestamp=1288380727188, value=value1
cf:c timestamp=1288380727188, value=value1
cf:d timestamp=1288380727188, value=value1
1 row(s) in 0.0400 seconds
So, cf:a, column qualifier is missing in above data as visible above when fetched via scan or get commands. But as per our ingestion flow/process, it should have been there. So, we are triaging as to where it went or what happened and so on. Still the analysis is in process and we are kind of clueless as to where it is.
Now, cut story short, we have a spark util to read the Hbase table into a Rdd, via
hbasecontext.hbaseRdd API function, convert it into a dataframe and display the tabular data. So, we ran this spark util on the same table to help locate this row and very surprisingly it returned 2 rows for the this same rowkey 'row1', where 1st row was the same as above get/scan row (above data) and the 2nd row had our missing column cf:a (surprising it had the same value which was expected). Say the output dataframe appeared something like below.
rowkey |cf:a |cf:b|cf:c|cf:d
row1 |null | 1 | 1 | 1 >> cf:a col qualifier missing (same as in Hbase shell)
row1 | 1 | 1 | 1 | 1 >> This cf:a was expected
We checked our Hbase table schema as well, so we dont have multiple versions of the cf:a in the describe or we dont do versioning on the table. The schema of the Hbase table describe has
VERSIONS => '1'
Anyways, i am clueless as to how hbaseRdd is able to read that row or missing col qualifier, but the Hbase shell cmds via get, scans does not read the missing col qualifier or row.
Any Hbase expert or suggestions please.
Fyi, i tried Hbase shell cmds as well via get - versions on the row, but it only returns the above get data and not the missing cf:a.
Is the col qualifier cf:a marked for deletion or something like that, which the Hbase shell cmd doesn't show ?
Any help would be appreciated.
Thanks !!
This is a strange problem, which I suspect has to do with puts with the same rowkey having different column qualifiers at different times. However, I just tried to recreate this behaviour and I don't seem to be getting this problem. But I have a regular HBase 2.x build, as opposed to yours.
One option I would recommend to explore the problem more closely is to inspect the HFiles physically, outside of hbase shell. You can use the HBase HFile utility to print the physical key-value content at the HFile level. Obviously try to do this on a small HFile! Don't forget to flush and major-compact your table before you do it though, because HBase stores all updates in memory while it can.
You can launch the utility as below, and it will print all key-values sequentially:
hbase hfile -f hdfs://HDFS-NAMENODE:9000/hbase/data/default/test/29cfaecf083bff2f8aa2289c6a078678/f/09f569670678405a9262c8dfa7af8924 -p --printkv
In the above command, HDFS-NAMENODE is your HDFS server, default is your namespace (assuming you have none), test is your table name, and f is the column family name. You can find out the exact path to your HFiles by using the HDFS browse command recursively:
hdfs dfs -ls /hbase/data
[Updated] We worked with Cloudera and found the issue was due to the Hbase region servers getting overlapped. Cloudera fixed it for us. I dont have the full details how they did it.

spark handling of time zone for builtin date & time related functions

Assuming I have a timestamp like one obtained from current_timestamp() UDF inside spark when using a function like: hour(), minute(), ... . How can I specify a time zone?
I believe that https://issues.apache.org/jira/browse/SPARK-18350 introduced the support for it. But can't get it to work. Similar to the last comment on the page:
session.read.schema(mySchema)
.json(path)
.withColumn("year", year($"_time"))
.withColumn("month", month($"_time"))
.withColumn("day", dayofmonth($"_time"))
.withColumn("hour", hour($"_time", $"_tz"))
Having a look at the definition of the hour function, it uses an Hour
expression which can be constructed with an optional timeZoneId. I
have been trying to create an Hour expression but this is
Spark-internal construct - and the API forbids to use it directly. I
guess providing a function hour(t: Column, tz: Column) along with the
existing hour(t: Column) would not be a satisfying design.
I am stuck on trying to pass a specific time zone to the default builtin time UDFs.

Logstash: TZInfo::AmbiguousTime exception when parsing JDBC column Logstash

I am getting this exception when getting my data with the Logstash JDBC input plugin:
error:
26413962
Sequel::InvalidValue
TZInfo::AmbiguousTime: 2017-11-05T01:30:00+00:00 is an ambiguous local time.
This is probably because I am already converting my time zone in my JDBC plugin with this parameter:
jdbc_default_timezone => "America/New_York"
Therefore 1:30am happened twice on November 5th, and I suspect Logstash doesn't know what to do and it fall in an infinite loop.
As a workaround, I removed the jdbc_default_timezone parameter and instead I convert my values in UTC in the select statement, like this:
DATEADD(hh, DATEDIFF(hh, GETDATE(), GETUTCDATE()), th.[DueDate]) as DueDate
But this workaround is annoying since I need to modify all of my logstash inputs date columns.
Is there a way to force it to pick any of the two possible times, or any more elegant way?
It seems that this is a known bug in the Logstash JDBC Input plugin, it is flagged as a P2 enhancement.
https://github.com/logstash-plugins/logstash-input-jdbc/issues/121
Meanwhile, the workaround is to convert to UTC all the date and timestamps in the SQL query, as described above in the question (MS SQL version), or like this for the Oracle version:
select from_tz(cast(<column> as timestamp), 'CET') at time zone ('EST') "#timestamp"
from <table>
where ...
We also need to remove the jdbc_default_timezone parameter in the input file and in the filter, if applicable.

select string literal in cassandra cql

I am new to Cassandra and I am trying to run a simple query in CQL:
select aggregate_name as name, 'test' as test from aggregates;
and I get an error: Line 1: no viable alternative at input ''test''
The question is: how could I select string literal in Apache Cassandra?
I found an ugly workaround, if you really want to print a text value as a column:
cqlsh> select aggregate_name as name, blobAsText(textAsBlob('test')) as test from aggregates;
name | test
------+------
dude | test
CQL supports native Cassandra functions as a select_expression, so you can convert your string literal to a blob and back again as shown above. (source)

Presto - static date and timestamp in where clause

I'm pretty sure the following query used to work for me on Presto:
select segment, sum(count)
from modeling_trends
where segment='2557172' and date = '2016-06-23' and count_time between '2016-06-23 14:00:00.000' and '2016-06-23 14:59:59.000';
group by 1;
now when I run it (on Presto 0.147 on EMR) I get an error of trying to assigning varchar to date/timestamp..
I can make it work using:
select segment, sum(count)
from modeling_trends
where segment='2557172' and date = cast('2016-06-23' as date) and count_time between cast('2016-06-23 14:00:00.000' as TIMESTAMP) and cast('2016-06-23 14:59:59.000' as TIMESTAMP)
group by segment;
but it feels dirty...
is there a better way to do this?
Unlike some other databases, Presto doesn't automatically convert between varchar and other types, even for constants. The cast works, but a simpler way is to use the type constructors:
WHERE segment = '2557172'
AND date = date '2016-06-23'
AND count_time BETWEEN timestamp '2016-06-23 14:00:00.000' AND timestamp '2016-06-23 14:59:59.000'
You can see examples for various types here: https://prestosql.io/docs/current/language/types.html
Just a quick thought.. have you tried omitting the dashes in your date? try 20160623 instead of 2016-06-23.
I encountered something similar with SQL server, but not used Presto.

Resources