Hive statistics - statistics

I am trying to compute statistics on ORC File, but I am unable see any changes in PART_COL_STATS as well on using
set hive.compute.query.using.stats=true;
set hive.stats.reliable=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
set hive.cbo.enable=true;
to get max value of a column it is running full Map reduce on column ..
what I want to use is max value stored in meta store, but I am unable to catch these statistics.
My table desc is:
load_inst_id int
src_filename string
server_date date
My analyze query is:
analyze table mytable partition(server_date=’2013-11-30′) compute statistics for columns load_inst_id;
I am always getting 0 as loadinstant id, I have to turn off my hive.compute.query.using.stats to get correct result(through map reduce max(load_inst_id))

Related

Select and decode blob using python cassandra driver

I am trying to query the traces Cassandra table which is part of the Jaeger architecture. As you can see the refs field is a list:
cqlsh:jaeger_v1_dc1> describe traces
CREATE TABLE jaeger_v1_dc1.traces (
trace_id blob,
span_id bigint,
span_hash bigint,
duration bigint,
flags int,
logs list<frozen<log>>,
operation_name text,
parent_id bigint,
process frozen<process>,
refs list<frozen<span_ref>>,
start_time bigint,
tags list<frozen<keyvalue>>,
PRIMARY KEY (trace_id, span_id, span_hash)
)
from the python code:
traces = session.execute('SELECT span_id,refs from traces')
for t in traces:
if t.refs is not None:
parentTrace=t['refs'][0].trace_id
My first question is it possible to directly select the parent trace without iterating through the result? Is there a way i can get the first element in the list and then get the elements inside from the select statment?
From the terminal using cqlsh ,I am getting this result: trace_id: 0x00000000000000003917678c73006f57. However, from a python cassandra client I got this trace_id=b'\x00\x00\x00\x00\x00\x00\x00\x009\x17g\x8cs\x00oW' any idea what transformation happened to it? How can decode it since I want to use to query the table again.
To my knowledge, there is no easy way as there is no guarantee that the spans are stored in a specific order. Worth noting though, is if by parentTrace, you mean the root span of the trace (the first span), then you can search for spans where refs is null because a root span has no parent. Another way to identify a root span is if the trace_id == span_id.
trace_id is stored as a binary blob. What you see from cassandra client is an array of 16 bytes with each octet element represented as two hexadecimal values. To convert it to the hex string you see in cqlsh, you'll need to convert the entire array to a single hex string. See the following python example that does this:
from cassandra.cluster import Cluster
cluster = Cluster(['127.0.0.1'])
session = cluster.connect()
rows = session.execute("select * from jaeger_v1_test.traces")
trace = rows[0]
hexstr = ''.join('{:02x}'.format(x) for x in trace.trace_id)
print("hex=%s, byte_arr=%s, len(byte_arr)=%d" % (hexstr, trace.trace_id, len(trace.trace_id)))
cluster.shutdown()

Inserting a value on a frozen set in cassandra 3

I am currently working on a Cassandra 3 database in which one of its tables has a column that is defined like this:
column_name map<int, frozen <set<int>>>
When I have to change the value of a complete set given a map key x I just have to do this:
UPDATE keyspace.table SET column_name[x] = {1,2,3,4,5} WHERE ...
The thing is that I need to insert a value on a set given a key. I tried with this:
UPDATE keyspace.table SET column_name[x] = column_name[x] + {1} WHERE ...
But it returns:
SyntaxException: line 1:41 no viable alternative at input '[' (... SET column_name[x] = [column_name][...)
What am I doing wrong? Does anyone know how to insert data the way I need?
Since the value of map is frozen, you can't use update like this.
A frozen value serializes multiple components into a single value. Non-frozen types allow updates to individual fields. Cassandra treats the value of a frozen type as a blob. The entire value must be overwritten.
You have to read the full map get the value of the key append new item and then reinsert

Polybase - maximum reject threshold (0 rows) was reached while reading from an external source: 1 rows rejected out of total 1 rows processed

[Question from customer]
I have following data in a text file. Delimited by |
A | null , ZZ
C | D
When I run this query using HDInsight:
CREATE EXTERNAL TABLE myfiledata(
col1 string,
col2 string
)
row format delimited fields terminated by '|' STORED AS TEXTFILE LOCATION 'wasb://.....';
I get the following result as expected:
A null , ZZ
C D
But when I run the same query using SQL DW Polybase, it throws error:
Query aborted-- the maximum reject threshold (0 rows) was reached while reading from an external source: 1 rows rejected out of total 1 rows processed.
How do I fix this?
Here's my script in SQL DW:
-- Creating external data source (Azure Blob Storage)
CREATE EXTERNAL DATA SOURCE azure_storage1
WITH
(
TYPE = HADOOP
, LOCATION ='wasbs://....blob.core.windows.net'
, CREDENTIAL = ASBSecret
)
;
-- Creating external file format (delimited text file)
CREATE EXTERNAL FILE FORMAT text_file_format
WITH
(
FORMAT_TYPE = DELIMITEDTEXT
, FORMAT_OPTIONS (
FIELD_TERMINATOR ='|'
, USE_TYPE_DEFAULT = TRUE
)
)
;
-- Creating external table pointing to file stored in Azure Storage
CREATE EXTERNAL TABLE [Myfile]
(
Col1 varchar(5),
Col2 varchar(5)
)
WITH
(
LOCATION = '/myfile.txt'
, DATA_SOURCE = azure_storage1
, FILE_FORMAT = text_file_format
)
;
We’re currently working on a way to bubble up the reason for reject to the user.
In the meantime, here's what's happening:
The default # of rows allowed to fail schema matching is 0. This means that if at least one of the rows you’re loading in from /myfile.txt doesn’t match the schema. In Hive, strings can accommodate an arbitrary amount of chars, but varchars cannot. In this case it’s failing on the varchar(5) for “null , ZZ” because that is more than 5 characters.
If you’d like to change the REJECT_VALUE in the CREATE EXTERNAL TABLE call, that will let through the other row – more info can be found here: https://msdn.microsoft.com/library/dn935021(v=sql.130).aspx
It's due to dirty record for the respective file format, for example in the case of parquet if the column contains '' (empty string) then it won't work, and will throw Query aborted-- the maximum reject threshold
[AZURE.NOTE] A query on an external table can fail with the error "Query aborted-- the maximum reject threshold was reached while reading from an external source". This indicates that your external data contains dirty records. A data record is considered 'dirty' if the actual data types/number of columns do not match the column definitions of the external table or if the data doesn't conform to the specified external file format. To fix this, ensure that your external table and external file format definitions are correct and your external data conform to these definitions. In case a subset of external data records is dirty, you can choose to reject these records for your queries by using the reject options in CREATE EXTERNAL TABLE DDL.

Parse GeoPoint query slow and timed out using javascript sdk in node.js

I have the following parse query which times out when the number of records is large.
var query = new Parse.Query("UserLocation");
query.withinMiles("geo", geo, MAX_LOCATION_RADIUS);
query.ascending("createdAt");
if (createdAt !== undefined) {
query.greaterThan("createdAt", createdAt);
}
query.limit(1000);
it runs ok if UserLocation table is small. But the query times out from time to time when the table has ~100k records:
[2015-07-15 21:03:30.879] [ERROR] [default] - Error while querying for locations: [latitude=39.959064, longitude=-75.15846]: {"code":124,"message":"operation was slow and timed out"}
UserLocation table has a latitude,longitude pair and a radius. Given a geo point (latitude,longitude), I'm trying to find the list of UserLocations whose circle (lat,long)+radius covers the given geo point. It doesn't seem like I can use the value from another column in the table for the distance query (something like query.withinMiles("geo", inputGeo, "radius"), where "geo" and "radius" are the column names for GeoPoint and radius). It also has the limit that query "limit" combined with "skip" can only return maximum of 10,000 records (1000 records at a time and skip 10 times). So I had to do a almost full table scan by using "createdAt" as a filter criteria and keep querying until the query doesn't return results any more.
Anyway I can improve the algorithm so that it doesn't time out on large data set?

Cannot link MS Access query with subquery

I have created a query with a subquery in Access, and cannot link it in Excel 2003: when I use the menu Data -> Import External Data -> Import Data... and select the mdb file, the query is not present in the list. If I use the menu Data -> Import External Data -> New Database Query..., I can see my query in the list, but at the end of the import wizard I get this error:
Too few parameters. Expected 2.
My guess is that the query syntax is causing the problem, in fact the query contains a subquery. So, I'll try to describe the query goal and the resulting syntax.
Table Positions
ID (Autonumber, Primary Key)
position (double)
currency_id (long) (references Currency.ID)
portfolio (long)
Table Currency
ID (Autonumber, Primary Key)
code (text)
Query Goal
Join the 2 tables
Filter by portfolio = 1
Filter by currency.code in ("A", "B")
Group by currency and calculate the sum of the positions for each currency group an call the result: sumOfPositions
Calculate abs(sumOfPositions) on each currency group
Calculate the sum of the previous results as a single result
Query
The query without the final sum can be created using the Design View. The resulting SQL is:
SELECT Currency.code, Sum(Positions.position) AS SumOfposition
FROM [Currency] INNER JOIN Positions ON Currency.ID = Positions.currency_id
WHERE (((Positions.portfolio)=1))
GROUP BY Currency.code
HAVING (((Currency.code) In ("A","B")));
in order to calculate the final SUM I did the following (in the SQL View):
SELECT Sum(Abs([temp].[SumOfposition])) AS sumAbs
FROM [SELECT Currency.code, Sum(Positions.position) AS SumOfposition
FROM [Currency] INNER JOIN Positions ON Currency.ID = Positions.currency_id
WHERE (((Positions.portfolio)=1))
GROUP BY Currency.code
HAVING (((Currency.code) In ("A","B")))]. AS temp;
So, the question is: is there a better way for structuring the query in order to make the export work?
I can't see too much wrong with it, but I would take out some of the junk Access puts in and scale down the query to this, hopefully this should run ok:
SELECT Sum(Abs(A.SumOfPosition)) As SumAbs
FROM (SELECT C.code, Sum(P.position) AS SumOfposition
FROM Currency As C INNER JOIN Positions As P ON C.ID = P.currency_id
WHERE P.portfolio=1
GROUP BY C.code
HAVING C.code In ("A","B")) As A
It might be worth trying to declare your parameters in the MS Access query definition and define their datatypes. This is especially important when you are trying to use the query outside of MS Access itself, since it can't auto-detect the parameter types. This approach is sometimes hit or miss, but worth a shot.
PARAMETERS [[Positions].[portfolio]] Long, [[Currency].[code]] Text ( 255 );
SELECT Sum(Abs([temp].[SumOfposition])) AS sumAbs
FROM [SELECT Currency.code, Sum(Positions.position) AS SumOfposition
FROM [Currency] INNER JOIN Positions ON Currency.ID = Positions.currency_id
WHERE (((Positions.portfolio)=1))
GROUP BY Currency.code
HAVING (((Currency.code) In ("A","B")))]. AS temp;
I have solved my problems thanks to the fact that the outer query is doing a trivial sum. When choosing New Database Query... in Excel, at the end of the process, after pressing Finish, an Import Data form pops up, asking
Where do you want to put the data?
you can click on Create a PivotTable report... . If you define the PivotTable properly, Excel will display only the outer sum.

Resources