Split data in Amazon Athena / Presto by varchar column

Split data in Amazon Athena / Presto by varchar column - presto

I want to split data in Amazon Athena database by varchar column on similar parts. If I could convert varchar to integer I would use just
some_hash_function(data) mod n. But Athena's hash functions return varbinary and it can't be cast to integer.
So, is it possible to solve this problem in another way?

You can convert an 8 byte varbinary into a bigint using the from_big_endian_64 function. Here is a full example:
select from_big_endian_64(substr(sha256(to_utf8('test')), 1, 8));

Generally, Dain answer is correct, but there is a slight remark: since substr takes varchar as a first argument, but sha256 returns varbinary that code will fail.
Here is a working alternative:
from_big_endian_64(xxhash64(to_utf8(user_id)))

Related

Azure Congnitive Seach index int64 does not accept string value

I'm using a Python Azure App Function to copy data from a SQLServer database to a Azure Cognitive Search index. The problem I'm seeing is that there are some nvarchar fields that contain numeric data that I'm trying to put into an Edm.int64 field in the index. The documentation states that this should work:
https://learn.microsoft.com/en-us/rest/api/searchservice/data-type-map-for-indexers-in-azure-search#bkmk_sql_search
However, I get an error – “Cannot convert a value to target type 'Edm.Int64' because of conflict between input format string/number and parameter 'IEEE754Compatible' false/true”.
It works when copy string with numbers into an Edm.int32 index field....
Has anyone else encountered/solved this issue?
Thanks!

You're getting the error since you're trying to convert from a varchar to a Edm.int32 index field and that is not supported.
As per https://learn.microsoft.com/rest/api/searchservice/data-type-map-for-indexers-in-azure-search#bkmk_sql_search you can only convert int, smallint, tinyint types into Edm.int32.
In the conversion table you'll find that char, nchar, varchar, nvarchar can only be converted to Edm.String or
Collection(Edm.String).
You can make your index field an Edm.String type and then in your client app code translate the string to an int accordingly once the content has been indexed to manipulate the response type.
I hope this helps.

Converting Oracle RAW types with Spark

I have a table in an Oracle DB that contains a column stored as a RAW type. I'm making a JDBC connection to read that column and, when I print the schema of the resulting dataframe, I notice that I have a column with a binary data type. This was what I was expecting to happen.
The thing is that I need to be able to read that column as a String so I thought that a simple data type conversion would solve it.
df.select("COLUMN").withColumn("COL_AS_STRING", col("COLUMN").cast(StringType)).show
But what I got was a bunch of random characters. As I'm dealing with a RAW type it was possible that a string representation of this data doesn't exist so, just to be safe, I did simple select to get the first rows from the source (using sqoop-eval) and somehow sqoop can display this column as a string.
I then thought that this could be an encoding problem so I tried this:
df.selectExpr("decode(COLUMN,'utf-8')").show
With utf-8 and a bunch of other encodings. But again all I got was random characters.
Does anyone know how can I do this data type conversion?

String vs Varchar Hive Query Performance

I have this table with 5 mill records, around 25 columns and most of them are String type. When I make a query, it lasts around 47 sec to fetch the results. I've 2 GB of space for each String column (because I dont know how to reduce that value), the max length record is just around 32k characters for one column, the other ones have way more less than that (7,18,50).
To get a better query performance, I copied that table, but instead of String, I used Varchar(1000) and varchar(50000) for that long record mentioned above, in all STring columns. I thought this would get me a faster fetch, but it takes almost the double of the time.
As my understanding, im using way more less space using varchar, but somehow this is not happening. Under the same conditions, should I get a better response using varchar instead of string?

There should not be any performance difference between string and varchar but best option is used as string, varchar is also stored internally as string.
Here are some excellent thread on detail comparison for both:
https://community.hortonworks.com/questions/48260/hive-string-vs-varchar-performance.html
Hive - Varchar vs String , Is there any advantage if the storage format is Parquet file format

Converting 128 bit int for row key in Cassandra

If I wish to have a comparable 128 bit integer equivalent as a row key in Cassandra, what data type is the most efficient to process this? ASCII using the full 8-bit range?
I need to be able to select row slices and ranges.

Row keys are not compared if you use Random Partitioner (the piece that determine how the keys get distributed around the cluster).
If you want to compare row keys use a Order Preserving partitioner ... but that will surely lead to an unbalanced cluster and crashes.
Column names get compared though, with other column names inside the same row.
So my advise is Bucket your columns into number intervals and insert your columns with LongType column name.

Probably just use the raw byte[] representation of the int and avoid any conversion; Comments above from le douard withstanding.

Raw byte[] comparison is not going to sort columns in numerical order. If that's what you want you should use varint (CQL) / IntegerType (Thrift)

How to order by a varchar column numerically in vertica database?

How to order by a varchar column numerically in vertica database?
For example we can add a +0 in the order by clause in oracle to sort a varchar column numerically.
Thanks!

Use cast as in
select x from foo order by cast(x as int);
You will get an error if not all values can be casted to an int.

I haven't done this before in Vertica, but my advice is the same for this type of problem. Try to figure out how PostgreSQL does it and try that since Vertica is utilizing a lot of PostgreSQL funcitonality.
I just did a quick search and came up with this as a possible solution: http://archives.postgresql.org/pgsql-general/2002-01/msg01057.php
A more thorough search may get you better answers.

If the data is truly numeric data, the '+0' will do conversion as you have requested but if there are any values that can not be converted the query will return an error like the following one:
ERROR: Could not convert "200 ... something" from column table_name.column_name to a number

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Split data in Amazon Athena / Presto by varchar column - presto

You can convert an 8 byte varbinary into a bigint using the from_big_endian_64 function. Here is a full example: select from_big_endian_64(substr(sha256(to_utf8('test')), 1, 8));

Generally, Dain answer is correct, but there is a slight remark: since substr takes varchar as a first argument, but sha256 returns varbinary that code will fail. Here is a working alternative: from_big_endian_64(xxhash64(to_utf8(user_id)))

Related

Azure Congnitive Seach index int64 does not accept string value

Converting Oracle RAW types with Spark

String vs Varchar Hive Query Performance

Converting 128 bit int for row key in Cassandra

How to order by a varchar column numerically in vertica database?

Categories

Resources