String vs Varchar Hive Query Performance - string

I have this table with 5 mill records, around 25 columns and most of them are String type. When I make a query, it lasts around 47 sec to fetch the results. I've 2 GB of space for each String column (because I dont know how to reduce that value), the max length record is just around 32k characters for one column, the other ones have way more less than that (7,18,50).
To get a better query performance, I copied that table, but instead of String, I used Varchar(1000) and varchar(50000) for that long record mentioned above, in all STring columns. I thought this would get me a faster fetch, but it takes almost the double of the time.
As my understanding, im using way more less space using varchar, but somehow this is not happening. Under the same conditions, should I get a better response using varchar instead of string?

There should not be any performance difference between string and varchar but best option is used as string, varchar is also stored internally as string.
Here are some excellent thread on detail comparison for both:
https://community.hortonworks.com/questions/48260/hive-string-vs-varchar-performance.html
Hive - Varchar vs String , Is there any advantage if the storage format is Parquet file format

Related

Spark JDBC UpperBound

jdbc(String url,
String table,
String columnName,
long lowerBound,
long upperBound,
int numPartitions,
java.util.Properties connectionProperties)
Hello,
I want to import few table from Oracle to hdfs using spark jdbc connectivity. To ensure parallelism, I want to choose the correct upperBound for each table. I am planning put row_number as my partition column and count of the table as the upperBound. Is there a better way to chose upperBound?, since I have to connect to the table in the first time to get the count. Please help
Generally the better way to use partitioning in Spark JDBC:
Choose a numeric or date type column.
Set upper bound as the maximum value of the col
Set lower bound as the minimum value of the col
(if there is skew then there are other ways to handle, generally this is good)
Obviously the above requires some querying and handling
Keep the mapping of table: partition column (probably an external store)
Query and fetch min, max
Another tip for skipping query: if you can find a Date based column , you can probably use upper = today's date and lower = 2000's date. But again it subject's to your content. the values might not hold true.
From your question I believe you are looking for something generic which you can easily apply for all tables. I understand that's the desired state , but it was as straight forward as using row_number in DB to do so, Spark would have done that already by default.
Such functions may technically work, but will be definitely be slower than the above steps as well put extra load on your database.

How to store Bert embeddings in cassandra

I want to use Cassandra as feature store to store precomputed Bert embedding,
Each row would consist of roughly 800 integers (ex. -0.18294132) Should I store all 800 in one large string column or 800 separate columns?
Simple read pattern, On read we would want to read every value in a row. Not sure which would be better for serialization speed.
Having everything as a separate column will be quite inefficient - each value will have its own metadata (writetime, for example) that will add significant overhead (at least 8 bytes per every value). Storing data as string will be also not very efficient, and will add the complexity on the application side.
I would suggest to store data as fronzen list of integers/longs or doubles/floats, depending on your requirements. Something like:
create table ks.bert(
rowid int primary key,
data frozen<list<int>>
);
In this case, the whole list will be effectively serialized as binary blob, occupying just one cell.

How can I make my Athena SQL query faster

I am running this on AWS Athena based on PrestoDB. My original plan was to query data 3 months in the past to analyze that data. However, even the query times for 2 hours in the past takes more than 30 minutes, at which point the Query times out. Is there any more efficient way for the query to be carried out?
SELECT column1, dt, column 2
FROM database1
WHERE date_parse(dt, '%Y%m%d%H%i%s') > CAST(now() - interval '1' hour AS timestamp)
The date column is recorded in the form of a string YYYYmmddhhmmss
Likely, the problem is that the query applies a function on the column being filtered. This is inefficient, becase the database needs to convert the entire column before it is able to filter it. One says that this predicate is non-SARGable.
Your primary effort should go into fixing your data model and store dates as dates rather than strings.
That said, the string format that you are using to represent dates still makes it possible to use direct filtering. The idea is to convert the filter value to the target string format (rather than converting the column value to a date):
where dt > date_format(now() - interval '1' hour, '%Y%m%d%H%i%s')
There are a lot of different factors that influence the time it takes for Athena to execute a query. The amount of data is usually dominates, but other important factors are data format (there's a huge difference between CSV and Parquet for example), and the number of files. In contrast to many other new database situations the complexity of the query is less often an important factor, and your query is very straightforward and is not the problem (it doesn't help that you apply a function in both sides of the WHERE condition, but it's not a big deal in Athena since the filtering is brute force and applying a function on each row isn't that big a deal compared to IO in an engine like Athena.
If you provide more information about the number of files, the data format, and so on we can probably help you better, because without that kind of information it could be just about anything. My suspicion is that you have something like a single prefix with tens or hundreds of millions of files – this is the worst possible case for Athena.
When Athena plans a query it lists the table's location on S3. S3's list operation has a page size of 1000, so if there are more files than that Athena will have to list sequentially until it gets the full listing. This cannot be parallelised, and it's also not very fast.
You need to avoid, almost at all cost, having more than 1000 files in the same prefix. If you have more files than that you can add prefixes (directories), because Athena will list S3 as if it was a file system, and parallelise listings of prefixes. A 1000 files each in table-data/a/, table-data/b/, table-data/c/ is much better than 3000 files in table-data/.
The reason why I suspect it's lots of small files rather than a lot of data is that if it was a lot of data you would probably have said so – and lots of data is actually something Athena is really good at. Ripping though terabytes of data is no problem unless it's a billion tiny files.

Unable to coerce to a formatted date - Cassandra timestamp type

I have the values stored for timestamp type column in cassandra table in format of
2018-10-27 11:36:37.950000+0000 (GMT date).
I get Unable to coerce '2018-10-27 11:36:37.950000+0000' to a formatted date (long) when I run below query to get data.
select create_date from test_table where create_date='2018-10-27 11:36:37.950000+0000' allow filtering;
How to get the query working if the data is already stored in the table (of format, 2018-10-27 11:36:37.950000+0000) and also perform range (>= or <=) operations on create_date column?
I tried with create_date='2018-10-27 11:36:37.95Z',
create_date='2018-10-27 11:36:37.95' create_date='2018-10-27 11:36:37.95'too.
Is it possible to perform filtering on this kind of timestamp type data?
P.S. Using cqlsh to run query on cassandra table.
In first case, the problem is that you specify timestamp with microseconds, while Cassandra operates with milliseconds - try to remove the three last digits - .950 instead of .950000 (see this document for details). The timestamps are stored inside Cassandra as 64-bit number, and then formatted when printing results using the format specified by datetimeformat options of cqlshrc (see doc). Dates without explicit timezone will require that default timezone is specified in cqlshrc.
Regarding your question about filtering the data - this query will work only for small amounts of data, and on bigger data sizes will most probably timeout, as it will need to scan all data in the cluster. Also, the data won't be sorted correctly, because sorting happens only inside single partition.
If you want to perform such queries, then maybe the Spark Cassandra Connector will be the better choice, as it can effectively select required data, and then you can perform sorting, etc. Although this will require much more resources.
I recommend to take DS220 course from DataStax Academy to understand how to model data for Cassandra.
This is works for me
var datetime = DateTime.UtcNow.ToString("yyyy-MM-dd HH:MM:ss");
var query = $"SET updatedat = '{datetime}' WHERE ...

Limit on the number of columns in cassandra

Is there any limit on the number of columns in cassandra? I am thinking of using a unix timestamp (converted to TimeUUID) as the column key. In the worst case, I will end up having 86400 columns per row. Is this a good idea?
Having 86.400 columns per row is piece of cake for cassandra as long your columns are not too big and you don't retrieve all of them.
The maximum of column per row is 2 billion.
See http://wiki.apache.org/cassandra/CassandraLimitations
A suggestion: For column name you should use Integer data serialization, which would take just 4 bytes for 1 second precision instead of using UUID (16 bytes); as long as your timestamps are all unique and 1s precision is enough.
Column names are sorted and you can use unix time as Integer. With this you can have fast lookups on columns.
There is also timestamp associated with each column, which can be useful to set in some cases. You cannot query on it, but may provide you additional information if needed.
Assuming you're doing that for a good reason, it's totally fine.

Resources