spark thrift server issue: Length field is empty for varchar fields - apache-spark

I am trying to read Data from Spark Thrift Server using SAS. In the table definition through DBeaver, I am seeing that Length field is empty only for fields with VARCHAR data type. I can see the length in the Data Type field as varchar(32). But that doesn't suffice my purpose as the SAS application taps into the Length field. Since, this field is not populated now, SAS is defaulting to the max size and as a result its becoming extremely slow. I get the length field populated in Hive.

Related

Double values getting to exponential values when inserting data from Azure Databricks to Azure SQL database

I'm trying to load data data from Azure Databricks into Azure SQL database table via jdbc. Data is loading fine but the double values in Azure Databricks when inserted in to SQL table is getting converted to exponential values. I have tried different datatypes in SQL database such as nvarchar, varchar, float and it gets converted to exponential values.
However, when i try using the decimal data type in Azure SQL database it loads data into the column without exponential values but giving me extra zeros in the end.
The command that i'm using in Databricks is:
%scala
spark.sql("select ID from customers")
.write
.mode(SaveMode.Append) // <--- Append to the existing table
.jdbc(jdbcUrl,stg.customers, connectionProperties)
Some of the examples stored in Azure Dataricks ID column are:
ID
1900845009567889.12
2134012183812321
When using numeric(38,15) data type in Azure SQL Database it is giving me the following output:
|ID|
|:--|
|1900845009567889.1200000000000000|
|2134012183812321.0000000000000000|
I don't want the extra zeros in the end. Also, data in the Databricks table is not properly defined so i cannot say if numeric(38,15) would suffice or not.
I also tried storing the data in Azure Databricks ID column as String datatype and then tried loading that into a varchar or nvarchar datatype in SQL table. But, still it is converting the data into exponential values.
Can anyone please suggest if there is any easy way to load this data from Azure Databricks to Azure SQL database?
I cannot say if numeric (38,15) would suffice or not
Before SQL Server 2016, the conversion of numeric is restricted up to the precision of 17 digits only.
Now there is no restriction after this version of SQL Server 2016 (13.x).
Generic Example:
Below is the simplified example from Microsoft Documentation which addresses the numeric data types of values how it will inserted into the respective columns.
CREATE TABLE dbo.Table ( DecimalColumn DECIMAL(5,2) ,NumericColumn NUMERIC(10,5) );
GO
INSERT INTO dbo.Table VALUES (123, 12345.12); GO SELECT DecimalColumn, NumericColumn FROM dbo.Table;
Result for the above SQL Query:
I don't want the extra zeros in the end.
In SQL Server we can use the float data type to exclude the extra zeros from the end, i.e., (scale).
For this we have to cast it to a float type in order to display purpose only.
SELECT CAST(12345.1200000 as float)
Output:
12345.12
Reference for the same above to Exclude the extra zeros.

Unable to coerce to a formatted date - Cassandra timestamp type

I have the values stored for timestamp type column in cassandra table in format of
2018-10-27 11:36:37.950000+0000 (GMT date).
I get Unable to coerce '2018-10-27 11:36:37.950000+0000' to a formatted date (long) when I run below query to get data.
select create_date from test_table where create_date='2018-10-27 11:36:37.950000+0000' allow filtering;
How to get the query working if the data is already stored in the table (of format, 2018-10-27 11:36:37.950000+0000) and also perform range (>= or <=) operations on create_date column?
I tried with create_date='2018-10-27 11:36:37.95Z',
create_date='2018-10-27 11:36:37.95' create_date='2018-10-27 11:36:37.95'too.
Is it possible to perform filtering on this kind of timestamp type data?
P.S. Using cqlsh to run query on cassandra table.
In first case, the problem is that you specify timestamp with microseconds, while Cassandra operates with milliseconds - try to remove the three last digits - .950 instead of .950000 (see this document for details). The timestamps are stored inside Cassandra as 64-bit number, and then formatted when printing results using the format specified by datetimeformat options of cqlshrc (see doc). Dates without explicit timezone will require that default timezone is specified in cqlshrc.
Regarding your question about filtering the data - this query will work only for small amounts of data, and on bigger data sizes will most probably timeout, as it will need to scan all data in the cluster. Also, the data won't be sorted correctly, because sorting happens only inside single partition.
If you want to perform such queries, then maybe the Spark Cassandra Connector will be the better choice, as it can effectively select required data, and then you can perform sorting, etc. Although this will require much more resources.
I recommend to take DS220 course from DataStax Academy to understand how to model data for Cassandra.
This is works for me
var datetime = DateTime.UtcNow.ToString("yyyy-MM-dd HH:MM:ss");
var query = $"SET updatedat = '{datetime}' WHERE ...

spark JDBC column size

spark JDBC column size:
I"m trying to get column (VARCHAR) size, I'm using :
spark.read.jdbc(myDBconnectionSTring,scheam.table, connectionProperties)
to retrieve column name and type but I need for varchar column the size.
In java JDBC Database Metadata I can get column name, type, and size.
Is it possible with spark?
Thanks
Apache Spark uses only a single uniform type for all text columns - StringType which is mapped to internal unsafe UTF representation. There is no difference in representation no matter the type used in the external storage.

Data loss in cassandra because of frequent delete and insert of same column in a row

I have a column family posts which is used to store post detail of my facebook account. I am using cassandra 2.0.9 and datastax java driver 3.0.
CREATE TABLE posts (
key blob,
column1 text,
value blob,
PRIMARY KEY ((key), column1)
) WITH COMPACT STORAGE;
where rowkey is my userid, columnkey is postid, value is post json. Whenever i refresh my application in browser, it'll fetch data from facebook and remove and add data for existing postids. Some times i miss some posts from cassandra. May frequent delete and insert in same column of a row causes data loss? How can i manage this?
It's not really dataloss, if you're updating the same column at a very high frequency (like thousands updates/sec) you may have unpredictable result.
Why ? Because Cassandra is using insert timestamp to determine at read time which value is the right one by comparing the timestamp of the same column from different replicas.
Currently, the resolution of the timestamp is the order of milliseconds so if you update rate is very high, for example 2 update on the same column for the same millisecond, the bigger post JSON will win.
By bigger, I mean by using postJson1.compareTo(postJson2). The ordering is determined by the type of your column and in your case it's a String so Cassandra breaks tie by comparing the post JSON data lexicographically.
To avoid this, you can provide the write timestamp at client side by generating yourself an unique timmeuuid().
There are many internatives to generate such TimeUUID, for example by using the Java driver class com.datastax.driver.core.utils.UUIDs.timeBased()

Azure Table Storage Rowkey Query not returning correct entities

I have an azure table storage with a lot of entities and when I query for entities with Rowkey(which is of the data type "Double") less than 8888 by using the query "RowKey le '8888' ".I get also those entities with Rowkey greater than 8888 also.
Even if you are storing a Double data type in RowKey, it gets stored as a String (both PartitionKey and RowKey are string data type). Thus the behavior you are seeing is correct because in string comparison 21086 is smaller than 8888.
What you need to do is make both of these strings of equal length by pre-padding them with 0 (so your RowKey values will be 000021086 and 000008888 for example) and then when you perform your query, these values will not be returned.

Resources