spark JDBC column size - apache-spark

spark JDBC column size:
I"m trying to get column (VARCHAR) size, I'm using :
spark.read.jdbc(myDBconnectionSTring,scheam.table, connectionProperties)
to retrieve column name and type but I need for varchar column the size.
In java JDBC Database Metadata I can get column name, type, and size.
Is it possible with spark?
Thanks

Apache Spark uses only a single uniform type for all text columns - StringType which is mapped to internal unsafe UTF representation. There is no difference in representation no matter the type used in the external storage.

Related

spark thrift server issue: Length field is empty for varchar fields

I am trying to read Data from Spark Thrift Server using SAS. In the table definition through DBeaver, I am seeing that Length field is empty only for fields with VARCHAR data type. I can see the length in the Data Type field as varchar(32). But that doesn't suffice my purpose as the SAS application taps into the Length field. Since, this field is not populated now, SAS is defaulting to the max size and as a result its becoming extremely slow. I get the length field populated in Hive.

Oracle Column type Number is showing Decimal value in spark

Using spark read jdbc option , i am reading oracle table and one of the column type is 'Number' type. After reading and writing into s3 bucket, dataframe printschema is showing decimal(38,10). I know cast to int type can help but issue with we created redshift table with Intger type and decimal value(data frame value) is not allowing to do copy command. Is there any solution other than cast option. ?

null values in some of dataframe columns, while reading it from hbase

I am reading data from hbase using spark sql. one column has xml data. when xml size is small , I am able to read correct data. but as soon as size increases too much, some of the columns in dataframe becomes null. xml is still coming correctly.
while reading data from sql to hbase I have used this constraint:
hbase.client.keyvalue.maxsize=0 in my sqoop.

spark to hive data types

is there a way, how to convert an input string field into orc table with column specified as varchar(xx) in sparksql select query? Or I have to use some workaroud? I'm using Spark 1.6.
I found on Cloudera forum, Spark does not care about length, it saves the value as string with no size limit.
The table is inserted into Hive OK, but I'm little bit worried about data quality.
temp_table = sqlContext.table(ext)
df = temp_table.select(temp_dable.day.cast('string'))
I would like to see something like that :)))
df = temp_table.select(temp_dable.day.cast('varchar(100)'))
Edit:
df.write.partitionBy(part).mode('overwrite').insertInto(int)
Table I'm inserting into is saved as an ORC file (the line above probably should have .format('orc')).
I found here, that If I specify a column as a varchar(xx) type, than the input string will be cutoff to the xx length.
Thx

Inserting data into a static Hive partition using Spark SQL

I have trouble figuring out how to insert data into a static partition of a Hive table using Spark SQL. I can use code like this to write into dynamic partitions:
df.write.partitionBy("key").insertInto("my_table")
However, I can't figure out how to insert the data into a static partition. That means, I want to define the partition where the entire DataFrame should be written without the need to add the column to the DataFrame.
I see static partitioning mentioned in the
InsertIntoHiveTable class, so I guess it is supported. Is there a public API to do what I want?
You can use
DataFrame tableMeta = sqlContext.sql(String.format("DESCRIBE FORMATTED %s", tableName));
String location = tableMeta.filter("result LIKE 'Location:%'").first().getString(0);
and use regex to get your table partition. Once you get the table location, you can easily construct the partition location like
String partitionLocation = location + "/" + partitionKey
(partitionKey is something like dt=20160329/hr=21)
Then, you can write to that path
df.write.parquet(partitionLocation)
(in my case when I build the dataframe, I do not include the partition columns in. Not sure if there is any error when partition columns are included)

Resources