Oracle Column type Number is showing Decimal value in spark - apache-spark

Using spark read jdbc option , i am reading oracle table and one of the column type is 'Number' type. After reading and writing into s3 bucket, dataframe printschema is showing decimal(38,10). I know cast to int type can help but issue with we created redshift table with Intger type and decimal value(data frame value) is not allowing to do copy command. Is there any solution other than cast option. ?

Related

spark JDBC column size

spark JDBC column size:
I"m trying to get column (VARCHAR) size, I'm using :
spark.read.jdbc(myDBconnectionSTring,scheam.table, connectionProperties)
to retrieve column name and type but I need for varchar column the size.
In java JDBC Database Metadata I can get column name, type, and size.
Is it possible with spark?
Thanks
Apache Spark uses only a single uniform type for all text columns - StringType which is mapped to internal unsafe UTF representation. There is no difference in representation no matter the type used in the external storage.

Auto infer schema from parquet/ selectively convert string to float

I have a parquet file with 400+ columns, when I read it, the default datatypes attached to a lot of columns is String (may be due to the schema specified by someone else).
I was not able to find a parameter similar to
inferSchema=True' #for spark.read.parquet, present for spark.read.csv
I tried changing
mergeSchema=True #but it doesn't improve the results
To manually cast columns as float, I used
df_temp.select(*(col(c).cast("float").alias(c) for c in df_temp.columns))
this runs without error, but converts all the actual string column values to Null. I can't wrap this in a try, catch block as its not throwing any error.
Is there a way where i can check whether the columns contains only 'integer/ float' values and selectively cast those columns to float?
Parquet columns are typed, so there is no such thing as schema inference when loading Parquet files.
Is there a way where i can check whether the columns contains only 'integer/ float' values and selectively cast those columns to float?
You can use the same logic as Spark - define preferred type hierarchy and attempt to cast, until you get to the point, where you find the most selective type, that parses all values in the column.
How to force inferSchema for CSV to consider integers as dates (with "dateFormat" option)?
Spark data type guesser UDAF
There's no easy way currently,
there's a Github issue already existing which can be referred
https://github.com/databricks/spark-csv/issues/264
somthing like https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala
existing for scala this can be created for pyspark

Spark: read from parquet an int column as long

I have a parquet file that is read by spark as an external table.
One of the columns is defined as int both in the parquet schema and in the spark table.
Recently, I've discovered int is too small for my needs, so I changed the column type to long in new parquet files.
I changed also the type in the spark table to bigint.
However, when I try to read an old parquet file (with int) by spark as external table (with bigint), I get the following error:
java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary
One possible solution is altering the column type in the old parquet to long, which I asked about here: How can I change parquet column type from int to long?, but it is very expensive since I have a lot of data.
Another possible solution is to read each parquet file according to its schema to a different spark table and create a union view of the old and new tables, which is very ugly.
Is there another way to read from parquet an int column as long in spark?
using pyspark couldn't you just do
df = spark.read.parquet('path to parquet files')
the just change the cast the column type in the dataframe
new_df = (df
.withColumn('col_name', col('col_name').cast(LongType()))
)
and then just save the new dataframe to same location with overwrite mode

spark to hive data types

is there a way, how to convert an input string field into orc table with column specified as varchar(xx) in sparksql select query? Or I have to use some workaroud? I'm using Spark 1.6.
I found on Cloudera forum, Spark does not care about length, it saves the value as string with no size limit.
The table is inserted into Hive OK, but I'm little bit worried about data quality.
temp_table = sqlContext.table(ext)
df = temp_table.select(temp_dable.day.cast('string'))
I would like to see something like that :)))
df = temp_table.select(temp_dable.day.cast('varchar(100)'))
Edit:
df.write.partitionBy(part).mode('overwrite').insertInto(int)
Table I'm inserting into is saved as an ORC file (the line above probably should have .format('orc')).
I found here, that If I specify a column as a varchar(xx) type, than the input string will be cutoff to the xx length.
Thx

Unable to read timstamp value in pyspark from Hive (spark 1.6.1)

I am trying to read a Hive Table having a date column with datatype as timestamp, length=9.
my code looks something like the following -
df = hc.sql("select * from schema.table")
It can read all other columns (datatype = varchar).
It either reads null or gives none in date column
I have printed the df.dtypes and found that the dataframe schema was inferred correctly and date columns have timestamp datatype
surprisingly the same code works in notebook, only fails in spark-shell environment.
Can someone guide me what could be wrong or limitation that renders this error and how could I rectify it?
I have seen this problem in spark, Where It will display as null when the datatype is timestamp, Its a bug. There is a way to get around it where you have to read that date column as string using something like this to_char(Column_name ,'YYYY-MM-DD HH-MM-SS') as column_name, and then cast it to timestamp. If you can tell me the source type and the tool you used to pull the data like sqoop or are you getting the data in some form of files ?? I can help you better.

Resources