Converting Oracle RAW types with Spark - apache-spark

I have a table in an Oracle DB that contains a column stored as a RAW type. I'm making a JDBC connection to read that column and, when I print the schema of the resulting dataframe, I notice that I have a column with a binary data type. This was what I was expecting to happen.
The thing is that I need to be able to read that column as a String so I thought that a simple data type conversion would solve it.
df.select("COLUMN").withColumn("COL_AS_STRING", col("COLUMN").cast(StringType)).show
But what I got was a bunch of random characters. As I'm dealing with a RAW type it was possible that a string representation of this data doesn't exist so, just to be safe, I did simple select to get the first rows from the source (using sqoop-eval) and somehow sqoop can display this column as a string.
I then thought that this could be an encoding problem so I tried this:
df.selectExpr("decode(COLUMN,'utf-8')").show
With utf-8 and a bunch of other encodings. But again all I got was random characters.
Does anyone know how can I do this data type conversion?

Related

datatype text not support

I have huge data file and one of the column is text and has large data set in that column.
I tried to create column with text data type but it is not supported.
How to bring text data type data over to databricks.
please guide
Here's reference: Databricks data types
For CHAR, VARCHAR, NVARCHAR, TEXT and, in general, character strings of any size, just use STRING.

Decimals stored in scientific format in Hive table while loading it from Apache Spark

I am facing a problem with a hive table where decimal number such as 0.00000000000 is stored as 0E-11. Even though they are representing the same value 0, I do not understand why it is getting stored in scientific format. This is one of the percentage fields used for numeric calculation so the scale of the decimal number should be high. Even though it is in scientific format, it is not impacting our calculation in any way. We are able to do numeric operations but the representation in scientific format might cause some confusion for the people who are using this table. This issue is happening only when the percentage is 0. In other cases where there are valid percentages like 0.123456789, the value is stored as is without any epsilon.
Can you please explain why 0.00000000000 is represented in scientific format?. Also, I would like to know how can I store the decimal number as is without the epsilon like 0.00000000000. For our purpose, we want the solution to be in terms of Hive Query Language(HQL) only since we have a framework that takes hql file and writes the result of hql file to the hive table.
To demonstrate this issue, I followed the below steps.
I created a temp table with a decimal and string column.
It uses parquet as a file format.
Inserted 0.00000000000 as a string as well as decimal(12,11).
Displayed both the columns and both are displayed in scientific format.
Tried using parquet-tools to inspect the file contents but even in the parquet file, it is stored in
scientific format.
Tried with plain text format also but the behavior is the same.
I am using Spark 2.3 for the run. I looked at various StackOverflow threads such as this, this, and this but they are using Spark Dataframe API to preserve the natural number format but I want the solution to be in terms of HQL.
Please let me know if there are any questions.
I reckon format_number function should do the trick for you.
Please have a look at the below post
How to show decimal point in hive?
Thanks to user https://stackoverflow.com/users/4681341/vk-217?tab=profile
I checked it and it is working.
select format_number(0.00000000000,11);
Note: Don't have enough reputations to comment so adding it as an answer here.

Auto infer schema from parquet/ selectively convert string to float

I have a parquet file with 400+ columns, when I read it, the default datatypes attached to a lot of columns is String (may be due to the schema specified by someone else).
I was not able to find a parameter similar to
inferSchema=True' #for spark.read.parquet, present for spark.read.csv
I tried changing
mergeSchema=True #but it doesn't improve the results
To manually cast columns as float, I used
df_temp.select(*(col(c).cast("float").alias(c) for c in df_temp.columns))
this runs without error, but converts all the actual string column values to Null. I can't wrap this in a try, catch block as its not throwing any error.
Is there a way where i can check whether the columns contains only 'integer/ float' values and selectively cast those columns to float?
Parquet columns are typed, so there is no such thing as schema inference when loading Parquet files.
Is there a way where i can check whether the columns contains only 'integer/ float' values and selectively cast those columns to float?
You can use the same logic as Spark - define preferred type hierarchy and attempt to cast, until you get to the point, where you find the most selective type, that parses all values in the column.
How to force inferSchema for CSV to consider integers as dates (with "dateFormat" option)?
Spark data type guesser UDAF
There's no easy way currently,
there's a Github issue already existing which can be referred
https://github.com/databricks/spark-csv/issues/264
somthing like https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala
existing for scala this can be created for pyspark

Cassandra CQL: How to select encoded value from column

I have inserted string and integer values into dynamic columns in a Cassandra Column Family. When I query for the values in CQL they are displayed as hex encoded bits.
Can I somehow tell the query to decode the value into a string or integer?
I also would be happy to do this in the CLI if that's easier. There I see you can specify assume <column_family> validator as <type>;, but that applies to all columns and they have different types, so I have to run the assumption and query many times.
(Note that the columns are dynamic, so I haven't specified the validator when creating the column family).
You can use ASSUME in cqlsh like in cassandra-cli (although it only applies to printing values, not sending them, but that ought to be ok for you). You can also use it on a per-column basis, like:
ASSUME <column_family> ('anchor:cnnsi.com') VALUES ARE text;
..although (a), I just tested it, and this functionality is broken in cassandra-1.1.1 and later. I posted a fix at CASSANDRA-4352. And (b), this probably isn't a very versatile or helpful solution for more than a few one-off uses. I'd strongly recommend using CQL 3 here, as CQL direct support for wide storage engine rows like this is deprecated. Your table here is certainly adaptable to an (easier to use) CQL 3 model, but I couldn't say exactly what it would be without knowing more about how you're using it.

SQLite 3 CSV Import to Table

I am using this as a resource to get me started - http://www.pantz.org/software/sqlite/sqlite_commands_and_general_usage.html
Currently I am working on creating an AIR program making use of the built in SQLite database. I could be considered a complete noob in making SQL queries.
table column types
I have a rather large excel file (14K rows) that I have exported to a CSV file. It has 65 columns of varying data types (mostly ints, floats and short strings, MAYBE a few bools). I have no idea about the proper form of importing so as to preserve the column structure nor do I know the best data formats to choose per db column. I could use some input on this.
table creation utils
Is there a util that can read an XLS file and based on the column headers, generate a quick query statement to ease the pain of making the query manually? I saw this post but it seems geared towards a preexisting CSV file and makes use of python (something I am also a noob at)
Thank you in advance for your time.
J
SQLite3's column types basically boil down to:
TEXT
NUMERIC (REAL, FLOAT)
INTEGER (the various lengths of integer; but INT will normally do)
BLOB (binary objects)
Generally in a CSV file you will encounter strings (TEXT), decimal numbers (FLOAT), and integers (INT). If performance isn't critical, those are pretty much the only three column types you need. (CHAR(80) is smaller on disk than TEXT but for a few thousand rows it's not so much of an issue.)
As far as putting data into the columns is concerned, SQLite3 uses type coercion to convert the input data type to the column type whereever the conversion makes sense. So all you have to do is specify the correct column type, and SQLite will take care of storing it in the correct way.
For example the number -1230.00, the string "-1230.00", and the string "-1.23e3" will all coerce to the number 1230 when stored in a FLOAT column.
Note that if SQLite3 can't apply a meaningful type conversion, it will just store the original data without attempting to convert it at all. SQLite3 is quite happy to insert "Hello World!" into a FLOAT column. This is usually a Bad Thing.
See the SQLite3 documentation on column types and conversion for gems such as:
Type Affinity
In order to maximize compatibility between SQLite and other database
engines, SQLite supports the concept of "type affinity" on columns.
The type affinity of a column is the recommended type for data stored
in that column. The important idea here is that the type is
recommended, not required. Any column can still store any type of
data. It is just that some columns, given the choice, will prefer to
use one storage class over another. The preferred storage class for a
column is called its "affinity".

Resources