Spark Cassandra Write UDT With Case-Sensitive Names Fails - apache-spark

Spark connector Write fails with a java.lang.IllegalArgumentException: udtId is not a field defined in this definition error when using case-sensitive field names
I need the fields in the Cassandra table to maintain case. So i have used
quotes to create them.
My Cassandra schema
CREATE TYPE my_keyspace.my_udt (
"udtId" text,
"udtValue" text
);
CREATE TABLE my_keyspace.my_table (
"id" text PRIMARY KEY,
"someCol" text,
"udtCol" list<frozen<my_udt>>
);
My Spark DataFrame schema is
root
|-- id: string (nullable = true)
|-- someCol: string (nullable = true)
|-- udtCol: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- udtId: string (nullable = true)
|-- udtValue: string (nullable = true)
Are there any other options to get this write to work other than defining my udt with lowercase names? Making them lower case would make me invoke case management code everywhere this is used and i'd like to avoid that ?
Because i couldn't write successfully, i did try read yet? Is this an issue with reads as well ?

You need to upgrade to Spark Cassandra Connector 2.5.0 - I can't find specific commit that fixes it, or specific Jira that mentions that - I suspect that it was fixed in the DataStax version first, and then released as part of merge announced here.
Here is how it works in SCC 2.5.0 + Spark 2.4.6, while it fails with SCC 2.4.2 + Spark 2.4.6:
scala> import org.apache.spark.sql.cassandra._
import org.apache.spark.sql.cassandra._
scala> val data = spark.read.cassandraFormat("my_table", "test").load()
data: org.apache.spark.sql.DataFrame = [id: string, someCol: string ... 1 more field]
scala> val data2 = data.withColumn("id", concat(col("id"), lit("222")))
data2: org.apache.spark.sql.DataFrame = [id: string, someCol: string ... 1 more field]
scala> data2.write.cassandraFormat("my_table", "test").mode("append").save()

Related

Reading orc file of Hive managed tables in pyspark

I am trying to read orc file of a managed hive table using below pyspark code.
spark.read.format('orc').load('hive managed table path')
when i do a print schema on fetched dataframe, it is as follow
root
|-- operation: integer (nullable = true)
|-- originalTransaction: long (nullable = true)
|-- bucket: integer (nullable = true)
|-- rowId: long (nullable = true)
|-- currentTransaction: long (nullable = true)
|-- row: struct (nullable = true)
| |-- col1: float (nullable = true)
| |-- col2: integer (nullable = true)
|-- partition_by_column: date (nullable = true)
Now i am not able to parse this data and do any manipulation on data frame. While applying action like show(), i am getting an error saying
java.lang.IllegalArgumentException: Include vector the wrong length
did someone face the same issue? if yes can you please suggest how to resolve it.
It's a known issue.
You get that error because you're trying to read Hive ACID table but Spark still doesn't have support for this.
Maybe you can export your Hive table to normal ORC files and then read them with Spark or try using alternatives like Hive JDBC as described here
As i am not sure about the versions You can try other ways to load the ORC file.
Using SqlContext
val df = sqlContext.read.format("orc").load(orcfile)
OR
val df= spark.read.option("inferSchema", true).orc("filepath")
OR SparkSql(recommended)
import spark.sql
sql("SELECT * FROM table_name").show()

Convert from long to Timestamp for Insert into DB

Goal:
Read data from a JSON file where timestamp is a long type, and insert into a table that has a Timestamp type. The problem is that I don't know how to convert the long type to a Timestamp type for the insert.
Input File Sample:
{"sensor_id":"sensor1","reading_time":1549533263587,"notes":"My Notes for
Sensor1","temperature":24.11,"humidity":42.90}
I want to read this, create a Bean from it, and insert into a table. Here is my Bean Definition:
public class DummyBean {
private String sensor_id;
private String notes;
private Timestamp reading_time;
private double temperature;
private double humidity;
Here is the table I want to insert into:
create table dummy (
id serial not null primary key,
sensor_id varchar(40),
notes varchar(40),
reading_time timestamp with time zone default (current_timestamp at time zone 'UTC'),
temperature decimal(15,2),
humidity decimal(15,2)
);
Here is my Spark app to read the JSON file and do the insert (append)
SparkSession spark = SparkSession
.builder()
.appName("SparkJDBC2")
.getOrCreate();
// Java Bean used to apply schema to JSON Data
Encoder<DummyBean> dummyEncoder = Encoders.bean(DummyBean.class);
// Read JSON file to DataSet
String jsonPath = "input/dummy.json";
Dataset<DummyBean> readings = spark.read().json(jsonPath).as(dummyEncoder);
// Diagnostics and Sink
readings.printSchema();
readings.show();
// Write to JDBC Sink
String url = "jdbc:postgresql://dbhost:5432/mydb";
String table = "dummy";
Properties connectionProperties = new Properties();
connectionProperties.setProperty("user", "foo");
connectionProperties.setProperty("password", "bar");
readings.write().mode(SaveMode.Append).jdbc(url, table, connectionProperties);
Output and Error Message:
root
|-- humidity: double (nullable = true)
|-- notes: string (nullable = true)
|-- reading_time: long (nullable = true)
|-- sensor_id: string (nullable = true)
|-- temperature: double (nullable = true)
+--------+--------------------+-------------+---------+-----------+
|humidity| notes| reading_time|sensor_id|temperature|
+--------+--------------------+-------------+---------+-----------+
| 42.9|My Notes for Sensor1|1549533263587| sensor1| 24.11|
+--------+--------------------+-------------+---------+-----------+
Exception in thread "main" org.apache.spark.sql.AnalysisException: Column "reading_time" not found in schema Some(StructType(StructField(id,IntegerType,false), StructField(sensor_id,StringType,true), StructField(notes,StringType,true), StructField(temperature,DecimalType(15,2),true), StructField(humidity,DecimalType(15,2),true)));
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$4$$anonfun$6.apply(JdbcUtils.scala:147)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$4$$anonfun$6.apply(JdbcUtils.scala:147)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$4.apply(JdbcUtils.scala:146)
The exception in your post says "reading_time" column not found.. so please cross check if the table is having the required column in the db end. Also, the timestamp is coming in milli seconds, so you need to divide that by 1000 before applying the to_timestamp() function, otherwise you'll get a weird date.
I'm able to replicate below and convert the reading_time.
scala> val readings = Seq((42.9,"My Notes for Sensor1",1549533263587L,"sensor1",24.11)).toDF("humidity","notes","reading_time","sensor_id","temperature")
readings: org.apache.spark.sql.DataFrame = [humidity: double, notes: string ... 3 more fields]
scala> readings.printSchema();
root
|-- humidity: double (nullable = false)
|-- notes: string (nullable = true)
|-- reading_time: long (nullable = false)
|-- sensor_id: string (nullable = true)
|-- temperature: double (nullable = false)
scala> readings.show(false)
+--------+--------------------+-------------+---------+-----------+
|humidity|notes |reading_time |sensor_id|temperature|
+--------+--------------------+-------------+---------+-----------+
|42.9 |My Notes for Sensor1|1549533263587|sensor1 |24.11 |
+--------+--------------------+-------------+---------+-----------+
scala> readings.withColumn("ts", to_timestamp('reading_time/1000)).show(false)
+--------+--------------------+-------------+---------+-----------+-----------------------+
|humidity|notes |reading_time |sensor_id|temperature|ts |
+--------+--------------------+-------------+---------+-----------+-----------------------+
|42.9 |My Notes for Sensor1|1549533263587|sensor1 |24.11 |2019-02-07 04:54:23.587|
+--------+--------------------+-------------+---------+-----------+-----------------------+
scala>
Thanks for your help. Yes the table was missing the column so I fixed that.
This is what solved it (Java version)
import static org.apache.spark.sql.functions.col;
import static org.apache.spark.sql.functions.to_timestamp;
...
Dataset<Row> readingsRow = readings.withColumn("reading_time", to_timestamp(col("reading_time").$div(1000L)));
// Write to JDBC Sink
String url = "jdbc:postgresql://dbhost:5432/mydb";
String table = "dummy";
Properties connectionProperties = new Properties();
connectionProperties.setProperty("user", "foo");
connectionProperties.setProperty("password", "bar");
readingsRow.write().mode(SaveMode.Append).jdbc(url, table, connectionProperties);
If your date is String you can use
String readtime = obj.getString("reading_time");
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ssZ"); //Z for time zone
Date reading_time = sdf.parse(readtime);
or use
new Date(json.getLong(milliseconds))
if it is long

spark read orc with specific columns

I have a orc file, when read with below option it reads all the columns .
val df= spark.read.orc("/some/path/")
df.printSChema
root
|-- id: string (nullable = true)
|-- name: string (nullable = true)
|-- value: string (nullable = true)
|-- all: string (nullable = true)
|-- next: string (nullable = true)
|-- action: string (nullable = true)
but I want to read only two columns from that file , is there any way to read only two columns (id,name) while loading orc file ?
is there any way to read only two columns (id,name) while loading orc file ?
Yes, all you need is subsequent select. Spark will take care of the rest for you:
val df = spark.read.orc("/some/path/").select("id", "name")
Spark has lazy execution model. So you can do any data transformation in you code without immediate real effect. Only after action call Spark start to doing job. And Spark are smart enough not to do extra work.
So you can write like this:
val inDF: DataFrame = spark.read.orc("/some/path/")
import spark.implicits._
val filteredDF: DataFrame = inDF.select($"id", $"name")
// any additional transformations
// real work starts after this action
val result: Array[Row] = filteredDF.collect()

JSON Struct to Map[String,String] using sqlContext

I am trying to read json data in spark streaming job.
By default sqlContext.read.json(rdd) is converting all map types to struct types.
|-- legal_name: struct (nullable = true)
| |-- first_name: string (nullable = true)
| |-- last_name: string (nullable = true)
| |-- middle_name: string (nullable = true)
But when i read from hive table using sqlContext
val a = sqlContext.sql("select * from student_record")
below is the schema.
|-- leagalname: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
Is there any way we can read data using read.json(rdd) and get Map data type?
Is there any option like
spark.sql.schema.convertStructToMap?
Any help is appreciated.
You need to explicitly define your schema, when calling read.json.
You can read about the details in Programmatically specifying the schema in the Spark SQL Documentation.
For example in your specific case it would be
import org.apache.spark.sql.types._
val schema = StructType(List(StructField("legal_name",MapType(StringType,StringType,true))))
That would be one column legal_name being a map.
When you have defined you schema you can call
sqlContext.read.json(rdd, schema) to create your data frame from your JSON dataset with the desired schema.

Why do Spark DataFrames not change their schema and what to do about it?

I'm using Spark 2.1's Structured Streaming to read from a Kafka topic whose contents are binary avro-encoded.
Thus, after setting up the DataFrame:
val messages = spark
.readStream
.format("kafka")
.options(kafkaConf)
.option("subscribe", config.getString("kafka.topic"))
.load()
If I print the schema of this DataFrame (messages.printSchema()), I get the following:
root
|-- key: binary (nullable = true)
|-- value: binary (nullable = true)
|-- topic: string (nullable = true)
|-- partition: integer (nullable = true)
|-- offset: long (nullable = true)
|-- timestamp: long (nullable = true)
|-- timestampType: integer (nullable = true)
This question should be orthogonal to the problem of avro-decoding, but let's assume I want to somehow convert the value content from the messages DataFrame into a Dataset[BusinessObject], by a function Array[Byte] => BusinessObject. For example completeness, the function may just be (using avro4s):
case class BusinessObject(userId: String, eventId: String)
def fromAvro(bytes: Array[Byte]): BusinessObject =
AvroInputStream.binary[BusinessObject](
new ByteArrayInputStream(bytes)
).iterator.next
Of course, as miguno says in this related question I cannot just apply the transformation with a DataFrame.map(), because I need to provide an implicit Encoder for such a BusinessObject.
That can be defined as:
implicit val myEncoder : Encoder[BusinessObject] = org.apache.spark.sql.Encoders.kryo[BusinessObject]
Now, perform the map:
val transformedMessages : Dataset[BusinessObjecŧ] = messages.map(row => fromAvro(row.getAs[Array[Byte]]("value")))
But if I query the new schema, I get the following:
root
|-- value: binary (nullable = true)
And I think that does not make any sense, as the dataset should use the Product properties of the BusinessObject case-class and get the correct values.
I've seen some examples on Spark SQL using .schema(StructType) in the reader, but I cannot do that, not just because I'm using readStream, but because I actually have to transform the column before being able to operate in such fields.
I am hoping to tell the Spark SQL engine that the transformedMessages Dataset schema is a StructField with the case class' fields.
I would say you get exactly what you ask for. As I already explained today Encoders.kryo generates a blob with serialized object. Its internal structure is opaque for the SQL engine and cannot be accessed without deserializing the object. So effectively what your code does is taking one serialization format and replacing it with another.
Another problem you have is that you try to mix dynamically typed DataFrame (Dataset[Row]) with statically typed object. Excluding UDT API Spark SQL doesn't work like this. Either you use statically Dataset or DataFrame with object structure encoded using struct hierarchy.
Good news is simple product types like BusinessObject should work just fine without any need for clumsy Encoders.kryo. Just skip Kryo encoder definition and be sure to import implicit encoders:
import spark.implicits._

Resources