SPARK read.jdbc & custom schema - apache-spark

With spark.read.format ... once can add the custom schema non-programmatically, like so:
val df = sqlContext
.read()
.format("jdbc")
.option("url", "jdbc:mysql://127.0.0.1:3306/test?useUnicode=true&characterEncoding=UTF-8&autoReconnect=true")
.option("user", "root")
.option("password", "password")
.option("dbtable", sql)
.schema(customSchema)
.load();
However, using spark.read.jdbc, I cannot seem to do the same or find the syntax to do the same as for the above. What am i missing or has this changed in SPARK 2.x? I read this in the manual: ... Spark automatically reads the schema from the database table and maps its types back to Spark SQL types. ... Presumably what I am trying to do is no longer possible as in the above example.
val dataframe_mysql = spark.read.jdbc(jdbcUrl, "(select k, v from sample) e ", connectionProperties)
I ended up trying this:
val dataframe_mysql = spark.read.schema(openPositionsSchema).jdbc(jdbcUrl, "(select k, v from sample) e ", connectionProperties)
and got this:
org.apache.spark.sql.AnalysisException: User specified schema not supported with `jdbc`;
Seems a retrograde step in a certain way.

I do not agree with the answer.
You can supply custom schema using your method or by setting properties:
connectionProperties.put("customSchema", schemachanges);
Where schema changes in format "field Name" "New data type", ... :
"key String, value DECIMAL(20, 0)"
If key was an number in original table, it will generate an SQL query like "key::character varying, value::numeric(20, 0)"
It is better than a cast, because cast is a mapping operation executed after it selected in original type, custom schema is not.
I had a case, when spark can not select NaN from postgres Numeric, because it maps numerics into java BigDecimal which does not allow NaN, so spark job failed every time when reading those values. Cast produced the same result. However after changing a scheme to either String or Double, it was able to read it properly.
Spark documentation: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

You can use a Custom schema and put in the properties parameters. You can read more at https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
Create a variable:
c_schema = 'id_type INT'
Properties conf:
config = {"user":"xxx",
"password": "yyy",
"driver":"com.mysql.jdbc.Driver",
"customSchema":c_schema}
Read the table and create the DF:
df = spark.read.jdbc(url=jdbc_url,table='table_name',properties=config)
You must use the same column name and it's going to change only the column
you put inside the customized schema.

. What am i missing or has this changed in SPARK 2.x?
You don't miss anything. Modifying schema on read with JDBC sources was never supported. The input is already typed so there there is no place for schema.
If the types are not satisfying, just cast the results to the desired types.

Related

Concatenating hive table after adding columns breaks spark read

Given some table manipulation – create table with 2 rows and columns, add 3rd column and insert third row with 3 values
CREATE TABLE concat_test(
one string,
two string
)
STORED AS ORC;
INSERT INTO TABLE concat_test VALUES (1,1), (2,2);
ALTER TABLE concat_test ADD COLUMNS (three string);
INSERT INTO TABLE concat_test VALUES (3,3,3);
alter table concat_test concatenate;
I'm having an exception Caused by: java.lang.ArrayIndexOutOfBoundsException: 3 when I try reading it with Spark
spark.sql("select * from concat_test").collect()
It is obviously connected with columns number. I'm further investigating problem in orc. I didn't find quick fix for such partitions nor the bug described elsewhere. Is there one?
Could anyone try this on the latest hadoop versions? Does the bug exist?
Hive 1.2.1, Spark 2.3.2
UPD. I myself fixed my tables via Hive. Hive queries do work after this manipulation so I created copy tables and did select-insert of the old data to them.
I have totally run into this issue before!
This is a known issue.
Hive only does schema on read, so there is no reason it should detect this as an issue and will happily let you define any definition you want. And the data underlying the table does NOT get updated when you change the definition of the hive table. Generally I have fixed the issue by fixing the underlying ORC files to meet the hive definition. You could read the ORC files directly as that issue has been fixed now as a work around.
Here's a work around if you know that the underlying orc files aren't in the correct format and want to correct the format.
val s = Seq(("apple","apples"),("car","cars")) // create data
val t = Seq(("apple",12),("apples", 50),("car",5),("cars",40))// create data
val df1 = sc.parallelize(t).toDF("Sub_Cat", "Count")
val df2 = sc.parallelize(s).toDF("Main_Cat","Sub_Cat")
df1.write.format("orc").save("category_count")
df2.write.format("orc").save("categories")
val schema = StructType( Array( StructField("Main_Cat", StringType, nullable = true), StructField("Sub_Cat", StringType, nullable = true),StructField("Count", IntegerType, nullable = true)) )
val CorrectedSchema = spark.read.schema(schema).org("category_count")
CorrectedSchema.show()
This helps to correct Schema into the format you intend. If you trust the hive schema you can use this cheat to get the schema.(and reduce the typing)
val schema = spark.sql("select * from concat_test limit 0").schema

spark cassandra connector problem using catalogs

I am following the instructions found here to connect my spark program to read data from Cassandra. Here is how I have configured spark:
val configBuilder = SparkSession.builder
.config("spark.sql.extensions", "com.datastax.spark.connector.CassandraSparkExtensions")
.config("spark.cassandra.connection.host", cassandraUrl)
.config("spark.cassandra.connection.port", 9042)
.config("spark.sql.catalog.myCatalogName", "com.datastax.spark.connector.datasource.CassandraCatalog")
According to the documentation, once this is done I should be able to query Cassandra like this:
spark.sql("select * from myCatalogName.myKeyspace.myTable where myPartitionKey = something")
however when I do so I get the following error message:
mismatched input '.' expecting <EOF>(line 1, pos 43)
== SQL ==
select * from myCatalog.myKeyspace.myTable where myPartitionKey = something
----------------------------------^^^
When I try in the following format I am successful at retrieving entries from Cassandra:
val frame = spark
.read
.format("org.apache.spark.sql.cassandra")
.options(Map("keyspace" -> "myKeyspace", "table" -> "myTable"))
.load()
.filter(col("timestamp") > startDate && col("timestamp") < endDate)
However this query requires a full table scan to be performed. The table contains a few million entries and I would prefer to avail myself of the predicate Pushdown functionality, which it would seem is only available via the SQL API.
I am using spark-core_2.11:2.4.3, spark-cassandra-connector_2.11:2.5.0 and Cassandra 3.11.6
Thanks!
The Catalogs API is available only in SCC version 3.0 that is not released yet. It will be released with Spark 3.0 release, so it isn't available in the SCC 2.5.0. So for 2.5.0 you need to register your table explicitly, with create or replace temporary view..., as described in docs:
spark.sql("""CREATE TEMPORARY VIEW myTable
USING org.apache.spark.sql.cassandra
OPTIONS (
table "myTable",
keyspace "myKeyspace",
pushdown "true")""")
Regarding the pushdowns (they work the same for all Dataframe APIs, SQL, Scala, Python, ...) - such filtering will happen when your timestamp is the first clustering column. And even in that case, the typical problem is that you may specify startDate and endDate as strings, not timestamp. You can check by executing frame.explain, and checking that predicate is pushed down - it should have * marker near predicate name.
For example,
val data = spark.read.cassandraFormat("sdtest", "test").load()
val filtered = data.filter("ts >= cast('2019-03-10T14:41:34.373+0000' as timestamp) AND ts <= cast('2019-03-10T19:01:56.316+0000' as timestamp)")
val not_filtered = data.filter("ts >= '2019-03-10T14:41:34.373+0000' AND ts <= '2019-03-10T19:01:56.316+0000'")
the first filter expression will push predicate down, while 2nd (not_filtered) will require a full scan.

How to cast a decimal value into String in spark while reading data from Greenplum?

I am trying to read an RDBMS table on Greenplum database using spark. I have the following columns:
val allColumnsSeq: Seq[String] = Seq("usd_exchange_rate", "usd_exchange_rate::character varying as usd_exchange_rate_text")
I am trying to read above columns in spark as:
val yearDF = spark.read.format("io.pivotal.greenplum.spark.GreenplumRelationProvider").option("url", connectionUrl)
.option("dbtable", "x_lines")
.option("dbschema","copydb")
.option("user", devUserName).option("password", devPassword)
.option("partitionColumn","id")
.load()
.where("year=2017 and month=12")
.select(allColumnsSeq map col:_*)
.withColumn(flagCol, lit(0))
There are certain columns in gp that are of datatype: decimal which contain precision digits.
In above table, it is:
usd_exchange_rate
It contains nearly 45 digits of precision. In our project is, we keep the original column(usd_exchange_rate) and we create a new column from usd_exchange_rate in character datatype & its column name appended with _text. In this case,
decimal datatype: usd_exchange_rate & same column in char datatype: usd_exchange_rate_text
When I execute the above line, I get the exception:
org.apache.spark.sql.AnalysisException: cannot resolve '`usd_exchange_rate::character varying as usd_exchange_rate_text`'
I see that that I am casting it in a wrong format but I don't understand how can I read the same column in decimal & text format in one step.
Could anyone let me know if there is a way to achieve it in spark ?
Not Sure about the error, but to cast did you try defining custom schema? Assuming that you know your schema already, Define your own custom schema with StructType.
import org.apache.spark.sql.types._
val customSchema = StructType(Seq(
StructField("usd_exchange_rate",StringType,true),
StructField("aud_exchange_rate",StringType,true),
.
.
.
StructField("<some field>",<data type>,<Boolean for nullable>)
))
val yearDF = spark.read.format("io.pivotal.greenplum.spark.GreenplumRelationProvider").option("url", connectionUrl)
.option("dbtable", "x_lines")
.option("dbschema","copydb")
.option("user", devUserName).option("password", devPassword)
.option("partitionColumn","id")
.schema(customSchema)
.load()
.where("year=2017 and month=12")
.select(allColumnsSeq map col:_*)
.withColumn(flagCol, lit(0))
I didn't test this in IDE but it should work.

spark Dataframe string to Hive varchar

I read data from Oracle via spark JDBC connection to a DataFrame. I have a column which is obviously StringType in dataframe.
Now I want to persist this in Hive, but as datatype Varchar(5). I know the string would be truncated but it is ok.
I tried using UDFs which didn't work since dataframe does not have varchar or char types. I also created a temporary view in Hive using:
val tv = df.createOrReplaceTempView("t_name")
val df = spark.sql("select cast(col_name as varchar(5)) from tv")
But then when i printSchema, i still see a string type.
How can I make I save it as a varchar column in Hive table ?
Try creating Hive table("dbName.tableName") with required schema (varchar(5) in this case) and insert into the table directly from Dataframe like below.
df.write.insertInto("dbName.tableName" ,overwrite = False)

create hive external table with schema in spark

I am using spark 1.6 and I aim to create external hive table like what I do in hive script. To do this, I first read in the partitioned avro file and get the schema of this file. Now I stopped here, I get no idea how to apply this schema to my creating table. I use scala. Need help guys.
finally, I make it myself with old-fashioned way. With the help of code below:
val rawSchema = sqlContext.read.avro("Path").schema
val schemaString = rawSchema.fields.map(field => field.name.replaceAll("""^_""", "").concat(" ").concat(field.dataType.typeName match {
case "integer" => "int"
case smt => smt
})).mkString(",\n")
val ddl =
s"""
|Create external table $tablename ($schemaString) \n
|partitioned by (y int, m int, d int, hh int, mm int) \n
|Stored As Avro \n
|-- inputformat 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' \n
| -- outputformat 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' \n
| Location 'hdfs://$path'
""".stripMargin
take care no column name can start with _ and hive can't parse integer. I would like to say that this way is not flexible but work. if anyone get better idea, plz comment.
I didn't see a way to automatically infer schema for external tables. So I created case for the string type. You could add case for your data type. But I'm not sure how many columns you have. I apologize as this might not be a clean approach.
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{Row, SaveMode};
import org.apache.spark.sql.types.{StructType,StructField,StringType};
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val results = hiveContext.read.format("com.databricks.spark.avro").load("people.avro")
val schema = results.schema.map( x => x.name.concat(" ").concat( x.dataType.toString() match { case "StringType" => "STRING"} ) ).mkString(",")
val hive_sql = "CREATE EXTERNAL TABLE people_and_age (" + schema + ") ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/user/ravi/people_age'"
hiveContext.sql(hive_sql)
results.saveAsTable("people_age",SaveMode.Overwrite)
hiveContext.sql("select * from people_age").show()
Try the below code.
val htctx= new HiveContext(sc)
htctx.sql(create extetnal table tablename schema partitioned by attribute row format serde serde.jar field terminated by value location path)

Resources