Spark NumberFormatException when using spark.read.jdbc from Hive by JDBCprotocal - apache-spark

The problem occurs when using spark.read.jdbc() ↓
spark = SparkSession.builder \
.[some options...]\
.config("spark.sql.warehouse.dir", "hdfs://ip:port/path") \
.config("hive.metastore.uris", "thrift://ip:port") \
.enableHiveSupport() \
.getOrCreate()
spark.read.jdbc(url="jdbc:hive2://ip:port", table="db.table", properties={...})
java.sql.SQLException: Cannot convert column 1 to integerjava.lang.NumberFormatException: For input string: "c1"
"c1" is the 1st column of table with simply values (1, 2, 3, 4)
However, if I read data with
spark.read.table("db.table")
↑ it does successfully work
Is there any suggestions that could make me resolve this problem?

Related

ParseException Error using Regex in pyspark 2.4

I am trying to get only those rows where colADD contain non alphanumeric character.
Code :
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Test") \
.getOrCreate()
data = spark.read.csv("Customers");
data.registerTempTable("data");
spark.sql("SELECT colADD from data WHERE colADD REGEXP '^[A-Za-z0-9]+$'; ");
Error:
pyspark.sql.utils.ParseException: u"\nextraneous input ';'
expecting <EOF>(line 1, pos 56)\n\n== SQL ==\nSELECT CNME from data WHERE CNME REGEXP '^[A-Za-z0-9]+$';
Please help, am i missing somethhing.
spark used this
spark.sql("SELECT col2 from test WHERE col2 REGEXP '^[A-Za-z0-9]*\\-' ").show
Have note used pyspark - but how about just removing the ; - seems not to be needed.

Hive - Create table LIKE doesn't work in spark-sql

In my pyspark job I'm trying to create a temp table using LIKE clause as below.
CREATE EXTERNAL TABLE IF NOT EXISTS stg.new_table_name LIKE stg.exiting_table_name LOCATION s3://s3-bucket/warehouse/stg/existing_table_name
My job fails as below -
mismatched input 'LIKE' expecting (line 1, pos 56)\n\n== SQL
==\nCREATE EXTERNAL TABLE IF NOT EXISTS stg.new_table_name LIKE
stg.exiting_table_name LOCATION
s3://s3-bucket/warehouse/stg/existing_table_name
Doesn't spark support LIKE clause to create new table using metadata of existing table?
My sparksession config:
self.session = SparkSession \
.builder \
.appName(self.app_name) \
.config("spark.dynamicAllocation.enabled", "false") \
.config("hive.exec.dynamic.partition.mode", "nonstrict") \
.config("mapreduce.fileoutputcommitter.algorithm.version", "2") \
.config("hive.load.dynamic.partitions.thread", "10") \
.config("hive.mv.files.thread", "30") \
.config("fs.trash.interval", "0") \
.enableHiveSupport()

Spark 2.3.1 AWS EMR not returning data for some columns yet works in Athena/Presto and Spectrum

I am using PySpark on Spark 2.3.1 on AWS EMR (Python 2.7.14)
spark = SparkSession \
.builder \
.appName("Python Spark SQL data source example") \
.config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
.config("hive.exec.dynamic.partition", "true") \
.config("hive.exec.dynamic.partition.mode", "nonstrict") \
.config("spark.debug.maxToStringFields", 100) \
.enableHiveSupport() \
.getOrCreate()
spark.sql('select `message.country` from datalake.leads_notification where `message.country` is not null').show(10)
This returns no data, 0 rows found.
Every value for each row in above table is returned Null.
Data is stored in PARQUET.
When I ran same SQL query on AWS Athena/Presto or on AWs Redshift Spectrum then I get all column data returned correctly (most column values are not null).
This is the Athena SQL and Redshift SQL query that returns correct data:
select "message.country" from datalake.leads_notification where "message.country" is not null limit 10;
I use AWS Glue catalog in all cases.
The column above is NOT partitioned but the table is partitioned on other columns. I tried to use repair table, it did not help.
i.e. MSCK REPAIR TABLE datalake.leads_notification
i tried Schema Merge = True like so:
spark = SparkSession \
.builder \
.appName("Python Spark SQL data source example") \
.config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
.config("hive.exec.dynamic.partition", "true") \
.config("spark.sql.parquet.mergeSchema", "true") \
.config("hive.exec.dynamic.partition.mode", "nonstrict") \
.config("spark.debug.maxToStringFields", 200) \
.enableHiveSupport() \
.getOrCreate()
No difference, still every value of one column is nulls even though some are not null.
This column was added as the last column to the table so most data is indeed null but some rows are not null. The column is listed at last on the column list in catalog, sitting just above the partitioned columns.
Nevertheless Athena/Presto retrieves all non-null values OK and so does Redshift Spectrum too but alas EMR Spark 2.3.1 PySpark shows all values for this column as "null". All other columns in Spark are retrieved correctly.
Can anyone help me to debug this problem please?
Hive Schema is hard to cut and paste here due to output format.
***CREATE TABLE datalake.leads_notification(
message.environment.siteorigin string,
dcpheader.dcploaddateutc string,
message.id int,
message.country string,
message.financepackage.id string,
message.financepackage.version string)
PARTITIONED BY (
partition_year_utc string,
partition_month_utc string,
partition_day_utc string,
job_run_guid string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://blahblah/leads_notification/leads_notification/'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'UPDATED_BY_CRAWLER'='weekly_datalake_crawler',
'averageRecordSize'='3136',
'classification'='parquet',
'compressionType'='none',
'objectCount'='2',
'recordCount'='897025',
'sizeKey'='1573529662',
'spark.sql.create.version'='2.2 or prior',
'spark.sql.sources.schema.numPartCols'='4',
'spark.sql.sources.schema.numParts'='3',
'spark.sql.sources.schema.partCol.0'='partition_year_utc',
'spark.sql.sources.schema.partCol.1'='partition_month_utc',
'spark.sql.sources.schema.partCol.2'='partition_day_utc',
'spark.sql.sources.schema.partCol.3'='job_run_guid',
'typeOfData'='file')***
Last 3 columns all have the same problems in Spark:
message.country string,
message.financepackage.id string,
message.financepackage.version string
All return OK in Athena/Presto and Redshift Spectrum using same catalog.
I apologize for my editing.
thank you
do step 5 schema inspection:
http://www.openkb.info/2015/02/how-to-build-and-use-parquet-tools-to.html
my bet is these new column names in parquet definition are either upper case (while other column names are lower case) or new column names in parquet definition are either lower case (while other column names are upper case)
see Spark issues reading parquet files
https://medium.com/#an_chee/why-using-mixed-case-field-names-in-hive-spark-sql-is-a-bad-idea-95da8b6ec1e0
spark = SparkSession \
.builder \
.appName("Python Spark SQL data source example") \
.config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
.config("hive.exec.dynamic.partition", "true") \
.config("spark.sql.parquet.mergeSchema", "true") \
.config("spark.sql.hive.convertMetastoreParquet", "false") \
.config("hive.exec.dynamic.partition.mode", "nonstrict") \
.config("spark.debug.maxToStringFields", 200) \
.enableHiveSupport() \
.getOrCreate()
This is the solution: note the
.config("spark.sql.hive.convertMetastoreParquet", "false")
The schema columns are all in lower case and the schema was created by AWS Glue, not by my custom code so I dont really know what caused the problem so using the above is probably the safe default setting when schema creation is not directly under your control. This is a major trap, IMHO, so I hope this will help someone else in future.
Thanks to tooptoop4 who pointed out the article:
https://medium.com/#an_chee/why-using-mixed-case-field-names-in-hive-spark-sql-is-a-bad-idea-95da8b6ec1e0

Spark Streaming 2.3.1 Type Casting: String to Timestamp

I am using apsche spark streaming 2.3.1, where I am receiving a stream containing a timestamp values (13:09:05.761237147) of the format "HH:mm:ss.xxxxxxxxx" as string.
I am in need to cast this string to timestamp data type.
spark = SparkSession \
.builder \
.appName("abc") \
.getOrCreate()
schema = StructType().add("timestamp", "string").add("object", "string").add("score", "double")
lines = spark \
.readStream \
.option("sep", ",") \
.schema(schema) \
.csv("/path/to/folder/")
Any suggestion how to convert "timestamp" to timestamp data type?
As per the description provided in source code of TimestampType and DateTimeUtils classes, they support timestamps till microseconds precision only.
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/TimestampType.scala
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

PySpark How to read CSV into Dataframe, and manipulate it

I'm quite new to pyspark and am trying to use it to process a large dataset which is saved as a csv file.
I'd like to read CSV file into spark dataframe, drop some columns, and add new columns.
How should I do that?
I am having trouble getting this data into a dataframe. This is a stripped down version of what I have so far:
def make_dataframe(data_portion, schema, sql):
fields = data_portion.split(",")
return sql.createDateFrame([(fields[0], fields[1])], schema=schema)
if __name__ == "__main__":
sc = SparkContext(appName="Test")
sql = SQLContext(sc)
...
big_frame = data.flatMap(lambda line: make_dataframe(line, schema, sql))
.reduce(lambda a, b: a.union(b))
big_frame.write \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://<...>") \
.option("dbtable", "my_table_copy") \
.option("tempdir", "s3n://path/for/temp/data") \
.mode("append") \
.save()
sc.stop()
This produces an error TypeError: 'JavaPackage' object is not callable at the reduce step.
Is it possible to do this? The idea with reducing to a dataframe is to be able to write the resulting data to a database (Redshift, using the spark-redshift package).
I have also tried using unionAll(), and map() with partial() but can't get it to work.
I am running this on Amazon's EMR, with spark-redshift_2.10:2.0.0, and Amazon's JDBC driver RedshiftJDBC41-1.1.17.1017.jar.
Update - answering also your question in comments:
Read data from CSV to dataframe:
It seems that you only try to read CSV file into a spark dataframe.
If so - my answer here: https://stackoverflow.com/a/37640154/5088142 cover this.
The following code should read CSV into a spark-data-frame
import pyspark
sc = pyspark.SparkContext()
sql = SQLContext(sc)
df = (sql.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("/path/to_csv.csv"))
// these lines are equivalent in Spark 2.0 - using [SparkSession][1]
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
spark.read.format("csv").option("header", "true").load("/path/to_csv.csv")
spark.read.option("header", "true").csv("/path/to_csv.csv")
drop column
you can drop column using "drop(col)"
https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html
drop(col)
Returns a new DataFrame that drops the specified column.
Parameters: col – a string name of the column to drop, or a Column to drop.
>>> df.drop('age').collect()
[Row(name=u'Alice'), Row(name=u'Bob')]
>>> df.drop(df.age).collect()
[Row(name=u'Alice'), Row(name=u'Bob')]
>>> df.join(df2, df.name == df2.name, 'inner').drop(df.name).collect()
[Row(age=5, height=85, name=u'Bob')]
>>> df.join(df2, df.name == df2.name, 'inner').drop(df2.name).collect()
[Row(age=5, name=u'Bob', height=85)]
add column
You can use "withColumn"
https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html
withColumn(colName, col)
Returns a new DataFrame by adding a column or replacing the existing column that has the same name.
Parameters:
colName – string, name of the new column.
col – a Column expression for the new column.
>>> df.withColumn('age2', df.age + 2).collect()
[Row(age=2, name=u'Alice', age2=4), Row(age=5, name=u'Bob', age2=7)]
Note: spark has a lot of other functions which can be used (e.g. you can use "select" instead of "drop")

Resources