Setting datatypes when writting parquet files with spark - apache-spark

I’m reading some data from an Oracle DB and storing it in parquet files with spark-3.0.1/hadoop3.2 through python api.
The process works, but the datatypes aren’t maintained, and the resulting types in the parquet file are all “object” except for date fields.
This is how the datatypes look after creating the dataframe:
conn_properties = {"driver": "oracle.jdbc.driver.OracleDriver",
"url": cfg.ORACLE_URL,
"query": "select ..... from ...",
}
df = spark.read.format("jdbc") \
.options(**conn_properties) \
.load()
df.dtypes:
[('idest', 'string'),
('date', 'timestamp'),
('temp', 'decimal(18,3)'),
('hum', 'decimal(18,3)'),
('prec', 'decimal(18,3)'),
('wspeed', 'decimal(18,3)'),
('wdir', 'decimal(18,3)'),
('radiation', 'decimal(18,3)')]
They all match the original database schema.
But opening the parquet file with pandas I get:
df.dtypes:
idest object
date datetime64[ns]
temp object
hum object
prec object
wspeed object
wdir object
radiation object
dtype: object
I tried to change the datatype for one column using the customSchema option, the doc says Spark SQL types can be used:
Users can specify the corresponding data types of Spark SQL instead of using the defaults.
And FloatType is included in Spark SQL datatypes.
So I expect something like this should work:
custom_schema = "radiation FloatType"
conn_properties = {"driver": "oracle.jdbc.driver.OracleDriver",
"url": cfg.ORACLE_URL,
"query": 'select ... from .... ",
"customSchema": custom_schema
}
But I get this error
pyspark.sql.utils.ParseException:
DataType floattype is not supported.(line 1, pos 10)
== SQL ==
radiation FloatType
I can’t find an option in the corresponding “write” method to specify the schema/types for the parquet DataFrameWriter.
Any idea how can I set the datatype mapping?
Thank you

Related

Date type is saved as long type when pyspark write data to elasticsearch

nice to meet you.
Nowadays i use Elasticsearch for Apache Hadoop to join elasticsearch index.
(https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html)
However, i have a problem when pyspark writes data with date type field to elasticsearch.
Original Field:
created: timestamp (nullable = true)
However, when i save the data to elasticsearch like below:
result.write.format("org.elasticsearch.spark.sql")\
.option("es.nodes","server")\
.option("es.mapping.date.rich", "true")\
.option("timestampFormat", "YYYY-MM-DD'T'hh:mm:ss.sss")\
.option("es.mapping.id","id")\
.mode("append")\
.option("es.resource", "index").save()
Fields with date type converted to long type with Unixtimestamp.
However, i want to save the data as date type( like ISO 8601 Format)
How can I save the type as it is?
Please help me
The code i used.
# Import PySpark modules
from pyspark import SparkContext, SparkConf, SQLContext
# Spark Config
conf = SparkConf().setAppName("es_app")
conf.set("es.scroll.size", "1000")
sc = SparkContext(conf=conf)
# sqlContext
sqlContext = SQLContext(sc)
# Load data from elasticsearch
df = sqlContext.read.format("org.elasticsearch.spark.sql") \
.option("es.nodes","server")\
.option("es.nodes.discovery", "true")\
.option("es.mapping.date.rich", 'false')\
.load("index")
# Make view
df.registerTempTable("test")
all_data = sqlContext.sql("SELECT * from test")
result.write.format("org.elasticsearch.spark.sql")\
.option("es.nodes","server")\
.option("es.mapping.date.rich", "true")\
.option("timestampFormat", "YYYY-MM-DD'T'hh:mm:ss.sss")\
.option("es.mapping.id","id")\
.mode("append")\
.option("es.resource", "index").save()
How can i fixed the problem?
please define mapping for your date field and use Date field of Elasticsearch which supports multiple date formats. also date fields in Elasticsearch is internally stored as long.
Ref :- date datatype in elasticsearch
Define date field in mapping with various formats
Also please read this note about how date fields are internally stored and displayed
Internally, dates are converted to UTC (if the time-zone is specified)
and stored as a long number representing milliseconds-since-the-epoch.
Dates will always be rendered as strings, even if they were initially
supplied as a long in the JSON document.
Example
{
"mappings": {
"properties": {
"date": {
"type": "date",
"format": "yyyy-MM-dd"
}
}
}
}

Spark reading parquet compressed data

I have nested JSON converted to Parquet (snappy) without any flattening. The structure, for example, has the following:
{"a":{"b":{"c":"abcd","d":[1,2,3]},"e":["asdf","pqrs"]}}
df = spark.read.parquet('<File on AWS S3>')
df.createOrReplaceTempView("test")
query = """select a.b.c from test"""
df = spark.sql(query)
df.show()
When the query is executed, does Spark read only the lowest-level attribute column referenced in query or does it read the top-level attribute that has this referenced attribute in its hierarchy?

SPARK read.jdbc & custom schema

With spark.read.format ... once can add the custom schema non-programmatically, like so:
val df = sqlContext
.read()
.format("jdbc")
.option("url", "jdbc:mysql://127.0.0.1:3306/test?useUnicode=true&characterEncoding=UTF-8&autoReconnect=true")
.option("user", "root")
.option("password", "password")
.option("dbtable", sql)
.schema(customSchema)
.load();
However, using spark.read.jdbc, I cannot seem to do the same or find the syntax to do the same as for the above. What am i missing or has this changed in SPARK 2.x? I read this in the manual: ... Spark automatically reads the schema from the database table and maps its types back to Spark SQL types. ... Presumably what I am trying to do is no longer possible as in the above example.
val dataframe_mysql = spark.read.jdbc(jdbcUrl, "(select k, v from sample) e ", connectionProperties)
I ended up trying this:
val dataframe_mysql = spark.read.schema(openPositionsSchema).jdbc(jdbcUrl, "(select k, v from sample) e ", connectionProperties)
and got this:
org.apache.spark.sql.AnalysisException: User specified schema not supported with `jdbc`;
Seems a retrograde step in a certain way.
I do not agree with the answer.
You can supply custom schema using your method or by setting properties:
connectionProperties.put("customSchema", schemachanges);
Where schema changes in format "field Name" "New data type", ... :
"key String, value DECIMAL(20, 0)"
If key was an number in original table, it will generate an SQL query like "key::character varying, value::numeric(20, 0)"
It is better than a cast, because cast is a mapping operation executed after it selected in original type, custom schema is not.
I had a case, when spark can not select NaN from postgres Numeric, because it maps numerics into java BigDecimal which does not allow NaN, so spark job failed every time when reading those values. Cast produced the same result. However after changing a scheme to either String or Double, it was able to read it properly.
Spark documentation: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
You can use a Custom schema and put in the properties parameters. You can read more at https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
Create a variable:
c_schema = 'id_type INT'
Properties conf:
config = {"user":"xxx",
"password": "yyy",
"driver":"com.mysql.jdbc.Driver",
"customSchema":c_schema}
Read the table and create the DF:
df = spark.read.jdbc(url=jdbc_url,table='table_name',properties=config)
You must use the same column name and it's going to change only the column
you put inside the customized schema.
. What am i missing or has this changed in SPARK 2.x?
You don't miss anything. Modifying schema on read with JDBC sources was never supported. The input is already typed so there there is no place for schema.
If the types are not satisfying, just cast the results to the desired types.

Save and append a file in HDFS using PySpark

I have a data frame in PySpark called df. I have registered this df as a temptable like below.
df.registerTempTable('mytempTable')
date=datetime.now().strftime('%Y-%m-%d %H:%M:%S')
Now from this temp table I will get certain values, like max_id of a column id
min_id = sqlContext.sql("select nvl(min(id),0) as minval from mytempTable").collect()[0].asDict()['minval']
max_id = sqlContext.sql("select nvl(max(id),0) as maxval from mytempTable").collect()[0].asDict()['maxval']
Now I will collect all these values like below.
test = ("{},{},{}".format(date,min_id,max_id))
I found that test is not a data frame but it is a str string
>>> type(test)
<type 'str'>
Now I want save this test as a file in HDFS. I would also like to append data to the same file in hdfs.
How can I do that using PySpark?
FYI I am using Spark 1.6 and don't have access to Databricks spark-csv package.
Here you go, you'll just need to concat your data with concat_ws and right it as a text:
query = """select concat_ws(',', date, nvl(min(id), 0), nvl(max(id), 0))
from mytempTable"""
sqlContext.sql(query).write("text").mode("append").save("/tmp/fooo")
Or even a better alternative :
from pyspark.sql import functions as f
(sqlContext
.table("myTempTable")
.select(f.concat_ws(",", f.first(f.lit(date)), f.min("id"), f.max("id")))
.coalesce(1)
.write.format("text").mode("append").save("/tmp/fooo"))

Auto Cast parquet to Hive

I have a scenario where spark infers schema from the input file and writes parquet files with Integer Data Types.
But we have tables in hive where the fields are defined as BigInt. Right now there is no conversion from int to long and hive throws errors that it cannot cast Integer to Long. I cannot edit the Hive DDL to Integer data types as it is business requirement to have those fields as Long.
I have looked up the option where we can cast the data types before saving.This can be done except that i have hundreds of columns and explicit cast makes code very messy.
Is there a way to tell spark to auto cast data types.
Since Spark version 1.4 you can apply the cast method with DataType on the column:
Suppose dataframe df has column year : Long
import org.apache.spark.sql.types.IntegerType
val df2 = df.withColumn("yearTmp", df.year.cast(IntegerType))
.drop("year")
.withColumnRenamed("yearTmp", "year")
If you are using sql expressions you can also do:
val df2 = df.selectExpr("cast(year as int) year",
"make",
"model",
"comment",
"blank")
For more info check the docs: http://spark.apache.org/docs/1.6.0/api/scala/#org.apache.spark.sql.DataFrame

Resources