How to run LOAD DATA INPATH hive command with wildcard from spark? - apache-spark

I am creating a dataframe as below:
from pyspark.sql import SparkSession, functions as f
from pyspark.sql.types import StructType, StructField, DataType, IntegerType, StringType
schma = StructType([
StructField("id", IntegerType(), True),
StructField("name",StringType(), True),
]
)
empdf=spark.read.format("csv").csv("/home/hdfs/sparkwork/hiveproj/Datasets/empinfo/emp.csv",schema=schma);
empdf.show();
I am saving the dataframe as a parquet file.
empdf.write.parquet(path="/home/hdfs/sparkwork/hiveproj/Data/empinfo/empl_par/")
If I am using the specific file name in LOAD DATA INPATH command then it is working fine.
spark.sql("LOAD DATA INPATH '/home/hdfs/sparkwork/hiveproj/Data/empinfo/empl_par/part-00000-6cdfcba5-49ab-499c-8d7f-831c9ec314de-c000.snappy.parquet' INTO TABLE EMPINFO.EMPLOYEE")
But If i am using wildcard instead of file name(* or *.parquet) it is giving me error.
spark.sql("LOAD DATA INPATH '/home/hdfs/sparkwork/hiveproj/Data/empinfo/empl_par/*.parquet' INTO TABLE EMPINFO.EMPLOYEE")
Is there a way to push all the contents of a folder using wildcard in hive command from spark?
please help with the same.

Instead of spark.sql("LOAD DATA INPATH '/home/hdfs/sparkwork/hiveproj/Data/empinfo/empl_par/*.parquet' INTO TABLE EMPINFO.EMPLOYEE")
try using this empdf.write.partitionBy("year","month","day").insertInto("EMPINFO.EMPLOYEE")
Note I have used partition columns as year,month & day. You may need to change as per your requirement.

Related

Delta Lake Create Table with structure like another

I have a bronze level delta lake table(events_bronze) at location "/mnt/events-bronze" to which data is streamed from kafka. Now I want to be able to stream from this table and update using "foreachBatch" into a silver table(events_silver". This can be achieved using bronze table as a source. However, during the initial run since events_silver doesn't exist, I keep getting error saying Delta table doesn't exist which is obvious. So how do I go about creating events_silver which has the same structure as events_bronze? I couldn't find a DDL to do the same.
def upsertToDelta(microBatchOutputDF: DataFrame, batchId: Long) {
DeltaTable.forPath(spark, "/mnt/events-silver").as("silver")
.merge(
microBatchOutputDF.as("bronze"),
"silver.id=bronze.id")
.whenMatched().updateAll()
.whenNotMatched().insertAll()
.execute()
}
events_bronze
.writeStream
.trigger(Trigger.ProcessingTime("120 seconds"))
.format("delta")
.foreachBatch(upsertToDelta _)
.outputMode("update")
.start()
During initial run, the problem is that there is no delta lake table defined for path "/mnt/events-silver". I'm not sure how to create it having the same structure as "/mnt/events-bronze" for the first run.
Before starting stream write/merge, check whether table is already exists. If not create one using empty dataframe & schema (of events_bronze)
val exists = DeltaTable.isDeltaTable("/mnt/events-silver")
if (!exists) {
val emptyDF = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], <schema of events_bronze>)
emptyDF
.write
.format("delta")
.mode(SaveMode.Overwrite)
.save("/mnt/events-silver")
}
Table(delta lake metadata) will get created only one time at the start and if it doesn't exist. In case of job restart and all, it will be present & skip table creation
As of release 1.0.0 of Delta Lake, the method DeltaTable.createIfNotExists() was added (Evolving API).
In your example DeltaTable.forPath(spark, "/mnt/events-silver") can be replaced with:
DeltaTable.createIfNotExists(spark)
.location("/mnt/events-silver")
.addColumns(microBatchOutputDF.schema)
.execute
You have to be careful not to supply an .option("checkpointLocation", "/mnt/events-silver/_checkpoint") where the checkpointLocation is a subdirectory within your DeltaTable's location. This will cause the _checkpoint directory to be created before the DeltaTable and an exception will be thrown when trying to create the DeltaTable.
Here's a pyspark example:
from pyspark.sql.types import StructType, StructField, StringType, TimestampType
from delta.tables import DeltaTable
basePath = 'abfss://stage2#your_storage_account_name.dfs.core.windows.net'
schema = StructType([StructField('SignalType', StringType()),StructField('StartTime', TimestampType())])
if not DeltaTable.isDeltaTable(spark, basePath + '/tutorial_01/test1'):
emptyDF = spark.createDataFrame(spark.sparkContext.emptyRDD(), schema)
emptyDF.write.format('delta').mode('overwrite').save(basePath + '/tutorial_01/test1')
and here's an updated pyspark example, using the newer createIfNotExists
from pyspark.sql.types import StructType, StructField, StringType, TimestampType
from delta.tables import DeltaTable
schema = StructType([StructField('SignalType', StringType()),StructField('StartTime', TimestampType())])
DeltaTable.createIfNotExists(spark).location('abfss://stage2#your_storage_account_name.dfs.core.windows.net/tutorial_01/test1').addColumns(schema).execute()
You can check the table using spark SQL. First run below on spark SQL, which will give table definition of bronze table
:
spark.sql("show create table event_bronze").show
After getting the DDL just change the location to silver table's path and run that statement is spark SQL.
Note: Use "create table if not exists..." as it will not fail in concurrent runs.

Trouble when writing the data to Delta Lake in Azure databricks (Incompatible format detected)

I need to read dataset into a DataFrame, then write the data to Delta Lake. But I have the following exception :
AnalysisException: 'Incompatible format detected.\n\nYou are trying to write to `dbfs:/user/class#azuredatabrickstraining.onmicrosoft.com/delta/customer-data/` using Databricks Delta, but there is no\ntransaction log present. Check the upstream job to make sure that it is writing\nusing format("delta") and that you are trying to write to the table base path.\n\nTo disable this check, SET spark.databricks.delta.formatCheck.enabled=false\nTo learn more about Delta, see https://docs.azuredatabricks.net/delta/index.html\n;
Here is the code preceding the exception :
from pyspark.sql.types import StructType, StructField, DoubleType, IntegerType, StringType
inputSchema = StructType([
StructField("InvoiceNo", IntegerType(), True),
StructField("StockCode", StringType(), True),
StructField("Description", StringType(), True),
StructField("Quantity", IntegerType(), True),
StructField("InvoiceDate", StringType(), True),
StructField("UnitPrice", DoubleType(), True),
StructField("CustomerID", IntegerType(), True),
StructField("Country", StringType(), True)
])
rawDataDF = (spark.read
.option("header", "true")
.schema(inputSchema)
.csv(inputPath)
)
# write to Delta Lake
rawDataDF.write.mode("overwrite").format("delta").partitionBy("Country").save(DataPath)
This error message is telling you that there is already data at the destination path (in this case dbfs:/user/class#azuredatabrickstraining.onmicrosoft.com/delta/customer-data/), and that that data is not in the Delta format (i.e. there is no transaction log). You can either choose a new path (which based on the comments above, it seems like you did) or delete that directory and try again.
I found this Question with this search: "You are trying to write to *** using Databricks Delta, but there is no transaction log present."
In case someone searches for the same:
For me the solution was to explicitly code
.write.format("parquet")
because
.format("delta")
is the dafault since Databricks Runtime 8.0 and above and I need "parquet" for legacy reasons.
One can get this error if also tries to read the data in a format that is not supported by spark.read (or if does not specify the format).
The file format should be specified along the supported formats: csv, txt, json, parquet or arvo.
dataframe = spark.read.format('csv').load(path)

Specify pyspark dataframe schema with string longer than 256

I'm reading a source that got descriptions longer then 256 chars. I want to write them to Redshift.
According to: https://github.com/databricks/spark-redshift#configuring-the-maximum-size-of-string-columns it is only possible in Scala.
According to this: https://github.com/databricks/spark-redshift/issues/137#issuecomment-165904691
it should be a workaround to specify the schema when creating the dataframe. I'm not able to get it to work.
How can I specify the schema with varchar(max)?
df = ...from source
schema = StructType([
StructField('field1', StringType(), True),
StructField('description', StringType(), True)
])
df = sqlContext.createDataFrame(df.rdd, schema)
Redshift maxlength annotations are passed in format
{"maxlength":2048}
so this is the structure you should pass to StructField constructor:
from pyspark.sql.types import StructField, StringType
StructField("description", StringType(), metadata={"maxlength":2048})
or alias method:
from pyspark.sql.functions import col
col("description").alias("description", metadata={"maxlength":2048})
If you use PySpark 2.2 or earlier please check How to change column metadata in pyspark? for workaround.

withColumn in spark dataframe inserts NULL in SaveMode.Append

I have a spark application to create Hive external table which works fine for the first time that is while creating the table in hive with partitions. I have three partition namely event,centerCode,ExamDate
var sqlContext = spark.sqlContext
sqlContext.setConf("hive.exec.dynamic.partition", "true")
sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
import org.apache.spark.sql.functions._
val candidateList = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("nullValue", "null").option("quote", "\"").option("dateFormat", "dd/MM/yyyy")
.schema(StructType(Array(StructField("RollNo/SeatNo", StringType, true), StructField("LabName", StringType, true), StructField("Student_Name", StringType, true), StructField("ExamName", StringType, true), StructField("ExamDate", DateType, true), StructField("ExamTime", StringType, true), StructField("CenterCode", StringType, true), StructField("Center", StringType, true)))).option("multiLine", "true").option("mode", "DROPMALFORMED").load(filePath(0))
val nef = candidateList.withColumn("event", lit(eventsId))
Partition column event will not be present in input csv file so I'm adding that column to the dataframe candidateList using withColumn("event", lit(eventsId))
While im writing it to the Hive table it works fine withColumn added to the table with event say "ABCD" and the partitions are created as expected.
nef.repartition(1).write.mode(SaveMode.Overwrite).option("path", candidatePath).partitionBy("event", "CenterCode", "ExamDate").saveAsTable("sify_cvs_output.candidatelist")
candidateList.show() Gives
+-------------+--------------------+-------------------+----------+----------+--------+----------+--------------------+-----+
|RollNo/SeatNo| LabName| Student_Name| ExamName| ExamDate|ExamTime|CenterCode| Center|event|
+-------------+--------------------+-------------------+----------+----------+--------+----------+--------------------+-----+
| 80000077|BUILDING-MAIN FLO...| ABBAS MOHAMMAD|PGECETICET|2018-07-30|10:00 AM| 500098A|500098A-SURYA TEC...| ABCD|
| 80000056|BUILDING-MAIN FLO...| ABDUL YASARARFATH|PGECETICET|2018-07-30|10:00 AM| 500098A|500098A-SURYA TEC...| ABCD|
But for the second time i'm trying to Append the data to the hive table created already with a new event "EFGH" but for the second time the added column using withColumn inserted as NULL
nef.write.mode(SaveMode.Append).insertInto("sify_cvs_output.candidatelist") and the partitions also haven't come properly as one of the partition column becomes `NULL`, so I tried adding one more new column in the dataframe `.withColumn("sample", lit("sample"))` again for the first time it writes all the extra added columns to the table and the next time on `SaveMode.Append` inserts the `event` column and the `sample` column added to the table as `NULL`
show create table below
CREATE EXTERNAL TABLE `candidatelist`(
`rollno/seatno` string,
`labname` string,
`student_name` string,
`examname` string,
`examtime` string,
`center` string,
`sample` string)
PARTITIONED BY (
`event` string,
`centercode` string,
`examdate` date)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'path'='hdfs://172.16.2.191:8020/biometric/sify/cvs/output/candidate/')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'hdfs://172.16.2.191:8020/biometric/sify/cvs/output/candidate'
TBLPROPERTIES (
'spark.sql.partitionProvider'='catalog',
'spark.sql.sources.provider'='parquet',
'spark.sql.sources.schema.numPartCols'='3',
'spark.sql.sources.schema.numParts'='1',
'spark.sql.sources.schema.part.0'='{\"type\":\"struct\",\"fields\":[{\"name\":\"RollNo/SeatNo\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"LabName\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"Student_Name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"ExamName\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"ExamTime\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"Center\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"sample\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"event\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"CenterCode\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"ExamDate\",\"type\":\"date\",\"nullable\":true,\"metadata\":{}}]}',
'spark.sql.sources.schema.partCol.0'='event',
'spark.sql.sources.schema.partCol.1'='CenterCode',
'spark.sql.sources.schema.partCol.2'='ExamDate',
'transient_lastDdlTime'='1536040545')
Time taken: 0.025 seconds, Fetched: 32 row(s)
hive>
What am I doing wrong here..!
UPDATE
#pasha701, below is my sparkSession
val Spark=SparkSession.builder().appName("splitInput").master("local").config("spark.hadoop.fs.defaultFS", "hdfs://" + hdfsIp)
.config("hive.metastore.uris", "thrift://172.16.2.191:9083")
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.enableHiveSupport()
.getOrCreate()
and if I add partitionBy in InsertInto
nef.write.mode(SaveMode.Append).partitionBy("event", "CenterCode", "ExamDate").option("path", candidatePath).insertInto("sify_cvs_output.candidatelist")
it throws exception as org.apache.spark.sql.AnalysisException: insertInto() can't be used together with partitionBy(). Partition columns have already be defined for the table. It is not necessary to use partitionBy().;
Second time "partitionBy" also have to be used. Also maybe option "hive.exec.dynamic.partition.mode" will be required.

.csv not a SequenceFile Failed with exception java.io.IOException:java.io.IOException

While creating External table with partition in hive using spark in csv format com.databricks.spark.csv it works fine but I can't able to open the table created in hive which is in .csv format from hive shell
ERROR
hive> select * from output.candidatelist;
Failed with exception java.io.IOException:java.io.IOException: hdfs://10.19.2.190:8020/biometric/event=ABCD/LabName=500098A/part-00000-de39bb3d-0548-4db6-b8b7-bb57739327b4.c000.csv not a SequenceFile
Code:
val sparkDf = spark.read.format("com.databricks.spark.csv").option("header", "true").option("nullValue", "null").schema(StructType(Array(StructField("RollNo/SeatNo", StringType, true), StructField("LabName", StringType, true)))).option("multiLine", "true").option("mode", "DROPMALFORMED").load("hdfs://10.19.2.190:8020/biometric/SheduleData_3007_2018.csv")
sparkDf.write.mode(SaveMode.Overwrite).option("path", "hdfs://10.19.2.190:8020/biometric/event=ABCD/").partitionBy("LabName").format("com.databricks.spark.csv").saveAsTable("output.candidateList")
how to access the table in Hive shell while the format of the table in csv
SHOW CREATE TABLE candidatelist;
CREATE EXTERNAL TABLE `candidatelist`(
`col` array<string> COMMENT 'from deserializer')
PARTITIONED BY (
`centercode` string,
`examdate` date)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES ('path'='hdfs://10.19.2.190:8020/biometric/output')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'
LOCATION
'hdfs://nnuat.iot.com:8020/apps/hive/warehouse/sify_cvs_output.db/candidatelist-__PLACEHOLDER__'TBLPROPERTIES (
'spark.sql.create.version'='2.3.0.2.6.5.0-292',
'spark.sql.partitionProvider'='catalog',
'spark.sql.sources.provider'='com.databricks.spark.csv',
'spark.sql.sources.schema.numPartCols'='2',
'spark.sql.sources.schema.numParts'='1',
'spark.sql.sources.schema.part.0'='{\"type\":\"struct\",\"fields\":[{\"name\":\"RollNo/SeatNo\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"LabName\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"Student_Name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"ExamName\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"ExamTime\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"Center\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"CenterCode\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"ExamDate\",\"type\":\"date\",\"nullable\":true,\"metadata\":{}}]}',
'spark.sql.sources.schema.partCol.0'='CenterCode',
'spark.sql.sources.schema.partCol.1'='ExamDate',
'transient_lastDdlTime'='1535692379')

Resources