Field delimiter of Hive table not recognized by spark HiveContext - apache-spark

I have created a hive external table stored as textfile partitioned by event_date Date.
How do we have to specify a specific format of csv while reading in spark from Hive table ?
The environment is
1. 1.Spark 1.5.0 - cdh5.5.1 Using Scala version 2.10.4(Java HotSpot(TM) 64 - Bit Server VM, Java 1.7.0_67)
2. Hive 1.1, CDH 5.5.1
scala script
sqlContext.setConf("hive.exec.dynamic.partition", "true")
sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
val distData = sc.parallelize(Array((1, 1, 1), (2, 2, 2), (3, 3, 3))).toDF
val distData_1 = distData.withColumn("event_date", current_date())
distData_1: org.apache.spark.sql.DataFrame = [_1: int, _2: int, _3: int, event_date: date]
scala > distData_1.show
+ ---+---+---+----------+
|_1 |_2 |_3 | event_date |
| 1 | 1 | 1 | 2016-03-25 |
| 2 | 2 | 2 | 2016-03-25 |
| 3 | 3 | 3 | 2016-03-25 |
distData_1.write.mode("append").partitionBy("event_date").saveAsTable("part_table")
scala > sqlContext.sql("select * from part_table").show
| a | b | c | event_date |
|1,1,1 | null | null | 2016-03-25 |
|2,2,2 | null | null | 2016-03-25 |
|3,3,3 | null | null | 2016-03-25 |
Hive table
create external table part_table (a String, b int, c bigint)
partitioned by (event_date Date)
row format delimited fields terminated by ','
stored as textfile LOCATION "/user/hdfs/hive/part_table";
select * from part_table shows
|part_table.a | part_table.b | part_table.c | part_table.event_date |
|1 |1 |1 |2016-03-25
|2 |2 |2 |2016-03-25
|3 |3 |3 |2016-03-25
Looking at the hdfs
The path has 2 part files /user/hdfs/hive/part_table/event_date=2016-03-25
part-00000
part-00001
part-00000 content
1,1,1
part-00001 content
2,2,2
3,3,3
P.S. if we store the table as orc it writes and reads the data as expected.
If the 'fields terminated by' is default then Spark can read the data as expected hence i guess this would be a bug.

Related

Show create table on a Hive Table in Spark SQL - Treats CHAR, VARCHAR as STRING

I have a need to generate DDL statements for Hive tables & views programmatically. I tried using Spark and Beeline for this task. Beeline takes around 5-10 seconds for each of the statements whereas Spark completes the same thing in a few milliseconds. I am planning to use Spark since it is faster compared to beeline. One downside of using spark for getting DDL statements from the hive is, it treats CHAR, VARCHAR characters as String and it doesn't preserve the length information that goes with CHAR,VARCHAR data types. At the same time beeline preserves the data type and the length information for CHAR,VARCHAR data types. I am using Spark 2.4.1 and Beeline 2.1.1.
Given below the sample create table command and its show create table output.
Beeline Output:
Spark-Shell:
I wanted to know if there is any configuration on the Spark side to preserve the data type and length information for CHAR,VARCHAR data types. If there are other ways to get DDL from Hive quickly, I will be fine with that also.
This is in
Hive 3.1.1
Spark 3.1.1
Your stack overflow issue raised and I quote:
"I have a need to generate DDL statements for Hive tables & views programmatically. I tried using Spark and Beeline for this task. Beeline takes around 5-10 seconds for each of the statements whereas Spark completes the same thing in a few milliseconds. I am planning to use Spark since it is faster compared to beeline. One downside of using spark for getting DDL statements from the hive is, it treats CHAR, VARCHAR characters as String and it doesn't preserve the length information that goes with CHAR,VARCHAR data types. At the same time beeline preserves the data type and the length information for CHAR,VARCHAR data types. I am using Spark 2.4.1 and Beeline 2.1.1. Given below the sample create table command and its show create table output."
Create a simple table in Hive in test database
hive> use test;
OK
hive> create table etc(ID BIGINT, col1 VARCHAR(30), col2 STRING);
OK
hive> desc formatted etc;
# col_name data_type comment
id bigint
col1 varchar(30)
col2 string
# Detailed Table Information
Database: test
OwnerType: USER
Owner: hduser
CreateTime: Fri Mar 11 18:29:34 GMT 2022
LastAccessTime: UNKNOWN
Retention: 0
Location: hdfs://rhes75:9000/user/hive/warehouse/test.db/etc
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}}
bucketing_version 2
numFiles 0
numRows 0
rawDataSize 0
totalSize 0
transient_lastDdlTime 1647023374
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Now let's go to spark-shell
scala> spark.sql("show create table test.etc").show(false)
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|createtab_stmt |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|CREATE TABLE `test`.`etc` (
`id` BIGINT,
`col1` VARCHAR(30),
`col2` STRING)
USING text
TBLPROPERTIES (
'bucketing_version' = '2',
'transient_lastDdlTime' = '1647023374')
|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
You can see Spark shows columns correctly
Now let us go and create the same table in hive through beeline
0: jdbc:hive2://rhes75:10099/default> use test
No rows affected (0.019 seconds)
0: jdbc:hive2://rhes75:10099/default> create table etc(ID BIGINT, col1 VARCHAR(30), col2 STRING)
. . . . . . . . . . . . . . . . . . > No rows affected (0.304 seconds)
0: jdbc:hive2://rhes75:10099/default> desc formatted etc
. . . . . . . . . . . . . . . . . . > +-------------------------------+----------------------------------------------------+----------------------------------------------------+
| col_name | data_type | comment |
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
| # col_name | data_type | comment |
| id | bigint | |
| col1 | varchar(30) | |
| col2 | string | |
| | NULL | NULL |
| # Detailed Table Information | NULL | NULL |
| Database: | test | NULL |
| OwnerType: | USER | NULL |
| Owner: | hduser | NULL |
| CreateTime: | Fri Mar 11 18:51:00 GMT 2022 | NULL |
| LastAccessTime: | UNKNOWN | NULL |
| Retention: | 0 | NULL |
| Location: | hdfs://rhes75:9000/user/hive/warehouse/test.db/etc | NULL |
| Table Type: | MANAGED_TABLE | NULL |
| Table Parameters: | NULL | NULL |
| | COLUMN_STATS_ACCURATE | {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}} |
| | bucketing_version | 2 |
| | numFiles | 0 |
| | numRows | 0 |
| | rawDataSize | 0 |
| | totalSize | 0 |
| | transient_lastDdlTime | 1647024660 |
| | NULL | NULL |
| # Storage Information | NULL | NULL |
| SerDe Library: | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL |
| InputFormat: | org.apache.hadoop.mapred.TextInputFormat | NULL |
| OutputFormat: | org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL |
| Compressed: | No | NULL |
| Num Buckets: | -1 | NULL |
| Bucket Columns: | [] | NULL |
| Sort Columns: | [] | NULL |
| Storage Desc Params: | NULL | NULL |
| | serialization.format | 1 |
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
33 rows selected (0.159 seconds)
Now check that in spark-shell again
scala> spark.sql("show create table test.etc").show(false)
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|createtab_stmt |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|CREATE TABLE `test`.`etc` (
`id` BIGINT,
`col1` VARCHAR(30),
`col2` STRING)
USING text
TBLPROPERTIES (
'bucketing_version' = '2',
'transient_lastDdlTime' = '1647024660')
|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
It shows OK. So in summary you get column definitions in Spark as you have defined them in Hive.
In your statement above and I quote "I am using Spark 2.4.1 and Beeline 2.1.1", refers to older versions of Spark and hive which may have had such issues.

What does select("device") do in Spark query?

In Spark documentation there is an example
df = ... # streaming DataFrame with IOT device data with schema { device: string, deviceType: string, signal: double, time: DateType }
# Select the devices which have signal more than 10
df.select("device").where("signal > 10")
What does select("device") part do?
If it is a selection by signal field value, then why to mention device field?
Why don't write just
df.where("signal > 10")
or
df.select("time").where("signal > 10")
?
select("device")
this only select the Column "device"
df.show
+----------+-------------------+
|signal | B | C | D | E | F |
+----------+---+---+---+---+---+
|10 | 4 | 1 | 0 | 3 | 1 |
|15 | 6 | 4 | 3 | 2 | 0 |
+----------+---+---+---+---+---+
df.select("device").show
+----------+
|signal |
+----------+
|10 |
|15 |
+----------+

Updated dataframe column value failed to overwrite in Hive

Consider hive table tbl with column aid and bid
| aid | bid |
---------------
| | 12 |
| 24 | 13 |
| 18 | 3 |
| | 7 |
---------------
requirement is when aid is null or empty string, aid should be overwritten by value of bid
| aid | bid |
---------------
| 12 | 12 |
| 24 | 13 |
| 18 | 3 |
| 7 | 7 |
---------------
code is simple
val df01 = spark.sql("select * from db.tbl")
val df02 = df01.withColumn("aid", when(col("aid").isNull || col("aid") <=> "", col("bid")) otherwise(col("aid")))
and when running in spark-shell, df02.show displayed correct data just like above table
problem is when write the data back to hive
df02.write
.format("orc")
.mode("Overwrite")
.option("header", "false")
.option("orc.compress", "snappy")
.insertInto(tbl)
there is no error but when I validate the data
select * from db.tbl where aid is null or aid= '' limit 10;
I can still see multiple rows return from the query with aid being null
How to overwrite the data back to hive if previously update column value just like above example?
I would try this
df02.write
.orc
.mode(SaveMode.Overwrite)
.option("compression", "snappy")
.insertInto(tbl)

How to concatenate spark dataframe columns using Spark sql in databricks

I have two columns called "FirstName" and "LastName" in my dataframe, how can I concatenate this two columns into one.
|Id |FirstName|LastName|
| 1 | A | B |
| | | |
| | | |
I want to make it like this
|Id |FullName |
| 1 | AB |
| | |
| | |
my query look like this but it raises an error
val kgt=spark.sql("""
Select Id,FirstName+' '+ContactLastName AS FullName from tblAA """)
kgt.createOrReplaceTempView("NameTable")
Here we go with the Spark SQL solution:
spark.sql("select Id, CONCAT(FirstName,' ',LastName) as FullName from NameTable").show(false)
OR
spark.sql( " select Id, FirstName || ' ' ||LastName as FullName from NameTable ").show(false)
from pyspark.sql import functions as F
df = df.withColumn('FullName', F.concat(F.col('First_name'), F.col('last_name')))

Pyspark : forward fill with last observation for a DataFrame

Using Spark 1.5.1,
I've been trying to forward fill null values with the last known observation for one column of my DataFrame.
It is possible to start with a null value and for this case I would to backward fill this null value with the first knwn observation. However, If that too complicates the code, this point can be skipped.
In this post, a solution in Scala was provided for a very similar problem by zero323.
But, I don't know Scala and I don't succeed to ''translate'' it in Pyspark API code. It's possible to do it with Pyspark ?
Thanks for your help.
Below, a simple example sample input:
| cookie_ID | Time | User_ID
| ------------- | -------- |-------------
| 1 | 2015-12-01 | null
| 1 | 2015-12-02 | U1
| 1 | 2015-12-03 | U1
| 1 | 2015-12-04 | null
| 1 | 2015-12-05 | null
| 1 | 2015-12-06 | U2
| 1 | 2015-12-07 | null
| 1 | 2015-12-08 | U1
| 1 | 2015-12-09 | null
| 2 | 2015-12-03 | null
| 2 | 2015-12-04 | U3
| 2 | 2015-12-05 | null
| 2 | 2015-12-06 | U4
And the expected output:
| cookie_ID | Time | User_ID
| ------------- | -------- |-------------
| 1 | 2015-12-01 | U1
| 1 | 2015-12-02 | U1
| 1 | 2015-12-03 | U1
| 1 | 2015-12-04 | U1
| 1 | 2015-12-05 | U1
| 1 | 2015-12-06 | U2
| 1 | 2015-12-07 | U2
| 1 | 2015-12-08 | U1
| 1 | 2015-12-09 | U1
| 2 | 2015-12-03 | U3
| 2 | 2015-12-04 | U3
| 2 | 2015-12-05 | U3
| 2 | 2015-12-06 | U4
Another workaround to get this working, is to try something like this:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
window = (
Window
.partitionBy('cookie_id')
.orderBy('Time')
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
)
final = (
joined
.withColumn('UserIDFilled', F.last('User_ID', ignorenulls=True).over(window))
)
So what this is doing is that it constructs your window based on the partition key and the order column. It also tells the window to look back all rows within the window up to the current row. Finally, at each row, you return the last value that is not null (which remember, according to your window, it includes your current row)
The partitioned example code from Spark / Scala: forward fill with last observation in pyspark is shown. This only works for data that can be partitioned.
Load the data
values = [
(1, "2015-12-01", None),
(1, "2015-12-02", "U1"),
(1, "2015-12-02", "U1"),
(1, "2015-12-03", "U2"),
(1, "2015-12-04", None),
(1, "2015-12-05", None),
(2, "2015-12-04", None),
(2, "2015-12-03", None),
(2, "2015-12-02", "U3"),
(2, "2015-12-05", None),
]
rdd = sc.parallelize(values)
df = rdd.toDF(["cookie_id", "c_date", "user_id"])
df = df.withColumn("c_date", df.c_date.cast("date"))
df.show()
The DataFrame is
+---------+----------+-------+
|cookie_id| c_date|user_id|
+---------+----------+-------+
| 1|2015-12-01| null|
| 1|2015-12-02| U1|
| 1|2015-12-02| U1|
| 1|2015-12-03| U2|
| 1|2015-12-04| null|
| 1|2015-12-05| null|
| 2|2015-12-04| null|
| 2|2015-12-03| null|
| 2|2015-12-02| U3|
| 2|2015-12-05| null|
+---------+----------+-------+
Column used to sort the partitions
# get the sort key
def getKey(item):
return item.c_date
The fill function. Can be used to fill in multiple columns if necessary.
# fill function
def fill(x):
out = []
last_val = None
for v in x:
if v["user_id"] is None:
data = [v["cookie_id"], v["c_date"], last_val]
else:
data = [v["cookie_id"], v["c_date"], v["user_id"]]
last_val = v["user_id"]
out.append(data)
return out
Convert to rdd, partition, sort and fill the missing values
# Partition the data
rdd = df.rdd.groupBy(lambda x: x.cookie_id).mapValues(list)
# Sort the data by date
rdd = rdd.mapValues(lambda x: sorted(x, key=getKey))
# fill missing value and flatten
rdd = rdd.mapValues(fill).flatMapValues(lambda x: x)
# discard the key
rdd = rdd.map(lambda v: v[1])
Convert back to DataFrame
df_out = sqlContext.createDataFrame(rdd)
df_out.show()
The output is
+---+----------+----+
| _1| _2| _3|
+---+----------+----+
| 1|2015-12-01|null|
| 1|2015-12-02| U1|
| 1|2015-12-02| U1|
| 1|2015-12-03| U2|
| 1|2015-12-04| U2|
| 1|2015-12-05| U2|
| 2|2015-12-02| U3|
| 2|2015-12-03| U3|
| 2|2015-12-04| U3|
| 2|2015-12-05| U3|
+---+----------+----+
Hope you find this forward fill function useful. It is written using native pyspark function. Neither udf nor rdd being used (both of them are very slow, especially UDF!).
Let's use example provided by #Sid.
values = [
(1, "2015-12-01", None),
(1, "2015-12-02", "U1"),
(1, "2015-12-02", "U1"),
(1, "2015-12-03", "U2"),
(1, "2015-12-04", None),
(1, "2015-12-05", None),
(2, "2015-12-04", None),
(2, "2015-12-03", None),
(2, "2015-12-02", "U3"),
(2, "2015-12-05", None),
]
df = spark.createDataFrame(values, ['cookie_ID', 'Time', 'User_ID'])
Functions:
def cum_sum(df, sum_col , order_col, cum_sum_col_nm='cum_sum'):
'''Find cumulative sum of a column.
Parameters
-----------
sum_col : String
Column to perform cumulative sum.
order_col : List
Column/columns to sort for cumulative sum.
cum_sum_col_nm : String
The name of the resulting cum_sum column.
Return
-------
df : DataFrame
Dataframe with additional "cum_sum_col_nm".
'''
df = df.withColumn('tmp', lit('tmp'))
windowval = (Window.partitionBy('tmp')
.orderBy(order_col)
.rangeBetween(Window.unboundedPreceding, 0))
df = df.withColumn('cum_sum', sum(sum_col).over(windowval).alias('cumsum').cast(StringType()))
df = df.drop('tmp')
return df
def forward_fill(df, order_col, fill_col, fill_col_name=None):
'''Forward fill a column by a column/set of columns (order_col).
Parameters:
------------
df: Dataframe
order_col: String or List of string
fill_col: String (Only work for a column for this version.)
Return:
---------
df: Dataframe
Return df with the filled_cols.
'''
# "value" and "constant" are tmp columns created ton enable forward fill.
df = df.withColumn('value', when(col(fill_col).isNull(), 0).otherwise(1))
df = cum_sum(df, 'value', order_col).drop('value')
df = df.withColumn(fill_col,
when(col(fill_col).isNull(), 'constant').otherwise(col(fill_col)))
win = (Window.partitionBy('cum_sum')
.orderBy(order_col))
if not fill_col_name:
fill_col_name = 'ffill_{}'.format(fill_col)
df = df.withColumn(fill_col_name, collect_list(fill_col).over(win)[0])
df = df.drop('cum_sum')
df = df.withColumn(fill_col_name, when(col(fill_col_name)=='constant', None).otherwise(col(fill_col_name)))
df = df.withColumn(fill_col, when(col(fill_col)=='constant', None).otherwise(col(fill_col)))
return df
Let's see the results.
ffilled_df = forward_fill(df,
order_col=['cookie_ID', 'Time'],
fill_col='User_ID',
fill_col_name = 'User_ID_ffil')
ffilled_df.sort(['cookie_ID', 'Time']).show()
// Forward filling
w1 = Window.partitionBy('cookie_id').orderBy('c_date').rowsBetween(Window.unboundedPreceding,0)
w2 = w1.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
//Backward filling
final_df = df.withColumn('UserIDFilled', F.coalesce(F.last('user_id', True).over(w1),
F.first('user_id',True).over(w2)))
final_df.orderBy('cookie_id', 'c_date').show(truncate=False)
+---------+----------+-------+------------+
|cookie_id|c_date |user_id|UserIDFilled|
+---------+----------+-------+------------+
|1 |2015-12-01|null |U1 |
|1 |2015-12-02|U1 |U1 |
|1 |2015-12-02|U1 |U1 |
|1 |2015-12-03|U2 |U2 |
|1 |2015-12-04|null |U2 |
|1 |2015-12-05|null |U2 |
|2 |2015-12-02|U3 |U3 |
|2 |2015-12-03|null |U3 |
|2 |2015-12-04|null |U3 |
|2 |2015-12-05|null |U3 |
+---------+----------+-------+------------+
Cloudera has released a library called spark-ts that offers a suite of useful methods for processing time series and sequential data in Spark. This library supports a number of time-windowed methods for imputing data points based on other data in the sequence.
http://blog.cloudera.com/blog/2015/12/spark-ts-a-new-library-for-analyzing-time-series-data-with-apache-spark/

Resources