How to read the ROW FORMAT DELIMITED with SEQUENCEFILE in Spark SQL - apache-spark

I have the following Hive table definition:
CREATE EXTERNAL TABLE english_1grams (
gram string,
year int,
occurrences bigint,
pages bigint,
books bigint
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS SEQUENCEFILE
location 's3://datasets.elasticmapreduce/ngrams/books/20090715/eng-all/1gram/';
From: http://aws.amazon.com/articles/Elastic-MapReduce/5249664154115844
It works just fine in Hive. However, when trying to use it wirh Spark, it gives an error:
Operation not allowed: ROW FORMAT DELIMITED is only compatible with 'textfile', not 'sequencefile'(line 1, pos 0)
How can I read this table in Spark SQL? I've removed the ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' from the definition but that returns just gibberish instead of the actual data.

Related

Getting null when trying to read a column with value '-' from aws glue catalog table

I am reading an Athena table which has a column name br_book_gl1 which has values as '-' and '+'.
Athena Source data
I am getting the '+'value when reading it as glue catalog table but for '-' values, I am getting null.
The datatype is String in Athena table and I am using the below code:
gluecontext.getCatalogSource(database = database, tableName = tableName).getDynamicFrame().toDF()
.select(col("br_book_gl1").as("GainLossSign"))
output
Athena can only read table names with alphanumeric or underscore characters.
See this page for more info: https://docs.aws.amazon.com/glue/latest/dg/console-tables.html?icmpid=docs_glue_console.

org.apache.spark.sql.AnalysisException: while saving Spark Dataframe

I have 1 table in 2 tables in my database.I am tring to save data from first table to second table using insertInto.
CREATE TABLE if not exists dbname.tablename_csv ( id STRING, location STRING, city STRING, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE ;
CREATE TABLE if not exists dbname.tablename_orc ( id String,location STRING, country String PARTITIONED BY (city string) CLUSTERED BY (country) into 4 buckets ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORCFILE tblproperties("orc.compress"="SNAPPY");
var query=spark.sql("id,location,city,country from dbname.tablename_csv")
query.write.insertInto("dbname.tablename_orc")
but its giving issue."
"org.apache.spark.sql.AnalysisException: `dbname`.`tablename_orc` requires that the data to be inserted have the same number of columns as the target table: target table has 3 column(s) but the inserted data has 4 column(s), including 0 partition column(s) having constant value(s).;"
Plese someone give me a hint what else need to add.I tried by adding partitionBy also but got same error and was showing partitionBy not Required.
query.write.partitionBy("city").insertInto("dbname.tablename_orc")
saveAsTable(...) with mode = "append"

HIVE rendered timestamp column data as NULL

I am trying to create an external table using Hive. Below is the Hive query I ran:
create external table trips_raw
(
VendorID int,
tpep_pickup_datetime timestamp,
tpep_dropoff_datetime timestamp
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' location '/user/taxi_trips/';
When I looked at the output from the 'trips_raw' table created by the query above, I saw that both the 'tpep_pickup_date_time' and 'tpep_dropoff_datetime' columns are 'NULL' in all rows. I have seen other threads talked about the reason being that the '1/1/2018 11:13:00 AM' timestamp format is not accepted by Hive, but problem is that's the timestamp format I have in my csv source data (as you can see from screenshot here).
I could specify those 2 timestamp columns as 'string' and Hive will be able to render them correctly, but I still would want those 2 columns to be 'timestamp' type so specifying those 2 columns as 'string' type is not a viable option here.
I had also tried the following technique using recommendation from this site (https://community.hortonworks.com/questions/55266/hive-date-time-problem.html) but had no success:
Create the 'trips_raw' table using 'string' as type for the 2 timestamp columns. This allows the resulting table to render the timestamps correctly, albeit in 'string' type. The Hive command I used is shown below:
create external table trips_raw
(
VendorID int,
tpep_pickup_datetime string,
tpep_dropoff_datetime string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' location
'/user/taxi_trips/';
When I look at the resulting table, the dates are shown as string as you can see from this screenshot below.
But as I had mentioned earlier, I want the time columns to be in timestamp type and not string type. Therefore in the next 2 steps I tried to create a blank table and then insert the data from the table created from Step 1 but converting the string to timestamp this time.
Create an external blank table called 'trips_not_raw' using the following Hive commands:
create external table trips_not_raw
(VendorID int,
tpep_pickup_datetime timestamp,
tpep_dropoff_datetime timestamp
);
Insert data from 'trips_raw' table (which was mentioned earlier in this question), using the Hive commands below:
insert into table trips_not_raw select vendorid,
from_unixtime(unix_timestamp(tpep_pickup_datetime, 'MM/dd/yyyy HH:mm:ss
aa')) as tpep_pickup_datetime,
from_unixtime(unix_timestamp(tpep_dropoff_datetime, 'MM/dd/yyyy HH:mm:ss
aa')) as tpep_dropoff_datetime
from trips_raw;
Doing this inserts the rows into the blank table 'trips_not_raw', but the results from the 2 timestamp columns still showed as 'Null' as you can see from the screenshot below:
Is there a simple way to store the 2 time columns as 'timestamp' type and not 'string', but still be able to render them correctly in the output without seeing 'Null/None'?
I'm afraid you need to parse timestamp column and then cast string as timestamp. For example,
select cast(regexp_replace('1/1/2018 11:13:00 AM', '(\\d{1,2})/(\\d{1,2})/(\\d{4})\\s(\\d{2}:\\d{2}:\\d{2}) \\w{2}', '$3-$1-$2 $4') as timestamp)
You can create and use a macro function for convenience, e.g.,
create temporary macro parse_date (ts string)
cast(regexp_replace(ts, '(\\d{1,2})/(\\d{1,2})/(\\d{4})\\s(\\d{2}:\\d{2}:\\d{2}) \\w{2}', '$3-$1-$2 $4') as timestamp);
then use it as follows
select parse_date('1/1/2018 11:13:00 AM');

spark-hive - Upsert into dynamic partition hive table throws an error - Partition spec contains non-partition columns

I am using spark 2.2.1 and hive2.1. I am trying to insert overwrite multiple partitions into existing partitioned hive/parquet table.
Table was created using sparkSession.
I have a table 'mytable' with partitions P1 and P2.
I have following set on sparkSession object:
"hive.exec.dynamic.partition"=true
"hive.exec.dynamic.partition.mode"="nonstrict"
Code:
val df = spark.read.csv(pathToNewData)
df.createOrReplaceTempView("updateTable") //here 'df' may contains data from multiple partitions. i.e. multiple values for P1 and P2 in data.
spark.sql("insert overwrite table mytable PARTITION(P1, P2) select c1, c2,..cn, P1, P2 from updateTable") // I made sure that partition columns P1 and P2 are at the end of projection list.
I am getting following error:
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.Table.ValidationFailureSemanticException: Partition spec {p1=, p2=, P1=1085, P2=164590861} contains non-partition columns;
dataframe 'df' have records for P1=1085, P2=164590861 . It looks like issue with casing (lower vs upper). I tried both cases in my query but it's still not working.
EDIT:
Insert statement works with static partitioning but that is not what I am looking for:
e.g. following works
spark.sql("insert overwrite table mytable PARTITION(P1=1085, P2=164590861) select c1, c2,..cn, P1, P2 from updateTable where P1=1085 and P2=164590861")
Create table stmt:
`CREATE TABLE `my_table`(
`c1` int,
`c2` int,
`c3` string,
`p1` int,
`p2` int)
PARTITIONED BY (
`p1` int,
`p2` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'maprfs:/mds/hive/warehouse/my.db/xc_bonus'
TBLPROPERTIES (
'spark.sql.partitionProvider'='catalog',
'spark.sql.sources.schema.numPartCols'='2',
'spark.sql.sources.schema.numParts'='1',
'spark.sql.sources.schema.part.0'='{.spark struct metadata here.......}';
'spark.sql.sources.schema.partCol.0'='P1', //Spark is using Capital Names for Partitions; while hive is using lowercase
'spark.sql.sources.schema.partCol.1'='P2',
'transient_lastDdlTime'='1533665272')`
In above, spark.sql.sources.schema.partCol.0 uses all uppercase while PARTITIONED BY statement uses all lowercase for partitions columns
Based on the Exception and also assuming that the table 'mytable' was created as a partitioned table with P1 and P2 as partitions. One way to overcome this exception would be to force a dummy partition manually before executing the command. Try doing
spark.sql("alter table mytable add partition (p1=default, p2=default)").
Once successful, execute your insert overwrite statement. Hope this helps?
As I mentioned in EDIT section issue was in fact with difference in partition columns casing (lower vs upper) between hive and spark! I created hive table with all Upper cases but hive still internally stored it as lowercases but spark metadata kept is as Upper cases as intended by me. Fixing create statement with all lower case partition columns fixed the issue with subsequent updates!
If you are using hive 2.1 and spark 2.2 make sure following properties in create statement have same casing.
PARTITIONED BY (
p1int,
p2int)
'spark.sql.sources.schema.partCol.0'='p1',
'spark.sql.sources.schema.partCol.1'='p2',

How to create hive table with date format 'dd-MMM-yyyy'?

I,m trying to create a hive table for importing csv data into table where the date format in the csv file is 'dd-MMM-yyyy' (for example 20-Mar-2018). When i created table in hive it turns out the entire column of date into null values. Can anyone suggest me how to figure out this?
My Query:
create external table new_stock (Symbol String,Series String,Dat date,Prev_Close float,Open_Price float,High_Price float,Low_Price float,Last_Price float,Close_Price float,Avg_Price float,Volume int,Turn_Over float,Trades int,Del_Qty int,DQPQ_Per float) row format delimited fields terminated by ',' stored as textfile LOCATION '/stock_details/'
Finally some help from #leftjoin, i solved the problem of converting string date with format (dd-MMM-yyyy) to (dd-MM-yyyy) by using select query. It would work fine.
select from_unixtime(unix_timestamp(columnname ,'dd-MMM-yyyy'), 'dd-MM-yyyy') from tablename;

Resources