How to automatically sync a Hive external table with a MySQL table without using Sqoop? - apache-spark

I'm already having a MySQL table in my local machine (Linux) itself, and I have a Hive external table with the same schema as the MySQL table.
I want to sync my hive external table whenever a new record is inserted or updated. Batch update is ok with me to say hourly.
What is the best possible approach to achieve the same without using sqoop?
Thanks,
Sumit

Without scoop, you can create table STORED BY JdbcStorageHandler. Project repository: https://github.com/qubole/Hive-JDBC-Storage-Handler It will work as usual hive table, but query will run on MySQL. Predicate pushdown will work.
DROP TABLE HiveTable;
CREATE EXTERNAL TABLE HiveTable(
id INT,
id_double DOUBLE,
names STRING,
test INT
)
STORED BY 'org.apache.hadoop.hive.jdbc.storagehandler.JdbcStorageHandler'
TBLPROPERTIES (
"mapred.jdbc.driver.class"="com.mysql.jdbc.Driver",
"mapred.jdbc.url"="jdbc:mysql://localhost:3306/rstore",
"mapred.jdbc.username"="root",
"mapred.jdbc.input.table.name"="JDBCTable",
"mapred.jdbc.output.table.name"="JDBCTable",
"mapred.jdbc.password"="",
"mapred.jdbc.hive.lazy.split"= "false"
);

Related

How to structure schemas better in Synapse SQL dedicated pool

Using Azure Synapse , Dedicated SQL pool.
How can I structure my tables underneath a button that represents the schema?
This is a small issue that really makes a big impact when many tables and schemas will be used in the database, and users will need to navigate to the correct schema quickly.
I tried dragging the schema under over the tables section, but nothing worked.
to structure the table first we need to create schema in sql dedicated pool using below command
CREATE SCHEMA <schemaName>
we need to create table using above created schema with required columns with suitable data types to that column.
Table creation:
CREATE TABLE <tableName>(col1 dataType,col2 dataType)
I created external table in dedicated sql pool in synapse following below steps:
Schema creation:
Created external data source:
CREATE EXTERNAL DATA SOURCE [DATASOURCE] WITH
(
LOCATION = '<location>',
)
Image for reference:
created external file format:
CREATE EXTERNAL FILE FORMAT [FileFormat1] WITH
(
FORMAT_TYPE = DELIMITEDTEXT
)
Image for reference:
I created external table using above data source and file format using below code:
CREATE EXTERNAL TABLE [wwi].[information2]
(
[Id] INT
)
WITH
(
LOCATION = '<folder/file>',
DATA_SOURCE = [DATASOURCE1],
FILE_FORMAT = [FileFormat1]
)
Image for reference:
In this we can structure the table in synapse dedicated pool.

Spark SQL persistent view over jdbc data source

I want to create a persistent (global) view in spark sql that gets data from an underlying jdbc database connection. It works fine when I use a temporary (session-scoped) view as shown below but fails when trying to create a regular (persistent and global) view.
I don't understand why the latter should not work but couldn't find any docs/hints as all examples are always done with temporary views. Technically, I cannot see why it shouldn't work as the data is properly retrieved from jdbc source in the temporary view and thus it should not matter if I wanted to "store" the query in a persistent view so that whenever calling the view it would retrieve data directly from jdbc source.
Config.
tbl_in = myjdbctable
tbl_out = myview
db_user = 'myuser'
db_pw = 'mypw'
jdbc_url = 'jdbc:sqlserver://myserver.domain:1433;database=mydb'
This works.
query = f"""
create or replace temporary view {tbl_out}
using jdbc
options(
dbtable '{tbl_in}',
user '{db_user}',
password '{db_pw}',
url '{jdbc_url}'
)
"""
spark.sql(query)
> DataFrame[]
This does not work.
query = f"""
create or replace view {tbl_out}
using jdbc
options(
dbtable '{tbl_in}',
user '{db_user}',
password '{db_pw}',
url '{jdbc_url}'
)
"""
spark.sql(query)
> ParseException:
Error.
ParseException:
mismatched input 'using' expecting {'(', 'UP_TO_DATE', 'AS', 'COMMENT', 'PARTITIONED', 'TBLPROPERTIES'}(line 3, pos 0)
== SQL ==
create or replace view myview
using jdbc
^^^
options(
dbtable 'myjdbctable',
user 'myuser',
password '[REDACTED]',
url 'jdbc:sqlserver://myserver.domain:1433;database=mydb'
)
TL;DR: A spark sql table over jdbc source behaves like a view and so can be used like one.
It seems my assumptions about jdbc tables in spark sql were flawed. It turns out that a sql table with a jdbc source (i.e. created via using jdbc) is actually a live query against the jdbc source (and not a one-off jdbc query during table creation as I assumed). In my mind it actually behaves like a view then. That means if the underlying jdbc source changes (e.g. new entries in a column) this is reflected in the spark sql table on read (e.g. select from) without having to re-create the table.
It follows that the spark sql table over jdbc source satisfies my requirements of having an always up2date reflection of the underlying table/sql object in the jdbc source. Usually, I would use a view for that. Maybe this is the reason why there is no persistent view over a jdbc source but only temporary views (which of course still make sense as they are session-scoped). It should be noted that the spark sql jdbc table behaves like a view which may be surprising, in particular:
if you add a column in underlying jdbc table, it will not show up in spark sql table
if you remove a column from underlying jdbc table, an error will occur when spark sql table is accessed (assuming the removed column was present during spark sql table creation)
if you remove the underlying jdbc table, an error will occur when spark sql table is accessed
The input of spark.sql should be DML (Data Manipulation Language). Its output is a dataframe.
In terms of best practices, you should avoid using DDL (Data Definition Language) with spark.sql. Even if some statements may work, that's not meant to be used this way.
If you want to use DDL, simply connect to your DB using python packages.
If you want to create a temp view in spark, do it using spark syntaxe createTempView

ADF copy data activity - check for duplicate records before inserting into SQL db

I have a very simple ADF pipeline to copy data from local mongoDB (self-hosted integration environment) to Azure SQL database.
My pipleline is able to copy the data from mongoDB and insert into SQL db.
Currently if I run the pipeline it inserts duplicate data if run multiple times.
I have made _id column as unique in SQL database and now running pipeline throws and error because of SQL constraint wont letting it insert the record.
How do I check for duplicate _id before inserting into SQL db?
should I use Pre-copy script / stored procedure?
Some guidance / directions would be helpful on where to add extra steps. Thanks
Azure Data Factory Data Flow can help you achieve that:
You can follow these steps:
Add two sources: Cosmos db table(source1) and SQL database table(source2).
Using Join active to get all the data from two tables(left join/full join/right join) on Cosmos table.id= SQL table.id.
AlterRow expression to filter the duplicate _id, it not duplicate then insert it.
Then mapping the no-duplicate column to the Sink SQL database table.
Hope this helps.
You Should implement your SQL Logic to eliminate duplicate at the Pre-Copy Script
Currently I got the solution using a Stored Procedure which look like a lot less work as far this requirement is concerned.
I have followed this article:
https://www.cathrinewilhelmsen.net/2019/12/16/copy-sql-server-data-azure-data-factory/
I created table type and used in stored procedure to check for duplicate.
my sproc is very simple as shown below:
SET QUOTED_IDENTIFIER ON
GO
ALTER PROCEDURE [dbo].[spInsertIntoDb]
(#sresults dbo.targetSensingResults READONLY)
AS
BEGIN
MERGE dbo.sensingresults AS target
USING #sresults AS source
ON (target._id = source._id)
WHEN NOT MATCHED THEN
INSERT (_id, sensorNumber, applicationType, place, spaceType, floorCode, zoneCountNumber, presenceStatus, sensingTime, createdAt, updatedAt, _v)
VALUES (source._id, source.sensorNumber, source.applicationType, source.place, source.spaceType, source.floorCode,
source.zoneCountNumber, source.presenceStatus, source.sensingTime, source.createdAt, source.updatedAt, source.updatedAt);
END
I think using stored proc should do for and also will help in future if I need to do more transformation.
Please let me know if using sproc in this case has potential risk in future ?
To remove the duplicates you can use the pre-copy script. OR what you can do is you can store the incremental or new data into a temp table using copy activity and use a store procedure to delete only those Ids from the main table which are in temp table after deletion insert the temp table data into the main table. and then drop the temp table.

Is there any alternative of CREATE TYPE in SQL as CREATE TYPE is Not supported in Azure SQL data warehouse

I am trying to execute this query but as userdefined(Create type) types are not supportable in azure data warehouse. and i want to use it in stored procedure.
CREATE TYPE DataTypeforCustomerTable AS TABLE(
PersonID int,
Name varchar(255),
LastModifytime datetime
);
GO
CREATE PROCEDURE usp_upsert_customer_table #customer_table DataTypeforCustomerTable READONLY
AS
BEGIN
MERGE customer_table AS target
USING #customer_table AS source
ON (target.PersonID = source.PersonID)
WHEN MATCHED THEN
UPDATE SET Name = source.Name,LastModifytime = source.LastModifytime
WHEN NOT MATCHED THEN
INSERT (PersonID, Name, LastModifytime)
VALUES (source.PersonID, source.Name, source.LastModifytime);
END
GO
CREATE TYPE DataTypeforProjectTable AS TABLE(
Project varchar(255),
Creationtime datetime
);
GO
CREATE PROCEDURE usp_upsert_project_table #project_table DataTypeforProjectTable READONLY
AS
BEGIN
MERGE project_table AS target
USING #project_table AS source
ON (target.Project = source.Project)
WHEN MATCHED THEN
UPDATE SET Creationtime = source.Creationtime
WHEN NOT MATCHED THEN
INSERT (Project, Creationtime)
VALUES (source.Project, source.Creationtime);
END
Is there any alternative way to do this.
You've got a few challenges there, because most of what you're trying to convert is not the way to do things on ASDW.
First, as you point out, CREATE TYPE is not supported, and there is no equivalent alternative.
Next, the code appears to be doing single inserts to a table. That's really bad on ASDW, performance will be dreadful.
Next, there's no MERGE statement (yet) for ASDW. That's because UPDATE is not the best way to handle changing data.
And last, stored procedures work a little differently on ASDW, they're not compiled, but interpreted each time the procedure is called. Stored procedures are great for big chunks of table-level logic, but not recommended for high volume calls with single-row operations.
I'd need to know more about the use case to make specific recommendations, but in general you need to think in tables rather than rows. In particular, focus on the CREATE TABLE AS (CTAS) way of handling your ELT.
Here's a good link, it shows how the equivalent of a Merge/Upsert can be handled using a CTAS:
https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-develop-ctas#replace-merge-statements
As you'll see, it processes two tables at a time, rather than one row. This means you'll need to review the logic that called your stored procedure example.
If you get your head around doing everything in CTAS, and separately around Distribution, you're well on your way to having a high performance data warehouse.
Temp tables in Azure SQL Data Warehouse have a slightly different behaviour to box product SQL Server or Azure SQL Database - they exist at the session level. So all you have to do is convert your CREATE TYPE statements to temp tables and split the MERGE out into separate INSERT / UPDATE / DELETE statements as required.
Example:
CREATE TABLE #DataTypeforCustomerTable (
PersonID INT,
Name VARCHAR(255),
LastModifytime DATETIME
)
WITH
(
DISTRIBUTION = HASH( PersonID ),
HEAP
)
GO
CREATE PROCEDURE usp_upsert_customer_table
AS
BEGIN
-- Add records which do not already exist
INSERT INTO customer_table ( PersonID, Name, LastModifytime )
SELECT PersonID, Name, LastModifytime
FROM #DataTypeforCustomerTable AS source
WHERE NOT EXISTS
(
SELECT *
FROM customer_table target
WHERE source.PersonID = target.PersonID
)
...
Simply load the temp table and execute the stored proc. See here for more details on temp table scope.
If you are altering a large portion of the table then you should consider the CTAS approach to create a new table, then rename it as suggested by Ron.

Composite key in Cassandra with Pig

We have a CQL table that looks something like this:
CREATE table data (
occurday text,
seqnumber int,
occurtimems bigint,
unique bigint,
fields map<text, text>,
primary key ((occurday, seqnumber), occurtimems, unique)
)
I can query this table from cqlsh like this:
select * from data where seqnumber = 10 AND occurday = '2013-10-01';
This query works and returns the expected data.
If I execute this query as part of a LOAD from within Pig, however, things don't work.
-- Need to URL encode the query
data = LOAD 'cql://ks/data?where_clause=seqnumber%3D10%20AND%20occurday%3D%272013-10-01%27' USING CqlStorage();
gives
InvalidRequestException(why:seqnumber cannot be restricted by more than one relation if it includes an Equal)
at org.apache.cassandra.thrift.Cassandra$prepare_cql3_query_result.read(Cassandra.java:39567)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at org.apache.cassandra.thrift.Cassandra$Client.recv_prepare_cql3_query(Cassandra.java:1625)
at org.apache.cassandra.thrift.Cassandra$Client.prepare_cql3_query(Cassandra.java:1611)
at org.apache.cassandra.hadoop.cql3.CqlPagingRecordReader$RowIterator.prepareQuery(CqlPagingRecordReader.java:591)
at org.apache.cassandra.hadoop.cql3.CqlPagingRecordReader$RowIterator.executeQuery(CqlPagingRecordReader.java:621)
Shouldn't these behave the same? Why is the version through Pig failing where the straight cqlsh command works?
Hadoop is using CqlPagingRecordReader to try to load your data. This is leading to queries that are not identical to what you have entered. The paging record reader is trying to obtain small slices of Cassandra data at a time to avoid timeouts.
This means that your query is executed as
SELECT * FROM "data" WHERE token("occurday","seqnumber") > ? AND
token("occurday","seqnumber") <= ? AND occurday='A Great Day'
AND seqnumber=1 LIMIT 1000 ALLOW FILTERING
And this is why you are seeing your repeated key error. I'll submit a bug to the Cassandra Project.
Jira:
https://issues.apache.org/jira/browse/CASSANDRA-6151

Resources