How to define MySQL types in Glue Job, which saves data to RDS? - apache-spark

I have a Glue job, which moves data from DynamoDB to RDS MySQL database. The code is scala spark. Overall I am doing infrastructure as a code with ansible and cloudformation and need a code-only solution.
When I am saving data to RDS table the types I have are "text" for string types. I want to find a way to save as VARCHAR(N).
I thought it is done via a mapping like this
def getMapping(frame: DynamicFrame): DynamicFrame = {
frame.applyMapping(
mappings = Seq(
("stringColumn", "string", "stringColumn", "string"),
("timestampColumn", "timestamp", "timestampColumn", "timestamp")
),
caseSensitive = true,
transformationContext = "applyMapping"
)
}
But it doesn't seem like a case, because Spark does not have string types restricted by number of characters
Another way is via parameters of glueContext.getJDBCSink(...), which has a lot of parameters for Redshift, but not for RDS.
What will be a good way to go here?
I tried defining mappings as per function above, but no luck
Nothing else comes to mind or on the Internet

Related

How to use Spark cosmos.oltp datasource/format. Is that configuration a datasource or a format?

I am new to Spark and CosmosDB.
Reviewing the quickstart documentation here: https://learn.microsoft.com/en-us/azure/cosmos-db/create-sql-api-spark
I found they use "cosmos.oltp" like:
spark.sql("CREATE TABLE IF NOT EXISTS cosmosCatalog.{}.{}
using **cosmos.oltp** TBLPROPERTIES(partitionKeyPath = '/id', manualThroughput = '1100')".format(cosmosDatabaseName, cosmosContainerName))
The documentation refers to it as a data source but the API suggests it is a format.
What is the purpose of cosmos.oltp is it a Datasource or a format?
cosmos.oltp is used as format when defining Read or Write flows, like:
val df = spark.read.format("cosmos.oltp").options(cfg).load()
df.printSchema()
That reads data using the Cosmos DB Connector.
or
spark.createDataFrame(Seq(("cat-alive", "Schrodinger cat", 2, true), ("cat-dead", "Schrodinger cat", 2, false)))
.toDF("id","name","age","isAlive")
.write
.format("cosmos.oltp")
.options(cfg)
.mode("APPEND")
.save()
That writes data using the Cosmos DB Connector (and you can combine read and write too).
In both cases, they are Datasources, but used through the format call with the cosmos.oltp name.
The Catalog API support on the other hand is used to create or manage the databases/collections.
The format call in your code example is merely to replace the values in {}, you could remove the format and just have the names as part of the string, or use any other string formatting you choose.
Here are more Spark samples: https://github.com/Azure/azure-sdk-for-java/tree/main/sdk/cosmos/azure-cosmos-spark_3-1_2-12/Samples
And here are docs for the Catalog API:https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos-spark_3-1_2-12/docs/catalog-api.md

Spark DataFrame Filter using Binary (Array[Bytes]) data

I have a DataFrame from a JDBC table hitting MySql and I need to filter it using a UUID. The data is stored in MySql using binary(16) and when querying out in spark is converted to Array[Byte] as expected.
I'm new to spark and have been trying various ways to pass a variable of type UUID into the DataFrame's filter method.
Ive tried statements like
val id: UUID = // other logic that looks this up
df.filter(s"id = $id")
df.filter("id = " convertToByteArray(id))
df.filter("id = " convertToHexString(id))
All of these error with different messages.
I just need to somehow pass in Binary types but can't seem to put my finger on how to do so properly.
Any help is greatly appreciated.
After reviewing even more sources online, I found a way to accomplish this without using the filter method.
When I'm reading from my sparkSession, I just use an adhoc table instead of table name, as follows:
sparkSession.read.jdbc(connectionString, s"(SELECT id, {other col omitted) FROM MyTable WHERE id = 0x$id) AS MyTable", props)
This pre-filters the results for me and then I just work with the data frame as I need.
If anyone knows of a solution using filter, I'd still love to know it as that would be useful in some cases.

Query Spark SQL from Node.js server

I'm currently using npm's cassandra-driver to query my Cassandra database from a Node.js server. Since I want to be able to write more complex queries, I'd like to use Spark SQL instead of CQL. Is there any way to create a RESTful API (or something else) so that I can use Spark SQL the same way that I currently use CQL?
In other words, I want to be able to send a Spark SQL query from my Node.js server to another server and get a result back.
Is there any way to do this? I've been searching for solutions to this problem for a while and haven't found anything yet.
Edit: I'm able to query my database with Scala and Spark SQL from the Spark shell, so that bit is working. I just need to connect Spark and my Node.js server somehow.
I had a similar problem, and I solved by using Spark-JobServer.
The main approach with Spark-Jobserver (SJS) usually is to create a special job that extends their SparkSQLJob such as in the following example:
object ExecuteQuery extends SparkSQLJob {
override def validate(sqlContext: SQLContext, config: Config): SparkJobValidation = {
// Code to validate the parameters received in the request body
}
override def runJob(sqlContext: SQLContext, jobConfig: Config): Any = {
// Assuming your request sent a { "query": "..." } in the body:
val df = sqlContext.sql(config.getString("query"))
createResponseFromDataFrame(df) // You should implement this
}
}
However, for this approach to work well with Cassandra, you have to use the spark-cassandra-connector and then, to load the data you will have two options:
1) Before calling this ExecuteQuery via REST, you have to transfer the full data you want to query from Cassandra to Spark. For that, you would do something like (code adapted from the spark-cassandra-connector documentation):
val df = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "words", "keyspace" -> "test"))
.load()
And then register it as a table in order to SparkSQL be able to access it:
df.registerAsTempTable("myTable") // As a temporary table
df.write.saveAsTable("myTable") // As a persistent Hive Table
Only after that you would be able to use the ExecuteQuery to query from myTable.
2) As the first option can be inefficient in some use cases, there is another option.
The spark-cassandra-connector has a special CassandraSQLContext that can be used to query C* tables directly from Spark. It can be used like:
val cc = new CassandraSQLContext(sc)
val df = cc.sql("SELECT * FROM keyspace.table ...")
However, to use a different type of context with Spark-JobServer, you need to extend SparkContextFactory and use it in the moment of context creation (which can be done by a POST request to /contexts). An example of a special context factory can be seen on SJS Gitub. You also have to create a SparkCassandraJob, extending SparkJob (but this part is very easy).
Finally, the ExecuteQuery job have to be adapted to use the new classes. It would be something like:
object ExecuteQuery extends SparkCassandraJob {
override def validate(cc: CassandraSQLContext, config: Config): SparkJobValidation = {
// Code to validate the parameters received in the request body
}
override def runJob(cc: CassandraSQLContext, jobConfig: Config): Any = {
// Assuming your request sent a { "query": "..." } in the body:
val df = cc.sql(config.getString("query"))
createResponseFromDataFrame(df) // You should implement this
}
}
After that, the ExecuteQueryjob can be executed via REST with a POST request.
Conclusion
Here I use the first option because I need the advanced queries available in the HiveContext (window functions, for example), which are not available in the CassandraSQLContext. However, if you don't need those kind of operations, I recommend the second approach, even if it needs some extra coding to create a new ContextFactory for SJS.

Populate Azure Data Factory dataset from query

Cannot find an answer via google, msdn (and other microsoft) documentation, or SO.
In Azure Data Factory you can get data from a dataset by using copy activity in a pipeline. The pipeline definition includes a query. All the queries I have seen in documentation are simple, single table queries with no joins. In this case, a dataset is defined as a table in the database with "TableName"= "mytable". Additionally, one could retrieve data from a stored procedure, presumably allowing more complex sql.
Is there a way to define a more complex query in a pipeline that includes joins and/or transformation logic that alters the data from or pipeline from a query rather than stored procedure. I know that you can specify fields in a dataset, but don't know how to get around the "tablename" property.
If there is a way, what would that method be?
input is on-premises sql server. output is azure sql database.
UPDATED for clarity.
Yes, the sqlReaderQuery can be much more complex than what is provided in the examples, and it doesn't have to only use the Table Name in the Dataset.
In one of my pipelines, I have a Dataset with the TableName "dbo.tbl_Build", but my sqlReaderQuery looks at several tables in that database. Here's a heavily truncated example:
with BuildErrorNodes as (select infoNode.BuildId, ...) as MessageValue from dbo.tbl_BuildInformation2 as infoNode inner join dbo.tbl_BuildInformationType as infoType on (infoNode.PartitionId = infoType), BuildInfo as ...
It's a bit confusing to list a single table name in the Dataset, then use multiple tables in the query, but it works just fine.
There's a way to move data from on-premise SQL to Azure SQL using Data Factory.
You can use Copy Activity, check this code sample for your case specifically GitHub link to the ADF Activity source.
Basically you need create Copy Activity which will have TypeProperties with SqlSource and SqlSink sets look like this:
<!-- language: lang-json -->
"typeProperties": {
"source": {
"type": "SqlSource",
"SqlReaderQuery": "select * from [Source]"
},
"sink": {
"type": "SqlSink",
"WriteBatchSize": 1000000,
"WriteBatchTimeout": "00:05:00"
}
},
Also do mention - you can use not only selects from tables or views, but also [Table-Valued-Functions] will work as well.

Saving / exporting transformed DataFrame back to JDBC / MySQL

I'm trying to figure out how to use the new DataFrameWriter to write data back to a JDBC database. I can't seem to find any documentation for this, although looking at the source code it seems like it should be possible.
A trivial example of what I'm trying looks like this:
sqlContext.read.format("jdbc").options(Map(
"url" -> "jdbc:mysql://localhost/foo", "dbtable" -> "foo.bar")
).select("some_column", "another_column")
.write.format("jdbc").options(Map(
"url" -> "jdbc:mysql://localhost/foo", "dbtable" -> "foo.bar2")
).save("foo.bar2")
This doesn't work — I end up with this error:
java.lang.RuntimeException: org.apache.spark.sql.execution.datasources.jdbc.DefaultSource does not allow create table as select.
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:200)
I'm not sure if I'm doing something wrong (why is it resolving to DefaultSource instead of JDBCRDD for example?) or if writing to an existing MySQL database just isn't possible using Spark's DataFrames API.
Update
Current Spark version (2.0 or later) supports table creation on write.
The original answer
It is possible to write to an existing table but it looks like at this moment (Spark 1.5.0) creating table using JDBC data source is not supported yet*. You can check SPARK-7646 for reference.
If table already exists you can simply use DataFrameWriter.jdbc method:
val prop: java.util.Properties = ???
df.write.jdbc("jdbc:mysql://localhost/foo", "foo.bar2", prop)
* What is interesting PySpark seems to support table creation using jdbc method.

Resources