Ingest JDBC/ODBC data to Snowflake - netsuite

Does Snowflake support JDBC data sources, and if so how? I'm using Netsuite Analytics as a datasource and would like to load that to a Snowflake warehouse. The examples I'm finding for SnowFlake are file readers, I realise I can convert my netsuite data to a file and then ingest that but I'd rather remove that additonal step.

Snowflake has both ODBC and JDBC drivers that you can use. However, if you are loading a lot of data from Netsuite Analytics, most of the Snowflake drivers will actually generate files, PUT them to S3, and execute a COPY INTO statement to get the data into Snowflake for you. While it is more seamless, it is still executing that "additional step". The reason is...that's the most efficient way to get data into Snowflake, and it's not even close.
https://docs.snowflake.com/en/user-guide/odbc.html
https://docs.snowflake.com/en/user-guide/jdbc.html

No, Snowflake doesn't offer tools for loading data from JDBC or ODBC data sources. This is because Snowflake is a database platform and the functionality you're describing is that of a data integration or ETL tool. There are plenty of third party tools available that can handle this such as Matillion or Talend. Snowflake has a list of recommended technology partners on their website.
If you don't have access to an ETL tool then, as you mentioned, you can create a process yourself to export data from Netsuite to files that are uploaded to cloud storage such AWS S3. You can then set up this storage area an "external stage" and use Snowflake's COPY statement to load the data into Snowflake.

Related

is there any better approach to sync data from Bigquery to singlestore throgh pipelines?

I have data in the Bigquery table and wanted to sync it to singlestore table. I can see the singlestore pipeline documentation here https://docs.singlestore.com/db/v7.8/en/reference/sql-reference/pipelines-commands/create-pipeline.html. it has options to use GCS to load data from. it seems like it expects files from google cloud. I am new to singlestore, can somebody suggest a better approach. should I use pipelines or not? I have created a query stream from Bigquery and now want to insert data to singlestore DB in Nodejs. can we use write stream to singlestore? can we use the pipeline to insert records via the above stream from BQ?
The most efficient way to perform batch data movement from BigQuery to SingleStoreDB would be to perform exports of the data to GCS and use Pipelines to pull the data into SingleStoreDB. Pipelines are optimized for loading data into SingleStoreDB in parallel. If you export the data in Avro format, it will be even more efficient on both sides. It will likely be less complex and more efficient than trying to build the same workflow in Node.js.

Read a Databricks table via Databricks api in Python?

Using Python-3, I am trying to compare an Excel (xlsx) sheet to an identical spark table in Databricks. I want to avoid doing the compare in Databricks. So I am looking for a way to read the spark table via the Databricks api. Is this possible? How can I go on to read a table: DB.TableName?
There is no way to read the table from the DB API as far as I am aware unless you run it as a job as LaTreb already mentioned. However, if you really wanted to, you could use either the ODBC or JDBC drivers to get the data through your databricks cluster.
Information on how to set this up can be found here.
Once you have the DSN set up you can use pyodbc to connect to databricks and run a query. At this time the ODBC driver will only allow you to run Spark-SQL commands.
All that being said, it will probably still be easier to just load the data into Databricks, unless you have some sort of security concern.
I can recomend you write pyspark code in notebook, call the notebook from previously defined job, and establish connection between your local machine and databricks workspace.
You could perfom comaprision directly on spark or convert data frames to pandas if you wish. If noteebok will end comaprision, could retrun result from particular job. I think that sending all databricks tables could be impossible because of API limitation you have spark cluster to perform complex operation, API should be use to send small messages.
Officical documentation:
https://learn.microsoft.com/en-us/azure/databricks/dev-tools/api/latest/jobs#--runs-get-output
Retrieve the output and metadata of a run. When a notebook task
returns a value through the dbutils.notebook.exit() call, you can use
this endpoint to retrieve that value. Azure Databricks restricts this
API to return the first 5 MB of the output. For returning a larger
result, you can store job results in a cloud storage service.

Read Azure Synapse table with Spark

I'm looking for, with no success, how to read a Azure Synapse table from Scala Spark. I found in https://learn.microsoft.com connectors for others Azure Databases with Spark but nothing with the new Azure Data Warehouse.
Does anyone know if it is possible?
It is now directly possible, and with trivial effort (there is even a right-click option added in the UI for this), to read data from a DEDICATED SQL pool in Azure Synapse (the new Analytics workspace, not just the DWH) for Scala (and unfortunately, ONLY Scala right now).
Within Synapse workspace (there is of course a write API as well):
val df = spark.read.sqlanalytics("<DBName>.<Schema>.<TableName>")
If outside of the integrated notebook experience, need to add imports:
import com.microsoft.spark.sqlanalytics.utils.Constants
import org.apache.spark.sql.SqlAnalyticsConnector._
It sounds like they are working on expanding to SERVERLESS SQL pool, as well as other SDKs (e.g. Python).
Read top portion of this article as reference: https://learn.microsoft.com/en-us/learn/modules/integrate-sql-apache-spark-pools-azure-synapse-analytics/5-transfer-data-between-sql-spark-pool
maybe I misunderstood your question, but normally you would use jdbc connection in Spark to use data from remote database
check this doc
https://docs.databricks.com/data/data-sources/azure/synapse-analytics.html
keep in mind, Spark would have to ingest data from Synapse tables into memory for processing and perform transformations there, so it is not going to push down operations into Synapse.
Normally, you want to run SQL query against source database and only bring results of SQL into Spark dataframe.

AWS Data Lake Ingest

Do you need to ingest excel and other proprietary formats using glue or allow glue to work crawl your s3 bucket to use these data formats within your data lake?
I have gone through the "Data Lake Foundation on the AWS Cloud" document and am left scratching my head about getting data into the lake. I have a Data Provider with a large set of data stored on their system as excel and access files.
Based on the process flow they would upload the data into the submission s3 bucket, which would set off a series of actions, but there is no etl of the data into a format that would work with the other tools.
Would using these files require using glue on the data that is submitted in the bucket or is there another way to make this data available to other tools such as Athena and redshift spectrum?
Thank you for any light you can shed on this topic.
-Guido
I'm not seeing that can take excel data directly to Data Lake. You might need to convert into CSV/TSV/Json or other formats before loading into Data Lake.
Formats Supported by Redshift Spectrum:
http://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-data-files.html -- Again I don't see Excel as of now.
Athena Supported File Formats:
http://docs.aws.amazon.com/athena/latest/ug/supported-formats.html -- I don't see Excel also not supported here.
You need to upload the files to S3 either to Use Athena or Redshift Spectrum or even Redshift storage itself.
Uploading Files to S3:
If you have bigger files, you need to use S3 multipart upload to upload quicker. If you want more speed, you need to use S3 accelerator to upload your files.
Querying Big Data with Athena:
You can create external tables with Athena from S3 locations. Once you create external tables, use Athena Sql reference to query your data.
http://docs.aws.amazon.com/athena/latest/ug/language-reference.html
Querying Big Data with Redshift Spectrum:
Similar to Athena, you can create external tables with Redshift. Start querying those tables and get the results on Redshift.
Redshift has lot of commercial tools, I use SQL Workbench. It is free open source and rock solid, supported by AWS.
SQL WorkBench: http://www.sql-workbench.net/
Connecting your WorkBench to Redshift: http://docs.aws.amazon.com/redshift/latest/mgmt/connecting-using-workbench.html
Copying data to Redshift:
Also if you want to take the data storage to Redshift, you can use the copy command to pull the data from S3 and its gets loaded to Redshift.
Copy Command Examples:
http://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html
Redshift Cluster Size and Number of Nodes:
Before creating Redshift Cluster, check for required size and number of nodes needed. More number of nodes gets query parallely running. One more important factor is how well your data is distributed. (Distribution key and Sort keys)
I have a very good experience with Redshift, getting up to the speed might take sometime.
Hope it helps.

Data transfer from Hive to Google Storage/Big Query

I have some Hive tables in an on-premise hadoop cluster.
I need to transfer the tables to BigQuery in google cloud.
Can you suggest any google tools or any open source tools for the data transfer?
Thanks in advance
BigQuery can import Avro files.
This means you can do something like INSERT overwrite table target_avro_hive_table SELECT * FROM source_hive_table;
You can then load the underlying .avro files into BigQuery via the bq command line tool or using the console UI:
bq load --source_format=AVRO your_dataset.something something.avro
Using BigQuery migration assessment feature we can migrate from data warehouse to BigQuery.
https://cloud.google.com/bigquery/docs/migration-assessment

Resources