Pyspark: isDeltaTable running forever - apache-spark

I want to check if a delta table in an s3 bucket is actually a delta table. I am trying do this by
from delta import *
from delta.tables import DeltaTable
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark = SparkSession.builder\
.appName('test')\
.getOrCreate()
if DeltaTable.isDeltaTable(spark, "s3a://landing-zone/table_name/year=2022/month=2/part-0000-xyz.snappy.parquet"):
print("bla")
else:
print("blabla")
This code runs forever without returning any result. I tested it with a local delta table and there it works. When I trim the path url so it stops after the actual table name, the code shows the same behavior. I also generated a boto3 client and I can see the bucket list when calling s3.list_bucket(). Do I need to parse the client somehow into the if statement?
Thanks a lot in advance!

I am an idiot, I forgot that it is not enough to just create a boto3 client, but I also have to make the actual connection to S3 via
spark._jsc.hadoopConfiguration().set(...)

Related

How to get File/Files create by Spark df.write?

I have requirement to capture the parquet files created as the outcome of a df.write.parquet("s3://bkt/folder", mode="append") command.
I am running this on AWS EMR pyspark.
I can achive this using awswrangler using wr.s3.to_parquet() but this is not really fit for my EMR spark use case.
Is there such functionality ?
I want list of the files from s3://bkt/folder which spark wrote
Thx all
If you want a list of files that spark wrote to particular S3 path you can use either of below approach:
Use input_file_name which will give file path from which the record is originating from and do a distinct operation by selecting filename:
from pyspark.sql.functions import input_file_name
df=spark.read.parquet("s3://bkt/folder")
df.withColumn("filename", input_file_name())
Or you can use boto3 to list the files :
from boto3 import client
conn = client('s3') # again assumes boto.cfg setup, assume AWS S3
for key in conn.list_objects(Bucket='bucket_name')['Contents']:
print(key['Key'])

Writing data into snowflake using Python

Can we write data directly into snowflake table without using Snowflake internal stage using Python????
It seems auxiliary task to write in stage first and then transform it and then load it into table. can it done in one step only just like JDBC connection in RDBMS.
The absolute fastest way to load data into Snowflake is from a file on either internal or external stage. Period. All connectors have the ability to insert the data with standard insert commands, but this will not perform as well. That said, many of the Snowflake drivers are now transparently using PUT/COPY commands to load large data to Snowflake via internal stage. If this is what you are after, then you can leverage the pandas write_pandas command to load data from a pandas dataframe to Snowflake in a single command. Behind the scenes, it will execute the PUT and COPY INTO for you.
https://docs.snowflake.com/en/user-guide/python-connector-api.html#label-python-connector-api-write-pandas
I highly recommend this pattern over INSERT commands in any driver. And I would also recommend transforms be done AFTER loading to Snowflake, not before.
If someone is having issues with large datasets. Try Using dask instead and generate your dataframe partitioned into chunks. Then you can use dask.delayed with sqlalchemy. Here we are using snowflake's native connector method i.e. pd_writer, which under the hood uses write_pandas and eventually using PUT COPY with compressed parquet file. At the end it comes down to your I/O bandwidth to be honest. The more throughput you have, the faster it gets loaded in Snowflake Table. But this snippet provides decent amount of parallelism overall.
import functools
from dask.diagnostics import ProgressBar
from snowflake.connector.pandas_tools import pd_writer
import dask.dataframe as dd
df = dd.read_csv(csv_file_path, blocksize='64MB')
ddf_delayed = df.to_sql(
table_name.lower(),
uri=str(engine.url),
schema=schema_name,
if_exists=if_exists,
index=False,
method=functools.partial(
pd_writer,quote_identifiers=False),
compute=False,
parallel=True
)
with ProgressBar():
dask.compute(ddf_delayed, scheduler='threads', retries=3)
Java:
Load Driver Class:
Class.forName("net.snowflake.client.jdbc.SnowflakeDriver")
Maven:
Add following code block as a dependency
<dependency>
<groupId>net.snowflake</groupId>
<artifactId>snowflake-jdbc</artifactId>
<version>{version}</version>
Spring :
application.yml:
spring:
datasource
hikari:
maximumPoolSize: 4 # Specify maximum pool size
minimumIdle: 1 # Specify minimum pool size
driver-class-name: com.snowflake.client.jdbc.SnowflakeDriver
Python :
import pyodbc
# pyodbc connection string
conn = pyodbc.connect("Driver={SnowflakeDSIIDriver}; Server=XXX.us-east-2.snowflakecomputing.com; Database=VAQUARKHAN_DB; schema=public; UID=username; PWD=password")
# Cursor
cus=conn.cursor()
# Execute SQL statement to get current datetime and store result in cursor
cus.execute("select current_date;")
# Display the content of cursor
row = cus.fetchone()
print(row)
How to insert json response data in snowflake database more efficiently?
Apache Spark:
<dependency>
<groupId>net.snowflake</groupId>
<artifactId>spark-snowflake_2.11</artifactId>
<version>2.5.9-spark_2.4</version>
</dependency>
Code
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.DataFrame
/ Use secrets DBUtil to get Snowflake credentials.
val user = dbutils.secrets.get("data-warehouse", "<snowflake-user>")
val password = dbutils.secrets.get("data-warehouse", "<snowflake-password>")
val options = Map(
"sfUrl" -> "<snowflake-url>",
"sfUser" -> user,
"sfPassword" -> password,
"sfDatabase" -> "<snowflake-database>",
"sfSchema" -> "<snowflake-schema>",
"sfWarehouse" -> "<snowflake-cluster>"
)
// Generate a simple dataset containing five values and write the dataset to Snowflake.
spark.range(5).write
.format("snowflake")
.options(options)
.option("dbtable", "<snowflake-database>")
.save()
// Read the data written by the previous cell back.
val df: DataFrame = spark.read
.format("snowflake")
.options(options)
.option("dbtable", "<snowflake-database>")
.load()
display(df)
Fastest way to load data into Snowflake is from a file
https://community.snowflake.com/s/article/How-to-Load-Terabytes-Into-Snowflake-Speeds-Feeds-and-Techniques
https://bryteflow.com/how-to-load-terabytes-of-data-to-snowflake-fast/
https://www.snowflake.com/blog/ability-to-connect-to-snowflake-with-jdbc/
https://docs.snowflake.com/en/user-guide/jdbc-using.html
https://www.persistent.com/blogs/json-processing-in-spark-snowflake-a-comparison/

merge multiple small json files using pyspark in s3 [duplicate]

This question already has answers here:
Spark - Reading JSON from Partitioned Folders using Firehose
(2 answers)
Closed 3 years ago.
I am a newbie in spark.
I have multiple small json files (1kb) in subdirectories of my s3 bucket. I want to merge all the files present in a single directory. Is there any optimized way in doing this using pyspark.
Directory structure:
region/year/month/day/hour/multiple_json_files
I have many directories as indicated above and want to merge all files in a single directory.
P.S: I have tried using python but its taking more time, tried s3distcp but its the same result.
Can anyone please help me in this
you can achieve by below code
First we need to make sure the hadoop aws package is available when we load spark:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages=org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell"
Next we need to make pyspark available in the jupyter notebook:
import findspark
findspark.init()
from pyspark.sql import SparkSession
We need the aws credentials in order to be able to access the s3 bucket. We can use the configparser package to read the credentials from the standard aws file.
import configparser
config = configparser.ConfigParser()
config.read(os.path.expanduser("~/.aws/credentials"))
access_id = config.get(aws_profile, "aws_access_key_id")
access_key = config.get(aws_profile, "aws_secret_access_key")
We can start the spark session and pass the aws credentials to the hadoop configuration:
sc=spark.sparkContext
hadoop_conf=sc._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoop_conf.set("fs.s3n.awsAccessKeyId", access_id)
hadoop_conf.set("fs.s3n.awsSecretAccessKey", access_key)
Finally we can read the data and display it:
df=spark.read.json("s3n://path_of_location/*.json")
df.show()

Pyspark reads csv - NameError: name 'spark' is not defined

I am trying to run the following code in databricks in order to call a spark session and use it to open a csv file:
spark
fireServiceCallsDF = spark.read.csv('/mnt/sf_open_data/fire_dept_calls_for_service/Fire_Department_Calls_for_Service.csv', header=True, inferSchema=True)
And I get the following error:
NameError:name 'spark' is not defined
Any idea what might be wrong?
I have also tried to run:
from pyspark.sql import SparkSession
But got the following in response:
ImportError: cannot import name SparkSession
If it helps, I am trying to follow the following example (you will understand better if you watch it from from 17:30 on):
https://www.youtube.com/watch?v=K14plpZgy_c&list=PLIxzgeMkSrQ-2Uizm4l0HjNSSy2NxgqjX
I got it worked by using the following imports:
from pyspark import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import SparkSession, SQLContext
I got the idea by looking into the pyspark code as I found read csv was working in the interactive shell.
Please note the example code your are using is for Spark version 2.x
"spark" and "SparkSession" are not available on Spark 1.x. The error messages you are getting point to a possible version issue (Spark 1.x).
Check the Spark version you are using.

BigQuery connector for pyspark via Hadoop Input Format example

I have a large dataset stored into a BigQuery table and I would like to load it into a pypark RDD for ETL data processing.
I realized that BigQuery supports the Hadoop Input / Output format
https://cloud.google.com/hadoop/writing-with-bigquery-connector
and pyspark should be able to use this interface in order to create an RDD by using the method "newAPIHadoopRDD".
http://spark.apache.org/docs/latest/api/python/pyspark.html
Unfortunately, the documentation on both ends seems scarce and goes beyond my knowledge of Hadoop/Spark/BigQuery. Is there anybody who has figured out how to do this?
Google now has an example on how to use the BigQuery connector with Spark.
There does seem to be a problem using the GsonBigQueryInputFormat, but I got a simple Shakespeare word counting example working
import json
import pyspark
sc = pyspark.SparkContext()
hadoopConf=sc._jsc.hadoopConfiguration()
hadoopConf.get("fs.gs.system.bucket")
conf = {"mapred.bq.project.id": "<project_id>", "mapred.bq.gcs.bucket": "<bucket>", "mapred.bq.input.project.id": "publicdata", "mapred.bq.input.dataset.id":"samples", "mapred.bq.input.table.id": "shakespeare" }
tableData = sc.newAPIHadoopRDD("com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat", "org.apache.hadoop.io.LongWritable", "com.google.gson.JsonObject", conf=conf).map(lambda k: json.loads(k[1])).map(lambda x: (x["word"], int(x["word_count"]))).reduceByKey(lambda x,y: x+y)
print tableData.take(10)

Resources