Is it possible to use PySpark to insert data into couchbase? - apache-spark

Is it possible to use Pyspark to load csv data into Couchbase? I was on Couchbase website but I didn't see support for Pyspark just python.

Related

Automatically update Glue Data Catalog after writing parquet to S3 using Pyspark

Is there a way to automatically update Glue Data Catalog (without having a separate crawler) after writing parquet files to S3 from PySpark?
The awswrangler python package can do it by specifying database and table in the s3.to_parquet(). Is there an equivalent way to do it in PySpark?

Pyspark on EMR and external hive/glue - can drop but not create tables via sqlContext

I'm writing a dataframe to an external hive table from pyspark running on EMR. The work involves dropping/truncating data from an external hive table, writing the contents of a dataframe into aforementioned table, then writing the data from hive to DynamoDB. I am looking to write to an internal table on the EMR cluster but for now I would like the hive data to be available to subsequent clusters. I could write to the Glue catalog directly and force it to registered but that is a step further than I need to go.
All components work fine individually on a given EMR cluster: I can create an external hive table on EMR, either using a script or ssh and hive shell. This table can be queried by Athena and can be read from by pyspark. I can create a dataframe and INSERT OVERWRITE the data into the aforementioned table in pyspark.
I can then use hive shell to copy the data from the hive table into a DynamoDB table.
I'd like to wrap all of the work into the one pyspark script instead of having to submit multiple distinct steps.
I am able to drop tables using
sqlContext.sql("drop table if exists default.my_table")
When I try to create a table using sqlContext.sql("create table default.mytable(id string,val string) STORED AS ORC") I get the following error:
org.apache.hadoop.net.ConnectTimeoutException: Call From ip-xx-xxx-xx-xxx/xx.xxx.xx.xx to ip-xxx-xx-xx-xx:8020 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=ip-xxx-xx-xx-xx:8020]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout
I can't figure out why I can create an external hive table in Glue using hive shell on the cluster, drop the table using hive shell or pyspark sqlcontext, but I can't create a table using sqlcontext. I have checked around and the solutions offered don't make sense in this context (copying hive-site.xml) as I can clearly write to the required addresses with no hassle, just not in pyspark. And it is doubly strange that I can drop the tables with them being definitely dropped when I check in Athena.
Running on:
emr-5.28.0,
Hadoop distribution Amazon 2.8.5
Spark 2.4.4
Hive 2.3.6
Livy 0.6.0 (for notebooks but my experimentation is via ssh and pyspark shell)
Turns out I could create tables via a spark.sql() call as long as I provided a location for the tables. Seems like Hive shell doesn't require it, yet spark.sql() does. Not expected but not entirely unsurprising.
Complementing #Zeathor's answer. After configuring the EMR and Glue connection and permission (you can check more in here: https://www.youtube.com/watch?v=w20tapeW1ME), you will just need to write sparkSQL commands:
spark = SparkSession.builder.appName('TestSession').getOrCreate()
spark.sql("create database if not exists test")
You can then create your tables from dataframes:
df.createOrReplaceTempView("first_table");
spark.sql("create table test.table_name as select * from first_table");
All the databases and tables metadata will then be stored in AWS Glue Catalogue.

Reading data across clusters in Apache Spark

I have a huge table which I am loading from RedShift into a csv file on S3 using a Databricks(DBx) notebookA. This notebook is running on clusterA.
I have another notebookB, which is reading the data from csv files in S3 into a dataframe. This notebookB is running on clusterB.
Now, I want to access this dataframe in third notebookC, which is on clusterC.
How can I do this?
registerTempTable is session specific.
createGlobalTempView can be accessed across notebooks but not across notebooks on different clusters.

How to save Spark Dataframe to Hana Vora table?

We have a file that we want split into 3 and that we need to perform some data cleanup on before it can be imported into Hana Vora - otherwise everything has to be typed as String, which is not ideal.
We can import and prepare the DataFrames in spark just fine, but then when i try to write to either the HDFS filesystem or, better, to save as a Table in the "com.sap.spark.vora" datasource, i get errors.
Can any one advise on a reliable way to import the spark-prepared datasets into Hana Vora? Thanks!
Vora currently only officially supports appending data to an existing table (using the APPEND statement). For details see SAP HANA Vora Developer Guide -> Chapter "3.5 Appending Data to Existing Tables"
This means you would have to create an intermediate file. Vora supports reading from CSV, ORC, Parquet files. A dataframe can be saved in an ORC and Parquet files directly from Spark (see https://spark.apache.org/docs/1.6.1/sql-programming-guide.htm). To write to CSV files from Spark see https://github.com/databricks/spark-csv

Using spark dataFrame to load data from HDFS

Can we use DataFrame while reading data from HDFS.
I have a tab separated data in HDFS.
I googled, but saw it can be used with NoSQL data
DataFrame is certainly not limited to NoSQL data sources. Parquet, ORC and JSON support is natively provided in 1.4 to 1.6.1; text delimited files are supported using the spark-cvs package.
If you have your tsv file in HDFS at /demo/data then the following code will read the file into a DataFrame
sqlContext.read.
format("com.databricks.spark.csv").
option("delimiter","\t").
option("header","true").
load("hdfs:///demo/data/tsvtest.tsv").show
To run the code from spark-shell use the following:
--packages com.databricks:spark-csv_2.10:1.4.0
In Spark 2.0 csv is natively supported so you should be able to do something like this:
spark.read.
option("delimiter","\t").
option("header","true").
csv("hdfs:///demo/data/tsvtest.tsv").show
If I am understanding correctly, you essentially want to read data from the HDFS and you want this data to be automatically converted to a DataFrame.
If that is the case, I would recommend you this spark csv library. Check this out, it has a very good documentation.

Resources