i am trying to insert to cassandra a dataframe.
When i write
rdd.tosaveToCasssandra("keyspace","table")
Not problem but i can't write with this function
myDataFrame.tosaveToCassandra("keyspace","table")
Also i tried but didn't save.
myDataFrame.write.format("org.apache.spark.sql.cassandra").mode('append').options(table="mytable", keyspace="mykeyspace").save()
Do you have any idea except from new API for Spark 2.0
Thanks
For python there is currently no streaming Sink for Cassandra in the Spark Cassandra Connector, you will have to implement your own.
Related
How can I create a custom write format for Spark Dataframe to use it like df.write.format("com.mycompany.mydb").save()? I've tried reading through Datastax Cassandra connector code but still couldn't figure it out
Spark 3.0 completely changes the API. Some new interfaces e.g. TableProvider and SupportsWrite have been added.
You might find this guide helpful.
Using Spark's DataSourceV2.
If your are using Spark version < 2.3, then you can use Spark Data Source API V1.
I am new to spark and hive. I need to understand what happens behind when a hive table is queried in Spark. I am using PySpark
Ex:
warehouse_location = '\user\hive\warehouse'
from pyspark.sql import SparkSession
spark =SparkSession.builder.appName("Pyspark").config("spark.sql.warehouse.dir", warehouse_location).enableHiveSupport().getOrCreate()
DF = spark.sql("select * from hive_table")
In the above case, does the actual SQL run in spark framework or does it run in MapReduce framework of Hive.
I am just wondering how the SQL is being processed. Whether in Hive or in Spark?
enableHiveSupport() and HiveContext are quite misleading, as they suggest some deeper relationship with Hive.
In practice Hive support means that Spark will use Hive metastore to read and write metadata. Before 2.0 there where some additional benefits (window function support, better parser), but this no longer the case today.
Hive support does not imply:
Full Hive Query Language compatibility.
Any form of computation on Hive.
SparkSQL allows reading and writing data to Hive tables. In addition to Hive data, any RDD can be converted to a DataFrame, and SparkSQL can be used to run queries on the DataFrame.
The actual execution will happen on Spark. You can check this in your example by running a DF.count() and track the job via Spark UI at http://localhost:4040.
I am working with HDP 2.6.4, to be more specific Hive 1.2.1 with TEZ 0.7.0 , Spark 2.2.0.
My task is simple. Store data in ORC file format then use Spark to process the data. To achieve this, I am doing this:
Create a Hive table through HiveQL
Use Spark.SQL("select ... from ...") to load data into dataframe
Process against the dataframe
My questions are:
1. What is Hive's role behind the scene?
2. Is it possible to skip Hive?
You can skip Hive and use SparkSQL to run the command in step 1
In your case, Hive is defining a schema over your data and providing you a query layer for Spark and external clients to communicate
Otherwise, spark.orc exists for reading and writing of dataframes directly on the filesystem
I am trying to write a spark job in java that would open a jdbc connection with Impala and let me load a table and perform other operations.
How do I do this? Any example would be of great help. Thank you!
If using JDBC is a must, what you might want to try is execute a query in spark driver.
E.g. using impyla for python, you'd get a result from impala in a normal list of tuples. Later you can convert this result to spark rdd with parallelize().
I use Spark 1.3.1.
How to store/save a DataFrame data to a Hive metastore?
In Hive If I run show tables the DataFrame does not appear as a table in Hive databases. I have copied hive-site.xml to $SPARK_HOME/conf, but it didn't help (and the dataframe does not appear in Hive metastore either).
I am following this document, using spark 1.4 version.
dataframe.registerTempTable("people")
How to analyze the spark table in Hive?
You can use insertInto method to save dataframe in hive.
Example:
dataframe.insertInto("table_name", false);
//2nd arg is to specify if table should be overridden
Got the solution. I am using spark 1.3.1, so that it's not supporting all. Now using spark 1.5.1 that problem resolved.
I have noticed data-frames fully working after 1.40. Many commands are deprecated.