RegisterTempTable using dataset Spark Java - apache-spark

I have been using dataframe in my Java Spark Project (Spark version 1.6.1).
Now I am refactoring, trying to use the dataset in order to exploit the strong typed feature which comes with them.
In some part of the project I was using the following code:
dataframe.registerTempTable("table")
in order to use pure sql queries.
This kind of feature looks to be not present with dataset, I cannot find any similar method offered by them.
Can you confirm that?

I confirm that no method available in spark 1.6 for registering temp table or view using dataset.
https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/sql/Dataset.html
These methods were introduced in spark 2.0.
https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/Dataset.html

Use createOrReplaceTempView:
public void createOrReplaceTempView(String viewName)
Creates a local temporary view using the given name. The lifetime of this temporary view is tied to the SparkSession that was used to create this > Dataset.
Parameters:
viewName - (undocumented)
Since:
2.0.0

Related

Apache Spark + cassandra+Java +Spark session to display all records

I am working on a Spring Java Project and integrating Apache spark and cassandra using Datastax connector.
I have autowired sparkSession and the below lines of code seems to work.
Map<String, String> configMap = new HashMap<>();
configMap.put("keyspace", "key1");
configMap.put("table", tableName.toLowerCase());
Dataset<Row> ds = sparkSession.sqlContext().read().format("org.apache.spark.sql.cassandra").options(configMap)
.load();
ds.show();
But this is always giving me 20 records. I want to select all the records of table. can someone tell me how to do this ?
Thanks in advance.
show always outputs 20 records by default, although you can pass an argument to specify how many items do you need. But show is usually used just for briefly examine the data, especially when working interactively.
In your case, everything is really depends on what do you want to do with the data - you already successfully loaded the data using the load function - after that you can just start to use normal Spark functions - select, filter, groupBy, etc.
P.S. You can find here more examples on using Spark Cassandra Connector (SCC) from Java, although it's more cumbersome than using Scala... And I recommend to make sure that you're using SCC 2.5.0 or higher because of the many new features there.

Spark SQL encapsulation of data sources

I have a Dataset where 98% (older than one day ) of its data would be in Parquet file and 2% (the current day - real time feed) of data would be in HBase, i always need to union them to get final data set for that particular table or entity.
So i would like my clients use the data seamlessly like below in any language they use for accessing spark or via spark shell or any BI tools
spark.read.format("my.datasource").load("entity1")
internally i will read entity1's data from parquet and hbase then union them and return it.
I googled and got few examples on extending DatasourceV2, most of them says you need to develop reader, but here i do not need new reader, but need to make use the existing ones (parquet and HBase).
as i am not introducing any new datasource as such, do i need to create new datasource? or is there any higher level abstraction/hook available?
You have to implement a new datasource per se "parquet+hbase", in the implementation you will make use of existing readers of parquet and hbase, may be extending your classes with both of them and union them etc
For your reference here are some links, which can help you implementing new DataSource.
spark "bigquery" datasource implementation
https://github.com/GoogleCloudDataproc/spark-bigquery-connector
Implementing custom datasource
https://michalsenkyr.github.io/2017/02/spark-sql_datasource
After going through various resource below is what i found and implemented the same.
it might help someone, so adding it as answer
Custom datasource is required only if we introduce a new datasource. For combining existing datasources we have to extend SparkSession and DataFrameReader. In the extended data frame reader we can invoke spark parquet read method, hbase reader and get the corresponding datasets then combine the datasets and return the combined dataset.
in scala we can use implicits to add custom logic to the spark session and dataframe.
in java we need to extend spark session and dataframe, then when using it use imports of extended classes

what is the difference between java object (how it is represented in memory) and spark SQL object?

I have read in many articles and in the "SparK: The definitive guide book" that the spark-SQL structured data representation (at low level) is not the same as java objects.
A line in the book goes below...
"Beginning with Spark 1.0, the project added Spark SQL, a new API for working with structured data - tables with structured data format that is not tied to Java's in-memory representation."
If the low level Java object representation is different than the default representation by JRE, then how the JRE can correctly read/write objects ?
Can some one please help me to understand this.
Thanks!
I am unable to find any articles related to this.
"Beginning with Spark 1.0, the project added Spark SQL, a new API for working with structured data - tables with structured data format that is not tied to Java's in-memory representation." -> This is Dataset or Row objects
https://spark.apache.org/docs/2.3.0/api/java/index.html?org/apache/spark/sql/Dataset.html
https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/Row.html
It mean its not custom java object,rather its a spark type object.
You can also create custom java objects in spark using rdd and spark-sql as well.
Also i advice to go through https://spark.apache.org/docs/latest/rdd-programming-guide.html and not look at spark 1.0

How to create custom writer for Spark Dataframe?

How can I create a custom write format for Spark Dataframe to use it like df.write.format("com.mycompany.mydb").save()? I've tried reading through Datastax Cassandra connector code but still couldn't figure it out
Spark 3.0 completely changes the API. Some new interfaces e.g. TableProvider and SupportsWrite have been added.
You might find this guide helpful.
Using Spark's DataSourceV2.
If your are using Spark version < 2.3, then you can use Spark Data Source API V1.

Creating a UDF in spark

I am trying to create a permanent function in spark using geomesa-spark-jts.
Geomesa-spark-jts has huge potential in the larger LocationTech community.
I started first by downloading geomesa-spark-jts which contain the following
The after that I have launched spark like this (I made sure that the jar is within the path)
Now whew I use ST_Translate which come with that package, it does give me a result
But the problem is when I try to define ST_Translate as a UDF , I get the following error
The functions you mentioned are already supported in GeoMesa 2.0.0 for Spark 2.2.0. http://www.geomesa.org/documentation/user/spark/sparksql_functions.html
The geomesa-accumulo-spark-runtime jar is a shaded jar that includes the code from geomesa-spark-jts. You might be hitting issues with having the classes defined in two different jars.
In order to use st_translate with hive, I believe that you would have to implement a new class that extends org.apache.hadoop.hive.ql.exec.UDF and invokes the GeoMesa function.

Resources