Apache Spark + cassandra+Java +Spark session to display all records - apache-spark

I am working on a Spring Java Project and integrating Apache spark and cassandra using Datastax connector.
I have autowired sparkSession and the below lines of code seems to work.
Map<String, String> configMap = new HashMap<>();
configMap.put("keyspace", "key1");
configMap.put("table", tableName.toLowerCase());
Dataset<Row> ds = sparkSession.sqlContext().read().format("org.apache.spark.sql.cassandra").options(configMap)
.load();
ds.show();
But this is always giving me 20 records. I want to select all the records of table. can someone tell me how to do this ?
Thanks in advance.

show always outputs 20 records by default, although you can pass an argument to specify how many items do you need. But show is usually used just for briefly examine the data, especially when working interactively.
In your case, everything is really depends on what do you want to do with the data - you already successfully loaded the data using the load function - after that you can just start to use normal Spark functions - select, filter, groupBy, etc.
P.S. You can find here more examples on using Spark Cassandra Connector (SCC) from Java, although it's more cumbersome than using Scala... And I recommend to make sure that you're using SCC 2.5.0 or higher because of the many new features there.

Related

Spark SQL encapsulation of data sources

I have a Dataset where 98% (older than one day ) of its data would be in Parquet file and 2% (the current day - real time feed) of data would be in HBase, i always need to union them to get final data set for that particular table or entity.
So i would like my clients use the data seamlessly like below in any language they use for accessing spark or via spark shell or any BI tools
spark.read.format("my.datasource").load("entity1")
internally i will read entity1's data from parquet and hbase then union them and return it.
I googled and got few examples on extending DatasourceV2, most of them says you need to develop reader, but here i do not need new reader, but need to make use the existing ones (parquet and HBase).
as i am not introducing any new datasource as such, do i need to create new datasource? or is there any higher level abstraction/hook available?
You have to implement a new datasource per se "parquet+hbase", in the implementation you will make use of existing readers of parquet and hbase, may be extending your classes with both of them and union them etc
For your reference here are some links, which can help you implementing new DataSource.
spark "bigquery" datasource implementation
https://github.com/GoogleCloudDataproc/spark-bigquery-connector
Implementing custom datasource
https://michalsenkyr.github.io/2017/02/spark-sql_datasource
After going through various resource below is what i found and implemented the same.
it might help someone, so adding it as answer
Custom datasource is required only if we introduce a new datasource. For combining existing datasources we have to extend SparkSession and DataFrameReader. In the extended data frame reader we can invoke spark parquet read method, hbase reader and get the corresponding datasets then combine the datasets and return the combined dataset.
in scala we can use implicits to add custom logic to the spark session and dataframe.
in java we need to extend spark session and dataframe, then when using it use imports of extended classes

spark predicate push down is not working with phoenix hbase table

I am working on spark-hive-hbase integration.Here phoenix hbase table am using for the integration.
Phoenix : **apache-phoenix-4.14**
HBase : **hbase-1.4**
spark : **spark-2.3**
hive : **1.2.1**
I am using spark thrift server and accessing the table using jdbc.
Almost all basic features which i tested is working fine. but when i submit a query from spark with where condition it's submitted to phoenix with out where condition and all filtering happening in the spark side.
If the table has billions of data we can't go with this.
example:
Input-query: select * from hive_hbase where rowid=0;
Query-submitted: PhoenixQueryBuilder: Input query : select /*+ NO_CACHE */ "rowid","load_date","cluster_id","status" from hive_hbase
is it a bug?
Please suggest if there is any way to force the query to submit with where condition(filter) (with jdbc only).
Thanks & Regards
Rahul
The above-mentioned behavior is not a bug rather a feature of spark, which will make sure that filter is not happening on the DB side rather it is done at spark's end, which therefore ensures performance for a non-rowkey filter and execution can be finished fast. If you still want to push the predicates for all intents and purposes you can use phoenix-spark or rather edit the predicate pushdown code of spark on your own. Below are links for your reference
https://community.hortonworks.com/questions/87551/predicate-pushdown-support-in-hortonworks-hbase-co.html
http://www.waitingforcode.com/apache-spark-sql/predicate-pushdown-spark-sql/read

Structure streaming: First n rows

Recently, I encounter with the problem 'first n rows' in structure streaming during engineering with real-time data. I need to obtain the 50 newest event-time records as output, but structure streaming give me a whole unbounded table or several updated results. I search a lot online, and several methods is following:
(1) Using TTL, but I think that it is based on ingestion time, which is not my desired event-time;
(2) Using Flink to catch the newest event-time records. It is something messy to use flink and structure streaming in the meantime. As following, I have tried to use flink 1.6, statics is a table? I don't know how to process on because nothing output.
val source: KafkaTableSource = Kafka010JsonTableSource.builder()
.forTopic("BINANCE_BTCUSDT_RESULT")
.withKafkaProperties(properties)
.withSchema(TableSchema.builder()
.field("timestamp", Types.SQL_TIMESTAMP)
.field("future_max", Types.DOUBLE)
.field("future_min", Types.DOUBLE)
.field("close",Types.DOUBLE)
.field("quantities",Types.DOUBLE).build())
.fromEarliest()
.build()
tableEnv.registerTableSource("statics", source)
val statics = tableEnv.scan("statics")
statics.?
Any body could tell me more about the solving method with the first n rows problem? If the problem is solved, how to post the dataframe into url?
I recommend you use Flink 1.5, as 1.6 isn't stable yet (in fact, 1.5 was just released).
When using event time with Flink, Flink needs to be aware of your timestamps, and it needs watermarks, which indicate the flow of event time. To do this with a Kafka010JsonTableSource, you should configure a rowtime attribute.
Note that fetch() is only available when using Flink SQL in batch mode.

RegisterTempTable using dataset Spark Java

I have been using dataframe in my Java Spark Project (Spark version 1.6.1).
Now I am refactoring, trying to use the dataset in order to exploit the strong typed feature which comes with them.
In some part of the project I was using the following code:
dataframe.registerTempTable("table")
in order to use pure sql queries.
This kind of feature looks to be not present with dataset, I cannot find any similar method offered by them.
Can you confirm that?
I confirm that no method available in spark 1.6 for registering temp table or view using dataset.
https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/sql/Dataset.html
These methods were introduced in spark 2.0.
https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/Dataset.html
Use createOrReplaceTempView:
public void createOrReplaceTempView(String viewName)
Creates a local temporary view using the given name. The lifetime of this temporary view is tied to the SparkSession that was used to create this > Dataset.
Parameters:
viewName - (undocumented)
Since:
2.0.0

How to use Apache spark as Query Engine?

i am using Apache Spark For Big data Processing. The data is loaded to Data frames from a Flat file source or JDBC source. The Job is to search specific records from the data frame using spark sql.
So i have to Run the job again and again for new search terms. every time i have to submit the Jar files using spark submit to get the results. As the size of data is 40.5 GB it becomes tedious to reload the same data every time to data frame to get the results for different queries.
so What i need is,
a way if i can load the data in data frame once and query it multiple time with out submitting the jar multiple times ?
if we could use spark as a search engine/ query engine?
if we can load the data into data frame once and query the data frame remotely using RestAP
> The current configuration of My Spark Deployment is
5 node cluster.
runs on yarn rm.
i have tried to use spark-job server but it also runs the job every time.
You might be interested in HiveThriftServer and Spark integration.
Basically you start a Hive Thrift Server and inject your HiveContext build from SparkContext:
...
val sql = new HiveContext(sc)
sql.setConf("hive.server2.thrift.port", "10001")
...
dataFrame.registerTempTable("myTable")
HiveThriftServer2.startWithContext(sql)
...
There are several client libraries and tools to query the server:
https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients
Including CLI tool - beeline
Reference:
https://medium.com/#anicolaspp/apache-spark-as-a-distributed-sql-engine-4373e254e0f9#.3ntbhdxvr
You can also use spark+kafka streaming integration. Just that you will have to send your queries over kafka for the streaming APIs to pick up. Thats one design pattern picking up quickly in market cos if its simplicity.
Create Datasets over your lookup data.
Start a Spark streaming query over Kafka.
Get the sql from your Kafka topic
Execute the query over the already created Datasets
This should take care of your usecase.
Hope this helps!
For the spark search engine, if you require full text search capabilities and/or document level scoring - and you do not have an elasticsearch infrastructure - you can give a try to Spark Search - it brings Apache Lucene support to spark.
df.rdd.searchRDD().save("/tmp/hdfs-pathname")
val restoredSearchRDD: SearchRDD[Person] = SearchRDD.load[Person](sc, "/tmp/hdfs-pathname")
restoredSearchRDD.searchList("(fistName:Mikey~0.8) OR (lastName:Wiliam~0.4) OR (lastName:jonh~0.2)",
topKByPartition = 10)
.map(doc => s"${doc.source.firstName}=${doc.score}"
.foreach(println)

Resources