How To save DataFrame Into Cassandra table using Spark Java API - apache-spark

I want to save data frame into cassandra table using sparkJava API
I want to add the part of saving in the following code
I want to save people dataframe into cassandra table and make queries on that cassandra table
import org.apache.spark.api.java.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructType;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SQLContext;
import com.datastax.spark.connector.cql.CassandraConnector;
import com.datastax.spark.connector.japi.CassandraRow;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.DataFrame;
public class SimpleApp {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("Simple Application");
conf.setMaster("local");
conf.set("spark.cassandra.connection.host", "localhost");
JavaSparkContext sc = new JavaSparkContext(conf);
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
DataFrame people = sqlContext.read().json("/root/people.json");
people.printSchema();
people.registerTempTable("people");
**//I want to save this TempTable or people dataframe into cassandra table and make teenagers SQL query on that cassandra table**
DataFrame teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19");
teenagers.show();
}
}

Related

Not able import spark SQL in Maven

I am trying to import Spark SQL. I am not able to import. I am not sure about the mistake what I am making. I am just a starting learner.
package MySource
import java.sql.{DriverManager, ResultSet}
import org.apache.spark.sql.SparkSession
import java.util.Properties
object MyCalc {
def main(args: Array[String]): Unit = {
println("This is my first Spark")
//val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val spark = SparkSession
.builder()
.appName("SparkSQL")
//.master("YARN")
.master("local[*]")
//.enableHiveSupport()
//.config("spark.sql.warehouse.dir","file:///c:/temp")
.getOrCreate()
import spark.sqlContext.implicits._
}
}
Error:(3, 8) object SparkSession is not a member of package org.apache.spark.sql
import org.apache.spark.sql.SparkSession
Error:(15, 17) not found: value SparkSession
val spark = SparkSession

Loading file from HDFS in spark

I'm trying to run this spark program from HDFS because when I run it locally I don't have enough memory on my pc to handle it. Can someone inform me on how to load the csv file from my HDFS as opposed to doing it locally? Here is my code:
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SaveMode;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.StructType;
public class VideoGamesSale {
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("Video Games Spark")
.config("spark.master", "local")
.getOrCreate();
You can use the below code to create a dataset/dataframe from a csv file.
Dataset<Row> csvDS = spark.read().csv("/path/of/csv/file.csv");
If you want to read multiple files from directories you can use the below
Seq<String> paths = scala.collection.JavaConversions.asScalaBuffer(Arrays.asList("path1","path2"));
Dataset<Row> csvsDS = spark.read().csv(paths);

how can sparksql update to elasticsearch

I want to use sparksql to only update one of the fields in the ElasticSearch. But it covers it.
Can anyone please explain how can it be done?
Below is my code:
import com.bjvca.utils.PVConstant
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{DataFrame, SQLContext}
import org.apache.spark.sql.SQLContext._
import org.elasticsearch.spark.sql._
object DF2ESTest {
def main(args: Array[String]): Unit = {
val conf = conf...
val sc = new SparkContext(conf)
val sqlContext: SQLContext = new SQLContext(sc)
import sqlContext.implicits._
val person = sc.textFile("F:\\mrdata\\person\\input\\person.txt")
.map(_.split(","))
.map(p => Person(p(0), p(0), p(1).trim.toInt)) //first write
// .map(p => Person(p(0),p(0))) //want to update name
.toDF()
person.saveToEs("person/person", Map("es.mapping.id" -> "id"))
}
}
case class Person(id: String, name: String, age: Int)
//case class Person(id:String,name: String)
first:write data to elasticsearch.
second: i want update only name to es,but it covers old data than age is disap

How to save Dataset<Row> in mySQL in spark?

I am using spark standalone cluster in my scenario. I want to read read a JSON file from Azure data lake and using SparkSQL and do some query over it and save the result into a mysql database. I don't know how to do it. A small help will be a great.
package com.biz.Read_from_ADL;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public class App {
public static void main(String[] args) throws Exception {
SparkSession spark = SparkSession.builder().appName("Java Spark SQL basic example").getOrCreate();
Dataset<Row> df = spark.read().json("adl://pare.azuredatalakestore.net/EXCHANGE_DATA/BITFINEX/ETHBTC/MIDPOINT/BITFINEX_ETHBTC_MIDPOINT_2017-06-25.json");
//df.show();
df.createOrReplaceTempView("trade");
Dataset<Row> sqlDF = spark.sql("SELECT * FROM trade");
sqlDF.show();
}
}
You need to first define the connection properties and jdbc url.
import java.util.Properties
val connectionProperties = new Properties()
connectionProperties.put("user", "USER_NAME")
connectionProperties.put("password", "PASSWORD")
val jdbc_url = ... // <- use mysql url
import org.apache.spark.sql.SaveMode
spark.sql("select * from diamonds limit 10").withColumnRenamed("table", "table_number")
.write
.mode(SaveMode.Append) // <--- Append to the existing table
.jdbc(jdbc_url, "diamonds_mysql", connectionProperties)
Refer here for more detail.

Spark, ADAM and Zeppelin

Trying to attempt genomic analysis using ADAM and Zeppelin. I'm not sure if I'm doing this right but running into below issue.
%dep
z.reset()
z.addRepo("Spark Packages Repo").url("http://dl.bintray.com/spark-packages/maven")
z.load("com.databricks:spark-csv_2.10:1.2.0")
z.load("mysql:mysql-connector-java:5.1.35")
z.load("org.bdgenomics.adam:adam-core_2.10:0.20.0")
z.load("org.bdgenomics.adam:adam-cli_2.10:0.20.0")
z.load("org.bdgenomics.adam:adam-apis_2.10:0.20.0")
%spark
import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.rdd.ADAMContext
import org.bdgenomics.adam.projections.{ AlignmentRecordField, Projection }
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.bdgenomics.adam.rdd.ADAMContext
import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.projections.Projection
import org.bdgenomics.adam.projections.AlignmentRecordField
import scala.io.Source
import org.apache.spark.rdd.RDD
import org.bdgenomics.formats.avro.Genotype
import scala.collection.JavaConverters._
import org.bdgenomics.formats.avro._
import org.apache.spark.SparkContext._
import org.apache.spark.mllib.linalg.{ Vector => MLVector, Vectors }
import org.apache.spark.mllib.clustering.{ KMeans, KMeansModel }
val ac = new ADAMContext(sc)
and I get the following output with a error
import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.rdd.ADAMContext
import org.bdgenomics.adam.projections.{AlignmentRecordField, Projection}
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.bdgenomics.adam.rdd.ADAMContext
import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.projections.Projection
import org.bdgenomics.adam.projections.AlignmentRecordField
import scala.io.Source
import org.apache.spark.rdd.RDD
import org.bdgenomics.formats.avro.Genotype
import scala.collection.JavaConverters._
import org.bdgenomics.formats.avro._
import org.apache.spark.SparkContext._
import org.apache.spark.mllib.linalg.{Vector=>MLVector, Vectors}
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
res7: org.apache.spark.SparkContext = org.apache.spark.SparkContext#62ec8142
<console>:188: error: constructor ADAMContext in class ADAMContext cannot be accessed in class $iwC
new ADAMContext(sc)
Any idea where to look? Am I missing any dependencies?
^
According file ADAMContext.scala in the version you use. The constructor is private.
class ADAMContext private (#transient val sc: SparkContext)
extends Serializable with Logging {
...
}
You can instead use like this.
import org.bdgenomics.adam.rdd.ADAMContext._
val adamContext: ADAMContext = z.sc
It will use the implicit conversion in object ADAMContext
object ADAMContext {
implicit def sparkContextToADAMContext(sc: SparkContext): ADAMContext =
new ADAMContext(sc)
}
It did work without using Z reference!!
val ac:ADAMContext = sc
val genotypes: RDD[Genotype] = ac.loadGenotypes("/tmp/ADAM2").rdd
Output
ac: org.bdgenomics.adam.rdd.ADAMContext = org.bdgenomics.adam.rdd.ADAMContext#2c60ef7e
genotypes:
org.apache.spark.rdd.RDD[org.bdgenomics.formats.avro.Genotype] = MapPartitionsRDD[3] at map at ADAMContext.scala:207
I had tried doing this at the adam-shell prompt and I don't recall having to use implicit conversion. It was using the 0.19 version of ADAM though.

Resources