Which of the following approach is better if I have billions of records in the hive table:
Direct:
SparkConf conf = new SparkConf(true).setMaster("yarn-cluster").setAppName("DCA_HIVE_HDFS");
SparkContext sc = new SparkContext(conf);
HiveContext hc = new HiveContext(sc);
DataFrame df = hc.table(tableName);
df.write().orc(outputHdfsFile);
Using JDBC:
SparkConf conf = new SparkConf(true).setMaster("yarn-cluster").setAppName("DCA_HIVE_HDFS");
SparkContext sc = new SparkContext(conf);
SQLContext sqlContext = new SQLContext(sc);
try {
Class.forName(driverName);
} catch (ClassNotFoundException e) {
e.printStackTrace();
}
Properties props = new Properties();
props.setProperty("user", userName);
props.setProperty("password", password);
props.setProperty("driver", driverName);
DataFrame df = sqlContext.read().jdbc(connectionUri, tableName, props);
df.write().orc(outputHdfsFile);
Related
I am working on spark module, where I need to load the collections from multiple sources (databases) but I can't get the collection from second db.
Databases
DB1
L_coll1
DB2
L_coll2
Logic code
String mst ="local[*]";
String host= "localhost";
String port = "27017";
String DB1 = "DB1";
String DB2 = "DB2";
SparkConf conf = new SparkConf().setAppName("cust data").setMaster(mst);
SparkSession spark = SparkSession
.builder()
.config(conf)
.config("spark.mongodb.input.uri", "mongodb://"+host+":"+port+"/")
.config("spark.mongodb.input.database",DB1)
.config("spark.mongodb.input.collection","coll1")
.getOrCreate();
SparkSession spark1 = SparkSession
.builder()
.config(conf)
.config("spark.mongodb.input.uri", "mongodb://"+host+":"+port+"/")
.config("spark.mongodb.input.database",DB2)
.config("spark.mongodb.input.collection","coll2")
.getOrCreate();
JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
JavaSparkContext jsc1 = new JavaSparkContext(spark1.sparkContext());
Reading configurations
ReadConfig readConfig = ReadConfig.create(spark);
Dataset<Row> MongoDatset = MongoSpark.load(jsc,readConfig).toDF();
MongoDatset.show();
ReadConfig readConfig1 = ReadConfig.create(spark1);
Dataset<Row> MongoDatset1 = MongoSpark.load(jsc1,readConfig1).toDF();
MongoDatset1.show();
After running the about code, I am getting the first dataset multiple time. If I comment the first SparkSession spark instance than only getting the collection from second db DB2.
Instead of using the multiple spark sessions you can use ReadConfig's override option to get multiple database and collections.
Creating spark session
String DB = "DB1";
String DB1 = "DB2";
String Coll1 ="Coll1";
String Coll2 ="Coll2";
SparkSession spark = SparkSession.builder()
.master("local")
.appName("MongoSparkConnectorIntro")
.config("spark.mongodb.input.uri", "mongodb://127.0.0.1/test.myCollection")
.config("spark.mongodb.output.uri", "mongodb://127.0.0.1/test.myCollection")
.getOrCreate();
// Create a JavaSparkContext using the SparkSession's SparkContext object
JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
Get database function
private static Dataset<Row> getDB(JavaSparkContext jsc_, String DB, String Coll1) {
// Create a custom ReadConfig
Map<String, String> readOverrides = new HashMap<String, String>();
readOverrides.put("database",DB );
readOverrides.put("collection", Coll1);
readOverrides.put("readPreference.name", "secondaryPreferred");
System.out.println(readOverrides);
ReadConfig readConfig = ReadConfig.create(jsc_).withOptions(readOverrides);
return MongoSpark.load(jsc_,readConfig).toDF();
}
Using getDB to create multiple databases
Dataset<Row> MongoDatset1 = getDB(jsc, DB, Coll1);
Dataset<Row> MongoDatset2 = getDB(jsc, DB1, Coll2);
MongoDatset1.show(1);
MongoDatset2.show(1);
just learn spark for a while, i found the api: saveAsNewAPIHadoopDataset when i use hbase, code like below, as far as know,this code can insert one row at a time , how to change it to batch put? i am a rookie ..please help...tks
import org.apache.hadoop.hbase.client.Put
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapreduce.TableOutputFormat
import org.apache.hadoop.hbase.client.Result
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.mapreduce.Job
import org.apache.spark.{SparkContext, SparkConf}
/**
*
*/
object HbaseTest2 {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setAppName("HBaseTest").setMaster("local")
val sc = new SparkContext(sparkConf)
val tablename = "account"
sc.hadoopConfiguration.set("hbase.zookeeper.quorum","slave1,slave2,slave3")
sc.hadoopConfiguration.set("hbase.zookeeper.property.clientPort", "2181")
sc.hadoopConfiguration.set(TableOutputFormat.OUTPUT_TABLE, tablename)
val job = Job.getInstance(sc.hadoopConfiguration)
job.setOutputKeyClass(classOf[ImmutableBytesWritable])
job.setOutputValueClass(classOf[Result])
job.setOutputFormatClass(classOf[TableOutputFormat[ImmutableBytesWritable]])
val indataRDD = sc.makeRDD(Array("1,jack,15","2,Lily,16","3,mike,16"))
val rdd = indataRDD.map(_.split(',')).map{arr=>{
val put = new Put(Bytes.toBytes(arr(0)))
put.addColumn(Bytes.toBytes("cf"),Bytes.toBytes("name"),Bytes.toBytes(arr(1)))
put.addColumn(Bytes.toBytes("cf"),Bytes.toBytes("age"),Bytes.toBytes(arr(2).toInt))
(new ImmutableBytesWritable, put)
}}
rdd.saveAsNewAPIHadoopDataset(job.getConfiguration())
sc.stop()
}
}
Actually you don't need to worry about this - under the hood, put(Put) and put(List<Put>) are identical. They both buffer messages and flush them in batches. There should be no noticeable performance difference.
I'm afraid the other answer is misguided.
saveAsNewAPIHadoopDataset performs single put.
To perform bulk put to hbase table, you can use hbase-spark connector.
The connector executes bulkPutFunc2 within mapPartition() so is efficient.
Your source code will change like below -
object HBaseTest {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setAppName("HBaseTest").setMaster("local")
val sc = new SparkContext(sparkConf)
val tablename = "account"
val hbaseConf = HBaseConfiguration.create()
hbaseConf.set("hbase.zookeeper.quorum", "slave1,slave2,slave3")
hbaseConf.set("hbase.zookeeper.property.clientPort", "2181")
hbaseConf.set("zookeeper.znode.parent", "/hbase")
val hbaseContext = new HBaseContext(sc, hbaseConf)
val indataRDD = sc.makeRDD(Array("1,jack,15", "2,Lily,16", "3,mike,16"))
hbaseContext.bulkPut(indataRDD, TableName.valueOf(tablename), bulkPutFunc2)
sc.stop()
}
def bulkPutFunc2(arrayRec : String): Put = {
val rec = arrayRec.split(",")
val put = new Put(Bytes.toBytes(rec(0).toInt))
put.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("name"), Bytes.toBytes(rec(1)))
put.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("age"), Bytes.toBytes(rec(2).toInt))
put
}
}
pom.xml would have following entry -
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-spark</artifactId>
<version>1.2.0-cdh5.12.1</version>
<dependency>
I was trying to write data in HBase using Spark but getting the exception Exception in thread "main" org.apache.spark.SparkException: Task not serializable. I was trying to open connection on each worker node using the following code snippet:
val conf = HBaseConfiguration.create()
val tableName = args(1)
conf.set(TableInputFormat.INPUT_TABLE, tableName)
val admin = new HBaseAdmin(conf)
val tableDesc = new HTableDescriptor(tableName)
val columnDesc = new HColumnDescriptor("cf".getBytes()).setBloomFilterType(BloomType.ROWCOL).setMaxVersions(5)
tableDesc.addFamily(columnDesc)
admin.createTable(tableDesc)
rddData.foreachPartition( part => {
val table = new HTable(conf, tableName)
part.foreach( elem => {
var put = new Put(Bytes.toBytes(elem._1))
put.add(Bytes.toBytes("cf"), Bytes.toBytes("col"), Bytes.toBytes(elem._2))
table.put(put)
})
table.flushCommits()
})
How can I make task serializable while writing on HBase using spark?
If I am not mistaken conf (instance of hadoop Configuration) is not serializable.
Write your code in such a way that all the non-serializable parts are in the foreachPartition block (so that it is executed on the nodes). Here is an example where I create a second conf etc..:
`
rddData.foreachPartition( part => {
val conf2 = HBaseConfiguration.create()
val tableName2 = args(1)
conf2.set(TableInputFormat.INPUT_TABLE, tableName2)
val table2 = new HTable(conf2, tableName2)
part.foreach( elem => {
var put = new Put(Bytes.toBytes(elem._1))
put.add(Bytes.toBytes("cf"), Bytes.toBytes("col"), Bytes.toBytes(elem._2))
table2.put(put)
})
table2.flushCommits()
})
`
I am processing logs which using Spark Streaming. I parse the log and convert the logs into Java Map. Following is the code.
Now I want to convert this Map into DataFrames
Any suggestion how achieve this?
val sparkConf = new SparkConf().setAppName("StreamingApp").setMaster("local[2]")
sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(2))
sqlContext= new SQLContext(sc)
val lines = ssc.textFileStream("hdfs://localhost:9000/test")
process(lines)
def process(lines: DStream[String]) {
val maptorow = lines.foreachRDD(rdd=>{
rdd.map(line => getMap(line))
.map(p =>
Row(p.get("column1"),
p.get("column2"))
}) // how to get dataframe after this?
def getMap(logs: String): java.util.Map[String, Object] = {
val k : java.util.Map[String, String] = parseLog(logs)
}
}
Thanks
foreachRDD has no return type, hence, you shouldn't be saving maptorow and in order for you to convert it, you need to do the conversion inside the foreachRDD and then deal with each RDD by itself as a separate set of data
val sqlContext = new SQLContext(sparkContext)
lines.foreachRDD(rdd=>{
import sqlContext.implicits._
val maptorow = lines.foreachRDD(rdd=>{
val newRDD = rdd.map(line => getMap(line))
.map(p =>
Row(p.get("column1"),
p.get("column2"))
val myDataFrame = newRDD.toDF()
//process myDataFrame as a DF
})
I write a demo to write data to hbase, but no response, no error, no log.
My hbase is 0.98, hadoop 2.3, spark 1.4.
And I run in yarn-client mode. any idea? thanks.
object SparkConnectHbase2 extends Serializable {
def main(args: Array[String]) {
new SparkConnectHbase2().toHbase();
}
}
class SparkConnectHbase2 extends Serializable {
def toHbase() {
val conf = new SparkConf().setAppName("ljh_ml3");
val sc = new SparkContext(conf)
val tmp = sc.parallelize(Array(601, 701, 801, 901)).foreachPartition({ a =>
val configuration = HBaseConfiguration.create();
configuration.set("hbase.zookeeper.property.clientPort", "2181");
configuration.set("hbase.zookeeper.quorum", “192.168.1.66");
configuration.set("hbase.master", “192.168.1.66:60000");
val table = new HTable(configuration, "ljh_test4");
var put = new Put(Bytes.toBytes(a+""));
put.add(Bytes.toBytes("f"), Bytes.toBytes("c"), Bytes.toBytes(a + "value"));
table.put(put);
table.flushCommits();
})
}
}
thanks.
Write to hbase table
import org.apache.hadoop.hbase.client.{HBaseAdmin, HTable, Put}
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor, HColumnDescriptor, TableName}
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.spark._
val hconf = HBaseConfiguration.create()
hconf.set(TableInputFormat.INPUT_TABLE, tablename)
val admin = new HBaseAdmin(hconf)
if(!admin.isTableAvailable(tablename)) {
val tabledesc= new HTableDescriptor(tablename)
tabledesc.addFamily(new HColumnDescriptor("cf1".getBytes()));
admin.createTable(tabledesc)
}
val newtable= new HTable(hconf, tablename);
val put = new Put(new String("row").getBytes());
put .add("cf1".getBytes(), "col1".getBytes(), new String("data").getBytes());
newtable.put(put);
newtable.flushCommits();
val hbaserdd = sc.newAPIHadoopRDD(hconf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])