I'm new with zeppelin, but it look like interesting.
I'd like to do some visualization with cassandra's data reading with spark within zeppelin. But I can't do it, yet!
This is my code:
import org.apache.spark.sql.cassandra._
import org.apache.spark.sql
val createDDL = """CREATE TEMPORARY VIEW keyspaces9
USING org.apache.spark.sql.cassandra
OPTIONS (
table "foehis",
keyspace "tfm",
pushdown "true")"""
spark.sql(createDDL)
spark.sql("SELECT hoclic,hodtac,hohrac,hotpac FROM keyspaces").show
And I get:
res41: org.apache.spark.sql.DataFrame = []
+------+--------+------+------+
|hoclic| hodtac|hohrac|hotpac|
+------+--------+------+------+
| 1011|10180619| 510| ENPR|
| 1011|20140427| 800| ANDE|
| 1011|20140427| 800| ANDE|
| 1011|20170522| 1100| ANDE|
| 1011|20170522| 1100| ANDE|
....
But I don't have the ability to make a viz
How do I convert that data into a table for zeppelin?
Register the DataFrame as a Table using df.registerTempTable.
In your case, register 'keyspaces' dataframe as table and then you can execute the SQL queries on the table and create visualizations.
Sample code:
Related
I stored data in a table:
spark.table("default.student").show()
(1) Spark Jobs
+---+----+---+
| id|name|age|
+---+----+---+
| 1| bob| 34|
+---+----+---+
I would like to make a read stream using that table as source. I tried
newDF=spark.read.table("default.student")
newDF.isStreaming
Which returns False.
Is there a way to use a table as Streaming Source?
Need to use delta table. Like this on Databricks Notebook:
data = spark.range(0, 5)
data.write.format("delta").mode("overwrite").saveAsTable("T1")
stream = spark.readStream.format("delta").table("T1").writeStream.format("console").start()
// In another cell, execute:
data = spark.range(6, 10)
In DriverLogs can see 2 sets of data, then.
when trying to use spark 2.3 on HDP 3.1 to write to a Hive table without the warehouse connector directly into hives schema using:
spark-shell --driver-memory 16g --master local[3] --conf spark.hadoop.metastore.catalog.default=hive
val df = Seq(1,2,3,4).toDF
spark.sql("create database foo")
df.write.saveAsTable("foo.my_table_01")
fails with:
Table foo.my_table_01 failed strict managed table checks due to the following reason: Table is marked as a managed table but is not transactional
but a:
val df = Seq(1,2,3,4).toDF.withColumn("part", col("value"))
df.write.partitionBy("part").option("compression", "zlib").mode(SaveMode.Overwrite).format("orc").saveAsTable("foo.my_table_02")
Spark with spark.sql("select * from foo.my_table_02").show works just fine.
Now going to Hive / beeline:
0: jdbc:hive2://hostname:2181/> select * from my_table_02;
Error: java.io.IOException: java.lang.IllegalArgumentException: bucketId out of range: -1 (state=,code=0)
A
describe extended my_table_02;
returns
+-----------------------------+----------------------------------------------------+----------+
| col_name | data_type | comment |
+-----------------------------+----------------------------------------------------+----------+
| value | int | |
| part | int | |
| | NULL | NULL |
| # Partition Information | NULL | NULL |
| # col_name | data_type | comment |
| part | int | |
| | NULL | NULL |
| Detailed Table Information | Table(tableName:my_table_02, dbName:foo, owner:hive/bd-sandbox.t-mobile.at#SANDBOX.MAGENTA.COM, createTime:1571201905, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:value, type:int, comment:null), FieldSchema(name:part, type:int, comment:null)], location:hdfs://bd-sandbox.t-mobile.at:8020/warehouse/tablespace/external/hive/foo.db/my_table_02, inputFormat:org.apache.hadoop.hive.ql.io.orc.OrcInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.ql.io.orc.OrcSerde, parameters:{path=hdfs://bd-sandbox.t-mobile.at:8020/warehouse/tablespace/external/hive/foo.db/my_table_02, compression=zlib, serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[FieldSchema(name:part, type:int, comment:null)], parameters:{numRows=0, rawDataSize=0, spark.sql.sources.schema.partCol.0=part, transient_lastDdlTime=1571201906, bucketing_version=2, spark.sql.create.version=2.3.2.3.1.0.0-78, totalSize=740, spark.sql.sources.schema.numPartCols=1, spark.sql.sources.schema.part.0={\"type\":\"struct\",\"fields\":[{\"name\":\"value\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"part\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}}]}, numFiles=4, numPartitions=4, spark.sql.partitionProvider=catalog, spark.sql.sources.schema.numParts=1, spark.sql.sources.provider=orc, transactional=true}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE, rewriteEnabled:false, catName:hive, ownerType:USER, writeId:-1) |
How can I use spark to write to hive without using the warehouse connector but still writing to the same metastore which can later on be read by hive?
To my best knowledge external tables should be possible (thy are not managed, not ACID not transactional), but I am not sure how to tell the saveAsTable how to handle these.
edit
related issues:
https://community.cloudera.com/t5/Support-Questions/In-hdp-3-0-can-t-create-hive-table-in-spark-failed/td-p/202647
Table loaded through Spark not accessible in Hive
setting the properties there proposed in the answer do not solve my issue
seems also to be a bug: https://issues.apache.org/jira/browse/HIVE-20593
Might be a workaround like the https://github.com/qubole/spark-acid like https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.4/integrating-hive/content/hive_hivewarehouseconnector_for_handling_apache_spark_data.html but I do not like the idea of using more duct tape where I have not seen any large scale performance tests just yet. Also, this means changing all existing spark jobs.
In fact Cant save table to hive metastore, HDP 3.0 reports issues with large data frames and the warehouse connector.
edit
I just found https://community.cloudera.com/t5/Support-Questions/Spark-hive-warehouse-connector-not-loading-data-when-using/td-p/243613
And:
execute() vs executeQuery()
ExecuteQuery() will always use the Hiveserver2-interactive/LLAP as it
uses the fast ARROW protocol. Using it when the jdbc URL point to the
non-LLAP Hiveserver2 will yield an error.
Execute() uses JDBC and does not have this dependency on LLAP, but has
a built-in restriction to only return 1.000 records max. But for most
queries (INSERT INTO ... SELECT, count, sum, average) that is not a
problem.
But doesn't this kill any high-performance interoperability between hive and spark? Especially if there are not enough LLAP nodes available for large scale ETL.
In fact, this is true. This setting can be configured at https://github.com/hortonworks-spark/spark-llap/blob/26d164e62b45cfa1420d5d43cdef13d1d29bb877/src/main/java/com/hortonworks/spark/sql/hive/llap/HWConf.java#L39, though I am not sure of the performance impact of increasing this value
Did you try
data.write \
.mode("append") \
.insertInto("tableName")
Inside Ambari simply disabling the option of creating transactional tables by default solves my problem.
set to false twice (tez, llap)
hive.strict.managed.tables = false
and enable manually in each table property if desired (to use a transactional table).
Creating an external table (as a workaround) seems to be the best option for me.
This still involves HWC to register the column metadata or update the partition information.
Something along these lines:
val df:DataFrame = ...
val externalPath = "/warehouse/tablespace/external/hive/my_db.db/my_table"
import com.hortonworks.hwc.HiveWarehouseSession
val hive = HiveWarehouseSession.session(spark).build()
dxx.write.partitionBy("part_col").option("compression", "zlib").mode(SaveMode.Overwrite).orc(externalPath)
val columns = dxx.drop("part_col").schema.fields.map(field => s"${field.name} ${field.dataType.simpleString}").mkString(", ")
val ddl =
s"""
|CREATE EXTERNAL TABLE my_db.my_table ($columns)
|PARTITIONED BY (part_col string)
|STORED AS ORC
|Location '$externalPath'
""".stripMargin
hive.execute(ddl)
hive.execute(s"MSCK REPAIR TABLE $tablename SYNC PARTITIONS")
Unfortunately, this throws a:
java.sql.SQLException: The query did not generate a result set!
from HWC
"How can I use spark to write to hive without using the warehouse connector but still writing to the same metastore which can later on be read by hive?"
We are working on the same setting (HDP 3.1 with Spark 2.3). Using below code we were getting the same error messages as you got "bucketId out of range: -1". The solution was to run set hive.fetch.task.conversion=none; in Hive shell before trying to query the table.
The code to write data into Hive without the HWC:
val warehouseLocation = new File("spark-warehouse").getAbsolutePath
case class Record(key: Int, value: String)
val spark = SparkSession.builder()
.master("yarn")
.appName("SparkHiveExample")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate()
spark.sql("USE databaseName")
val recordsDF = spark.createDataFrame((1 to 100).map(i => Record(i, s"val_$i")))
recordsDF.write.mode(SaveMode.Overwrite).format("orc").saveAsTable("sparkhive_records")
[Example from https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html]
I am using pyspark [spark2.3.1] and Hbase1.2.1, I am wondering what could be the best possible way of accessing Hbase using pyspark?
I did some initial level of search and found that there are few options available like using shc-core:1.1.1-2.1-s_2.11.jar this could be achieved, but whereever I try to look for some example, at most of the places code is written in Scala or examples are also scala based. I tried implementing basic code in pyspark:
from pyspark import SparkContext
from pyspark.sql import SQLContext
def main():
sc = SparkContext()
sqlc = SQLContext(sc)
data_source_format = 'org.apache.spark.sql.execution.datasources.hbase'
catalog = ''.join("""{
"table":{"namespace":"default", "name":"firsttable"},
"rowkey":"key",
"columns":{
"firstcol":{"cf":"rowkey", "col":"key", "type":"string"},
"secondcol":{"cf":"d", "col":"colname", "type":"string"}
}
}""".split())
df = sqlc.read.options(catalog=catalog).format(data_source_format).load()
df.select("secondcol").show()
# entry point for PySpark application
if __name__ == '__main__':
main()
and running it using:
spark-submit --master yarn-client --files /opt/hbase-1.1.2/conf/hbase-site.xml --packages com.hortonworks:shc-core:1.1.1-2.1-s_2.11 --jars /home/ubuntu/hbase-spark-2.0.0-alpha4.jar HbaseMain2.py
It is returning me blank output:
+---------+
|secondcol|
+---------+
+---------+
I am not sure what am I doing wrong? Also not sure what would be the best approach of doing this??
Any references would be appreciated.
Regards
Finally, Using SHC, I am able to connect to HBase-1.2.1 with Spark-2.3.1 using pyspark code. Following is my work:
All my hadoop [namenode, datanode, nodemanager, resourcemanager] & hbase [Hmaster, HRegionServer, HQuorumPeer] deamons were up and running on my EC2 instance.
I placed emp.csv file at hdfs location /test/emp.csv, with data:
key,empId,empName,empWeight
1,"E007","Bhupesh",115.10
2,"E008","Chauhan",110.23
3,"E009","Prithvi",90.0
4,"E0010","Raj",80.0
5,"E0011","Chauhan",100.0
I created readwriteHBase.py file with following line of code [for reading emp.csv file from HDFS, then creating tblEmployee first in HBase, pushing the data into tblEmployee then once again reading some data from the same table and displaying it on console]:
from pyspark.sql import SparkSession
def main():
spark = SparkSession.builder.master("yarn-client").appName("HelloSpark").getOrCreate()
dataSourceFormat = "org.apache.spark.sql.execution.datasources.hbase"
writeCatalog = ''.join("""{
"table":{"namespace":"default", "name":"tblEmployee", "tableCoder":"PrimitiveType"},
"rowkey":"key",
"columns":{
"key":{"cf":"rowkey", "col":"key", "type":"int"},
"empId":{"cf":"personal","col":"empId","type":"string"},
"empName":{"cf":"personal", "col":"empName", "type":"string"},
"empWeight":{"cf":"personal", "col":"empWeight", "type":"double"}
}
}""".split())
writeDF = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("/test/emp.csv")
print("csv file read", writeDF.show())
writeDF.write.options(catalog=writeCatalog, newtable=5).format(dataSourceFormat).save()
print("csv file written to HBase")
readCatalog = ''.join("""{
"table":{"namespace":"default", "name":"tblEmployee"},
"rowkey":"key",
"columns":{
"key":{"cf":"rowkey", "col":"key", "type":"int"},
"empId":{"cf":"personal","col":"empId","type":"string"},
"empName":{"cf":"personal", "col":"empName", "type":"string"}
}
}""".split())
print("going to read data from Hbase table")
readDF = spark.read.options(catalog=readCatalog).format(dataSourceFormat).load()
print("data read from HBase table")
readDF.select("empId", "empName").show()
readDF.show()
# entry point for PySpark application
if __name__ == '__main__':
main()
Ran this script on VM console using command:
spark-submit --master yarn-client --packages com.hortonworks:shc-core:1.1.1-2.1-s_2.11 --repositories http://nexus-private.hortonworks.com/nexus/content/repositories/IN-QA/ readwriteHBase.py
Intermediate Result: After reading CSV file:
+---+-----+-------+---------+
|key|empId|empName|empWeight|
+---+-----+-------+---------+
| 1| E007|Bhupesh| 115.1|
| 2| E008|Chauhan| 110.23|
| 3| E009|Prithvi| 90.0|
| 4|E0010| Raj| 80.0|
| 5|E0011|Chauhan| 100.0|
+---+-----+-------+---------+
Final Output : after reading data from HBase table:
+-----+-------+
|empId|empName|
+-----+-------+
| E007|Bhupesh|
| E008|Chauhan|
| E009|Prithvi|
|E0010| Raj|
|E0011|Chauhan|
+-----+-------+
Note: While creating Hbase table and inserting data into HBase table it expects NumberOfRegions should be greater than 3, hence I have added options(catalog=writeCatalog, newtable=5) while adding data to HBase
I am trying to create a datapipeline where the incomng data is stored into parquet and i create and external hive table and users can query the hive table and retrieve data .I am able to save the parquet data and retrieve it directly but when i query the hive table its not returning any rows. I did the following test setup
--CREATE EXTERNAL HIVE TABLE
create external table emp (
id double,
hire_dt timestamp,
user string
)
stored as parquet
location '/test/emp';
Now created dataframe on some data and saved to parquet .
---Create dataframe and insert DATA
val employeeDf = Seq(("1", "2018-01-01","John"),("2","2018-12-01", "Adam")).toDF("id","hire_dt","user")
val schema = List(("id", "double"), ("hire_dt", "date"), ("user", "string"))
val newCols= schema.map ( x => col(x._1).cast(x._2))
val newDf = employeeDf.select(newCols:_*)
newDf.write.mode("append").parquet("/test/emp")
newDf.show
--read the contents directly from parquet
val sqlcontext=new org.apache.spark.sql.SQLContext(sc)
sqlcontext.read.parquet("/test/emp").show
+---+----------+----+
| id| hire_dt|user|
+---+----------+----+
|1.0|2018-01-01|John|
|2.0|2018-12-01|Adam|
+---+----------+----+
--read from the external hive table
spark.sql("select id,hire_dt,user from emp").show(false)
+---+-------+----+
|id |hire_dt|user|
+---+-------+----+
+---+-------+----+
As shown above i am able to see the data if i read from parquet directly but not from hive .The question is what i am doing wrong here ? What i am i doing wrong that the hive isnt getting the data. I thought msck repair may be a reason but i get error if i try to do msck repair table saying table not partitioned.
Based on your create table statement, you have used location as /test/emp but while writing data, you are writing at /tenants/gwm/idr/emp. So you will not have data at /test/emp.
CREATE EXTERNAL HIVE TABLE create external table emp ( id double, hire_dt timestamp, user string ) stored as parquet location '/test/emp';
Please re-create external table as
CREATE EXTERNAL HIVE TABLE create external table emp ( id double, hire_dt timestamp, user string ) stored as parquet location '/tenants/gwm/idr/emp';
Apart from the answer given by Ramdev below, you also need to be cautious of using the correct datatype around date/timestamp; as 'date' type is not supported by parquet when creating a hive table.
For that you can change the 'date' type for column 'hire_dt' to 'timestamp'.
Otherwise there will be a mismatch in data you persisting through spark and trying to read in hive (or hive SQL). Keeping it to 'timestamp' at both places will resolve the issue. I hope it helps.
Do you have enableHiveSupport() in your sparkSession builder() statement. Are you able to connect to hive metastore? Try doing show tables/databases in your code to see if you can display tables present at your hive location?
i got this working with below chgn.
val dfTransformed = employeeDf.withColumn("id", employeeDf.col("id").cast(DoubleType))
.withColumn("hire_dt", employeeDf.col("hire_dt".cast(TimestampType))
So basically the issue was datatype mismatch and some how the original code the cast doesn't seem to work. So i did an explicit cast and then write it goes fine and able to query back as well.Logically both are doing the same not sure why the original code not working.
val employeeDf = Seq(("1", "2018-01-01","John"),("2","2018-12-01", "Adam")).toDF("id","hire_dt","user")
val dfTransformed = employeeDf.withColumn("id", employeeDf.col("id").cast(DoubleType))
.withColumn("hire_dt", employeeDf.col("hire_dt".cast(TimestampType))
dfTransformed.write.mode("append").parquet("/test/emp")
dfTransformed.show
--read the contents directly from parquet
val sqlcontext=new org.apache.spark.sql.SQLContext(sc)
sqlcontext.read.parquet("/test/emp").show
+---+----------+----+
| id| hire_dt|user|
+---+----------+----+
|1.0|2018-01-01|John|
|2.0|2018-12-01|Adam|
+---+----------+----+
--read from the external hive table
spark.sql("select id,hire_dt,user from emp").show(false)
+---+----------+----+
| id| hire_dt|user|
+---+----------+----+
|1.0|2018-01-01|John|
|2.0|2018-12-01|Adam|
+---+----------+----+
below is the code block and the error recieved
> creating a temporary views
sqlcontext.sql("""CREATE TEMPORARY VIEW temp_pay_txn_stage
USING org.apache.spark.sql.cassandra
OPTIONS (
table "t_pay_txn_stage",
keyspace "ks_pay",
cluster "Test Cluster",
pushdown "true"
)""".stripMargin)
sqlcontext.sql("""CREATE TEMPORARY VIEW temp_pay_txn_source
USING org.apache.spark.sql.cassandra
OPTIONS (
table "t_pay_txn_source",
keyspace "ks_pay",
cluster "Test Cluster",
pushdown "true"
)""".stripMargin)
querying the views as below to be able to get new records from stage not present in source .
Scala> val df_newrecords = sqlcontext.sql("""Select UUID(),
| |stage.order_id,
| |stage.order_description,
| |stage.transaction_id,
| |stage.pre_transaction_freeze_balance,
| |stage.post_transaction_freeze_balance,
| |toTimestamp(now()),
| |NULL,
| |1 from temp_pay_txn_stage stage left join temp_pay_txn_source source on stage.order_id=source.order_id and stage.transaction_id=source.transaction_id where
| |source.order_id is null and source.transaction_id is null""")`
org.apache.spark.sql.AnalysisException: Undefined function: 'uuid()'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 7
i am trying to get the UUIDs generated , but getting this error.
Here is a Simple Example How you can generate timeuuid :
import org.apache.spark.sql.SQLContext
val sqlcontext = new SQLContext(sc)
import sqlcontext.implicits._
//Import UUIDs that contains the method timeBased()
import com.datastax.driver.core.utils.UUIDs
//user define function timeUUID which will retrun time based uuid
val timeUUID = udf(() => UUIDs.timeBased().toString)
//sample query to test, you can change it to yours
val df_newrecords = sqlcontext.sql("SELECT 1 as data UNION SELECT 2 as data").withColumn("time_uuid", timeUUID())
//print all the rows
df_newrecords.collect().foreach(println)
Output :
[1,9a81b3c0-170b-11e7-98bf-9bb55f3128dd]
[2,9a831350-170b-11e7-98bf-9bb55f3128dd]
Source : https://stackoverflow.com/a/37232099/2320144
https://docs.datastax.com/en/drivers/java/2.0/com/datastax/driver/core/utils/UUIDs.html#timeBased--