I have a DataSourceV2Relation object and I would like to get the name of its table from spark catalog. spark.catalog.listTables() will list all the tables, but is there a way to get the specific table directly from the object?
PySpark
from pyspark.sql.catalog import Table
def get_t_object_by_name(t_name:str, db_name='default') -> Table:
catalog_t_list = spark.catalog.listTables('default')
return next((t for t in catalog_t_list if t.name == t_name), None)
Call as : get_t_object_by_name('my_table_name')
Result example : Table(name='my_table_name', database='default', description=None, tableType='MANAGED', isTemporary=False)
Table class definition : https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/catalog/Table.html
I am trying to read a Hive table in Spark. Below is the Hive Table format:
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
field.delim \u0001
serialization.format \u0001
When I am trying to read it using the Spark SQL with the below command:
val c = hiveContext.sql("""select
a
from c_db.c cs
where dt >= '2016-05-12' """)
c. show
I am getting the below warning:-
18/07/02 18:02:02 WARN ReaderImpl: Cannot find field for: a in _col0,
_col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, _col26, _col27, _col28, _col29, _col30, _col31, _col32, _col33, _col34, _col35, _col36, _col37, _col38, _col39, _col40, _col41, _col42, _col43, _col44, _col45, _col46, _col47, _col48, _col49, _col50, _col51, _col52, _col53, _col54, _col55, _col56, _col57, _col58, _col59, _col60, _col61, _col62, _col63, _col64, _col65, _col66, _col67,
The read starts but it is very slow and getting network time out.
When i am trying to read the Hive table directory directly i am getting the below error.
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
hiveContext.setConf("spark.sql.orc.filterPushdown", "true")
val c = hiveContext.read.format("orc").load("/a/warehouse/c_db.db/c")
c.select("a").show()
org.apache.spark.sql.AnalysisException: cannot resolve 'a' given input
columns: [_col18, _col3, _col8, _col66, _col45, _col42, _col31,
_col17, _col52, _col58, _col50, _col26, _col63, _col12, _col27, _col23, _col6, _col28, _col54, _col48, _col33, _col56, _col22, _col35, _col44, _col67, _col15, _col32, _col9, _col11, _col41, _col20, _col2, _col25, _col24, _col64, _col40, _col34, _col61, _col49, _col14, _col13, _col19, _col43, _col65, _col29, _col10, _col7, _col21, _col39, _col46, _col4, _col5, _col62, _col0, _col30, _col47, trans_dt, _col57, _col16, _col36, _col38, _col59, _col1, _col37, _col55, _col51, _col60, _col53];
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
I can convert the Hive table to TextInputFormat but that should be my last option as i would like to get the benefit of OrcInputFormat to compress the table size.
Really appreciate your suggestion.
I found workaround with reading table such way:
val schema = spark.table("db.name").schema
spark.read.schema(schema).orc("/path/to/table")
The issue occurs generally with large tables, as it fails to read to max field length. I added meta-store read as true (set spark.sql.hive.convertMetastoreOrc=true;) and it worked for me.
I think the table doesnt have named columns or if it has, Spark isnt able to read the names probably.
You can use the default column names that Spark has given as mentioned in the Error. Or also set column names in the Spark code.
Use printSchema and toDF method to rename the columns. But yes, you will need the mappings. This might require selecting and showing columns individually.
Setting (set spark.sql.hive.convertMetastoreOrc=true;) conf is working. But its trying to modify metadata of hive table. Can you please explain me, What is going to modify and does it effect the table.
Thanks
what is limitation for pyspark dataframe column names. I have issue with following code ..
%livy.pyspark
df_context_spark.agg({'spatialElementLabel.value': 'count'})
It gives ...
u'Cannot resolve column name "spatialElementLabel.value" among (lightFixtureID.value, spatialElementLabel.value);'
The column name is evidently typed correctly. I got the dataframe by transformation from pandas dataframe. It there any issue with dot in the column name string?
Dots are used for nested fields inside a structure type. So if you had a column that was called "address" of type StructType, and inside that you had street1, street2, etc you would access it the individual fields like this:
df.select("address.street1", "address.street2", ..)
Because of that, if you want to used a dot in your field name you need to quote the field whenever you refer to it. For example:
from pyspark.sql.types import *
schema = StructType([StructField("my.field", StringType())])
rdd = sc.parallelize([('hello',), ('world',)])
df = sqlContext.createDataFrame(rdd, schema)
# Using backticks to quote the field name
df.select("`my.field`").show()
I have an "iplRDD" which is a json, and I do below steps and query through hivecontext. I get the results but without columns headers. Is there is a way to get the columns names along with the values?
val teamRDD = hiveContext.jsonRDD(iplRDD)
teamRDD.registerTempTable("teams")
hiveContext.cacheTable("teams")
val result = hiveContext.sql("select * from teams where team_name = "KKR" )
result.collect.foreach(println)
Any thoughts please ?
teamRDD.schema.fieldNames should contain the header names.
You can get it by using:
result.schema().fields();
you can save your dataframe 'result' like this with header as csv file:
result.write().format("com.databricks.spark.csv").option("header", "true").save(outputPath);