Get the Dataframe from temp view name - apache-spark

I can register a Dataframe in the catalog using
df.createOrReplaceTempView("my_name")
I can check if the dataframe is registered
next(filter(lambda table:table.name=='my_name',spark.catalog.listTables()))
But how I can get the DataFrame associated with the name?
a_df = spark.catalog.getTempView()
assert id(a_df)==id(df)

spark.table() method will return the dataframe for the given table/view name
a_df = spark.table('my_name')

You may also use:
df=spark.sql("select * from my_name")
if you want to do some SQL before loading your df.

Related

Get table name from spark catalog

I have a DataSourceV2Relation object and I would like to get the name of its table from spark catalog. spark.catalog.listTables() will list all the tables, but is there a way to get the specific table directly from the object?
PySpark
from pyspark.sql.catalog import Table
def get_t_object_by_name(t_name:str, db_name='default') -> Table:
catalog_t_list = spark.catalog.listTables('default')
return next((t for t in catalog_t_list if t.name == t_name), None)
Call as : get_t_object_by_name('my_table_name')
Result example : Table(name='my_table_name', database='default', description=None, tableType='MANAGED', isTemporary=False)
Table class definition : https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/catalog/Table.html

Return a dataframe to a dataframe

I am trying to take a list of "ID"s from my dataframe dfrep and pass the column with the ID into a function I created in order to pass the values into a query to return back to dfrep.
My function is returning a dataframe, but the results of the dataframe are including the header and when I print dfrep there are two lines. I also cannot write the dataframe to excel using xlwings because I get TypeError: must be a pywintypes time object (got DataFrame).
def overrides(id):
sql = f"select name from sales..rep where id in({id})"
mydf = pd.read_sql(sql, conn)
return mydf
overrides = np.vectorize(overrides)
dfrep['name'] = overrides(dfrep['ID'])
wsData.range('A1').options(pd.DataFrame,index=False).value = dfrep
My goal is to load the column(s) in my function's dataframe into my main dataframe dfrep and then write to excel via xlwings. Any help is appreciated.

Spark DataFrame ORC Hive table reading issue

I am trying to read a Hive table in Spark. Below is the Hive Table format:
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
field.delim \u0001
serialization.format \u0001
When I am trying to read it using the Spark SQL with the below command:
val c = hiveContext.sql("""select
a
from c_db.c cs
where dt >= '2016-05-12' """)
c. show
I am getting the below warning:-
18/07/02 18:02:02 WARN ReaderImpl: Cannot find field for: a in _col0,
_col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, _col26, _col27, _col28, _col29, _col30, _col31, _col32, _col33, _col34, _col35, _col36, _col37, _col38, _col39, _col40, _col41, _col42, _col43, _col44, _col45, _col46, _col47, _col48, _col49, _col50, _col51, _col52, _col53, _col54, _col55, _col56, _col57, _col58, _col59, _col60, _col61, _col62, _col63, _col64, _col65, _col66, _col67,
The read starts but it is very slow and getting network time out.
When i am trying to read the Hive table directory directly i am getting the below error.
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
hiveContext.setConf("spark.sql.orc.filterPushdown", "true")
val c = hiveContext.read.format("orc").load("/a/warehouse/c_db.db/c")
c.select("a").show()
org.apache.spark.sql.AnalysisException: cannot resolve 'a' given input
columns: [_col18, _col3, _col8, _col66, _col45, _col42, _col31,
_col17, _col52, _col58, _col50, _col26, _col63, _col12, _col27, _col23, _col6, _col28, _col54, _col48, _col33, _col56, _col22, _col35, _col44, _col67, _col15, _col32, _col9, _col11, _col41, _col20, _col2, _col25, _col24, _col64, _col40, _col34, _col61, _col49, _col14, _col13, _col19, _col43, _col65, _col29, _col10, _col7, _col21, _col39, _col46, _col4, _col5, _col62, _col0, _col30, _col47, trans_dt, _col57, _col16, _col36, _col38, _col59, _col1, _col37, _col55, _col51, _col60, _col53];
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
I can convert the Hive table to TextInputFormat but that should be my last option as i would like to get the benefit of OrcInputFormat to compress the table size.
Really appreciate your suggestion.
I found workaround with reading table such way:
val schema = spark.table("db.name").schema
spark.read.schema(schema).orc("/path/to/table")
The issue occurs generally with large tables, as it fails to read to max field length. I added meta-store read as true (set spark.sql.hive.convertMetastoreOrc=true;) and it worked for me.
I think the table doesnt have named columns or if it has, Spark isnt able to read the names probably.
You can use the default column names that Spark has given as mentioned in the Error. Or also set column names in the Spark code.
Use printSchema and toDF method to rename the columns. But yes, you will need the mappings. This might require selecting and showing columns individually.
Setting (set spark.sql.hive.convertMetastoreOrc=true;) conf is working. But its trying to modify metadata of hive table. Can you please explain me, What is going to modify and does it effect the table.
Thanks

pyspark dataframe column name

what is limitation for pyspark dataframe column names. I have issue with following code ..
%livy.pyspark
df_context_spark.agg({'spatialElementLabel.value': 'count'})
It gives ...
u'Cannot resolve column name "spatialElementLabel.value" among (lightFixtureID.value, spatialElementLabel.value);'
The column name is evidently typed correctly. I got the dataframe by transformation from pandas dataframe. It there any issue with dot in the column name string?
Dots are used for nested fields inside a structure type. So if you had a column that was called "address" of type StructType, and inside that you had street1, street2, etc you would access it the individual fields like this:
df.select("address.street1", "address.street2", ..)
Because of that, if you want to used a dot in your field name you need to quote the field whenever you refer to it. For example:
from pyspark.sql.types import *
schema = StructType([StructField("my.field", StringType())])
rdd = sc.parallelize([('hello',), ('world',)])
df = sqlContext.createDataFrame(rdd, schema)
# Using backticks to quote the field name
df.select("`my.field`").show()

Is there a way to get column names using hiveContext?

I have an "iplRDD" which is a json, and I do below steps and query through hivecontext. I get the results but without columns headers. Is there is a way to get the columns names along with the values?
val teamRDD = hiveContext.jsonRDD(iplRDD)
teamRDD.registerTempTable("teams")
hiveContext.cacheTable("teams")
val result = hiveContext.sql("select * from teams where team_name = "KKR" )
result.collect.foreach(println)
Any thoughts please ?
teamRDD.schema.fieldNames should contain the header names.
You can get it by using:
result.schema().fields();
you can save your dataframe 'result' like this with header as csv file:
result.write().format("com.databricks.spark.csv").option("header", "true").save(outputPath);

Resources