How to join a hive table with a pandas dataframe in pyspark? - apache-spark

I have a hive table db.hive_table and a pandas dataframe df. I wish to use pyspark.SparkSession.builder.enableHiveSupport().getOrCreate().sql to join them. How can I do this?

Convert both to pyspark dataframe and then join dfs.
# Pandas to Spark
df_sp = spark_session.createDataFrame(df_pd)
# Convert hive table to df - sqlContext is of type HiveContext
df_hive = sqlContext.table(tablename)
Join the two dfs.
joined_df = df_sp.join(df_hive, df_sp.id == df_hive.id).select('df_sp.*', 'df_hive.*')

Related

how to append a dataframe to existing partitioned table to specific partition

I have a existing table like below
create_table=""" create table tbl1 (tran int,count int) partitioned by (year string) """
spark.sql(create_table)
insert_query="insert into tbl1 partition(year='2022') values (101,500)"
spark.sql(insert_query)
and i create dataframe like below
from pyspark.sql.functions import *
from datetime import datetime
rows=[
(1,501),
(2,502),
(3,503)
]
from pyspark.sql.types import *
myschema =StructType([
StructField("id",LongType(),True),\
StructField("count",LongType(),True)
])
df=spark.createDataFrame(rows,myschema)
Now I want to append this dataframe to above table and append values to existing partition 2022.
How can i do that
When you create the dataframe, you could include the year as well, then partitionBy and write into the table:
from pyspark.sql.types import StructType, StructField
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Test').getOrCreate()
rows=[
(1,501,'2022'),
(2,502,'2022'),
(3,503,'2022')
]
myschema =StructType([
StructField("id",LongType(),True),\
StructField("count",LongType(),True),\
StructField("year",StringType(),True)
])
df=spark.createDataFrame(rows,myschema)
df.write.mode('append').partitionBy('year').saveAsTable('tbl1')

Access temporary spark dataframe in spark.read teradata sql query

Suppose you have a dataframe df, registered as a temporary view in spark.
Is it possible to access the dataframe in a spark.read teradata sql command?
I.e.
df = spark.createDataFrame([(1, 5), (2, 6),], ['id', 'val'])
df = createOrReplaceTempView('df')
td_query = """(
SELECT * FROM EX_SCHEMA.EX_TABLE AS EX_TABLE LEFT JOIN df ON (EX_TABLE.ID = df.ID)
) qry
"""
df_td = spark.read.format('jdbc').option('driver', 'com.teradata.jdbc.TeraDriver')\
.option('url', url).option('user', user_td).option('password', password_td)\
.option('dbtable', td_query).load()
I know that the left join could easily be done in spark afterwards, this is just an example for the principal issue.

How to convert a rdd of pandas DataFrame to Spark DataFrame

I create a rdd of pandas DataFrame as intermediate result. I want to convert a Spark DataFrame, eventually save it into parquet file.
I want to know what is the efficient way.
Thanks
def create_df(x):
return pd.DataFrame(np.random.rand(5, 3)).\
assign(col=x)
sc.parallelize(range(5)).map(create_df).\
.TO_DATAFRAME()..write.format("parquet").save("parquet_file")
I have tried pd.concat to reduce rdd to a big dataframe, seems not right.
So talking of efficiency, since spark 2.3 Apache Arrow is integrated with Spark and it is supposed to efficiently transfer data between JVM and Python processes thus enhancing the performance of the conversion from pandas dataframe to spark dataframe. You can enable it by
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
If your spark distribution doesn't have arrow integrated, this should not throw an error, will just be ignored.
A sample code to be run at pyspark shell can be like below:
import numpy as np
import pandas as pd
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
pdf = pd.DataFrame(np.random.rand(100, 3))
df = spark.createDataFrame(pdf)
df.write.format("parquet").save('data_parquet_file')
Your create_df method returns a panda dataframe and from that you can create spark dataframe - not sure why you need "sc.parallelize(range(5)).map(create_df)"
So your full code can be like
import pandas as pd
import numpy as np
def create_df(x):
return pd.DataFrame(np.random.rand(5, 3)).assign(col=x)
pdf = create_df(10)
df = spark.createDataFrame(pdf)
df.write.format("parquet").save('data_parquet_file')
import pandas as pd
def create_df(x):
df=pd.DataFrame(np.random.rand(5, 3)).assign(col=x)
return df.values.tolist()
sc.parallelize(range(5)).flatMap(create_df).toDF().\
.write.format("parquet").save("parquet_file")

Pyspark And Cassandra - Extracting Data Into RDD as Fields from Map Field

I have a table with a map field with data that looks as follows from Cassandra,
test_id test_map
1 {tran_id=99, tran_type=sample}
I am attempting to add these fields to the existing RDD that I am pulling this data from as new fields to the exact same key which would look as follows,
test_id test_map tran_id tran_type
1 {tran_id=99, trantype=sample} 99 sample
I'm able to pull the fields fine using spark context but I can't find a good method to transform this field into the RDD as expected above.
Sample Code:
import os
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import *
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.datastax.spark:spark-cassandra-connector_2.11:2.3.0 --conf spark.cassandra.connection.host=xxx.xxx.xxx.xxx pyspark-shell'
sc = SparkContext("local", "test")
sqlContext = SQLContext(sc)
def test_df(keys_space_name, table_name):
table_df = sqlContext.read\
.format("org.apache.spark.sql.cassandra")\
.options(table=table_name, keyspace=keys_space_name)\
.load()
return table_df
df_test = test_df("test", "test")
Then to query data I use Spark SQL in such format:
df_test.registerTempTable("dftest")
df = sqlContext.sql(
"""
select * from dftest
"

How to get the numeric columns from DataFrame of Pyspark and calculating the zscore

sparkSession = SparkSession.builder.appName("example").getOrCreate()
df = sparkSession.read.json('hdfs://localhost/abc/zscore/')
I am able to read the data from hdfs and I want to calculate the zscore for only numeric columns
You can convert df to Pandas and calculate zscore
sparkSession = SparkSession.builder.appName("example").getOrCreate()
df = sparkSession.read.json('hdfs://localhost/SmartRegression/zscore/').toPandas()
num_cols = df._get_numeric_data().columns
results = df[num_cols].apply(zscore)
print(results)
toPandas() does not work for big dataset as this try to load the whole dataset in driver memory.

Resources