how to append a dataframe to existing partitioned table to specific partition - apache-spark

I have a existing table like below
create_table=""" create table tbl1 (tran int,count int) partitioned by (year string) """
spark.sql(create_table)
insert_query="insert into tbl1 partition(year='2022') values (101,500)"
spark.sql(insert_query)
and i create dataframe like below
from pyspark.sql.functions import *
from datetime import datetime
rows=[
(1,501),
(2,502),
(3,503)
]
from pyspark.sql.types import *
myschema =StructType([
StructField("id",LongType(),True),\
StructField("count",LongType(),True)
])
df=spark.createDataFrame(rows,myschema)
Now I want to append this dataframe to above table and append values to existing partition 2022.
How can i do that

When you create the dataframe, you could include the year as well, then partitionBy and write into the table:
from pyspark.sql.types import StructType, StructField
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Test').getOrCreate()
rows=[
(1,501,'2022'),
(2,502,'2022'),
(3,503,'2022')
]
myschema =StructType([
StructField("id",LongType(),True),\
StructField("count",LongType(),True),\
StructField("year",StringType(),True)
])
df=spark.createDataFrame(rows,myschema)
df.write.mode('append').partitionBy('year').saveAsTable('tbl1')

Related

Create dataframe with timestamp field

On Databricks, the following code snippet
%python
from pyspark.sql.types import StructType, StructField, TimestampType
from pyspark.sql import functions as F
data = [F.current_timestamp()]
schema = StructType([StructField("current_timestamp", TimestampType(), True)])
df = spark.createDataFrame(data, schema)
display(df)
displays a table with value "null". I would expect to see the current timestamp there. Why is this not the case?
createDataFrame does not accept PySpark expressions.
You could pass python's datetime.datetime.now():
import datetime
df = spark.createDataFrame([(datetime.datetime.now(),)], ['ts'])
Defining schema beforehand:
from pyspark.sql.types import *
import datetime
data = [(datetime.datetime.now(),)]
schema = StructType([StructField("current_timestamp", TimestampType(), True)])
df = spark.createDataFrame(data, schema)
OR add timestamp column afterwards:
from pyspark.sql import functions as F
df = spark.range(3)
df1 = df.select(
F.current_timestamp().alias('ts')
)
df2 = df.withColumn('ts', F.current_timestamp())

Transform a column in a sparksql dataframe using python

Hi I have a spark sql dataframe with a whole bunch of columns. One of the columns ("date") is a date field. I want to apply the following transformation to every row in that column.
This is what would I do if it were a pandas dataframe. I cant seem to figure out the spark equivalent
df["date"] = df["date"].map(lambda x: x.isoformat() + "Z")
The column has values of the form
2020-12-07 01:01:48
I want the values to be of the form:
2020-12-07T01:01:48Z
Try something like that:
from pyspark.sql.types import StructType, StructField, DateType, StringType, IntegerType
from pyspark.sql.functions import col
from pyspark.sql import functions as F
schema = StructType([
StructField("date",StringType(),True),
StructField("age", StringType(),True)])
df = spark.createDataFrame([(None,22),(None,25)],schema=schema)
Z = F.lit("Z").cast(StringType())
datetime = F.current_date().cast(StringType())
datetimeZ = F.concat(datetime,Z)
df = df.withColumn("date", datetimeZ)
df.show(5)
+-----------+---+
| date|age|
+-----------+---+
|2021-06-15Z| 22|
|2021-06-15Z| 25|
+-----------+---+

Appending column name to column value using Spark

I have data in comma separated file, I have loaded it in the spark data frame:
The data looks like:
A B C
1 2 3
4 5 6
7 8 9
I want to transform the above data frame in spark using pyspark as:
A B C
A_1 B_2 C_3
A_4 B_5 C_6
--------------
Then convert it to list of list using pyspark as:
[[ A_1 , B_2 , C_3],[A_4 , B_5 , C_6]]
And then run FP Growth algorithm using pyspark on the above data set.
The code that I have tried is below:
from pyspark.sql.functions import col, size
from pyspark.sql.functions import *
import pyspark.sql.functions as func
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from pyspark.ml.fpm import FPGrowth
from pyspark.sql import Row
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark import SparkConf
from pyspark.sql.types import StringType
from pyspark import SQLContext
sqlContext = SQLContext(sc)
df = spark.read.format("csv").option("header", "true").load("dbfs:/FileStore/tables/data.csv")
names=df.schema.names
Then I thought of doing something inside for loop:
for name in names:
-----
------
After this I will be using fpgrowth:
df = spark.createDataFrame([
(0, [ A_1 , B_2 , C_3]),
(1, [A_4 , B_5 , C_6]),)], ["id", "items"])
fpGrowth = FPGrowth(itemsCol="items", minSupport=0.5, minConfidence=0.6)
model = fpGrowth.fit(df)
A number of concepts here for those who use Scala normally showing how to do with pyspark. Somewhat different but learnsome for sure, although to how many is the big question. I certainly learnt a point on pyspark with zipWithIndex myself. Anyway.
First part is to get stuff into desired format, probably too may imports but leaving as is:
from functools import reduce
from pyspark.sql.functions import lower, col, lit, concat, split
from pyspark.sql.types import *
from pyspark.sql import Row
from pyspark.sql import functions as f
source_df = spark.createDataFrame(
[
(1, 11, 111),
(2, 22, 222)
],
["colA", "colB", "colC"]
)
intermediate_df = (reduce(
lambda df, col_name: df.withColumn(col_name, concat(lit(col_name), lit("_"), col(col_name))),
source_df.columns,
source_df
) )
allCols = [x for x in intermediate_df.columns]
result_df = intermediate_df.select(f.concat_ws(',', *allCols).alias('CONCAT_COLS'))
result_df = result_df.select(split(col("CONCAT_COLS"), ",\s*").alias("ARRAY_COLS"))
# Add 0,1,2,3, ... with zipWithIndex, we add it at back, but that does not matter, you can move it around.
# Get new Structure, the fields (one in this case but done flexibly, plus zipWithIndex value.
schema = StructType(result_df.schema.fields[:] + [StructField("index", LongType(), True)])
# Need this dict approach with pyspark, different to Scala.
rdd = result_df.rdd.zipWithIndex()
rdd1 = rdd.map(
lambda row: tuple(row[0].asDict()[c] for c in schema.fieldNames()[:-1]) + (row[1],)
)
final_result_df = spark.createDataFrame(rdd1, schema)
final_result_df.show(truncate=False)
returns:
+---------------------------+-----+
|ARRAY_COLS |index|
+---------------------------+-----+
|[colA_1, colB_11, colC_111]|0 |
|[colA_2, colB_22, colC_222]|1 |
+---------------------------+-----+
Second part is the old zipWithIndex with pyspark if you need 0,1,.. Painful compared to Scala.
In general easier to solve in Scala.
Not sure on performance, not a foldLeft, interesting. I think it is OK actually.

Pyspark And Cassandra - Extracting Data Into RDD as Fields from Map Field

I have a table with a map field with data that looks as follows from Cassandra,
test_id test_map
1 {tran_id=99, tran_type=sample}
I am attempting to add these fields to the existing RDD that I am pulling this data from as new fields to the exact same key which would look as follows,
test_id test_map tran_id tran_type
1 {tran_id=99, trantype=sample} 99 sample
I'm able to pull the fields fine using spark context but I can't find a good method to transform this field into the RDD as expected above.
Sample Code:
import os
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import *
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.datastax.spark:spark-cassandra-connector_2.11:2.3.0 --conf spark.cassandra.connection.host=xxx.xxx.xxx.xxx pyspark-shell'
sc = SparkContext("local", "test")
sqlContext = SQLContext(sc)
def test_df(keys_space_name, table_name):
table_df = sqlContext.read\
.format("org.apache.spark.sql.cassandra")\
.options(table=table_name, keyspace=keys_space_name)\
.load()
return table_df
df_test = test_df("test", "test")
Then to query data I use Spark SQL in such format:
df_test.registerTempTable("dftest")
df = sqlContext.sql(
"""
select * from dftest
"

Adding a Vectors Column to a pyspark DataFrame

How do I add a Vectors.dense column to a pyspark dataframe?
import pandas as pd
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.ml.linalg import DenseVector
py_df = pd.DataFrame.from_dict({"time": [59., 115., 156., 421.], "event": [1, 1, 1, 0]})
sc = SparkContext(master="local")
sqlCtx = SQLContext(sc)
sdf = sqlCtx.createDataFrame(py_df)
sdf.withColumn("features", DenseVector(1))
Gives an error in file anaconda3/lib/python3.6/site-packages/pyspark/sql/dataframe.py, line 1848:
AssertionError: col should be Column
It doesn't like the DenseVector type as a column. Essentially, I have a pandas dataframe that I'd like to transform to a pyspark dataframe and add a column of the type Vectors.dense. Is there another way of doing this?
Constant Vectors cannot be added as literal. You have to use udf:
from pyspark.sql.functions import udf
from pyspark.ml.linalg import VectorUDT
one = udf(lambda: DenseVector([1]), VectorUDT())
sdf.withColumn("features", one()).show()
But I am not sure why you need that at all. If you want to transform existing columns into Vectors use appropriate pyspark.ml tools, like VectorAssembler - Encode and assemble multiple features in PySpark
from pyspark.ml.feature import VectorAssembler
VectorAssembler(inputCols=["time"], outputCol="features").transform(sdf)

Resources