Adding missing columns to a dataframe pyspark - python-3.x

When reading data from a text file using pyspark using following code,
spark = SparkSession.builder.master("local[*]").getOrCreate()
df = sqlContext.read.option("sep", "|").option("header", "false").csv('D:\\DATA-2021-12-03.txt')
My data text file looks like,
col1|cpl2|col3|col4
112 |4344|fn1 | home_a| extras| applied | <null>| <empty>
But the output I got was,
col1|cpl2|col3|col4
112 |4344|fn1 | home_a
Is there a way to add those missing columns for the dataframe?
Expecting,
col1|cpl2|col3|col4|col5|col6|col7|col8
112 |4344|fn1 | home_a| extras| applied | <null>| <empty>

You can explicitly specify the schema, instead of infering it.
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
schema = StructType() \
.add("col1",StringType(),True) \
.add("col2",StringType(),True) \
.add("col3",StringType(),True) \
.add("col4",StringType(),True) \
.add("col5",StringType(),True) \
.add("col6",StringType(),True) \
.add("col7",StringType(),True) \
.add("col8",StringType(),True)
df = spark.read.option("sep", "|").option("header", "true").schema(schema).csv('70475571_data.txt')
Output
+----+----+----+-------+-------+---------+-------+--------+
|col1|col2|col3| col4| col5| col6| col7| col8|
+----+----+----+-------+-------+---------+-------+--------+
|112 |4344|fn1 | home_a| extras| applied | <null>| <empty>|
+----+----+----+-------+-------+---------+-------+--------+

Related

How to override default timestamp format while reading csv in pyspark?

Suppose I have the following data in a CSV format,
ID|TIMESTAMP_COL
1|03-02-2003 08:37:55.671 PM
2|2003-02-03 08:37:55.671 AM
and my code for reading the above CSV is,
from pyspark.sql.types import *
sch = StructType([StructField("ID",StringType(),False),StructField("TIMESTAMP_COL",StringType(),True)])
df = spark.read \
.format("csv") \
.option("encoding", "utf-8") \
.option("mode", "PERMISSIVE") \
.option("header", "true") \
.option("dateFormat", "dd-MM-yyyy") \
.option("timestampFormat", "dd-MM-yyyy HH:mm:ss.SSS a") \
.option("delimiter", "|") \
.option("columnNameOfCorruptRecord", "_corrupt_record") \
.schema(sch) \
.load("data.csv")
So, according to the given timestamp format, I should get the record with id '2' rejected as it has a different format but it gets parsed but the value is different.
The output I am getting is,
df.show(truncate=False)
+-------------+-----------------------+-------------------+
| ID| TIMESTAMP_COL| _corrupt_record|
+-------------+-----------------------+-------------------+
| 1|2003-02-03 08:37:55.671| null|
| 2|0008-07-26 08:37:55.671| null|
+-------------+-----------------------+-------------------+
Why is this happening?
Not sure if it helps but here is what i found:
In your schema second field is declared as StringType, shouldnt it be TimestampType()?
I was able to reproduce your results with spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY") i did also tests with other possible options for this parameter
object LegacyBehaviorPolicy extends Enumeration {
val EXCEPTION, LEGACY, CORRECTED = Value
}
and here is doc for this parameter:
.doc("When LEGACY, java.text.SimpleDateFormat is used for formatting and parsing " +
"dates/timestamps in a locale-sensitive manner, which is the approach before Spark 3.0. " +
"When set to CORRECTED, classes from java.time.* packages are used for the same purpose. " +
"The default value is EXCEPTION, RuntimeException is thrown when we will get different " +
"results.")
So with LEGACY i am getting same results as you
With EXCEPTION Spark is throwing exception
org.apache.spark.SparkUpgradeException: [INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER] You may get a different result due to the upgrading to Spark >= 3.0:
With CORRECTED Spark is returning nulls for both records
It is however parsing correctly record with id 1 when i change pattern to hh instead of HH
so with something like this:
from pyspark.sql.types import *
spark.conf.set("spark.sql.legacy.timeParserPolicy","CORRECTED")
sch = StructType([StructField("ID",StringType(),False),StructField("TIMESTAMP_COL",TimestampType(),True), StructField("_corrupt_record", StringType(),True)])
df = spark.read \
.format("csv") \
.option("encoding", "utf-8") \
.option("mode", "PERMISSIVE") \
.option("header", "true") \
.option("dateFormat", "dd-MM-yyyy") \
.option("timestampFormat", "dd-MM-yyyy hh:mm:ss.SSS a") \
.option("delimiter", "|") \
.option("columnNameOfCorruptRecord", "_corrupt_record") \
.schema(sch) \
.load("dbfs:/FileStore/tables/stack.csv") \
df.show(truncate = False)
I am able to get this on output:
+---+-----------------------+----------------------------+
|ID |TIMESTAMP_COL |_corrupt_record |
+---+-----------------------+----------------------------+
|1 |2003-02-03 20:37:55.671|null |
|2 |null |2|2003-02-03 08:37:55.671 AM|
+---+-----------------------+----------------------------+
I am getting null here because thats how Spark parser is working, when pattern is incorrect its assigning null and your value is not going to be moved to corrupted_records i think so if you want to remove not matching timestamps you may filter nulls
Edit: As mentioned in comment i was missing this column in schema, its added now and you can get corrupted_value if you need it

Pyspark: Long to wide Format and format based on Column Value

I want to bring a pyspark DataFrame from Long to Wide Format and cast the resulting columns based on the DataType of a given column.
Example:
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
from pyspark.sql.functions import *
data = [
("BMK","++FT001+TL001-MA11","String", "2021-06-07"),
("RPM","0","Int16", "2021-06-07"),
("CURRENT","-1330","Int16", "2021-06-07")
]
schema = StructType([ \
StructField("key",StringType(),True), \
StructField("value",StringType(),True), \
StructField("dataType",StringType(),True), \
StructField("timestamp",StringType(),True) ])
df = spark.createDataFrame(data=data,schema=schema)
df.show()
+-----------+------------------+--------+----------+
| key| value|dataType| timestamp|
+-----------+------------------+--------+----------+
| BMK|++FT001+TL001-MA11| String|2021-06-07|
| RPM| 0| Int16|2021-06-07|
|ACT_CURRENT| -1330| Int16|2021-06-07|
+-----------+------------------+--------+----------+
Column dataType holds the desired datatype.
Outcome should look like this:
+----------+-----------+------------------+---+
| timestamp|ACT_CURRENT| BMK|RPM|
+----------+-----------+------------------+---+
|2021-06-07| -1330|++FT001+TL001-MA11| 0|
+----------+-----------+------------------+---+
The fields "ACT_CURRENT", "BMK" and "RPM" should have the right datatypes (Int16/String/Int16).
There is only entry per timestamp.
What I have so far is only widening the DF - not casting the right datatype:
df_wide = (df.groupBy("timestamp").pivot("key").agg(first('value')))
Help is much appreciated!

Parsing a Type 4 Nested Parquet and flattening/Explode JSON values in a column in pyspark

I am relatively new to Pyspark. And for orchestration I use Databricks.
[Just FYI: My source Parquet holds a SCD Type 4 dataset where the Current Snapshot and History of it is maintained in a Single row, where the Current Snapshot is in Parquet individual Columns while the History Snapshot is within a Columns as a JSON Array.]
Believe my solution could be the one used in the below link and just want to expand that solution to work for me (I am not able to comment on that post as i believe my problem even same is different)
https://stackoverflow.com/questions/56409454/casting-a-column-to-json-dict-and-flattening-json-[values-in-a-column-in-pyspark/56409889#56409889][1]
Reference courtesies : #Gingerbread,#Kafels
And tried to use the resolution in that one, but getting some error
Here's how my dataframe looks like:
|HISTORY
|-------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
|[{"HASH_KEY":"LulKYlm1qJaXFRq7oS1X1A==","SOURCE_KEY":"AAAAA","ATTR1":"FSDF CC 10 ml ","DATE":"2021-06-11"}, {"HASH_KEY":"LulKYlm1qJaXFRq7oS1X1A==","SOURCE_KEY":"AAAAA","ATTR1":"BBB CC ","DATE":"2021-03-11"}, {"HASH_KEY":"LulKYlm1qJaXFRq7oS1X1A==","SOURCE_KEY":"AAAAA","ATTR1":"BBB DD ","DATE":"2021-02-27"}]|
|[{"HASH_KEY":"BK08ZMe/1UTHsenUAOMUwQ==","SOURCE_KEY":"BBBBB","ATTR1":"JAMES 50 ml ","DATE":"2021-03-02"}, {"HASH_KEY":"BK08ZMe/1UTHsenUAOMUwQ==","SOURCE_KEY":"BBBBB","ATTR1":"JAS 50 ml ","DATE":"2021-02-02"}] |
|null |
The DataFrame Schema is
root
|-- HISTORY: array (nullable = true)
| |-- element: string (containsNull = true)
Desired output is just to flattening JSON values in a column in pyspark
|HASH_KEY |SOURCE_KEY|DATE |ATTR1 |
|:-----------------------|:--------:|:--------:|---------------:|
|LulKYlm1qJaXFRq7oS1X1A==|AAAAA |2021-06-11|FSDF CC 10 ml |
|LulKYlm1qJaXFRq7oS1X1A==|AAAAA |2021-03-11|BBB CC |
|LulKYlm1qJaXFRq7oS1X1A==|AAAAA |2021-02-27|BBB DD |
|BK08ZMe/1UTHsenUAOMUwQ==|BBBBB |2021-03-02|JAMES 50 ml |
|BK08ZMe/1UTHsenUAOMUwQ==|BBBBB |2021-02-02|JAS 50 ml |
|CAsaZMe/1UTHsenUasasaW==|BBBBB |2021-09-11|null |
The code snippet i tried
schema = ArrayType(
StructType(
[
StructField("HASH_KEY1", StringType()),
StructField("SOURCE_KEY1", StringType()),
StructField("ATTR1X", StringType()),
StructField("DATE1", TimestampType())
]
)
)
#f.udf(returnType=schema)
def parse_col(column):
updated_values = []
for it in re.finditer(r'[.*?]', column):
parse = json.loads(it.group())
for key, values in parse.items():
for value in values:
value['HASH_KEY1'] = key
updated_values.append(value)
return updated_values
df = df \
.withColumn('tmp', parse_col(f.col('HISTORY'))) \
.withColumn('tmp', f.explode(f.col('tmp'))) \
.select(f.col('HASH_KEY'),
f.col('tmp').HASH_KEY1.alias('HASH_KEY1'),
f.col('tmp').SOURCE_KEY1.alias('SOURCE_KEY1'),
f.col('tmp').ATTR1X.alias('ATTR1X'),
f.col('tmp').DATE1.alias('DATE1'))
df.show()
The following is the result i got
|HASH_KEY1|SOURCE_KEY1|ATTR1X|DATE1|
|:-------:|:---------:|:----:|----:|
| | | | |
| | | | |
|:-------:|:---------:|:----:|----:|
I am having trouble in getting the expected output
Any help would be greatly appreciated. I am using Spark 2.0 + .
Thank you!
Undestood the usage of json_tuple and simplified my approach, where i can directly explode the array into a String and then use json_tuple function to convert into flattened columns
So answer snippet now looks as follow
from pyspark.sql.functions import col,json_tuple
DF_EXPLODE = df \
.withColumn('Expand', f.explode(f.col('HISTORY'))) \
.select(f.col('Expand'))
DF_FLATTEN =
DF_EXPLODE.select("*",json_tuple("Expand","HASH_KEY").alias("HASH_KEY")) \
.select("*",json_tuple("Expand","SOURCE_KEY").alias("SOURCE_KEY"))\
.select("*",json_tuple("Expand","DATE").alias("DATE"))\
.select("*",json_tuple("Expand","ATTR1").alias("ATTR1"))
Worked on my initial PySpark Looping approach and following is the code.
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *
DF_DIM2 = DF_DIM.withColumn("sizer",size(col('HISTORY'))).sort("sizer",ascending=False)
max_len = DF_DIM2.select('sizer').take(1)[0][0]
print(max_len)
expanded_df = DF_DIM.select(['*'] + [col('HISTORY')[i].alias(f'HISTORY_{i}') for i in range(max_len)])
original_cols = [i for i in expanded_df.columns if 'HISTORY_' not in i ]
cols_exp = [i for i in expanded_df.columns if 'HISTORY_' in i]
schema = StructType([
StructField("HASH_KEY",StringType(),True),
StructField("SOURCE_KEY",StringType(),True),
StructField("DATE", StringType(), True),
StructField("ATTR1",StringType(),True)
])
final_df = expanded_df.select([from_json(i,schema).alias(i) for i in cols_exp])
Did some use case testing where joined a 3.72Billion Fact Parquet with 390k Type4 Nested Dimension Parquet and it took 2.5 mins while the Explode option took over 4 mins.
The Explode Option is exploding each of Type4 records multiplied by the times dimension had its changes recorded in the History column. So on an averag if every dimension changed 10 times. Then 390k*10=3.9M records are used in memory to join with the fact leading to more processing times.

Spark logs to dataframe , create dataframe from the logfiles

#Spark #Python
Objective :
Read the location of the log files, extract the csv text table data from the logs and print the json of the table data (Table Columns ( CSV retrieved table columns + serial no + timestamp)
Read serial_no, time, s3_path from database
s3_path contains csv files.
Output needs as a dataframe of the columns in the tables + primary_key + timestamp
Current Code pseudo:
df = sparkSession.read \
.format("com.databricks.spark.redshift") \
.option("url",
"some url with id{}&password={}".format(
redshift_user, redshift_pass)) \
.option("query", query) \
.option("tempdir", s3_redshift_temp_dir) \
.option("forward_spark_s3_credentials", True)
df = df_context.load()
+-------------+-------------------+--------------------+
|serial_number| test_date| s3_path|
+-------------+-------------------+--------------------+
| A0123456|2019-07-10 04:11:52|s3://test-bucket-...|
| A0123456|2019-07-24 23:48:03|s3://test-bucket-...|
| A0123456|2019-07-22 20:56:57|s3://test-bucket-...|
| A0123456|2019-07-22 20:56:57|s3://test-bucket-...|
| A0123456|2019-07-22 20:58:36|s3://test-bucket-...|
+-------------+-------------------+--------------------+
Since we cannot pass the spark context to the worker nodes, used boto3 to read the text file and processed the text to fetch the csv table structure.
Not sharing the proprietary code here for retrieving the table from log.
spark.udf.register("read_s3_file", read_s3_file)
df_with_string_csv = df.withColumn('files_dataframes', read_s3_file(drive_event_tab.s3_path))
df_with_string_csv now contails below sample
+-------------+-------------------+--------------------+----------------------+
|serial_number| test_date| s3_path| table_csv_data |
+-------------+-------------------+--------------------+----------------------+
| 1050D1B0|2019-05-07 15:41:11|s3://test-bucket-...|col1,col2,col3,col4...|
| 1050D1B0|2019-05-07 15:41:11|s3://test-bucket-...|col1,col2,col3,col4...|
| 1050D1BE|2019-05-08 09:26:55|s3://test-bucket-...|col1,col2,col3,col4...|
| A0123456|2019-07-25 06:54:28|s3://test-bucket-...|col1,col2,col3,col4...|
| A0123456|2019-07-22 21:07:21|s3://test-bucket-...|col1,col2,col3,col4...|
| A0123456|2019-07-22 21:07:21|s3://test-bucket-...|col1,col2,col3,col4...|
| A0123456|2019-07-25 00:19:52|s3://test-bucket-...|col1,col2,col3,col4...|
| A0123456|2019-07-24 22:24:40|s3://test-bucket-...|col1,col2,col3,col4...|
| A0123456|2019-09-12 22:15:19|s3://test-bucket-...|col1,col2,col3,col4...|
| A0123456|2019-07-22 21:27:56|s3://test-bucket-...|col1,col2,col3,col4...|
+-------------+-------------------+--------------------+----------------------+
sample table_csv_data column contains:
timestamp,partition,offset,key,value
1625218801350,97,33009,2CKXTKAT_20210701193302_6400_UCMP,458969040
1625218801349,41,33018,3FGW9S6T_20210701193210_6400_UCMP,17569160
Trying to achieve the final dataframe as below, please help
+-------------+-------------------+--------------------+-----------------+-----------+-----------------------------------+--------------+
|serial_number| test_date| timestamp| partition | offset | key | value |
+-------------+-------------------+--------------------+-----------------+-----------+-----------------------------------+--------------+
| 1050D1B0|2019-05-07 15:41:11| 1625218801350 | 97 | 33009 | 2CKXTKAT_20210701193302_6400_UCMP | 458969040 |
| 1050D1B0|2019-05-07 15:41:11| 1625218801349 | 41 | 33018 | 3FGW9S6T_20210701193210_6400_UCMP | 17569160 |
..
..
..
+-------------+-------------------+--------------------+----------------------+
For Spark 2.4.0+ you just need some combinations of split, explode and array_except
Please use repartition to optimize as explode might create a lot of rows.
from pyspark.sql.session import SparkSession
from pyspark.sql.functions import split, explode, col, array_except, array, trim
from pyspark.sql.types import IntegerType
spark = SparkSession.builder.getOrCreate()
df = df \
.withColumn('table_csv_data', split(col('table_csv_data'), '\n')) \
.withColumn('table_csv_data', array_except(col('table_csv_data'), array([col('table_csv_data')[0]]))) \
.withColumn('table_csv_data', explode(col('table_csv_data'))) \
.withColumn('table_csv_data', split(col('table_csv_data'), ',')) \
.withColumn('timestamp', trim(col('table_csv_data')[0])) \
.withColumn('partition', trim(col('table_csv_data')[1])) \
.withColumn('offset', trim(col('table_csv_data')[2])) \
.withColumn('key', trim(col('table_csv_data')[3])) \
.withColumn('value', trim(col('table_csv_data')[4])) \
.drop('table_csv_data')
df.show(truncate=False)
+-------------+-------------------+-----------------+-------------+---------+------+---------------------------------+---------+
|serial_number|test_date |s3_path |timestamp |partition|offset|key |value |
+-------------+-------------------+-----------------+-------------+---------+------+---------------------------------+---------+
|1050D1B0 |2019-05-07 15:41:11|s3://test-bucket-|1625218801350|97 |33009 |2CKXTKAT_20210701193302_6400_UCMP|458969040|
|1050D1B0 |2019-05-07 15:41:11|s3://test-bucket-|1625218801349|41 |33018 |3FGW9S6T_20210701193210_6400_UCMP|17569160 |
|1050D1B0 |2019-05-07 15:41:11|s3://test-bucket-|1625218801350|97 |33009 |2CKXTKAT_20210701193302_6400_UCMP|458969040|
|1050D1B0 |2019-05-07 15:41:11|s3://test-bucket-|1625218801349|41 |33018 |3FGW9S6T_20210701193210_6400_UCMP|17569160 |
+-------------+-------------------+-----------------+-------------+---------+------+---------------------------------+---------+

multiple criteria for aggregation on pySpark Dataframe

I have a pySpark dataframe that looks like this:
+-------------+----------+
| sku| date|
+-------------+----------+
|MLA-603526656|02/09/2016|
|MLA-603526656|01/09/2016|
|MLA-604172009|02/10/2016|
|MLA-605470584|02/09/2016|
|MLA-605502281|02/10/2016|
|MLA-605502281|02/09/2016|
+-------------+----------+
I want to group by sku, and then calculate the min and max dates. If I do this:
df_testing.groupBy('sku') \
.agg({'date': 'min', 'date':'max'}) \
.limit(10) \
.show()
the behavior is the same as Pandas, where I only get the sku and max(date) columns. In Pandas I would normally do the following to get the results I want:
df_testing.groupBy('sku') \
.agg({'day': ['min','max']}) \
.limit(10) \
.show()
However on pySpark this does not work, and I get a java.util.ArrayList cannot be cast to java.lang.String error. Could anyone please point me to the correct syntax?
Thanks.
You cannot use dict. Use:
>>> from pyspark.sql import functions as F
>>>
>>> df_testing.groupBy('sku').agg(F.min('date'), F.max('date'))

Resources