How to override default timestamp format while reading csv in pyspark?

How to override default timestamp format while reading csv in pyspark? - apache-spark

Suppose I have the following data in a CSV format,
ID|TIMESTAMP_COL
1|03-02-2003 08:37:55.671 PM
2|2003-02-03 08:37:55.671 AM
and my code for reading the above CSV is,
from pyspark.sql.types import *
sch = StructType([StructField("ID",StringType(),False),StructField("TIMESTAMP_COL",StringType(),True)])
df = spark.read \
.format("csv") \
.option("encoding", "utf-8") \
.option("mode", "PERMISSIVE") \
.option("header", "true") \
.option("dateFormat", "dd-MM-yyyy") \
.option("timestampFormat", "dd-MM-yyyy HH:mm:ss.SSS a") \
.option("delimiter", "|") \
.option("columnNameOfCorruptRecord", "_corrupt_record") \
.schema(sch) \
.load("data.csv")
So, according to the given timestamp format, I should get the record with id '2' rejected as it has a different format but it gets parsed but the value is different.
The output I am getting is,
df.show(truncate=False)
+-------------+-----------------------+-------------------+
| ID| TIMESTAMP_COL| _corrupt_record|
+-------------+-----------------------+-------------------+
| 1|2003-02-03 08:37:55.671| null|
| 2|0008-07-26 08:37:55.671| null|
+-------------+-----------------------+-------------------+
Why is this happening?

Not sure if it helps but here is what i found:
In your schema second field is declared as StringType, shouldnt it be TimestampType()?
I was able to reproduce your results with spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY") i did also tests with other possible options for this parameter
object LegacyBehaviorPolicy extends Enumeration {
val EXCEPTION, LEGACY, CORRECTED = Value
}
and here is doc for this parameter:
.doc("When LEGACY, java.text.SimpleDateFormat is used for formatting and parsing " +
"dates/timestamps in a locale-sensitive manner, which is the approach before Spark 3.0. " +
"When set to CORRECTED, classes from java.time.* packages are used for the same purpose. " +
"The default value is EXCEPTION, RuntimeException is thrown when we will get different " +
"results.")
So with LEGACY i am getting same results as you
With EXCEPTION Spark is throwing exception
org.apache.spark.SparkUpgradeException: [INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER] You may get a different result due to the upgrading to Spark >= 3.0:
With CORRECTED Spark is returning nulls for both records
It is however parsing correctly record with id 1 when i change pattern to hh instead of HH
so with something like this:
from pyspark.sql.types import *
spark.conf.set("spark.sql.legacy.timeParserPolicy","CORRECTED")
sch = StructType([StructField("ID",StringType(),False),StructField("TIMESTAMP_COL",TimestampType(),True), StructField("_corrupt_record", StringType(),True)])
df = spark.read \
.format("csv") \
.option("encoding", "utf-8") \
.option("mode", "PERMISSIVE") \
.option("header", "true") \
.option("dateFormat", "dd-MM-yyyy") \
.option("timestampFormat", "dd-MM-yyyy hh:mm:ss.SSS a") \
.option("delimiter", "|") \
.option("columnNameOfCorruptRecord", "_corrupt_record") \
.schema(sch) \
.load("dbfs:/FileStore/tables/stack.csv") \
df.show(truncate = False)
I am able to get this on output:
+---+-----------------------+----------------------------+
|ID |TIMESTAMP_COL |_corrupt_record |
+---+-----------------------+----------------------------+
|1 |2003-02-03 20:37:55.671|null |
|2 |null |2|2003-02-03 08:37:55.671 AM|
+---+-----------------------+----------------------------+
I am getting null here because thats how Spark parser is working, when pattern is incorrect its assigning null and your value is not going to be moved to corrupted_records i think so if you want to remove not matching timestamps you may filter nulls
Edit: As mentioned in comment i was missing this column in schema, its added now and you can get corrupted_value if you need it

Related

Adding missing columns to a dataframe pyspark

When reading data from a text file using pyspark using following code,
spark = SparkSession.builder.master("local[*]").getOrCreate()
df = sqlContext.read.option("sep", "|").option("header", "false").csv('D:\\DATA-2021-12-03.txt')
My data text file looks like,
col1|cpl2|col3|col4
112 |4344|fn1 | home_a| extras| applied | <null>| <empty>
But the output I got was,
col1|cpl2|col3|col4
112 |4344|fn1 | home_a
Is there a way to add those missing columns for the dataframe?
Expecting,
col1|cpl2|col3|col4|col5|col6|col7|col8
112 |4344|fn1 | home_a| extras| applied | <null>| <empty>

You can explicitly specify the schema, instead of infering it.
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
schema = StructType() \
.add("col1",StringType(),True) \
.add("col2",StringType(),True) \
.add("col3",StringType(),True) \
.add("col4",StringType(),True) \
.add("col5",StringType(),True) \
.add("col6",StringType(),True) \
.add("col7",StringType(),True) \
.add("col8",StringType(),True)
df = spark.read.option("sep", "|").option("header", "true").schema(schema).csv('70475571_data.txt')
Output
+----+----+----+-------+-------+---------+-------+--------+
|col1|col2|col3| col4| col5| col6| col7| col8|
+----+----+----+-------+-------+---------+-------+--------+
|112 |4344|fn1 | home_a| extras| applied | <null>| <empty>|
+----+----+----+-------+-------+---------+-------+--------+

not null not working in Pyspark when using impala jdbc driver

I am new to Pyspark. I am using Impala JDBC driver ImpalaJDBC41.jar . In my pyspark code, I use the below.
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:impala://<instance>:21051") \
.option("query", "select dst_val,node_name,trunc(starttime,'SS') as starttime from def.tbl_dst where node_name is not null and trunc(starttime,'HH') >= trunc(hours_add(now(),-1),'HH') and trunc(starttime,'HH') < trunc(now(),'HH')") \
.option("user", "") \
.option("password", "") \
.load()
But the above does not work and the "node_name is not null" is not working. Also the trunc(starttime,'SS') is also not working. Any help would be appreciated.
sample input data :
dst_val,node_name,starttime
BCD098,,2021-03-26 15:42:06.890000000
BCD043,HKR_NODEF,2021-03-26 20:31:09
BCD038,BCF_NODEK,2021-03-26 21:29:10
Expected output :
dst_val,node_name,starttime
BCD043,HKR_NODEF,2021-03-26 20:31:09
BCD038,BCF_NODEK,2021-03-26 21:29:10
For debugging , I am tryin to print the df.show. But no use.
I am using df.show() , but it is still showing the record with null. The datatype of node_name is "STRING"

can you please use this ?
select dst_val,node_name,cast( from_timestamp(starttime,'SSS') as bigint) as starttime from def.tbl_dst where (node_name is not null or node_name<>'' ) and trunc(starttime,'HH') >= trunc(hours_add(now(),-1),'HH') and trunc(starttime,'HH') < trunc(now(),'HH')
I think node_name has empty space in it and above sql(i added or node_name<>'') will take care of it.
Now, if you have some non-printable character, then we may have to check accordingly.
EDIT : Since not null is working in Impala, i think this may be a spark issue.

PySpark pyspark.sql.DataFrameReader.jdbc() doesn't accept datetime type upperbound argument as document says

I found in the document for jdbc function in PySpark 3.0.1 at
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader, it says:
column – the name of a column of numeric, date, or timestamp type that
will be used for partitioning.
I thought it accepts a datetime column to partition the query.
So I tried this on EMR-6.2.0 (PySpark 3.0.1):
sql_conn_params = get_spark_conn_params() # my function
sql_conn_params['column'] ='EVENT_CAPTURED'
sql_conn_params['numPartitions'] = 8
# sql_conn_params['upperBound'] = datetime.strptime('2016-01-01', '%Y-%m-%d') # another trial
# sql_conn_params['lowerBound'] = datetime.strptime(''2016-01-10', '%Y-%m-%d')
sql_conn_params['upperBound'] = '2016-01-01 00:00:00'
sql_conn_params['lowerBound'] = '2016-01-10 00:00:00'
df = (spark.read.jdbc(
table=tablize(sql),
**sql_conn_params
))
df.show()
I got this error:
invalid literal for int() with base 10: '2016-01-01 00:00:00'
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 625, in jdbc
return self._df(self._jreader.jdbc(url, table, column, int(lowerBound), int(upperBound),
ValueError: invalid literal for int() with base 10: '2016-01-01 00:00:00'
I looked at the source code here
https://github.com/apache/spark/blob/master/python/pyspark/sql/readwriter.py#L865
and found it doesn't support datetime type as document says.
My question is:
It doesn't support datetime type partition column in PySpark as the code shows, but why the document says it supports it?
Thanks,
Yan

It supports.
The issue here is that the spark.read.jdbc method currently only supports parameters upper/lower bounds for integral type columns.
But you can use load method and DataFrameReader.option to specifiy upperBound and lowerBound for other column types date/timestamp :
df = spark.read.format("jdbc") \
.option("url", "jdbc:mysql://server/db") \
.option("dbtable", "table_name") \
.option("user", "user") \
.option("password", "xxxx") \
.option("partitionColumn", "EVENT_CAPTURED") \
.option("lowerBound", "2016-01-01 00:00:00") \
.option("upperBound", "2016-01-10 00:00:00") \
.option("numPartitions", "8") \
.load()
Or by passing dict of options:
df = spark.read.format("jdbc") \
.options(*sql_conn_params)\
.load()
You can see all available options and examples here: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

pyspark parse filename on load

I'm quite new to spark and there is one thing that I don't understand: how to manipulate column content.
I have a set of csv as follow:
each dsX is a table and I would like to load the data at once for each table.
So far no problems:
df = spark.read.format('csv') \
.option("header", "true") \
.option("escape", "\"") \
.load(table+"/*")
But There is one information missing: the client_id and this client id is the first part of the csv name: clientId_table_category.csv
So I tried to do this:
def extract_path(patht):
print(patht)
return patht
df = spark.read.format('csv') \
.option("header", "true") \
.option("escape", "\"") \
.load(table+"/*") \
.withColumn("clientId", fn.lit(extract_path(fn.input_file_name())))
But the print returns:
Column<b'input_file_name()'>
And I can't do much with this.
I'm quite stuck here, how do you manipulate data in this configuration?
Another solution for me is to load each csv one by one and parse the clientId from the file name manually, but I was wondering if there wouldn't be a more powerful solution with spark.

you are going a little too far away :
df = spark.read.csv(
table+"/*",
header=True,
sep='\\'
).withColumn("clientId", fn.input_file_name())
this will create a column with the full path. Then you just need some extra string manipulation - easy using an UDF. You can also do that with builtin function but it is trickier.
from pyspark.sql.types import StringType
#fn.udf(StringType())
def get_id(in_string):
return in_string.split("/")[-1].split("_")[0]
df = df.withColumn(
"clientId",
get_id(fn.col("clientId")
)

Read csv and join lines on a ASCII character pyspark

I have a csv file in following format -
id1,"When I think about the short time that we live and relate it to á
the periods of my life when I think that I did not use this á
short time."
id2,"[ On days when I feel close to my partner and other friends. á
When I feel at peace with myself and also experience a close á
contact with people whom I regard greatly.]"
I want to read in pyspark. My code is -
schema = StructType([
StructField("Id", StringType()),
StructField("Sentence", StringType()),
])
df = sqlContext.read.format("com.databricks.spark.csv") \
.option("header", "false") \
.option("inferSchema", "false") \
.option("delimiter", "\"") \
.schema(schema) \
.load("mycsv.csv")
But the result I am getting is -
+--------------------------------------------------------------+-------------------------------------------------------------------+
| Id | Sentence |
+--------------------------------------------------------------+-------------------------------------------------------------------+
|id1, |When I think about the short time that we live and relate it to á |
|the periods of my life when I think that I did not use this á |null |
|short time. |" |
...
I want to read it in 2 column one containing Id and other Sentence.
And the sentences should join on ASCII character á as I see it is reading on next line without getting the delimiter .
My output should look like this -
+--------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+
| Id | Sentence |
+--------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+
|id1, |When I think about the short time that we live and relate it to the periods of my life when I think that I did not use this short time. |
I have considered only one id in example.
What modification is needed in my code?

Just update Spark to 2.2 or later, if you haven't done this already and use multiline option:
df = spark.read
.option("header", "false") \
.option("inferSchema", "false") \
.option("delimiter", "\"") \
.schema(schema) \
.csv("mycsv.csv", multiLine=True)
If you do that, you can remove á with regexp_replace:
df.withColumn("Sentence", regexp_replace("Sentence", "á", "")

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to override default timestamp format while reading csv in pyspark? - apache-spark

Related

Adding missing columns to a dataframe pyspark

not null not working in Pyspark when using impala jdbc driver

PySpark pyspark.sql.DataFrameReader.jdbc() doesn't accept datetime type upperbound argument as document says

pyspark parse filename on load

Read csv and join lines on a ASCII character pyspark

Categories

Resources