Read csv and join lines on a ASCII character pyspark - apache-spark

I have a csv file in following format -
id1,"When I think about the short time that we live and relate it to á
the periods of my life when I think that I did not use this á
short time."
id2,"[ On days when I feel close to my partner and other friends. á
When I feel at peace with myself and also experience a close á
contact with people whom I regard greatly.]"
I want to read in pyspark. My code is -
schema = StructType([
StructField("Id", StringType()),
StructField("Sentence", StringType()),
])
df = sqlContext.read.format("com.databricks.spark.csv") \
.option("header", "false") \
.option("inferSchema", "false") \
.option("delimiter", "\"") \
.schema(schema) \
.load("mycsv.csv")
But the result I am getting is -
+--------------------------------------------------------------+-------------------------------------------------------------------+
| Id | Sentence |
+--------------------------------------------------------------+-------------------------------------------------------------------+
|id1, |When I think about the short time that we live and relate it to á |
|the periods of my life when I think that I did not use this á |null |
|short time. |" |
...
I want to read it in 2 column one containing Id and other Sentence.
And the sentences should join on ASCII character á as I see it is reading on next line without getting the delimiter .
My output should look like this -
+--------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+
| Id | Sentence |
+--------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+
|id1, |When I think about the short time that we live and relate it to the periods of my life when I think that I did not use this short time. |
I have considered only one id in example.
What modification is needed in my code?

Just update Spark to 2.2 or later, if you haven't done this already and use multiline option:
df = spark.read
.option("header", "false") \
.option("inferSchema", "false") \
.option("delimiter", "\"") \
.schema(schema) \
.csv("mycsv.csv", multiLine=True)
If you do that, you can remove á with regexp_replace:
df.withColumn("Sentence", regexp_replace("Sentence", "á", "")

Related

Can i exclude the column used for partitioning when writing to parquet?

i need to create parquet files, reading from jdbc. The table is quite big and all columns are varchars. So i created a new column with a random int to make partitioning.
so my read jdbc looks something like this:
data_df = sparkSession.read.format('jdbc') \
.option('url', 'jdbc:netezza://host:port/db') \
.option('dbtable', """(SELECT * FROM schema.table) A""") \
.option('user', 'user') \
.option('password', 'password') \
.option('partitionColumn','random_number') \
.option('lowerBound','1') \
.option('upperBound','200') \
.option('numPartitions','200') \
.load()
and my write to parquet looks something like this:
data_df.write.mode("overwrite").parquet("parquetfile.parquet").partitionBy('random_number')
The generated parquet also contains the 'random_number' column, but i only made that column for partitioning, is there a way to exclude that column to the writing of the parquet files?
Thanks for any help, i'm new to spark :)
I'm expecting to exclude the random_number column, but lack the knowledge if this is possible if i need the column for partitioning
So do you want to repartition in memory using a column but not writing it, you can just use .repartition(col("random_number")) before writing droping the column then write your data:
data_df = sparkSession.read.format('jdbc') \
.option('url', 'jdbc:netezza://host:port/db') \
.option('dbtable', """(SELECT * FROM schema.table) A""") \
.option('user', 'user') \
.option('password', 'password') \
.option('partitionColumn','random_number') \
.option('lowerBound','1') \
.option('upperBound','200') \
.option('numPartitions','200') \
.load()
.repartition(col("random_number")).drop("random_number")
then:
data_df.write.mode("overwrite").parquet("parquetfile.parquet")

How to override default timestamp format while reading csv in pyspark?

Suppose I have the following data in a CSV format,
ID|TIMESTAMP_COL
1|03-02-2003 08:37:55.671 PM
2|2003-02-03 08:37:55.671 AM
and my code for reading the above CSV is,
from pyspark.sql.types import *
sch = StructType([StructField("ID",StringType(),False),StructField("TIMESTAMP_COL",StringType(),True)])
df = spark.read \
.format("csv") \
.option("encoding", "utf-8") \
.option("mode", "PERMISSIVE") \
.option("header", "true") \
.option("dateFormat", "dd-MM-yyyy") \
.option("timestampFormat", "dd-MM-yyyy HH:mm:ss.SSS a") \
.option("delimiter", "|") \
.option("columnNameOfCorruptRecord", "_corrupt_record") \
.schema(sch) \
.load("data.csv")
So, according to the given timestamp format, I should get the record with id '2' rejected as it has a different format but it gets parsed but the value is different.
The output I am getting is,
df.show(truncate=False)
+-------------+-----------------------+-------------------+
| ID| TIMESTAMP_COL| _corrupt_record|
+-------------+-----------------------+-------------------+
| 1|2003-02-03 08:37:55.671| null|
| 2|0008-07-26 08:37:55.671| null|
+-------------+-----------------------+-------------------+
Why is this happening?
Not sure if it helps but here is what i found:
In your schema second field is declared as StringType, shouldnt it be TimestampType()?
I was able to reproduce your results with spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY") i did also tests with other possible options for this parameter
object LegacyBehaviorPolicy extends Enumeration {
val EXCEPTION, LEGACY, CORRECTED = Value
}
and here is doc for this parameter:
.doc("When LEGACY, java.text.SimpleDateFormat is used for formatting and parsing " +
"dates/timestamps in a locale-sensitive manner, which is the approach before Spark 3.0. " +
"When set to CORRECTED, classes from java.time.* packages are used for the same purpose. " +
"The default value is EXCEPTION, RuntimeException is thrown when we will get different " +
"results.")
So with LEGACY i am getting same results as you
With EXCEPTION Spark is throwing exception
org.apache.spark.SparkUpgradeException: [INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER] You may get a different result due to the upgrading to Spark >= 3.0:
With CORRECTED Spark is returning nulls for both records
It is however parsing correctly record with id 1 when i change pattern to hh instead of HH
so with something like this:
from pyspark.sql.types import *
spark.conf.set("spark.sql.legacy.timeParserPolicy","CORRECTED")
sch = StructType([StructField("ID",StringType(),False),StructField("TIMESTAMP_COL",TimestampType(),True), StructField("_corrupt_record", StringType(),True)])
df = spark.read \
.format("csv") \
.option("encoding", "utf-8") \
.option("mode", "PERMISSIVE") \
.option("header", "true") \
.option("dateFormat", "dd-MM-yyyy") \
.option("timestampFormat", "dd-MM-yyyy hh:mm:ss.SSS a") \
.option("delimiter", "|") \
.option("columnNameOfCorruptRecord", "_corrupt_record") \
.schema(sch) \
.load("dbfs:/FileStore/tables/stack.csv") \
df.show(truncate = False)
I am able to get this on output:
+---+-----------------------+----------------------------+
|ID |TIMESTAMP_COL |_corrupt_record |
+---+-----------------------+----------------------------+
|1 |2003-02-03 20:37:55.671|null |
|2 |null |2|2003-02-03 08:37:55.671 AM|
+---+-----------------------+----------------------------+
I am getting null here because thats how Spark parser is working, when pattern is incorrect its assigning null and your value is not going to be moved to corrupted_records i think so if you want to remove not matching timestamps you may filter nulls
Edit: As mentioned in comment i was missing this column in schema, its added now and you can get corrupted_value if you need it

Add leading zero to PySpark time components

I have this function which writes data partitioned by date and time
df = df.withColumn("year", F.year(col(date_column))) \
.withColumn("month", F.month(col(date_column))) \
.withColumn("day", F.dayofmonth(col(date_column))) \
.withColumn("hour", F.hour(col(date_column)))
df.write.partitionBy("year","month","day","hour").mode("append").format("csv").save(destination)
The output gets written to month=9 how can I make it be like month=09 same goes for hours, e.g. hour=04.
You could try
.withColumn("month", F.date_format(col(date_column), "MM")))
and
.withColumn("hour", F.date_format(col(date_column), "HH"))

pyspark parse filename on load

I'm quite new to spark and there is one thing that I don't understand: how to manipulate column content.
I have a set of csv as follow:
each dsX is a table and I would like to load the data at once for each table.
So far no problems:
df = spark.read.format('csv') \
.option("header", "true") \
.option("escape", "\"") \
.load(table+"/*")
But There is one information missing: the client_id and this client id is the first part of the csv name: clientId_table_category.csv
So I tried to do this:
def extract_path(patht):
print(patht)
return patht
df = spark.read.format('csv') \
.option("header", "true") \
.option("escape", "\"") \
.load(table+"/*") \
.withColumn("clientId", fn.lit(extract_path(fn.input_file_name())))
But the print returns:
Column<b'input_file_name()'>
And I can't do much with this.
I'm quite stuck here, how do you manipulate data in this configuration?
Another solution for me is to load each csv one by one and parse the clientId from the file name manually, but I was wondering if there wouldn't be a more powerful solution with spark.
you are going a little too far away :
df = spark.read.csv(
table+"/*",
header=True,
sep='\\'
).withColumn("clientId", fn.input_file_name())
this will create a column with the full path. Then you just need some extra string manipulation - easy using an UDF. You can also do that with builtin function but it is trickier.
from pyspark.sql.types import StringType
#fn.udf(StringType())
def get_id(in_string):
return in_string.split("/")[-1].split("_")[0]
df = df.withColumn(
"clientId",
get_id(fn.col("clientId")
)

How to skip lines while reading a CSV file as a dataFrame using PySpark?

I have a CSV file that is structured this way:
Header
Blank Row
"Col1","Col2"
"1,200","1,456"
"2,000","3,450"
I have two problems in reading this file.
I want to Ignore the Header and Ignore the blank row
The commas within the value is not a separator
Here is what I tried:
df = sc.textFile("myFile.csv")\
.map(lambda line: line.split(","))\ #Split By comma
.filter(lambda line: len(line) == 2).collect() #This helped me ignore the first two rows
However, This did not work, because the commas within the value was being read as a separator and the len(line) was returning 4 instead of 2.
I tried an alternate approach:
data = sc.textFile("myFile.csv")
headers = data.take(2) #First two rows to be skipped
The idea was to then use filter and not read the headers. But, when I tried to print the headers, I got encoded values.
[\x00A\x00Y\x00 \x00J\x00u\x00l\x00y\x00 \x002\x000\x001\x006\x00]
What is the correct way to read a CSV file and skip the first two rows?
Try to use csv.reader with 'quotechar' parameter.It will split the line correctly.
After that you can add filters as you like.
import csv
from pyspark.sql.types import StringType
df = sc.textFile("test2.csv")\
.mapPartitions(lambda line: csv.reader(line,delimiter=',', quotechar='"')).filter(lambda line: len(line)>=2 and line[0]!= 'Col1')\
.toDF(['Col1','Col2'])
For your first problem, just zip the lines in the RDD with zipWithIndex and filter the lines you don't want.
For the second problem, you could try to strip the first and the last double quote characters from the lines and then split the line on ",".
rdd = sc.textFile("myfile.csv")
rdd.zipWithIndex().
filter(lambda x: x[1] > 2).
map(lambda x: x[0]).
map(lambda x: x.strip('"').split('","')).
toDF(["Col1", "Col2"])
Although, if you're looking for a standard way to deal with CSV files in Spark, it's better to use the spark-csv package from databricks.
Answer by Zlidime had the right idea. The working solution is this:
import csv
customSchema = StructType([ \
StructField("Col1", StringType(), True), \
StructField("Col2", StringType(), True)])
df = sc.textFile("file.csv")\
.mapPartitions(lambda partition: csv.reader([line.replace('\0','') for line in partition],delimiter=',', quotechar='"')).filter(lambda line: len(line) > 2 and line[0] != 'Col1')\
.toDF(customSchema)
If CSV file structure always has two columns, on Scala can be implemented:
val struct = StructType(
StructField("firstCol", StringType, nullable = true) ::
StructField("secondCol", StringType, nullable = true) :: Nil)
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "false")
.option("inferSchema", "false")
.option("delimiter", ",")
.option("quote", "\"")
.schema(struct)
.load("myFile.csv")
df.show(false)
val indexed = df.withColumn("index", monotonicallyIncreasingId())
val filtered = indexed.filter(col("index") > 2).drop("index")
filtered.show(false)
Result is:
+---------+---------+
|firstCol |secondCol|
+---------+---------+
|Header |null |
|Blank Row|null |
|Col1 |Col2 |
|1,200 |1,456 |
|2,000 |3,450 |
+---------+---------+
+--------+---------+
|firstCol|secondCol|
+--------+---------+
|1,200 |1,456 |
|2,000 |3,450 |
+--------+---------+
Why don't you just try the DataFrameReader API from pyspark.sql? It is pretty easy. For this problem, I guess this single line would be good enough.
df = spark.read.csv("myFile.csv") # By default, quote char is " and separator is ','
With this API, you can also play around with few other parameters like header lines, ignoring leading and trailing whitespaces. Here is the link: DataFrameReader API

Resources