Read json files in permissive mode - PySpark 2.3 - apache-spark

I have a data job to read a bunch of json files, where there is a possibility that few json lines in some files might be corrupt(invalid json).
Below is the code:
df = spark.read \
.option("mode", "PERMISSIVE")\
.option("columnNameOfCorruptRecord", "_corrupt_record")\
.json("hdfs://someLocation/")
The thing happening for me is that if I try to read a completely perfect file(no corrupt records) with above code, this column is not added at all.
My ask here is to add this "_corrupt_record" column, regardless of whether the json file has corrupt record or not. If a file don't have any corrupt record, all values for this field should be null.

You can just check whether the _corrupt_record column exists in df, and add it manually if it doesn't.
import pyspark.sql.functions as F
if '_corrupt_record' not in df.columns:
df = df.withColumn('_corrupt_record', F.lit(None))

Related

Column names appearing as record data in Pyspark databricks

I'm working on Pyspark python. I downloaded a sample csv file from Kaggle (Covid Live.csv) and the data from the table is as follows when opened in visual code
(Raw CSV data only partial data)
#,"Country,
Other","Total
Cases","Total
Deaths","New
Deaths","Total
Recovered","Active
Cases","Serious,
Critical","Tot Cases/
1M pop","Deaths/
1M pop","Total
Tests","Tests/
1M pop",Population
1,USA,"98,166,904","1,084,282",,"94,962,112","2,120,510","2,970","293,206","3,239","1,118,158,870","3,339,729","334,805,269"
2,India,"44,587,307","528,629",,"44,019,095","39,583",698,"31,698",376,"894,416,853","635,857","1,406,631,776"........
The problem i'm facing here, the column names are also being displayed as records in pyspark databricks console when executed with below code
from pyspark.sql.types import *
df1 = spark.read.format("csv") \
.option("inferschema", "true") \
.option("header", "true") \
.load("dbfs:/FileStore/shared_uploads/mahesh2247#gmail.com/Covid_Live.csv") \
.select("*")
Spark Jobs -->
df1:pyspark.sql.dataframe.DataFrame
#:string
Country,:string
As can be observed above , spark is detecting only two columns # and Country but not aware that 'Total Cases', 'Total Deaths' . . are also columns
How do i tackle this malformation ?
Few ways to go about this.
Fix the header in the csv before reading (should be on a single
line). Also pay attention to quoting and escape settings.
Read in PySpark with manually provided schema and filter out the bad lines.
Read using pandas, skip the first 12 lines. Add proper column names, convert to PySpark dataframe.
So , the solution is pretty simple and does not require you to 'edit' the data manually or anything of those sorts.
I just had to add .option("multiLine","true") \ and the data is displaying as desired!

Reading from CSV file but mostly None values

I have a csv file with data in most fields. I can read this csv file in Pandas with no problem. However, when I try and read it in with Apache Spark, I get mostly Null values as shown in the screenshot. I have no idea why. This file is actually 400,000+ rows, which is why I am using Apache Spark, but I have the same problem when I take only 20 rows.
df = spark.read.csv('drive/My Drive/inc-20.csv', header=True)
df.show()
Apache Spark output
Here is the original CSV file
Any input would be very welcome!
Found the problem. The last column wasn't being parsed properly. Oddly, this seemed to have an impact on other columns. I dropped the last column, and this worked. Hope that helps anyone running into a similar problem in the future.
try to read the file with Schema as below
df=spark.read
.format("org.apache.spark.csv")
.option("header", true)
.option("inferSchema", true) // <-- HERE
.csv("/home/filepath/Book1.csv")

Pulling log file directory name into the Pyspark dataframe

I have a bit of a strange one. I have loads of logs that I need to trawl. I have done that successfully in Spark & I am happy with it.
However, I need to add one more field to the dataframe, which is the data center.
The only place that the datacenter name can be derived is from the directory path.
For example:
/feedname/date/datacenter/another/logfile.txt
What would be the way to extract the log file path and inject it into the dataframe? From there, I can do some string splits & extract the bit I need.
My current code:
mpe_data = my_spark.read\
.option("header","false")\
.option("delimiter", "\t")\
.withColumn("Datacenter", input_file_name())\
.csv('hdfs://nameservice/data/feed/mpe/dt=20191013/*/*/*', final_structure)
mpe_data.printSchema()
mpe_data.createOrReplaceTempView("mpe")
You can get the file path using the _input_file_name_ in Spark 2.0+
from pyspark.sql.functions import input_file_name
df.withColumn("Datacenter", input_file_name())
Adding your piece of code as example, once you have read your file use the withcolumn to get the file_name.
mpe_data = my_spark.read\
.option("header","false")\
.option("delimiter", "\t")\
.csv('hdfs://nameservice/data/feed/mpe/dt=20191013/*/*/*', final_structure)
mpe_data.withColumn("Datacenter", input_file_name())
mpe_data.printSchema()

Merge multiple files into one data frame and push to Azure SQL Server

I did some research on this during the past couple days and I think I'm close to getting this working, but there are still some issues that I can't quite figure out.
I believe this should work in a Scala environment
// Spark 2.0
// these lines are equivalent in Spark 2.0
spark.read.format("csv").option("header", "false").load("../Downloads/*.csv")
spark.read.option("header", "false").csv("../Downloads/*.csv")
That give me this error: org.apache.spark.sql.AnalysisException: Path does not exist:
I think this should work in a SQL environment:
df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "false")
.load("../Downloads/*.csv") // <-- note the star (*)
df.show()
This gives me a parse exception error.
The thing is, these are all .gz zipped text files and there is really no schema in all these files. Well, there is a vertical list of field names, and the real data sets always start down on something like row 26, 52, 99, 113, 149, and all kinds of random things. All data is pipe-delimited. I have the field names and I created structured tables in Azure SQL Server, which is where I want to store all data. I'm really stuck on how to iterate through folders and sub-folders, look for file names that match certain patterns, and merge all of these into a dataframe, then push that object into my SQL Server tables. It seems like a pretty straightforward thing, but I can't seem to get this darn thing working!!
I came across the idea here:
https://stackoverflow.com/questions/37639956/how-to-import-multiple-csv-files-in-a-single-load
you can find all files with pure scala and then pass them to spark:
val file = new File(yourDirectory)
val files: List[String] = file.listFiles
.filter(_.isFile)
.filter(_.getName.startsWith("yourCondition"))
.map(_.getPath).toList
val df = spark.read.csv(files:_*)
I finally, finally, finally got this working.
val myDFCsv = spark.read.format("csv")
.option("sep","|")
.option("inferSchema","true")
.option("header","false")
.load("mnt/rawdata/2019/01/01/client/ABC*.gz")
myDFCsv.show()
myDFCsv.count()

Spark: reading files with PERMISSIVE and provided schema - issues with corrupted records column

I am reading spark CSV. I am providing a schema for the file that I read and I read it permissive mode. I would like to keep all records in columnNameOfCorruptRecord (in my case corrupted_records).
I went trough hell to set this up and still get warnings that I cannot suppress i there something I miss.
So first in order to have the corrupted_records column I needed to add it to the schema as StringType. This is documented so okay.
But whenever I read a file a get a Warning that the schema doesn't match because the amount of columns is different. It's just a warning, but it's filling my logs.
Also when there is a field that is not nullable and there is a corrupted record, the corrupted record goes to the corrupt_records column and all it's fields are set to null thus I get an exception because I have non nullable field. The only to solve this is to set that the columns are not nullable to nullable. Which is quite strange.
Am I missing something?
Recap:
Is there a way to ignore the warning when I've added
corrupted_records column in the schema
Is there a way to use
PERMISSIVE mode and corrupted_records column with schema that has
non nullable fields.
Thanks!
The following documentation might help. It would be great if you atleast provide the code you've written.
https://docs.databricks.com/spark/latest/data-sources/read-csv.html
a demo of read json code snippet
df= self.spark.read.option("mode", "PERMISSIVE").option("primitivesAsString", True).json(self.src_path)
To answer your point 2, you should delve better point 1.
Point 1: you should do an analysis of your file and map your schema with all the fields in your file. After having imported your csv file into a DataFrame, I would select your fields of interest, and continue what you were doing.
Point 2: you will solve your problem defining your schema as follows (I would use scala):
import pyspark.sql.types as types
yourSchema = (types.StructType()
.add('field0', types.IntegerType(), True)
# all your .add(fieldsName, fieldsType, True which let your field be nullable)
.add('corrupted_records', types.StringType(), True) #your corrupted date will be here
)
With this having been defined, you can import your csv file into a DataFrame as follows:
df = ( spark.read.format("csv")
.schema(yourSchema)
.option("mode", "PERMISSIVE")
.option("columnNameOfCorruptRecord", "corrupted_records")
load(your_csv_files)
)
There are also other ways to do the same operation, and different modalities to handle bad bad; have a look at this insightful article: https://python.plainenglish.io/how-to-handle-bad-data-in-spark-sql-5e0276d37ca1

Resources