I'm working on Pyspark python. I downloaded a sample csv file from Kaggle (Covid Live.csv) and the data from the table is as follows when opened in visual code
(Raw CSV data only partial data)
#,"Country,
Other","Total
Cases","Total
Deaths","New
Deaths","Total
Recovered","Active
Cases","Serious,
Critical","Tot Cases/
1M pop","Deaths/
1M pop","Total
Tests","Tests/
1M pop",Population
1,USA,"98,166,904","1,084,282",,"94,962,112","2,120,510","2,970","293,206","3,239","1,118,158,870","3,339,729","334,805,269"
2,India,"44,587,307","528,629",,"44,019,095","39,583",698,"31,698",376,"894,416,853","635,857","1,406,631,776"........
The problem i'm facing here, the column names are also being displayed as records in pyspark databricks console when executed with below code
from pyspark.sql.types import *
df1 = spark.read.format("csv") \
.option("inferschema", "true") \
.option("header", "true") \
.load("dbfs:/FileStore/shared_uploads/mahesh2247#gmail.com/Covid_Live.csv") \
.select("*")
Spark Jobs -->
df1:pyspark.sql.dataframe.DataFrame
#:string
Country,:string
As can be observed above , spark is detecting only two columns # and Country but not aware that 'Total Cases', 'Total Deaths' . . are also columns
How do i tackle this malformation ?
Few ways to go about this.
Fix the header in the csv before reading (should be on a single
line). Also pay attention to quoting and escape settings.
Read in PySpark with manually provided schema and filter out the bad lines.
Read using pandas, skip the first 12 lines. Add proper column names, convert to PySpark dataframe.
So , the solution is pretty simple and does not require you to 'edit' the data manually or anything of those sorts.
I just had to add .option("multiLine","true") \ and the data is displaying as desired!
I have the following code and result. Here, I am using Databricks' autoloader.
The result I am getting is not correct, because if I don't drop the columns (df2), I have the following result.
Note that I notice similar behavior with select. What mistake am I doing here?
I have found the problem. I need to explicitly specify that the first line is a header. So, I changed the releavent line to this,
df = (spark.readStream.format("cloudFiles").option("cloudFiles.format", "csv")
.option("header", "true").schema(schema).load("/FileStore/tables/movies7"))
I have a data job to read a bunch of json files, where there is a possibility that few json lines in some files might be corrupt(invalid json).
Below is the code:
df = spark.read \
.option("mode", "PERMISSIVE")\
.option("columnNameOfCorruptRecord", "_corrupt_record")\
.json("hdfs://someLocation/")
The thing happening for me is that if I try to read a completely perfect file(no corrupt records) with above code, this column is not added at all.
My ask here is to add this "_corrupt_record" column, regardless of whether the json file has corrupt record or not. If a file don't have any corrupt record, all values for this field should be null.
You can just check whether the _corrupt_record column exists in df, and add it manually if it doesn't.
import pyspark.sql.functions as F
if '_corrupt_record' not in df.columns:
df = df.withColumn('_corrupt_record', F.lit(None))
I did some research on this during the past couple days and I think I'm close to getting this working, but there are still some issues that I can't quite figure out.
I believe this should work in a Scala environment
// Spark 2.0
// these lines are equivalent in Spark 2.0
spark.read.format("csv").option("header", "false").load("../Downloads/*.csv")
spark.read.option("header", "false").csv("../Downloads/*.csv")
That give me this error: org.apache.spark.sql.AnalysisException: Path does not exist:
I think this should work in a SQL environment:
df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "false")
.load("../Downloads/*.csv") // <-- note the star (*)
df.show()
This gives me a parse exception error.
The thing is, these are all .gz zipped text files and there is really no schema in all these files. Well, there is a vertical list of field names, and the real data sets always start down on something like row 26, 52, 99, 113, 149, and all kinds of random things. All data is pipe-delimited. I have the field names and I created structured tables in Azure SQL Server, which is where I want to store all data. I'm really stuck on how to iterate through folders and sub-folders, look for file names that match certain patterns, and merge all of these into a dataframe, then push that object into my SQL Server tables. It seems like a pretty straightforward thing, but I can't seem to get this darn thing working!!
I came across the idea here:
https://stackoverflow.com/questions/37639956/how-to-import-multiple-csv-files-in-a-single-load
you can find all files with pure scala and then pass them to spark:
val file = new File(yourDirectory)
val files: List[String] = file.listFiles
.filter(_.isFile)
.filter(_.getName.startsWith("yourCondition"))
.map(_.getPath).toList
val df = spark.read.csv(files:_*)
I finally, finally, finally got this working.
val myDFCsv = spark.read.format("csv")
.option("sep","|")
.option("inferSchema","true")
.option("header","false")
.load("mnt/rawdata/2019/01/01/client/ABC*.gz")
myDFCsv.show()
myDFCsv.count()
While loading csv file, there is an option to drop Malformed records. Can we do the same for XLS file load?
I have tried loading an XLS file (almost 1T size) and it shows this error:
warning: there was one deprecation warning; re-run with -deprecation for details
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext#339370e
java.lang.IllegalArgumentException: MALFORMED
at java.util.zip.ZipCoder.toString(ZipCoder.java:58)
at java.util.zip.ZipInputStream.readLOC(ZipInputStream.java:300)
Please, advise. Thank you very much.
I think this is what you are looking for (as done in Java for CSV):
spark.read().format("csv").option("sep", ",")
.option("header", true)
.option("mode", "DROPMALFORMED")
.schema(schema)
.load(filePath);
Here, we the mode will take care of malformed records and drop them when encountered.
Similarly, header is set to true which will not consider it as value and separator is defined as comma for CSV file format.
spark is the spark session to be created before the above snippet.