Column names appearing as record data in Pyspark databricks - python-3.x

I'm working on Pyspark python. I downloaded a sample csv file from Kaggle (Covid Live.csv) and the data from the table is as follows when opened in visual code
(Raw CSV data only partial data)
#,"Country,
Other","Total
Cases","Total
Deaths","New
Deaths","Total
Recovered","Active
Cases","Serious,
Critical","Tot Cases/
1M pop","Deaths/
1M pop","Total
Tests","Tests/
1M pop",Population
1,USA,"98,166,904","1,084,282",,"94,962,112","2,120,510","2,970","293,206","3,239","1,118,158,870","3,339,729","334,805,269"
2,India,"44,587,307","528,629",,"44,019,095","39,583",698,"31,698",376,"894,416,853","635,857","1,406,631,776"........
The problem i'm facing here, the column names are also being displayed as records in pyspark databricks console when executed with below code
from pyspark.sql.types import *
df1 = spark.read.format("csv") \
.option("inferschema", "true") \
.option("header", "true") \
.load("dbfs:/FileStore/shared_uploads/mahesh2247#gmail.com/Covid_Live.csv") \
.select("*")
Spark Jobs -->
df1:pyspark.sql.dataframe.DataFrame
#:string
Country,:string
As can be observed above , spark is detecting only two columns # and Country but not aware that 'Total Cases', 'Total Deaths' . . are also columns
How do i tackle this malformation ?

Few ways to go about this.
Fix the header in the csv before reading (should be on a single
line). Also pay attention to quoting and escape settings.
Read in PySpark with manually provided schema and filter out the bad lines.
Read using pandas, skip the first 12 lines. Add proper column names, convert to PySpark dataframe.

So , the solution is pretty simple and does not require you to 'edit' the data manually or anything of those sorts.
I just had to add .option("multiLine","true") \ and the data is displaying as desired!

Related

Read json files in permissive mode - PySpark 2.3

I have a data job to read a bunch of json files, where there is a possibility that few json lines in some files might be corrupt(invalid json).
Below is the code:
df = spark.read \
.option("mode", "PERMISSIVE")\
.option("columnNameOfCorruptRecord", "_corrupt_record")\
.json("hdfs://someLocation/")
The thing happening for me is that if I try to read a completely perfect file(no corrupt records) with above code, this column is not added at all.
My ask here is to add this "_corrupt_record" column, regardless of whether the json file has corrupt record or not. If a file don't have any corrupt record, all values for this field should be null.
You can just check whether the _corrupt_record column exists in df, and add it manually if it doesn't.
import pyspark.sql.functions as F
if '_corrupt_record' not in df.columns:
df = df.withColumn('_corrupt_record', F.lit(None))

Reading from CSV file but mostly None values

I have a csv file with data in most fields. I can read this csv file in Pandas with no problem. However, when I try and read it in with Apache Spark, I get mostly Null values as shown in the screenshot. I have no idea why. This file is actually 400,000+ rows, which is why I am using Apache Spark, but I have the same problem when I take only 20 rows.
df = spark.read.csv('drive/My Drive/inc-20.csv', header=True)
df.show()
Apache Spark output
Here is the original CSV file
Any input would be very welcome!
Found the problem. The last column wasn't being parsed properly. Oddly, this seemed to have an impact on other columns. I dropped the last column, and this worked. Hope that helps anyone running into a similar problem in the future.
try to read the file with Schema as below
df=spark.read
.format("org.apache.spark.csv")
.option("header", true)
.option("inferSchema", true) // <-- HERE
.csv("/home/filepath/Book1.csv")

Pulling log file directory name into the Pyspark dataframe

I have a bit of a strange one. I have loads of logs that I need to trawl. I have done that successfully in Spark & I am happy with it.
However, I need to add one more field to the dataframe, which is the data center.
The only place that the datacenter name can be derived is from the directory path.
For example:
/feedname/date/datacenter/another/logfile.txt
What would be the way to extract the log file path and inject it into the dataframe? From there, I can do some string splits & extract the bit I need.
My current code:
mpe_data = my_spark.read\
.option("header","false")\
.option("delimiter", "\t")\
.withColumn("Datacenter", input_file_name())\
.csv('hdfs://nameservice/data/feed/mpe/dt=20191013/*/*/*', final_structure)
mpe_data.printSchema()
mpe_data.createOrReplaceTempView("mpe")
You can get the file path using the _input_file_name_ in Spark 2.0+
from pyspark.sql.functions import input_file_name
df.withColumn("Datacenter", input_file_name())
Adding your piece of code as example, once you have read your file use the withcolumn to get the file_name.
mpe_data = my_spark.read\
.option("header","false")\
.option("delimiter", "\t")\
.csv('hdfs://nameservice/data/feed/mpe/dt=20191013/*/*/*', final_structure)
mpe_data.withColumn("Datacenter", input_file_name())
mpe_data.printSchema()

How can I use the saveAsTable function when I have two Spark streams running in parallel in the same notebook?

I have two Spark streams set up in a notebook to run in parallel like so.
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool1")
df1 = spark \
.readStream.format("delta") \
.table("test_db.table1") \
.select('foo', 'bar')
writer_df1 = df1.writeStream.option("checkpoint_location", checkpoint_location_1) \
.foreachBatch(
lambda batch_df, batch_epoch:
process_batch(batch_df, batch_epoch)
) \
.start()
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool2")
df2 = spark \
.readStream.format("delta") \
.table("test_db.table2") \
.select('foo', 'bar')
writer_df2 = merchant_df.writeStream.option("checkpoint_location", checkpoint_location_2) \
.foreachBatch(
lambda batch_df, batch_epoch:
process_batch(batch_df, batch_epoch)
) \
.start()
These dataframes then get processed row by row, with each row being sent to an API. If the API call reports an error, I then convert the row into JSON and append this row to a common failures table in databricks.
columns = ['table_name', 'record', 'time_of_failure', 'error_or_status_code']
vals = [(table_name, json.dumps(row.asDict()), datetime.now(), str(error_or_http_code))]
error_df = spark.createDataFrame(vals, columns)
error_df.select('table_name','record','time_of_failure', 'error_or_status_code').write.format('delta').mode('Append').saveAsTable("failures_db.failures_db)"
When attempting to add the row to this table, the saveAsTable() call here throws the following exception.
py4j.protocol.Py4JJavaError: An error occurred while calling o3578.saveAsTable.
: java.lang.IllegalStateException: Cannot find the REPL id in Spark local properties. Spark-submit and R doesn't support transactional writes from different clusters. If you are using R, please switch to Scala or Python. If you are using spark-submit , please convert it to Databricks JAR job. Or you can disable multi-cluster writes by setting 'spark.databricks.delta.multiClusterWrites.enabled' to 'false'. If this is disabled, writes to a single table must originate from a single cluster. Please check https://docs.databricks.com/delta/delta-intro.html#frequently-asked-questions-faq for more details.
If I comment out one of the streams and re-run the notebook, any errors from the API calls get inserted into the table with no issues. I feel like there's some configuration I need to add but am not sure of where to go from here.
Not sure if this is the best solution, but I believe the problem comes from each stream writing to the table at the same time. I split this table into separate tables for each stream and it worked after that.

Unable to append "Quotes" in write for dataframe

I am trying to save a dataframe as .csv in spark. It is required to have all fields bounded by "Quotes". Currently, the file is not enclosed by "Quotes".
I am using Spark 2.1.0
Code :
DataOutputResult.write.format("com.databricks.spark.csv").
option("header", true).
option("inferSchema", false).
option("quoteMode", "ALL").
mode("overwrite").
save(Dataoutputfolder)
Output format(actual) :
Name, Id,Age,Gender
XXX,1,23,Male
Output format (Required) :
"Name", "Id" ," Age" ,"Gender"
"XXX","1","23","Male"
Options I tried so far :
QuoteMode, Quote in the options during it as file, But with no success.
("quote", "all"), replace quoteMode with quote
or play with concat or concat_wsdirectly on df columns and save without quote - mode
import org.apache.spark.sql.functions.{concat, lit}
val newDF = df.select(concat($"Name", lit("""), $"Age"))
or create own udf function to add desired behaviour, pls find more examples in Concatenate columns in apache spark dataframe
Unable to add as a comment to the above answer, so posting as an answer.
In Spark 2.3.1, use quoteAll
df1.write.format("csv")
.option("header", true)
.option("quoteAll","true")
.save(Dataoutputfolder)
Also, to add to the comment of #Karol Sudol (great answer btw), .option("quote","\u0000") will work only if one is using Pyspark with Python 3 which has default encoding as 'utf-8'. A few reported that the option did not work, because they must be using Pyspark with Python 2 whose default encoding is 'ascii'. Therefore the error "java.lang.RuntimeException: quote cannot be more than one character"

Resources