read local csv file in pySpark (2.3) - apache-spark

I'm using pySpark 2.3, trying to read a csv file that looks like that:
0,0.000476517230863068,0.0008178378961061477
1,0.0008506156837329876,0.0008467260987257776
But it doesn't work:
from pyspark import sql, SparkConf, SparkContext
print (sc.applicationId)
>> <property at 0x7f47583a5548>
data_rdd = spark.textFile(name=tsv_data_path).filter(x.split(",")[0] != 1)
And I get an error:
AttributeError: 'SparkSession' object has no attribute 'textFile'
Any idea how I should read it in pySpark 2.3?

First, textFile exists on the SparkContext (called sc in the repl), not on the SparkSession object (called spark in the repl).
Second, for CSV data, I would recommend using the CSV DataFrame loading code, like this:
df = spark.read.format("csv").load("file:///path/to/file.csv")
You mentioned in comments needing the data as an RDD. You are going to have significantly better performance if you can keep all of your operations on DataFrames instead of RDDs. However, if you need to fall back to RDDs for some reason you can do it like the following:
rdd = df.rdd.map(lambda row: row.asDict())
Doing this approach is better than trying to load it with textFile and parsing the CSV data yourself. If you use the DataFrame CSV loading then it will properly handle all the CSV edge cases for you like quoted fields. Also if only needed some of the columns, you could filter on the DataFrame before converting it to a RDD to avoid needing to bring all that extra data over into the python interpreter.

Related

Read multiple text files into a spark dataframe

I am trying to read multiple text files into a single spark data frame, I have used the following code for as single file:
df =spark.read.text('C:/User/Alex/Directory/Subdirectory/Filename.txt.pgp.decr')
df.count()
and I get the correct result, then I try and read in all of the files in that directory as follows:
df = spark.read.text('C:/User/Alex/Directory/Subdirectory/*')
df.count()
and the notebook just hangs and produces no result. I have also tried reading the data into a rdd using the sparkContext with textFile and wholeTextFiles, but also didn't come right, please can you help?

Xml parsing on spark Structured Streaming

I'm trying to analyze data using Kinesis source in PySpark Structured Streaming on Databricks.
I created a Dataframe as shown below.
kinDF = spark.readStream.format("kinesis").("streamName", "test-stream-1").load()
Later I converted the data from base64 encoding as below.
df = kinDF.withColumn("xml_data", expr("CAST(data as string)"))
Now, I need to extract few fields from df.xml_data column using xpath. Can you please suggest any possible solution?
If I create a dataframe directly for these xml files as xml_df = spark.read.format("xml").options(rowTag='Consumers').load("s3a://bkt/xmldata"), I'm able to query using xpath:
xml_df.select("Analytics.Amount1").show()
But, not sure how to do extract elements similarly on a Spark Streaming dataframe where data is in text format.
Are there any xml functions to convert text data using schema? I saw an example for json data using from_json.
Is it possible to use spark.read on a dataframe column?
I need to find aggregated "Amount1" for every 5 minutes window.
Thanks for your help
You can use com.databricks.spark.xml.XmlReader to read xml data from column but requires an RDD, which means that you need to transform your df to RDD using df.rdd which may impact performance.
Below is untested code from spark java:
import com.databricks.spark.xml
xmlRdd = df = kinDF.select("xml_data").map(r -> r[0])
new XmlReader().xmlRdd(spark, xmlRdd)

How to save pandas DataFrame to Hive within `forearchPartition` (PySpark)

I have a PySpark DataFrame. I want to perform some function forearchPartition and then save each result to Hive. The result is a pandas dataframe (within each partition). What is the best way to do this?
I have tried the following without success (gives a serialization error):
def processData(x):
#do something
spark_df = spark.createDataFrame(pandas_df)
spark_df.write.mode("append").format("parquet").saveAsTable(db.table_name)
original_spark_df.rdd.forearchPartition(processData)
I guess, one solution would be to turn pandas into RDD and return it (using mapPartitions instead of forearchPartition), and then use rdd.toDF() and saveAsTable().
Is there some solution to save the pandas to Hive within forearchPartition?

Spark Parse Text File to DataFrame

Currently, I can parse a text file to a Spark DataFrame by way of the RDD API with the following code:
def row_parse_function(raw_string_input):
# Do parse logic...
return pyspark.sql.Row(...)
raw_rdd = spark_context.textFile(full_source_path)
# Convert RDD of strings to RDD of pyspark.sql.Row
row_rdd = raw_rdd.map(row_parse_function).filter(bool)
# Convert RDD of pyspark.sql.Row to Spark DataFrame.
data_frame = spark_sql_context.createDataFrame(row_rdd, schema)
Is this current approach ideal?
Or is there a better way to do this without using the older RDD API.
FYI, Spark 2.0.
Clay,
This is a good approach to load a file that has not specific format instead CSV, JSON, ORC, Parquet or from Database.
If you have any kind of specific logic to work on it, this is the best way to do that. Using RDD is for this kind of situation, when you need to run a specific logic in your data that is not trivial.
You can read here about the uses of the APIs of Spark. And you are in the situation of RDD is the best approach.

How to make spark write a _SUCCESS file for empty parquet output?

One of my spark jobs is currently running over empty input and so produces no output. That's fine for now, but I still need to know that the spark job ran even if it produced no parquet output.
Is there a way of forcing spark to write a _SUCCESS file even if there was no output at all? Currently it doesn't write anything to the directory where there would be output if there was input so I've no way of determining if there was a failure (this is part of a larger automated pipeline and so it keeps rescheduling the job because there's no indication it already ran).
_SUCESS file is written by Hadoop code. So if your spark app doesn't generate any output you can use Hadoop API to create _SUCCESS file yourself.
If you are using PySpark - look into https://github.com/spotify/snakebite
If you are using Scala or Java - look into Hadoop API.
Alternative would be to ask Spark write empty dataset into to the output. But this might not what you need - because there will be part-00000 and _SUCESS file, which downstream consumers might not like.
Here is how to save empty dataset in pyspark (in Scala the code should be the same)
$ pyspark
>>> sc.parallelize([], 1).saveAsTextFile("/path/on/hdfs")
>>> exit()
$ hadoop fs -ls /path/on/hdfs
Found 2 items
-rw-r--r-- 2 user user 0 2016-02-25 12:54 /path/on/hdfs/_SUCCESS
-rw-r--r-- 2 user user 0 2016-02-25 12:54 /path/on/hdfs/part-00000
With Spark 1.6:
If writing a DataFrame with a forced schema and Avro writer, zero rows produces at least one part-r-{part number}.avro file (containing essentially a schema without rows) and a _SUCCESS file. With this pseudocode example:
resultData.persist(/* optional storage value */)
if(resultData.count == 0)
resultData
.coalesce(1)
.write
.avro(memberRelationshipMapOutputDir)
else
doSomething()
resultData.unpersist()
It's possible to tweak avro to parquet and figure out the row count's relationship to the coalesce factor. (And ... switch to use approximate counts.) The above example brings up that schema may need to be forced on the internal data before writing. So ... this may be required:
case class Member(club : String, username : String)
hiveContext
.read
.schema(ScalaReflection.schemaFor[Member].dataType.asInstanceOf[StructType])
.avro(memberRelationshipMapInputDir)
Some useful imports / code may be:
import com.databricks.spark.avro._
import org.apache.spark.sql.catalyst.ScalaReflection
import org.apache.spark.sql.types.StructType
import org.apache.spark.storage.StorageLevel
import org.apache.spark.sql.hive.HiveContext
val hiveContext = new HiveContext(sparkContext)
import hiveContext.implicits._
Disclaimer: Some of this may be changed for Spark 2.x and all the above is 'scala-like pseudocode'.
In order to convert a RDD of myRow to a DataFrame, it's possible to use the read above to get the data or convert the RDD to an appropriate DataFrame with createDataFrame or toDF.
You can use emptyRDD for writing just _SUCCESS flag:
spark.sparkContext.emptyRDD[MyRow].saveAsTextFile(outputPath)

Resources