Read multiple text files into a spark dataframe - apache-spark

I am trying to read multiple text files into a single spark data frame, I have used the following code for as single file:
df =spark.read.text('C:/User/Alex/Directory/Subdirectory/Filename.txt.pgp.decr')
df.count()
and I get the correct result, then I try and read in all of the files in that directory as follows:
df = spark.read.text('C:/User/Alex/Directory/Subdirectory/*')
df.count()
and the notebook just hangs and produces no result. I have also tried reading the data into a rdd using the sparkContext with textFile and wholeTextFiles, but also didn't come right, please can you help?

Related

Apache Spark Dataframes Not Being Created with Databricks

When reading in data from SQL from a notebook a Spark Dataframe is created, but when I read in the same data from a different notebook I don't see a DataFrame.
So when I run the following in one notebook I get a dataframe as follows:
jdbcUrl = f"jdbc:sqlserver://{DBServer}.database.windows.net:1433;database={DBDatabase};user={DBUser};password={DBPword}"
my_sales = spark.read.jdbc(jdbcUrl, 'AZ_FH_ELLIPSE.AZ_FND_MSF620')
I get the following DataFrame output
However, but when I run the same code on a different notebook I only get how long it took to run the code but no dataframe.
Any thoughts?
I should mention that the DataFrame isn't appearing on the Community Edition of Databricks. However, I don't think that should be the reason why I'm not seeing a Dataframe or Schema appearing..

How to process multiple parquet files in parallel using Pyspark?

I am using Azure Databricks and i'm new to Pyspark and big data.
Here is my problem:
I have several parquet files in a directory on azure databricks.
I want to read these files to a pyspark dataframe and use the drop duplicates method to remove duplicate rows - a QA check.
I then want to overwrite these files in the same directory after dropping the duplicates.
Currently, I am using a for loop to loop over each parquet file in the directory. However, this is an inefficient way of doing things. I was wondering if there is a way to process these parquet files in parallel to save computational time. If so how do I need to change my code.
Here is the code:
for parquet_file_name in dir:
df = spark.read.option("header", "true").option("inferschema", "false").parquet('{}/{}'.format(dir,parquet_file_name))
df.dropDuplicates().write.mode('overwrite').parquet('{}/{}'.format(dir,parquet_file_name)
Any help here would be much appreciated.
Many thanks.
Rather than reading in one file at a time in a for loop, just read in the entire directory like so.
df = spark.read \
.option("header", "true") \
.option("inferschema", "false").parquet(dir)
df.dropDuplicates().write.mode('overwrite').parquet(dir)
The data will now be read all at once as intended. If you want to change the number of files wrote out, use coalesce command before the .write function like so: df.dropDuplicates().coalesce(4).write.mode('overwrite').parquet(dir).

Xml parsing on spark Structured Streaming

I'm trying to analyze data using Kinesis source in PySpark Structured Streaming on Databricks.
I created a Dataframe as shown below.
kinDF = spark.readStream.format("kinesis").("streamName", "test-stream-1").load()
Later I converted the data from base64 encoding as below.
df = kinDF.withColumn("xml_data", expr("CAST(data as string)"))
Now, I need to extract few fields from df.xml_data column using xpath. Can you please suggest any possible solution?
If I create a dataframe directly for these xml files as xml_df = spark.read.format("xml").options(rowTag='Consumers').load("s3a://bkt/xmldata"), I'm able to query using xpath:
xml_df.select("Analytics.Amount1").show()
But, not sure how to do extract elements similarly on a Spark Streaming dataframe where data is in text format.
Are there any xml functions to convert text data using schema? I saw an example for json data using from_json.
Is it possible to use spark.read on a dataframe column?
I need to find aggregated "Amount1" for every 5 minutes window.
Thanks for your help
You can use com.databricks.spark.xml.XmlReader to read xml data from column but requires an RDD, which means that you need to transform your df to RDD using df.rdd which may impact performance.
Below is untested code from spark java:
import com.databricks.spark.xml
xmlRdd = df = kinDF.select("xml_data").map(r -> r[0])
new XmlReader().xmlRdd(spark, xmlRdd)

How to convert multiple parquet files into TFrecord files using SPARK?

I would like to produce stratified TFrecord files from a large DataFrame based on a certain condition, for which I use write.partitionBy(). I'm also using the tensorflow-connector in SPARK, but this apparently does not work together with a write.partitionBy() operation. Therefore, I have not found another way than to try to work in two steps:
Repartion the dataframe according to my condition, using partitionBy() and write the resulting partitions to parquet files.
Read those parquet files to convert them into TFrecord files with the tensorflow-connector plugin.
It is the second step that I'm unable to do efficiently. My idea was to read in the individual parquet files on the executors and immediately write them into TFrecord files. But this needs access to the SQLContext which can only be done in the Driver (discussed here) so not in parallel. I would like to do something like this:
# List all parquet files to be converted
import glob, os
files = glob.glob('/path/*.parquet'))
sc = SparkSession.builder.getOrCreate()
sc.parallelize(files, 2).foreach(lambda parquetFile: convert_parquet_to_tfrecord(parquetFile))
Could I construct the function convert_parquet_to_tfrecord that would be able to do this on the executors?
I've also tried just using the wildcard when reading all the parquet files:
SQLContext(sc).read.parquet('/path/*.parquet')
This indeed reads all parquet files, but unfortunately not into individual partitions. It appears that the original structure gets lost, so it doesn't help me if I want the exact contents of the individual parquet files converted into TFrecord files.
Any other suggestions?
Try spark-tfrecord.
Spark-TFRecord is a tool similar to spark-tensorflow-connector but it does partitionBy. The following example shows how to partition a dataset.
import org.apache.spark.sql.SaveMode
// create a dataframe
val df = Seq((8, "bat"),(8, "abc"), (1, "xyz"), (2, "aaa")).toDF("number", "word")
val tf_output_dir = "/tmp/tfrecord-test"
// dump the tfrecords to files.
df.repartition(3, col("number")).write.mode(SaveMode.Overwrite).partitionBy("number").format("tfrecord").option("recordType", "Example").save(tf_output_dir)
More information can be found at
Github repo:
https://github.com/linkedin/spark-tfrecord
If I understood your question correctly, you want to write the partitions locally on the workers' disk.
If that is the case then I would recommend looking at spark-tensorflow-connector's instructions on how to do so.
This is the code that you are looking for (as stated in the documentation linked above):
myDataFrame.write.format("tfrecords").option("writeLocality", "local").save("/path")
On a side note, if you are worried about efficiency why are you using pyspark? It would be better to use scala instead.

read local csv file in pySpark (2.3)

I'm using pySpark 2.3, trying to read a csv file that looks like that:
0,0.000476517230863068,0.0008178378961061477
1,0.0008506156837329876,0.0008467260987257776
But it doesn't work:
from pyspark import sql, SparkConf, SparkContext
print (sc.applicationId)
>> <property at 0x7f47583a5548>
data_rdd = spark.textFile(name=tsv_data_path).filter(x.split(",")[0] != 1)
And I get an error:
AttributeError: 'SparkSession' object has no attribute 'textFile'
Any idea how I should read it in pySpark 2.3?
First, textFile exists on the SparkContext (called sc in the repl), not on the SparkSession object (called spark in the repl).
Second, for CSV data, I would recommend using the CSV DataFrame loading code, like this:
df = spark.read.format("csv").load("file:///path/to/file.csv")
You mentioned in comments needing the data as an RDD. You are going to have significantly better performance if you can keep all of your operations on DataFrames instead of RDDs. However, if you need to fall back to RDDs for some reason you can do it like the following:
rdd = df.rdd.map(lambda row: row.asDict())
Doing this approach is better than trying to load it with textFile and parsing the CSV data yourself. If you use the DataFrame CSV loading then it will properly handle all the CSV edge cases for you like quoted fields. Also if only needed some of the columns, you could filter on the DataFrame before converting it to a RDD to avoid needing to bring all that extra data over into the python interpreter.

Resources