Cannot stream files in subfolders with wildcards in pySpark streaming - apache-spark

This code works only if I make directory="s3://bucket/folder/2022/10/18/4/*"
from pyspark.sql.functions import from_json
from pyspark.streaming import StreamingContext
ssc = StreamingContext(sc, 30)
directory = "s3://bucket/folder/*/*/*/*/*"
stream_data = ssc.textFileStream(directory)
def readMyStream(rdd):
if not rdd.isEmpty():
df = spark.read.option("multiline","true").json(rdd)
print('Started the Process')
print('Selection of Columns')
df = df.select("c1","c2","c3","c4","c5")
df.show()
stream_data.foreachRDD(lambda rdd: readMyStream(rdd))
ssc.start()
ssc.awaitTermination()
In the docs it says it supports POSIX glob pattern. Any help is appreciated. Thank you

The issue is the final * should not be there. In the docs it says "it is a pattern of directories, not of files in directories". I didnt understand it the first time I read it.
A POSIX glob pattern can be supplied, such as "hdfs://namenode:8040/logs/2017/*". Here, the DStream will consist of all files in the directories matching the pattern. That is: it is a pattern of directories, not of files in directories.
directory = "s3://bucket/folder/*/*/*/*/*" should be directory = "s3://bucket/folder/*/*/*/*/"

Related

how to copy data from variable list of parquet files using pyspark

I have saved the list of parquet files(to be read) in a variable list, say listOffilteredFiles()
Now I want to read all the files from this list and write all the data into a single parquet file in another path. How can I do this. I have written the below code and I'm stuck here. Any help would be appreciated
import time
import datetime
from datetime import datetime
import pandas as pd
import glob
import pyspark
from pyspark.sql import SQLContext
dirName = 'dbfs:/mnt/abc/def/efg'
now = datetime.utcnow()
# Get the list of all files in directory tree at given path
listOfFiles = list()
listOffilteredFiles = list()
for (dirpath, dirnames, filenames) in os.walk(dirName):
listOfFiles += [os.path.join(dirpath, file) for file in filenames]
listOffilteredFiles = filter(lambda x: datetime.utcfromtimestamp(os.path.getmtime(x)) < now, listOfFiles)
Let's assume that the files you're trying to read are parquet files.
You can read all the parquet files from a directory using * syntax.
Suppose you have a directory like this:
/abc/def/[file1.parquet, file2.parquet, file3.parquet]
/abc/ghi/[file1.parquet, file2.parquet, file3.parquet]
and you wanna read all the parquet files under /abc directory. The spark read statement would be:
df = spark.read.parquet('/abc/*/*')
In another scenario, when you need to read some files after filtering them, you can do:
listOffilteredFiles = list(filter(lambda x: datetime.utcfromtimestamp(os.path.getmtime(x)) < now, listOfFiles))
df = spark.read.parquet(*listOffilteredFiles)

Is there a way to list all files in all folders and sub-folders in data lake?

I am trying to list all files in all folders and sub folders. I'm trying to get everything into a RDD or a dataframe (I don't think it matters because it's just a list of file names and paths). I found some code online that looks promising, but it doesn't seem to do anything. I'm pretty new to Scala though, so maybe I just missed something simple.
First code sample:
import org.apache.spark.sql.functions.input_file_name
val inputPath: String = "mnt/rawdata/2019/01/01/corp/*.gz"
val df = spark.read.text(inputPath)
.select(input_file_name, $"value")
.as[(String, String)] // Optionally convert to Dataset
.rdd // or RDD
Second code sample:
import java.io.File
def getListOfFiles(dir: String):List[File] = {
val d = new File(dir)
if (d.exists && d.isDirectory) {
d.listFiles.filter(_.isFile).toList
} else {
List[File]()
}
}
val files = getListOfFiles("mnt/rawdata/2019/01/01/corp/")
There's useful Files.walk method for recursive tree traversal in nio package.
import java.nio.file._
import scala.collection.JavaConverters._
val files = Files.walk(FileSystems.getDefault.getPath("mnt/rawdata/2019/01/01/corp")).iterator.asScala.toList
Just note it returns both files and directories, so you need to filter in case you only need files.

how to check if rdd is empty using spark streaming?

I have following pyspark code which I am using to read log files from logs/ directory and then saving results to a text file only when it has the data in it ... in other words when RDD is not empty. But I am having issues implementing it. I have tried both take(1) and notempty. As this is dstream rdd we can't apply rdd methods to it. Please let me know if I am missing anything.
conf = SparkConf().setMaster("local").setAppName("PysparkStreaming")
sc = SparkContext.getOrCreate(conf = conf)
ssc = StreamingContext(sc, 3) #Streaming will execute in each 3 seconds
lines = ssc.textFileStream('/Users/rocket/Downloads/logs/') #'logs/ mean directory name
audit = lines.map(lambda x: x.split('|')[3])
result = audit.countByValue()
#result.pprint()
#result.foreachRDD(lambda rdd: rdd.foreach(sendRecord))
# Print the first ten elements of each RDD generated in this DStream to the console
if result.foreachRDD(lambda rdd: rdd.take(1)):
result.pprint()
result.saveAsTextFiles("/Users/rocket/Downloads/output","txt")
else:
result.pprint()
print("empty")
The correct structure would be
import uuid
def process_batch(rdd):
if not rdd.isEmpty():
result.saveAsTextFiles("/Users/rocket/Downloads/output-{}".format(
str(uuid.uuid4())
) ,"txt")
result.foreachRDD(process_batch)
That however, as you see above, requires a separate directory for each batch, as RDD API doesn't have append mode.
And alternative could be:
def process_batch(rdd):
if not rdd.isEmpty():
lines = rdd.map(str)
spark.createDataFrame(lines, "string").save.mode("append").format("text").save("/Users/rocket/Downloads/output")

How to import multiple csv files in a single load?

Consider I have a defined schema for loading 10 csv files in a folder. Is there a way to automatically load tables using Spark SQL. I know this can be performed by using an individual dataframe for each file [given below], but can it be automated with a single command rather than pointing a file can I point a folder?
df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("../Downloads/2008.csv")
Use wildcard, e.g. replace 2008 with *:
df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("../Downloads/*.csv") // <-- note the star (*)
Spark 2.0
// these lines are equivalent in Spark 2.0
spark.read.format("csv").option("header", "true").load("../Downloads/*.csv")
spark.read.option("header", "true").csv("../Downloads/*.csv")
Notes:
Replace format("com.databricks.spark.csv") by using format("csv") or csv method instead. com.databricks.spark.csv format has been integrated to 2.0.
Use spark not sqlContext
Ex1:
Reading a single CSV file. Provide complete file path:
val df = spark.read.option("header", "true").csv("C:spark\\sample_data\\tmp\\cars1.csv")
Ex2:
Reading multiple CSV files passing names:
val df=spark.read.option("header","true").csv("C:spark\\sample_data\\tmp\\cars1.csv", "C:spark\\sample_data\\tmp\\cars2.csv")
Ex3:
Reading multiple CSV files passing list of names:
val paths = List("C:spark\\sample_data\\tmp\\cars1.csv", "C:spark\\sample_data\\tmp\\cars2.csv")
val df = spark.read.option("header", "true").csv(paths: _*)
Ex4:
Reading multiple CSV files in a folder ignoring other files:
val df = spark.read.option("header", "true").csv("C:spark\\sample_data\\tmp\\*.csv")
Ex5:
Reading multiple CSV files from multiple folders:
val folders = List("C:spark\\sample_data\\tmp", "C:spark\\sample_data\\tmp1")
val df = spark.read.option("header", "true").csv(folders: _*)
Note that you can use other tricks like :
-- One or more wildcard:
.../Downloads20*/*.csv
-- braces and brackets
.../Downloads201[1-5]/book.csv
.../Downloads201{11,15,19,99}/book.csv
Reader's Digest: (Spark 2.x)
For Example, if you have 3 directories holding csv files:
dir1, dir2, dir3
You then define paths as a string of comma delimited list of paths as follows:
paths = "dir1/,dir2/,dir3/*"
Then use the following function and pass it this paths variable
def get_df_from_csv_paths(paths):
df = spark.read.format("csv").option("header", "false").\
schema(custom_schema).\
option('delimiter', '\t').\
option('mode', 'DROPMALFORMED').\
load(paths.split(','))
return df
By then running:
df = get_df_from_csv_paths(paths)
You will obtain in df a single spark dataframe containing the data from all the csvs found in these 3 directories.
===========================================================================
Full Version:
In case you want to ingest multiple CSVs from multiple directories you simply need to pass a list and use wildcards.
For Example:
if your data_path looks like this:
's3://bucket_name/subbucket_name/2016-09-*/184/*,
s3://bucket_name/subbucket_name/2016-10-*/184/*,
s3://bucket_name/subbucket_name/2016-11-*/184/*,
s3://bucket_name/subbucket_name/2016-12-*/184/*, ... '
you can use the above function to ingest all the csvs in all these directories and subdirectories at once:
This would ingest all directories in s3 bucket_name/subbucket_name/ according to the wildcard patterns specified. e.g. the first pattern would look in
bucket_name/subbucket_name/
for all directories with names starting with
2016-09-
and for each of those take only the directory named
184
and within that subdirectory look for all csv files.
And this would be executed for each of the patterns in the comma delimited list.
This works way better than union..
Using Spark 2.0+, we can load multiple CSV files from different directories using
df = spark.read.csv(['directory_1','directory_2','directory_3'.....], header=True). For more information, refer the documentation
here
val df = spark.read.option("header", "true").csv("C:spark\\sample_data\\*.csv)
will consider files tmp, tmp1, tmp2, ....

Get CSV to Spark dataframe

I'm using python on Spark and would like to get a csv into a dataframe.
The documentation for Spark SQL strangely does not provide explanations for CSV as a source.
I have found Spark-CSV, however I have issues with two parts of the documentation:
"This package can be added to Spark using the --jars command line option. For example, to include it when starting the spark shell: $ bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3"
Do I really need to add this argument everytime I launch pyspark or spark-submit? It seems very inelegant. Isn't there a way to import it in python rather than redownloading it each time?
df = sqlContext.load(source="com.databricks.spark.csv", header="true", path = "cars.csv") Even if I do the above, this won't work. What does the "source" argument stand for in this line of code? How do I simply load a local file on linux, say "/Spark_Hadoop/spark-1.3.1-bin-cdh4/cars.csv"?
With more recent versions of Spark (as of, I believe, 1.4) this has become a lot easier. The expression sqlContext.read gives you a DataFrameReader instance, with a .csv() method:
df = sqlContext.read.csv("/path/to/your.csv")
Note that you can also indicate that the csv file has a header by adding the keyword argument header=True to the .csv() call. A handful of other options are available, and described in the link above.
from pyspark.sql.types import StringType
from pyspark import SQLContext
sqlContext = SQLContext(sc)
Employee_rdd = sc.textFile("\..\Employee.csv")
.map(lambda line: line.split(","))
Employee_df = Employee_rdd.toDF(['Employee_ID','Employee_name'])
Employee_df.show()
for Pyspark, assuming that the first row of the csv file contains a header
spark = SparkSession.builder.appName('chosenName').getOrCreate()
df=spark.read.csv('fileNameWithPath', mode="DROPMALFORMED",inferSchema=True, header = True)
Read the csv file in to a RDD and then generate a RowRDD from the original RDD.
Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1.
Apply the schema to the RDD of Rows via createDataFrame method provided by SQLContext.
lines = sc.textFile("examples/src/main/resources/people.txt")
parts = lines.map(lambda l: l.split(","))
# Each line is converted to a tuple.
people = parts.map(lambda p: (p[0], p[1].strip()))
# The schema is encoded in a string.
schemaString = "name age"
fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
schema = StructType(fields)
# Apply the schema to the RDD.
schemaPeople = spark.createDataFrame(people, schema)
source: SPARK PROGRAMMING GUIDE
If you do not mind the extra package dependency, you could use Pandas to parse the CSV file. It handles internal commas just fine.
Dependencies:
from pyspark import SparkContext
from pyspark.sql import SQLContext
import pandas as pd
Read the whole file at once into a Spark DataFrame:
sc = SparkContext('local','example') # if using locally
sql_sc = SQLContext(sc)
pandas_df = pd.read_csv('file.csv') # assuming the file contains a header
# If no header:
# pandas_df = pd.read_csv('file.csv', names = ['column 1','column 2'])
s_df = sql_sc.createDataFrame(pandas_df)
Or, even more data-consciously, you can chunk the data into a Spark RDD then DF:
chunk_100k = pd.read_csv('file.csv', chunksize=100000)
for chunky in chunk_100k:
Spark_temp_rdd = sc.parallelize(chunky.values.tolist())
try:
Spark_full_rdd += Spark_temp_rdd
except NameError:
Spark_full_rdd = Spark_temp_rdd
del Spark_temp_rdd
Spark_DF = Spark_full_rdd.toDF(['column 1','column 2'])
Following Spark 2.0, it is recommended to use a Spark Session:
from pyspark.sql import SparkSession
from pyspark.sql import Row
# Create a SparkSession
spark = SparkSession \
.builder \
.appName("basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
def mapper(line):
fields = line.split(',')
return Row(ID=int(fields[0]), field1=str(fields[1].encode("utf-8")), field2=int(fields[2]), field3=int(fields[3]))
lines = spark.sparkContext.textFile("file.csv")
df = lines.map(mapper)
# Infer the schema, and register the DataFrame as a table.
schemaDf = spark.createDataFrame(df).cache()
schemaDf.createOrReplaceTempView("tablename")
I ran into similar problem. The solution is to add an environment variable named as "PYSPARK_SUBMIT_ARGS" and set its value to "--packages com.databricks:spark-csv_2.10:1.4.0 pyspark-shell". This works with Spark's Python interactive shell.
Make sure you match the version of spark-csv with the version of Scala installed. With Scala 2.11, it is spark-csv_2.11 and with Scala 2.10 or 2.10.5 it is spark-csv_2.10.
Hope it works.
Based on the answer by Aravind, but much shorter, e.g. :
lines = sc.textFile("/path/to/file").map(lambda x: x.split(","))
df = lines.toDF(["year", "month", "day", "count"])
With the current implementation(spark 2.X) you dont need to add the packages argument, You can use the inbuilt csv implementation
Additionally as the accepted answer you dont need to create an rdd then enforce schema that has 1 potential problem
When you read the csv as then it will mark all the fields as string and when you enforce the schema with an integer column you will get exception.
A better way to do the above would be
spark.read.format("csv").schema(schema).option("header", "true").load(input_path).show()

Resources