Spark save dataframe metadata and reuse it - apache-spark

When I read a dataset with a lot of files (in my case from google cloud storage), spark.read works a lot of time before the first manipulation.
I'm not sure what it does but I guess it maps the files and sample them to infer the schema.
My question is, is there an option to save this metadata collected about the dataframe and reuse it in other work on the dataset.
-- UPDATE --
The data is arranged like this:
gs://bucket-name/table_name/day=yyyymmdd/many_json_files
When I run: df = spark.read.json("gs://bucket-name/table_name") That's take a lot of time. I wish I could do the following:
df = spark.read.json("gs://bucket-name/table_name")
df.saveMetadata("gs://bucket-name/table_name_metadata")
And in another session:
df = spark.read.metadata("gs://bucket-name/table_name_metadata").‌​json("gs://bucket-na‌​me/table_name")
...
<some df manipulation>
...

We just need infer the schema once and reuse it for the later files, if we have a lot of file which has the same schema. like this.
val df0 = spark.read.json("first_file_we_wanna_spark_to_info.json")
val schema = df0.schema
// for other files
val df = spark.read.schema(schema).json("donnot_info_schema.json")

Related

How to iterate in Databricks to read hundreds of files stored in different subdirectories in a Data Lake?

I have to read hundreds of avro files in Databricks from an Azure Data Lake Gen2, extract data from the Body field inside every file, and concatenate all the extracted data in a unique dataframe. The point is that all avro files to read are stored in different subdirectories in the lake, following the pattern:
root/YYYY/MM/DD/HH/mm/ss.avro
This forces me to loop the ingestion and selection of data. I'm using this Python code, in which list_avro_files is the list of paths to all files:
list_data = []
for file_avro in list_avro_files:
df = spark.read.format('avro').load(file_avro)
data1 = spark.read.json(df.select(df.Body.cast('string')).rdd.map(lambda x: x[0]))
list_data.append(data1)
data = reduce(DataFrame.unionAll, list_data)
Is there any way to do this more efficiently? How can I parallelize/speed up this process?
As long as your list_avro_files can be expressed through standard wildcard syntax, you can probably use Spark's own ability to parallelize read operation. All you'd need is to specify a basepath and a filename pattern for your avro files:
scala> var df = spark.read
.option("basepath","/user/hive/warehouse/root")
.format("avro")
.load("/user/hive/warehouse/root/*/*/*/*.avro")
And, in case you find that you need to know exactly which file any given row came from, use input_file_name() built-in function to enrich your dataframe:
scala> df = df.withColumn("source",input_file_name())

Spark infer schema with limit during a read.csv

I'd like to infer a Spark.DataFrame schema from a directory of CSV files using a small subset of the rows (say limit(100)).
However, setting inferSchema to True means that the Input Size / Records for the FileScanRDD seems to always be equal to the number of rows in all the CSV files.
Is there a way to make the FileScan more selective, such that Spark looks at fewer rows when inferring a schema?
Note: setting the samplingRatio option to be < 1.0 does not have the desired behaviour, though it is clear that inferSchema uses only the sampled subset of rows.
You could read a subset of your input data into a dataSet of String.
The CSV method allows you to pass this as a parameter.
Here is a simple example (I'll leave reading the sample of rows from the input file to you):
val data = List("1,2,hello", "2,3,what's up?")
val csvRDD = sc.parallelize(data)
val df = spark.read.option("inferSchema","true").csv(csvRDD.toDS)
df.schema
When run in spark-shell, the final line from the above prints (I reformatted it for readability):
res4: org.apache.spark.sql.types.StructType =
StructType(
StructField(_c0,IntegerType,true),
StructField(_c1,IntegerType,true),
StructField(_c2,StringType,true)
)
Which is the correct Schema for my limited input data set.
Assuming you are only interested in the schema, here is a possible approach based on cipri.l's post in this link
import org.apache.spark.sql.execution.datasources.csv.{CSVOptions, TextInputCSVDataSource}
def inferSchemaFromSample(sparkSession: SparkSession, fileLocation: String, sampleSize: Int, isFirstRowHeader: Boolean): StructType = {
// Build a Dataset composed of the first sampleSize lines from the input files as plain text strings
val dataSample: Array[String] = sparkSession.read.textFile(fileLocation).head(sampleSize)
import sparkSession.implicits._
val sampleDS: Dataset[String] = sparkSession.createDataset(dataSample)
// Provide information about the CSV files' structure
val firstLine = dataSample.head
val extraOptions = Map("inferSchema" -> "true", "header" -> isFirstRowHeader.toString)
val csvOptions: CSVOptions = new CSVOptions(extraOptions, sparkSession.sessionState.conf.sessionLocalTimeZone)
// Infer the CSV schema based on the sample data
val schema = TextInputCSVDataSource.inferFromDataset(sparkSession, sampleDS, Some(firstLine), csvOptions)
schema
}
Unlike GMc's answer from above, this approach tries to directly infer the schema the same way the DataFrameReader.csv() does in the background (but without going through the effort of building an additional Dataset with that schema, that we would then only use to retrieve the schema back from it)
The schema is inferred based on a Dataset[String] containing only the first sampleSize lines from the input files as plain text strings.
When trying to retrieve samples from data, Spark has only 2 types of methods:
Methods that retrieve a given percentage of the data. This operation takes random samples from all partitions. It benefits from higher parallelism, but it must read all the input files.
Methods that retrieve a specific number of rows. This operation must collect the data on the driver, but it could read a single partition (if the required row count is low enough)
Since you mentioned you want to use a specific small number of rows and since you want to avoid touching all the data, I provided a solution based on option 2
PS: The DataFrameReader.textFile method accepts paths to files, folders and it also has a varargs variant, so you could pass in one or more files or folders.

Call inferSchema directly after the load is done with spark-csv

Is there a way that I can directly call inferSchema after load is done?
Ex:
val df = sqlContext.read.format("com.databricks.spark.csv").
option("header", "true").
option("inferSchema", "false").load(location)
df.schema
I want to call some thing like below:
val newdf = df.inferSchema()
newdf.printSchema()
Regards
It's not possible unless you define a new schema and apply it to the new DataFrame on creation.
You can also read the schema from using the csv source and store it to use afterwards but this will scan the data either way.
You haven't inferred a schema, spark-csv considers every column as a string.

How to read only n rows of large CSV file on HDFS using spark-csv package?

I have a big distributed file on HDFS and each time I use sqlContext with spark-csv package, it first loads the entire file which takes quite some time.
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load("file_path")
now as I just want to do some quick check at times, all I need is few/ any n rows of the entire file.
df_n = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load("file_path").take(n)
df_n = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load("file_path").head(n)
but all these run after the file load is done. Can't I just restrict the number of rows while reading the file itself ? I am referring to n_rows equivalent of pandas in spark-csv, like:
pd_df = pandas.read_csv("file_path", nrows=20)
Or it might be the case that spark does not actually load the file, the first step, but in this case, why is my file load step taking too much time then?
I want
df.count()
to give me only n and not all rows, is it possible ?
You can use limit(n).
sqlContext.format('com.databricks.spark.csv') \
.options(header='true', inferschema='true').load("file_path").limit(20)
This will just load 20 rows.
My understanding is that reading just a few lines is not supported by spark-csv module directly, and as a workaround you could just read the file as a text file, take as many lines as you want and save it to some temporary location. With the lines saved, you could use spark-csv to read the lines, including inferSchema option (that you may want to use given you are in exploration mode).
val numberOfLines = ...
spark.
read.
text("myfile.csv").
limit(numberOfLines).
write.
text(s"myfile-$numberOfLines.csv")
val justFewLines = spark.
read.
option("inferSchema", true). // <-- you are in exploration mode, aren't you?
csv(s"myfile-$numberOfLines.csv")
Not inferring schema and using limit(n) worked for me, in all aspects.
f_schema = StructType([
StructField("col1",LongType(),True),
StructField("col2",IntegerType(),True),
StructField("col3",DoubleType(),True)
...
])
df_n = sqlContext.read.format('com.databricks.spark.csv').options(header='true').schema(f_schema).load(data_path).limit(10)
Note: If we use inferschema='true', its again the same time, and maybe hence the same old thing.
But if we dun have idea of the schema, Jacek Laskowski solutions works well too. :)
The solution given by Jacek Laskowski works well. Presenting an in-memory variation below.
I recently ran into this problem. I was using databricks and had a huge csv directory (200 files of 200MB each)
I originally had
val df = spark.read.format("csv")
.option("header", true)
.option("sep", ",")
.option("inferSchema", true)
.load("dbfs:/huge/csv/files/in/this/directory/")
display(df)
which took a lot of time (10+ minutes), but then I change it to below and it ran instantly (2 seconds)
val lines = spark.read.text("dbfs:/huge/csv/files/in/this/directory/").as[String].take(1000)
val df = spark.read
.option("header", true)
.option("sep", ",")
.option("inferSchema", true)
.csv(spark.createDataset(lines))
display(df)
Inferring schema for text formats is hard and it can be done this way for the csv and json (but not if it's a multi-line json) formats.
Since PySpark 2.3 you can simply load data as text, limit, and apply csv reader on the result:
(spark
.read
.options(inferSchema="true", header="true")
.csv(
spark.read.text("/path/to/file")
.limit(20) # Apply limit
.rdd.flatMap(lambda x: x))) # Convert to RDD[str]
Scala counterpart is available since Spark 2.2:
spark
.read
.options(Map("inferSchema" -> "true", "header" -> "true"))
.csv(spark.read.text("/path/to/file").limit(20).as[String])
In Spark 3.0.0 or later one can also apply limit and use from_csv function, but it requires a schema, so it probably won't fit your requirements.
Since I didn't see that solution in the answers, the pure SQL-approach is working for me:
df = spark.sql("SELECT * FROM csv.`/path/to/file` LIMIT 10000")
If there is no header the columns will be named _c0, _c1, etc. No schema required.
May be this would be helpful who is working in java.
Applying limit will not help to reduce the time. You have to collect the n rows from the file.
DataFrameReader frameReader = spark
.read()
.format("csv")
.option("inferSchema", "true");
//set framereader options, delimiters etc
List<String> dataset = spark.read().textFile(filePath).limit(MAX_FILE_READ_SIZE).collectAsList();
return frameReader.csv(spark.createDataset(dataset, Encoders.STRING()));

How to check if a DataFrame was already cached/persisted before?

For spark's RDD object this is quite trivial as it exposes a getStorageLevel method, but DF does not seem to expose anything similar. anyone?
You can check weather a DataFrame is cached or not using Catalog (org.apache.spark.sql.catalog.Catalog) which comes in Spark 2.
Code example :
val sparkSession = SparkSession.builder.
master("local")
.appName("example")
.getOrCreate()
val df = sparkSession.read.csv("src/main/resources/sales.csv")
df.createTempView("sales")
//interacting with catalog
val catalog = sparkSession.catalog
//print the databases
catalog.listDatabases().select("name").show()
// print all the tables
catalog.listTables().select("name").show()
// is cached
println(catalog.isCached("sales"))
df.cache()
println(catalog.isCached("sales"))
Using the above code you can list all the tables and check weather a table is cached or not.
You can check the working code example here

Resources