How to read only n rows of large CSV file on HDFS using spark-csv package? - apache-spark

I have a big distributed file on HDFS and each time I use sqlContext with spark-csv package, it first loads the entire file which takes quite some time.
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load("file_path")
now as I just want to do some quick check at times, all I need is few/ any n rows of the entire file.
df_n = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load("file_path").take(n)
df_n = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load("file_path").head(n)
but all these run after the file load is done. Can't I just restrict the number of rows while reading the file itself ? I am referring to n_rows equivalent of pandas in spark-csv, like:
pd_df = pandas.read_csv("file_path", nrows=20)
Or it might be the case that spark does not actually load the file, the first step, but in this case, why is my file load step taking too much time then?
I want
df.count()
to give me only n and not all rows, is it possible ?

You can use limit(n).
sqlContext.format('com.databricks.spark.csv') \
.options(header='true', inferschema='true').load("file_path").limit(20)
This will just load 20 rows.

My understanding is that reading just a few lines is not supported by spark-csv module directly, and as a workaround you could just read the file as a text file, take as many lines as you want and save it to some temporary location. With the lines saved, you could use spark-csv to read the lines, including inferSchema option (that you may want to use given you are in exploration mode).
val numberOfLines = ...
spark.
read.
text("myfile.csv").
limit(numberOfLines).
write.
text(s"myfile-$numberOfLines.csv")
val justFewLines = spark.
read.
option("inferSchema", true). // <-- you are in exploration mode, aren't you?
csv(s"myfile-$numberOfLines.csv")

Not inferring schema and using limit(n) worked for me, in all aspects.
f_schema = StructType([
StructField("col1",LongType(),True),
StructField("col2",IntegerType(),True),
StructField("col3",DoubleType(),True)
...
])
df_n = sqlContext.read.format('com.databricks.spark.csv').options(header='true').schema(f_schema).load(data_path).limit(10)
Note: If we use inferschema='true', its again the same time, and maybe hence the same old thing.
But if we dun have idea of the schema, Jacek Laskowski solutions works well too. :)

The solution given by Jacek Laskowski works well. Presenting an in-memory variation below.
I recently ran into this problem. I was using databricks and had a huge csv directory (200 files of 200MB each)
I originally had
val df = spark.read.format("csv")
.option("header", true)
.option("sep", ",")
.option("inferSchema", true)
.load("dbfs:/huge/csv/files/in/this/directory/")
display(df)
which took a lot of time (10+ minutes), but then I change it to below and it ran instantly (2 seconds)
val lines = spark.read.text("dbfs:/huge/csv/files/in/this/directory/").as[String].take(1000)
val df = spark.read
.option("header", true)
.option("sep", ",")
.option("inferSchema", true)
.csv(spark.createDataset(lines))
display(df)
Inferring schema for text formats is hard and it can be done this way for the csv and json (but not if it's a multi-line json) formats.

Since PySpark 2.3 you can simply load data as text, limit, and apply csv reader on the result:
(spark
.read
.options(inferSchema="true", header="true")
.csv(
spark.read.text("/path/to/file")
.limit(20) # Apply limit
.rdd.flatMap(lambda x: x))) # Convert to RDD[str]
Scala counterpart is available since Spark 2.2:
spark
.read
.options(Map("inferSchema" -> "true", "header" -> "true"))
.csv(spark.read.text("/path/to/file").limit(20).as[String])
In Spark 3.0.0 or later one can also apply limit and use from_csv function, but it requires a schema, so it probably won't fit your requirements.

Since I didn't see that solution in the answers, the pure SQL-approach is working for me:
df = spark.sql("SELECT * FROM csv.`/path/to/file` LIMIT 10000")
If there is no header the columns will be named _c0, _c1, etc. No schema required.

May be this would be helpful who is working in java.
Applying limit will not help to reduce the time. You have to collect the n rows from the file.
DataFrameReader frameReader = spark
.read()
.format("csv")
.option("inferSchema", "true");
//set framereader options, delimiters etc
List<String> dataset = spark.read().textFile(filePath).limit(MAX_FILE_READ_SIZE).collectAsList();
return frameReader.csv(spark.createDataset(dataset, Encoders.STRING()));

Related

Read Excel files from S3 using Scala, Spark and org.apache.poi

I'm looking for the way to open and process an Excel file (*.xlsx) in Spark job.
I'm quite new to Scala/Spark stack so trying to complete it in pythonic way :)
Without Spark it's simple:
val f = new File("src/worksheets.xlsx")
val workbook = WorkbookFactory.create(f)
val sheet = workbook.getSheetAt(0)
But Spark needs some streaming input. I've configured Hadoop for S3 (in my case - MinIO)
val hadoopConf = sparkSession.sparkContext.hadoopConfiguration
hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoopConf.set(
"fs.s3a.aws.credentials.provider",
"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider"
)
hadoopConf.set("fs.s3a.path.style.access", "true")
hadoopConf.set("fs.s3a.access.key", params.minioAccessKey.get)
hadoopConf.set("fs.s3a.secret.key", params.minioSecretKey.get)
hadoopConf.set(
"fs.s3a.connection.ssl.enabled",
params.minioSSL.get.toString
)
hadoopConf.set("fs.s3a.endpoint", params.minioUrl.get)
val FilterDF = sparkSession.read
.format("com.crealytics.spark.excel")
.option("recursiveFileLookup", "true")
.option("modifiedBefore", "2020-07-01T05:30:00")
.option("modifiedAfter", "2020-06-01T05:30:00")
.option("header", "true")
.load("s3a://first/");
println(FilterDF)
So the question is: how to configure DataFrame (or, maybe some other solution) to filter and gather files in some time range from S3 bucket and make it suitable to work with Apache POI? Its Workbook can process general file objects as well as InputStream (so this might be the point of conversion)
Thanks in advance

orderby is not giving correct results in spark SQL

I have a dataset of around 60 columns and 3000 rows.
I am using orderby for sorting rows in dataset and writing in a file
But its not giving correct results as excpeted.
dataset.orderBy(new Column(col_name).desc())
.coalesce(4)
.write()
.format("com.databricks.spark.csv")
.option("delimiter", ",")
.option("header", "false")
.mode(SaveMode.Overwrite)
.save("hdfs://" + filePath);
Please let me know what I am missing here
Also I found below solution but don't think that is the correct solution
Row[] rows = dataset.take(3000);
for ( Row row : rows){
// here i am writing in a file row by row
System.out.println(row);
}
the problem is that coalesce will merge your existing partitions in an unsorted way (and no, coalesce will not cause a shuffle).
If you want 4 files and sorting within the files, you need to change spark.sql.suffle.partitions before the orderBy, this will cause your shuffle to have 4 partitions.
spark.sql("set spark.sql.shuffle.partitions=4")
dataset.orderBy(new Column(col_name).desc())
.write()
.format("com.databricks.spark.csv")
.option("delimiter", ",")
.option("header", "false")
.mode(SaveMode.Overwrite)
.save("hdfs://" + filePath);
if you only care about the sorting within the files, you could also use sortWithinPartitions(new Column(col_name).desc())
because your .coalesce(4) suffle your dataframe order
coalesce first then sort .
dataset
.coalesce(4)
.orderBy(new Column(col_name).desc())
.write()
.format("com.databricks.spark.csv")
.option("delimiter", ",")
.option("header", "false")
.mode(SaveMode.Overwrite)
.save("hdfs://" + filePath);
you also should set spark.sql.suffle.partitions to 4 in your spark context because order by also provoque suffle.
As per your clarification in the comments, you need your ordered output to be contained in a single file.
With only spark, that's possible only with spark.sql("set spark.sql.shuffle.partitions=1") followed by orderBy and write. But the drawback is it won't scale for big data as it will not be parallelized.
A work around is :
Make your spark do the orderBy with maximum parallelized work, (i.e. don't coalesce or "set spark.sql.shuffle.partitions=1") and have n number of files.
Add some extra logical handling in your file merging code
List All files, fetch the value of col_name and maintain a map of [(col_name value), filepath]
Sort the map by key (value of col_name)
Then perform your merge
This will maintain your ordering.
The idea is, the merging part will be mostly single threaded, at least do the sorting in a distributed way :)

Spark save dataframe metadata and reuse it

When I read a dataset with a lot of files (in my case from google cloud storage), spark.read works a lot of time before the first manipulation.
I'm not sure what it does but I guess it maps the files and sample them to infer the schema.
My question is, is there an option to save this metadata collected about the dataframe and reuse it in other work on the dataset.
-- UPDATE --
The data is arranged like this:
gs://bucket-name/table_name/day=yyyymmdd/many_json_files
When I run: df = spark.read.json("gs://bucket-name/table_name") That's take a lot of time. I wish I could do the following:
df = spark.read.json("gs://bucket-name/table_name")
df.saveMetadata("gs://bucket-name/table_name_metadata")
And in another session:
df = spark.read.metadata("gs://bucket-name/table_name_metadata").‌​json("gs://bucket-na‌​me/table_name")
...
<some df manipulation>
...
We just need infer the schema once and reuse it for the later files, if we have a lot of file which has the same schema. like this.
val df0 = spark.read.json("first_file_we_wanna_spark_to_info.json")
val schema = df0.schema
// for other files
val df = spark.read.schema(schema).json("donnot_info_schema.json")

Call inferSchema directly after the load is done with spark-csv

Is there a way that I can directly call inferSchema after load is done?
Ex:
val df = sqlContext.read.format("com.databricks.spark.csv").
option("header", "true").
option("inferSchema", "false").load(location)
df.schema
I want to call some thing like below:
val newdf = df.inferSchema()
newdf.printSchema()
Regards
It's not possible unless you define a new schema and apply it to the new DataFrame on creation.
You can also read the schema from using the csv source and store it to use afterwards but this will scan the data either way.
You haven't inferred a schema, spark-csv considers every column as a string.

Get CSV to Spark dataframe

I'm using python on Spark and would like to get a csv into a dataframe.
The documentation for Spark SQL strangely does not provide explanations for CSV as a source.
I have found Spark-CSV, however I have issues with two parts of the documentation:
"This package can be added to Spark using the --jars command line option. For example, to include it when starting the spark shell: $ bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3"
Do I really need to add this argument everytime I launch pyspark or spark-submit? It seems very inelegant. Isn't there a way to import it in python rather than redownloading it each time?
df = sqlContext.load(source="com.databricks.spark.csv", header="true", path = "cars.csv") Even if I do the above, this won't work. What does the "source" argument stand for in this line of code? How do I simply load a local file on linux, say "/Spark_Hadoop/spark-1.3.1-bin-cdh4/cars.csv"?
With more recent versions of Spark (as of, I believe, 1.4) this has become a lot easier. The expression sqlContext.read gives you a DataFrameReader instance, with a .csv() method:
df = sqlContext.read.csv("/path/to/your.csv")
Note that you can also indicate that the csv file has a header by adding the keyword argument header=True to the .csv() call. A handful of other options are available, and described in the link above.
from pyspark.sql.types import StringType
from pyspark import SQLContext
sqlContext = SQLContext(sc)
Employee_rdd = sc.textFile("\..\Employee.csv")
.map(lambda line: line.split(","))
Employee_df = Employee_rdd.toDF(['Employee_ID','Employee_name'])
Employee_df.show()
for Pyspark, assuming that the first row of the csv file contains a header
spark = SparkSession.builder.appName('chosenName').getOrCreate()
df=spark.read.csv('fileNameWithPath', mode="DROPMALFORMED",inferSchema=True, header = True)
Read the csv file in to a RDD and then generate a RowRDD from the original RDD.
Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1.
Apply the schema to the RDD of Rows via createDataFrame method provided by SQLContext.
lines = sc.textFile("examples/src/main/resources/people.txt")
parts = lines.map(lambda l: l.split(","))
# Each line is converted to a tuple.
people = parts.map(lambda p: (p[0], p[1].strip()))
# The schema is encoded in a string.
schemaString = "name age"
fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
schema = StructType(fields)
# Apply the schema to the RDD.
schemaPeople = spark.createDataFrame(people, schema)
source: SPARK PROGRAMMING GUIDE
If you do not mind the extra package dependency, you could use Pandas to parse the CSV file. It handles internal commas just fine.
Dependencies:
from pyspark import SparkContext
from pyspark.sql import SQLContext
import pandas as pd
Read the whole file at once into a Spark DataFrame:
sc = SparkContext('local','example') # if using locally
sql_sc = SQLContext(sc)
pandas_df = pd.read_csv('file.csv') # assuming the file contains a header
# If no header:
# pandas_df = pd.read_csv('file.csv', names = ['column 1','column 2'])
s_df = sql_sc.createDataFrame(pandas_df)
Or, even more data-consciously, you can chunk the data into a Spark RDD then DF:
chunk_100k = pd.read_csv('file.csv', chunksize=100000)
for chunky in chunk_100k:
Spark_temp_rdd = sc.parallelize(chunky.values.tolist())
try:
Spark_full_rdd += Spark_temp_rdd
except NameError:
Spark_full_rdd = Spark_temp_rdd
del Spark_temp_rdd
Spark_DF = Spark_full_rdd.toDF(['column 1','column 2'])
Following Spark 2.0, it is recommended to use a Spark Session:
from pyspark.sql import SparkSession
from pyspark.sql import Row
# Create a SparkSession
spark = SparkSession \
.builder \
.appName("basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
def mapper(line):
fields = line.split(',')
return Row(ID=int(fields[0]), field1=str(fields[1].encode("utf-8")), field2=int(fields[2]), field3=int(fields[3]))
lines = spark.sparkContext.textFile("file.csv")
df = lines.map(mapper)
# Infer the schema, and register the DataFrame as a table.
schemaDf = spark.createDataFrame(df).cache()
schemaDf.createOrReplaceTempView("tablename")
I ran into similar problem. The solution is to add an environment variable named as "PYSPARK_SUBMIT_ARGS" and set its value to "--packages com.databricks:spark-csv_2.10:1.4.0 pyspark-shell". This works with Spark's Python interactive shell.
Make sure you match the version of spark-csv with the version of Scala installed. With Scala 2.11, it is spark-csv_2.11 and with Scala 2.10 or 2.10.5 it is spark-csv_2.10.
Hope it works.
Based on the answer by Aravind, but much shorter, e.g. :
lines = sc.textFile("/path/to/file").map(lambda x: x.split(","))
df = lines.toDF(["year", "month", "day", "count"])
With the current implementation(spark 2.X) you dont need to add the packages argument, You can use the inbuilt csv implementation
Additionally as the accepted answer you dont need to create an rdd then enforce schema that has 1 potential problem
When you read the csv as then it will mark all the fields as string and when you enforce the schema with an integer column you will get exception.
A better way to do the above would be
spark.read.format("csv").schema(schema).option("header", "true").load(input_path).show()

Resources