How to specify Hadoop Configuration when reading CSV - apache-spark

I am using Spark 2.0.2. How can I specify the Hadoop configuration item textinputformat.record.delimiter for the TextInputFormat class when reading a CSV file into a Dataset?
In Java I can code: spark.read().csv(<path>); However, there doesn't seem to be a way to provide a Hadoop configuration specific to the read.
It is possible to set the item using the spark.sparkContext().hadoopConfiguration() but that is global.
Thanks,

You cannot. Data Source API uses its own configuration which, as of 2.0 is not even compatible with Hadoop configuration.
If you want to use custom input format or other Hadoop configuration use SparkContext.hadoopFile, SparkContext.newAPIHadoopRDD or related classes.

Delimiter can be set using option() in spark2.0
var df = spark.read.option("header", "true").option("delimiter", "\t").csv("/hdfs/file/locaton")

Related

How to get the Hadoop path with Java/Scala API in Code Repositories

My need is to read other formats: JSON, binary, XML and infer the schema dynamically within a transform in Code Repositories and using Spark datasource api.
Example:
val df = spark.read.json(<hadoop_path>)
For that, I need an accessor to the Foundry file system path, which is something like:
foundry://...#url:port/datasets/ri.foundry.main.dataset.../views/ri.foundry.main.transaction.../startTransactionRid/ri.foundry.main.transaction...
This is possible with PySpark API (Python):
filesystem = input_transform.filesystem()
hadoop_path = filesystem.hadoop_path
However, for Java/Scala I didn’t find a way to do it properly.
The getter to the Hadoop path has been recently added to Foundry Java API. By upgrading the version of the java transform (transformsJavaVersion >= 1.188.0), and you can get it:
val hadoopPath = myInput.asFiles().getFileSystem().hadoopPath()

Are Apache Spark 2.0 parquet files incompatible with Apache Arrow?

The problem
I have written an Apache Spark DataFrame as a parquet file for a deep learning application in a Python environment ; I am currently experiencing issues in implementing basic examples of both petastorm (following this notebook) and horovod frameworks, in reading the aforementioned file namely. The DataFrame has the following type : DataFrame[features: array<float>, next: int, weight: int] (much like in DataBricks' notebook, I had features be a VectorUDT, which I converted to an array).
In both cases, Apache Arrow throws an ArrowIOError : Invalid parquet file. Corrupt footer. error.
What I found until now
I discovered in this question and in this PR that as of version 2.0, Spark doesn't write _metadata or _common_metadata files, unless spark.hadoop.parquet.enable.summary-metadata is set to true in Spark's configuration ; those files are indeed missing.
I thus tried rewriting my DataFrame with this environment, still no _common_metadata file. What also works is to explicitely pass a schema to petastorm when constructing a reader (passing schema_fields to make_batch_reader for instance ; which is a problem with horovod as there is no such parameter in horovod.spark.keras.KerasEstimator's constructor).
How would I be able, if at all possible, to either make Spark output those files, or in Arrow to infer the schema, just like Spark seems to be doing ?
Minimal example with horovod
# Saving df
print(spark.config.get('spark.hadoop.parquet.enable.summary-metadata')) # outputs 'true'
df.repartition(10).write.mode('overwrite').parquet(path)
# ...
# Training
import horovod.spark.keras as hvd
from horovod.spark.common.store import Store
model = build_model()
opti = Adadelta(learning_rate=0.015)
loss='sparse_categorical_crossentropy'
store = Store().create(prefix_path=prefix_path,
train_path=train_path,
val_path=val_path)
keras_estimator = hvd.KerasEstimator(
num_proc=16,
store=store,
model=model,
optimizer=opti,
loss=loss,
feature_cols=['features'],
label_cols=['next'],
batch_size=auto_steps_per_epoch,
epochs=auto_nb_epochs,
sample_weight_col='weight'
)
keras_model = keras_estimator.fit_on_parquet() # Fails here with ArrowIOError
The problem is solved in pyarrow 0.14+ (issues.apache.org/jira/browse/ARROW-4723), be sure to install the updated version with pip (up until Databricks Runtime 6.5, the included version is 0.13).
Thanks to #joris' comment for pointing this out.

How to save files in same directory using saveAsNewAPIHadoopFile spark scala

I am using spark streaming and I want to save each batch of spark streaming on my local in Avro format. I have used saveAsNewAPIHadoopFile to save data in Avro format. This works well. But it overwrites the existing file. Next batch data will overwrite the old data. Is there any way to save Avro file in common directory? I tried by adding some properties of Hadoop job conf for adding a prefix in the file name. But not working any properties.
dstream.foreachRDD {
rdd.saveAsNewAPIHadoopFile(
path,
classOf[AvroKey[T]],
classOf[NullWritable],
classOf[AvroKeyOutputFormat[T]],
job.getConfiguration()
)
}
Try this -
You can make your process split into 2 steps :
Step-01 :- Write Avro file using saveAsNewAPIHadoopFile to <temp-path>
Step-02 :- Move file from <temp-path> to <actual-target-path>
This will definitely solve your problem for now. I will share my thoughts if I get to fulfill this scenario in one step instead of two.
Hope this is helpful.

Read CSV with linebreaks in pyspark

Read CSV with linebreaks in pyspark
I want to read with pyspark a "legal" (it follows RFC4180) CSV that has breaklines (CRLF) in some of the rows. The next code sample shows how it does seem when opened it with Notepad++:
I try to read it with sqlCtx.read.load using format ='com.databricks.spark.csv. and the resulting dataset shows two rows instead of one in these specific cases. I am using Spark 2.1.0.2 version.
Is there any command or alternative way of reading the csv that allows me to read these two lines only as one?
You can use "csv" instead of Databricks CSV - the last one redirects now to default Spark reader. But, it's only a hint :)
In Spark 2.2 there was added new option - wholeFile. If you write this:
spark.read.option("wholeFile", "true").csv("file.csv")
it will read all file and handle multiline CSV.
There is no such option in Spark 2.1. You can read file using sparkContext.wholeTextFile or just use newer verison
wholeFile does not exist (anymore?) in the spark api documentation:
https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html
This solution will work:
spark.read.option("multiLine", "true").csv("file.csv")
From the api documentation:
multiLine – parse records, which may span multiple lines. If None is set, it uses the default value, false

Spark not using spark.sql.parquet.compression.codec

I'm comparing spark's parquets file vs apache-drill's.
Drill's parquet are way more lightweight then spark's. Spark uses GZIP as compression codec as default, for experimenting I tried to change it to
snappy : same size
uncompressed: same size
lzo : exception
I tried both ways:
sqlContext.sql("SET spark.sql.parquet.compression.codec=uncompressed")
sqlContext.setConf("spark.sql.parquet.compression.codec.", "uncompressed")
But seems like it dosen't change his settings
Worked for me in 2.1.1
df.write.option("compression","snappy").parquet(filename)
For spark 1.3 and spark.sql.parquet.compression.codec parameter did not compress the output. But the below one did work.
sqlContext.sql("SET parquet.compression=SNAPPY")
Try this. Seems to work for me in 1.6.0
val sc = new SparkContext(sparkConf)
val sqlContext = new HiveContext(sc)
sqlContext.setConf("spark.sql.parquet.compression.codec", "uncompressed")
For Spark 1.6 :
You can use different compression codecs. Try :
sqlContext.setConf("spark.sql.parquet.compression.codec","gzip")
sqlContext.setConf("spark.sql.parquet.compression.codec","lzo")
sqlContext.setConf("spark.sql.parquet.compression.codec","snappy")
sqlContext.setConf("spark.sql.parquet.compression.codec","uncompressed")
Try:
sqlContext.setConf("spark.sql.parquet.compression.codec", "snappy")
I see you already did this, but I'm unable to delete my answer on mobile. Try setting this before the sqlcontext as suggested in the comment.
When facing issues while storing into Hive via hive context use:
hc.sql("set parquet.compression=snappy")

Resources