Reading multiple avro files into RDD from a nested directory structure

Reading multiple avro files into RDD from a nested directory structure - apache-spark

suppose I have a directory which contains a bunch of avro files and I want to read them all in one shot. this code works fine
val path = "hdfs:///path/to/your/avro/folder"
val avroRDD = sc.hadoopFile[AvroWrapper[GenericRecord], NullWritable, AvroInputFormat[GenericRecord]](path)
However, if the folder contains subfolders and the avro files are in subfolders. then I get an error
5/10/30 14:57:47 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 6,
hadoop1): java.io.FileNotFoundException: Path is not a file: /folder/subfolder
Is there anyway I can read all the avros (even in subdirectories) into an RDD?
all avros have same schema and I am on spark 1.3.0
Edit::
Based on the suggestion below I executed this line in my spark shell
sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive","true")
and this solved the problem.... but now my code is very very slow and I don't understand what does a mapreduce setting have to do with spark.

Related

PySpark: how to clear readStream cache?

I am reading a directory with Spark's readStream. Earlier I gave the local path, but got FileNotFoundException. I have changed the path to hdfs path, but still the execution log shows its referring to the old settings (local path).
22/06/01 10:30:32 WARN scheduler.TaskSetManager: Lost task 0.2 in stage 1.0 (TID 3, my.nodes.com, executor 3): java.io.FileNotFoundException: File file:/home/myuser/testing_aiman/data/fix_rates.csv does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:129)
Infact I have hardcoded the path variable, but still its referring to the earlier set local path.
df = spark.readStream.csv("hdfs:///user/myname/streaming_test_dir",sep=sep,schema=df_schema,inferSchema=True,header=True)
i also ran spark.sql("CLEAR CACHE").collect, but it didn't help either.

Before running the spark.readStream(), I ran the following code:
spark.sql("REFRESH \"file:///home/myuser/testing_aiman/data/fix_rates.csv\"").collect
spark.sql("CLEAR CACHE").collect
REFRESH <file:///path/that/showed/FileNotFoundException> actually did the trick.

Spark empty _metadata file in parquet output

I am using an Oozie workflow to generate a parquet file. Occasionally, when I try to read the file using spark, I get the following exception
java.io.IOException: Could not read footer:
java.lang.RuntimeException:
hdfs://ip-10-1-2-243.ec2.internal:8020/path/to/file/_metadata is not a
Parquet file (too small)
After deleting the metadata file, I can read in the rest of the files normally. I would like to know what causes Spark to output an empty _metadata file, and how I can avoid it in the future.

Read multiple files with SparkSession in Spark 2.0

In Spark 1.6 to read multiple files, I have used:
JavaSparkContext ctx;
ctx.textFile(filePaths);
With filePaths is the directory to files. For example we have:
/home/user/folderA/0.log,/home/user/folderB/0.log. Each path separates by comma character.
But, when I upgrade to Spark 2.0. Method
SparkSession sparkSession;
sparkSession.read().textFile(filePaths);
doesn't work. The code throws exception: Path does not exist:
Question: Is there any solution to read multiple files, from multiple paths in Spark 2.0 just like Spark 1.6?
Edit: I try to call the method like Spark 1.6 using:
sparkSession.sparkContext().textFile(filePaths, 1).toJavaRDD();
The problem will solved. But, Is there have another solution?

Spark in docker parquet error No predefined schema found

I have a https://github.com/gettyimages/docker-spark based local spark test cluster including R. In particular, this image is used: https://hub.docker.com/r/possibly/spark/
Trying to read a parquet file with sparkR this exception occurs. Reading a parquet file works without any problems on a local spark installation.
myData.parquet <- read.parquet(sqlContext, "/mappedFolder/myFile.parquet")
16/03/29 20:36:02 ERROR RBackendHandler: parquet on 4 failed
Fehler in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
java.lang.AssertionError: assertion failed: No predefined schema found, and no Parquet data files or summary files found under file:/mappedFolder/myFile.parquet.
at scala.Predef$.assert(Predef.scala:179)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache.org$apache$spark$sql$execution$datasources$parquet$ParquetRelation$MetadataCache$$readSchema(ParquetRelation.scala:512)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache$$anonfun$12.apply(ParquetRelation.scala:421)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache$$anonfun$12.apply(ParquetRelation.scala:421)
at scala.Option.orElse(Option.scala:257)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache.refresh(ParquetRelation.scala:421)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation.org$apache$spark$sql$execution$datasources$parquet$ParquetRelation$$metadataCac
Strangely the same error is the same - even for not existing files.
However in the terminal I can see that the files are there:
/mappedFolder/myFile.parquet
root#worker:/mappedFolder/myFile.parquet# ls
_common_metadata part-r-00097-e5221f6f-e125-4f52-9f6d-4f38485787b3.gz.parquet part-r-00196-e5221f6f-e125-4f52-9f6d-4f38485787b3.gz.parquet
....

My initial parquet file seems to have been corrupted during my test runs of the dockerized spark.
To solve: re-create parquet files from original sources

Spark Sql 1.3.0 + parquet

USING SPARK-SQL:
i've created a table without parquet in hdfs and everything is ok.
i've created the same table structure but with "store as parquet", also i've created the parquet files and upload to hdfs and "load inpath 'hdfs://servever/parquet_files'
but when i try to execute "select * from table_name";
i've this exception
Exception in thread "main" java.sql.SQLException: java.lang.IllegalArgumentException: Wrong FS: hdfs://server:8020/user/hive/warehouse/table_name, expected: file:///
any tip??

Fixed including hadoop configuration files (core-site.xml and hdfs-site.xml) in spark

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Reading multiple avro files into RDD from a nested directory structure - apache-spark

Related

PySpark: how to clear readStream cache?

Spark empty _metadata file in parquet output

Read multiple files with SparkSession in Spark 2.0

Spark in docker parquet error No predefined schema found

Spark Sql 1.3.0 + parquet

Categories

Resources