How to manually load spark-redshift AVRO files into Redshift? - apache-spark

I have a Spark job that failed at the COPY portion of the write. I have all the output already processed in S3, but am having trouble figuring out how to manually load it.
COPY table
FROM 's3://bucket/a7da09eb-4220-4ebe-8794-e71bd53b11bd/part-'
CREDENTIALS 'aws_access_key_id=XXX;aws_secret_access_key=XXX'
format as AVRO 'auto'
In my folder there is a _SUCCESS, _committedxxx and _startedxxx file, and then 99 files all starting with the prefix part-. When I run this I get an stl_load_error -> Invalid AVRO file found. Unexpected end of AVRO file. If I take that prefix off, then I get:
[XX000] ERROR: Invalid AVRO file Detail: ----------------------------------------------- error: Invalid AVRO file code: 8001 context: Cannot init avro reader from s3 file Incorrect Avro container file magic number query: 10882709 location: avropath_request.cpp:432 process: query23_27 [pid=10653] -----------------------------------------------
Is this possible to do? It would be nice to save the processing.

I had the same error from Redshift.
The COPY works after I deleted the _committedxxx and _startedxxx files (the _SUCCESS file is no problem).
If you have many directories in s3, you can use the aws cli to clean them of these files:
aws s3 rm s3://my_bucket/my/dir/ --include "_comm*" --exclude "*.avro" --exclude "*_SUCCESS" --recursive
Note that the cli seems to have a bug, --include "_comm*" did not work for me. So it attempted to delete all files. Using "--exclude *.avro" does the trick. Be careful and run the command with --dryrun first!!

Related

Hadoop getmerge fails when trying to merge files to local directory

I am trying to merge two files from my HDFS to a folder on my local machine's desktop. The command that I am using is:
hadoop fs -getmerge -nl /user/hadoop/folder_name/ /Desktop/test_files/finalfile.csv
But that returns the following error:
getmerge: Mkdirs failed to create file:/Desktop/test_files (exists=false, cwd=file:/home/hadoop)
Does anyone know why this might be? I couldn't find much of anything else in my search.
You need to create the local folder /Desktop/test_files/

How to read a csv file from an FTP using PySpark in Databricks Community

I am trying to fetch a file using FTP (kept on Hostinger) using Pyspark in Databricks community.
Everything works fine until I try to read that file using spark.read.csv('MyFile.csv').
Following is the code and an error,
PySpark Code:
res=spark.sparkContext.addFile('ftp://USERNAME:PASSWORD#URL/<folder_name>/MyFile.csv')
myfile=SparkFiles.get("MyFile.csv")
spark.read.csv(path=myfile) # Errors out here
print(myfile, sc.getRootDirectory()) # Outputs almost same path (except dbfs://)
Error:
AnalysisException: Path does not exist: dbfs:/local_disk0/spark-ce4b313e-00cf-4f52-80d8-dff98fc3eea5/userFiles-90cd99b4-00df-4d59-8cc2-2e18050d395/MyFile.csv
As the spark.addfile downloads file on driver while databricks uses dbfs as default storage you are getting the error please try below code to see if it fixes your issue
res=spark.sparkContext.addFile('ftp://USERNAME:PASSWORD#URL/<folder_name>/MyFile.csv')
myfile=SparkFiles.get("MyFile.csv")
spark.read.csv(path='file:'+myfile)
print(myfile, sc.getRootDirectory())

Load/access/mount directory to aws sagemaker from S3

I am a newbee to aws s3/sagemaker. I am strugling to access my data [data meaning folders/directories, not any specific file/files] from S3 bucket to sagemaker jupyter notebook.
Say, my URI is:
s3://data/sub/dir/, where dir may contain multiple directories with files. I need to acess the directory (e.g., dir) in such a way where I can access any sub directories/files from it. I tried-
!aws s3 cp s3://data/sub/dir tempdata --recursive but did not work, getting error like-
fatal error: An error occurred (404) when calling the HeadObject operation: Key "sub/dir" does not exist.
Please advice, how can I access the dirs from s3 buckets to my aws sagemaker jupyter lab.
Or how to mount s3 buckets to sagemaker? I also tried this link and installed with no errors but s3fs wont show when I run dh -f, thus not worked as well! Thanks in advance.
Your cp syntax is correct.
S3 Sync could be an alternative way to get the same result, and the error response, if you got something wrong, could be more informative: !aws s3 sync s3://data/sub/dir tempdata

Spark Streaming reading from local file gives NullPointerException

Using Spark 2.2.0 on OS X High Sierra. I'm running a Spark Streaming application to read a local file:
val lines = ssc.textFileStream("file:///Users/userName/Documents/Notes/MoreNotes/sampleFile")
lines.print()
This gives me
org.apache.spark.streaming.dstream.FileInputDStream logWarning - Error finding new files
java.lang.NullPointerException
at scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:192)
The file exists, and I am able to read it using SparkContext (sc) from spark-shell on the terminal. For some reason going through the Intellij application and Spark Streaming is not working. Any ideas appreciated!
Quoting the doc comments of textFileStream:
Create an input stream that monitors a Hadoop-compatible filesystem
for new files and reads them as text files (using key as LongWritable, value
as Text and input format as TextInputFormat). Files must be written to the
monitored directory by "moving" them from another location within the same
file system. File names starting with . are ignored.
#param directory HDFS directory to monitor for new file
So, the method expects the path to a directory in the parameter.
So I believe this should avoid that error:
ssc.textFileStream("file:///Users/userName/Documents/Notes/MoreNotes/")
Spark streaming will not read old files, so first run the spark-submit command and then create the local file in the specified directory. Make sure in the spark-submit command, you give only directory name and not the file name. Below is a sample command. Here, I am passing the directory name through the spark command as my first parameter. You can specify this path in your Scala program as well.
spark-submit --class com.spark.streaming.streamingexample.HdfsWordCount --jars /home/cloudera/pramod/kafka_2.12-1.0.1/libs/kafka-clients-1.0.1.jar--master local[4] /home/cloudera/pramod/streamingexample-0.0.1-SNAPSHOT.jar /pramod/hdfswordcount.txt

Where is Spark writing SaveAsTextFile in cluster?

I'm a bit at loss here (Spark newbie). I spun up an EC2 cluster, and submitted a Spark job which saves as text file in the last step. The code reads
reduce_tuples.saveAsTextFile('september_2015')
and the working directory of the python file I'm submitting is /root. I cannot find the directory called september_2005, and if I try to run the job again I get the error:
: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://ec2-54-172-88-52.compute-1.amazonaws.com:9000/user/root/september_2015 already exists
The ec2 address is the master node where I'm ssh'ing to, but I don't have a folder /user/root.
Seems like Spark is creating the september_2015 directory somehwere, but find doesn't find it. Where does Spark write the resulting directory? Why is it pointing me to a directory that doesn't exist in the master node filesystem?
You're not saving it in the local file system, you're saving it in the hdfs cluster. Try eph*-hdfs/bin/hadoop fs -ls /, then you should see your file. See eph*-hdfs/bin/hadoop help for more commands, eg. -copyToLocal.

Resources