ParseExceptions when using HQL file on HDInsight - azure

I'm following this tutorial http://azure.microsoft.com/en-us/documentation/articles/hdinsight-use-hive/ but have become stuck when changing the source of the query to use a file.
It all works happily when using New-AzureHDInsightHiveJobDefinition -Query $queryString but when I try New-AzureHDInsightHiveJobDefinition -File "/example.hql" with example.hql stored in the "root" of the blob container I get ExitCode 40000 and the following in standarderror:
Logging initialized using configuration in file:/C:/apps/dist/hive-0.11.0.1.3.7.1-01293/conf/hive-log4j.properties
FAILED: ParseException line 1:0 character 'Ã?' not supported here
line 1:1 character '»' not supported here
line 1:2 character '¿' not supported here
Even when I deliberately misspell the hql filename the above error is still generated along with the expected file not found error so it's not the content of the hql that's causing the error.
I have not been able to find the hive-log4j.properties in the blob store to see if it's corrupt, I have torn down the HDInsight cluster and deleted the associated blob store and started again but ended up with the same result.
Would really appreciate some help!

I am able to induce a similar error by putting a Utf-8 or Unicode encoded .hql file into blob storage and attempting to run it. Try saving your example.hql file as 'ANSI' in Notepad (Open, the Save As and the encoding option is at the bottom of the dialog) and then copy it to blob storage and try again.
If the file is not found on Start-AzureHDInsightJob, then that cmdlet errors out and does not return a new AzureHDInsightJob object. If you had a previous instance of the result saved, then the subsequent Wait-AzureHDInsightJob and Get-AzureHDInsightJobOutput would be referring to a previous run, giving the illusion of the same error for the not found case. That error should definitely indicate a problem reading an UTF-8 or Unicode file when one is not expected.

Related

Receiving Unterminated CSV Quoted Field Error while inserting into Postgres using Copy

I am trying to insert CSV file into Postgres using Copy command at the time I am receiving Unterminated CSV Quoted Field error (out of 100 CSV files I am getting this error for 2-5 files).
I will usually identify the error file then I will open the same using Microsoft Excel and I will simply save without doing anything changes. Then I will try to copy the same file again into Postgres, this time it was working the data get inserted. Can anyone please explain how its possible simply opening and saving the file using Excel will resolve this error?

Snowpipe doesn't load the files after error has been rectified

I am using snowpipe to load files from S3 bucket. It worked well for 2 files.
But then to check out how snowpipe works when there is any error occur in between file loading, I intentionally changed file format ( changed delimiter to '|' whereas file is CSV ) so that COPY command will not work. And uploaded 3rd CSV file on S3. But it was not loaded due to file format error. It was perfect till this time.
Later I recreated file format with correct delimiter i.e. ',' but since notification was already sent for 3rd file, it did not loaded in table. So I uploaded 4th csv file and it got loaded successfully. So my questions is how to take care of loading of 3rd file for which event notification was generated while file format was wrong.
Let me know if any more details are required.

Reading GeoJSON in databricks, no mount point set

We have recently made changes to how we connect to ADLS from Databricks which have removed mount points that were previously established within the environment. We are using databricks to find points in polygons, as laid out in the databricks blog here: https://databricks.com/blog/2019/12/05/processing-geospatial-data-at-scale-with-databricks.html
Previously, a chunk of code read in a GeoJSON file from ADLS into the notebook and then projected it to the cluster(s):
nights = gpd.read_file("/dbfs/mnt/X/X/GeoSpatial/Hex_Nights_400Buffer.geojson")
a_nights = sc.broadcast(nights)
However, the new changes that have been made have removed the mount point and we are now reading files in using the string:
"wasbs://Z#Y.blob.core.windows.net/X/Personnel/*.csv"
This works fine for CSV and Parquet files, but will not load a GeoJSON! When we try this, we get an error saying "File not found". We have checked and the file is still within ADLS.
We then tried to copy the file temporarily to "dbfs" which was the only way we had managed to read files previously, as follows:
dbutils.fs.cp("wasbs://Z#Y.blob.core.windows.net/X/GeoSpatial/Nights_new.geojson", "/dbfs/tmp/temp_nights")
nights = gpd.read_file(filename="/dbfs/tmp/temp_nights")
dbutils.fs.rm("/dbfs/tmp/temp_nights")
a_nights = sc.broadcast(nights)
This works fine on the first use within the code, but then a second GeoJSON run immediately after (which we tried to write to temp_days) fails at the gpd.read_file stage, saying file not found! We have checked with dbutils.fs.ls() and can see the file in the temp location.
So some questions for you kind folks:
Why were we previously having to use "/dbfs/" when reading in GeoJSON but not csv files, pre-changes to our environment?
What is the correct way to read in GeoJSON files into databricks without a mount point set?
Why does our process fail upon trying to read the second created temp GeoJSON file?
Thanks in advance for any assistance - very new to Databricks...!
Pandas uses the local file API for accessing files, and you accessed files on DBFS via /dbfs that provides that local file API. In your specific case, the problem is that even if you use dbutils.fs.cp, you didn't specify that you want to copy file locally, and it's by default was copied onto DBFS with path /dbfs/tmp/temp_nights (actually it's dbfs:/dbfs/tmp/temp_nights), and as result local file API doesn't see it - you will need to use /dbfs/dbfs/tmp/temp_nights instead, or copy file into /tmp/temp_nights.
But the better way would be to copy file locally - you just need to specify that destination is local - that's done with file:// prefix, like this:
dbutils.fs.cp("wasbs://Z#Y.blob.core.windows.net/...Nights_new.geojson",
"file:///tmp/temp_nights")
and then read file from /tmp/temp_nights:
nights = gpd.read_file(filename="/tmp/temp_nights")

saving an image to bytes and uploading to boto3 returning content-MD5 mismatch

I'm trying to pull an image from s3, quantize it/manipulate it, and then store it back into s3 without saving anything to disk (entirely in-memory). I was able to do it once, but upon returning to the code and trying it again it did not work. The code is as follows:
import boto3
import io
from PIL import Image
client = boto3.client('s3',aws_access_key_id='',
aws_secret_access_key='')
cur_image = client.get_object(Bucket='mybucket',Key='2016-03-19 19.15.40.jpg')['Body'].read()
loaded_image = Image.open(io.BytesIO(cur_image))
quantized_image = loaded_image.quantize(colors=50)
saved_quantized_image = io.BytesIO()
quantized_image.save(saved_quantized_image,'PNG')
client.put_object(ACL='public-read',Body=saved_quantized_image,Key='testimage.png',Bucket='mybucket')
The error I received is:
botocore.exceptions.ClientError: An error occurred (BadDigest) when calling the PutObject operation: The Content-MD5 you specified did not match what we received.
It works fine if I just pull an image, and then put it right back without manipulating it. I'm not quite sure what's going on here.
I had this same problem, and the solution was to seek to the beginning of the saved in-memory file:
out_img = BytesIO()
image.save(out_img, img_type)
out_img.seek(0) # Without this line it fails
self.bucket.put_object(Bucket=self.bucket_name,
Key=key,
Body=out_img)
The file may need to be saved and reloaded before you send it off to S3. The file pointer seek also needs to be at 0.
My problem was sending a file after reading out the first few bytes of it. Opening a file cleanly did the trick.
I found this question getting the same error trying to upload files -- two scripts clashed, one creating, the other uploading. My answer was to create using ".filename" then:
os.rename(filename.replace(".filename","filename"))
The upload script then needs to ignore . files. This ensured the file was done being created.
To anyone else facing similar errors, this usually happens when content of the file gets modified during file upload, possibly due to file being modified by another process/thread.
A classic example would be to scripts modifying the same file at the same time, which throws the bad digest due to change in MD5 content. In the below example, the data file is being uploaded to s3, while it is being uploaded, if another process overwrites it, you will end up with this exception
random_uuid=$(uuidgen)
cat data
aws s3api put-object --acl bucket-owner-full-control --bucket $s3_bucket --key $random_uuid --body data

How to resolve Vsam File status error code 93?

When I am trying to access a Vsam Sequential dataset(which is also opened in CICS) from batch, I use EXTEND mode to open the file and append some data to it.
Earlier it was working fine. All of a sudden , it is not working now and I am getting File status : 93 error code which means "Resource not available".
OPEN EXTEND <filename>
Foe KSDS datasets I have used EXCI(external CICS Interface) calls to access from batch even though it was opened in Online.
But I do not know how to do the same for ESDS.
Could someone help me to resolve this error.

Resources