Reading parquet file by spark using wildcard - python-3.x

I have many parquet files in S3 directory. The directory structure may vary based on vid. something like this:
bucketname/vid=123/year=2020/month=9/date=12/hf1hfw2he.parquet
bucketname/vid=456/year=2020/month=8/date=13/34jbj.parquet
bucketname/vid=876/year=2020/month=9/date=15/ghg76.parquet
I have a list which contains all the vid something like this
vid_list = ['123','456','876']
How can I read all the files at once for month=9 with out effective performance issue?
current_month=9
temp_df = sqlContext.read.option("mergeSchema", "false").parquet('s3a://bucketname' + 'vid={}/year=2020/month={}/*/*.parquet'.format(*vid_list,current_month))
This is giving me error Path does not exist: file:/Users/home/desktop/test1/vid=123/year=2020/month=456/*/*.parquet;. Is there any way to achieve this in efficient way?

Try the following code:
vid_list = '(' + '|'.join(['123','456','876']) + ')'
current_month=9
temp_df = sqlContext.read.option("mergeSchema", "false").parquet('s3://bucketname/' + 'vid={}/year=2020/month={}/*/*.parquet'.format(vid_list,current_month))
// URL should look like: s3://bucketname/vid=(123|456|876)/year=2020/month=9/*/*.parquet
Error in your code: Month value is 456, it should be 9
file:/Users/home/desktop/test1/vid=123/year=2020/month=456/*/*.parquet;

Related

Python Glob - Get Full Filenames, but no directory-only names

This code works, but it's returning directory names and filenames. I haven't found a parameter that tells it to return only files or only directories.
Can glob.glob do this, or do I have to call os.something to test if I have a directory or file. In my case, my files all end with .csv, but I would like to know for more general knowledge as well.
In the loop, I'm reading each file, so currently bombing when it tries to open a directory name as a filename.
files = sorted(glob.glob(input_watch_directory + "/**", recursive=True))
for loop_full_filename in files:
print(loop_full_filename)
Results:
c:\Demo\WatchDir\
c:\Demo\WatchDir\2202
c:\Demo\WatchDir\2202\07
c:\Demo\WatchDir\2202\07\01
c:\Demo\WatchDir\2202\07\01\polygonData_2022_07_01__15_51.csv
c:\Demo\WatchDir\2202\07\01\polygonData_2022_07_01__15_52.csv
c:\Demo\WatchDir\2202\07\01\polygonData_2022_07_01__15_53.csv
c:\Demo\WatchDir\2202\07\01\polygonData_2022_07_01__15_54.csv
c:\Demo\WatchDir\2202\07\01\polygonData_2022_07_01__15_55.csv
c:\Demo\WatchDir\2202\07\05
c:\Demo\WatchDir\2202\07\05\polygonData_2022_07_05__12_00.csv
c:\Demo\WatchDir\2202\07\05\polygonData_2022_07_05__12_01.csv
Results needed:
c:\Demo\WatchDir\2202\07\01\polygonData_2022_07_01__15_51.csv
c:\Demo\WatchDir\2202\07\01\polygonData_2022_07_01__15_52.csv
c:\Demo\WatchDir\2202\07\01\polygonData_2022_07_01__15_53.csv
c:\Demo\WatchDir\2202\07\01\polygonData_2022_07_01__15_54.csv
c:\Demo\WatchDir\2202\07\01\polygonData_2022_07_01__15_55.csv
c:\Demo\WatchDir\2202\07\05\polygonData_2022_07_05__12_00.csv
c:\Demo\WatchDir\2202\07\05\polygonData_2022_07_05__12_01.csv
For this specific program, I can just check if the file name contains.csv, but I would like to know in general for future reference.
Line:
files = sorted(glob.glob(input_watch_directory + "/**", recursive=True))
replace with the line:
files = sorted(glob.glob(input_watch_directory + "/**/*.*", recursive=True))

How to write .dsv file into dynamic folders in U-SQL script

I am new to U-SQL and am trying to read multiple .dsv files from a folder and write the output as .dsv files in dynamic folder/sub folder. So far I have successfully read multiple .dsv files and am able to write the files in a single location(pre-defined location) but am not able to write the .dsv files into multiple folders/subfolders. The folders and subfolders are defined based on the file name. As an example, If the file name is ABC_20200421#015814.dsv, it should be reading data from the .dsv file and write it into folder system: test/ABC/2020/04/21/File_Data.dsv
The code I have so far is :
DECLARE #ImportFile string = "*.dsv";
-- To read all the .dsv files and move them to folder locations
DECLARE #Code String = #ImportFile.Substring(0, 3);
DECLARE #CommanName String = #ImportFile.Substring(4, 14); File_Contents
DECLARE #Year String = #ImportFile.Substring(18, 4);
DECLARE #Month String = #ImportFile.Substring(22, 2);
DECLARE #File_Date String = #ImportFile.Substring(24, 2);
#result=
SELECT col1,col2,col3
FROM Table1
//Writing to dsv file:
OUTPUT #result
TO "test/"+#Code+"/"+#Year+"/"+#Month+"/"+#File_Date+"/File_Data.dsv"
USING Outputters.Text(delimiter : '|', quoting: false);
On running the code, I get an error:
E_CSC_USER_EXPRESSIONNOTCONSTANTFOLDABLE: Expression cannot be constant folded.
Details:
at token '"/test/"', near the ###:
OUTPUT #result TO ### "/test/"+#Code+"/"+#Year+"/"+#Month+"/"+#File_Date+"/STAY_REVENUES.dsv" USING Outputters.Text(delimiter : '|', quoting : false)
Any help would be appreciated. Thanks in advance.

How to concatenate variables plus os.enviorn path plus yaml in Python

1)TOPDIR =/../../
2)/ABC/path/
3)name=os.os.environ['ABC'].lower()
4).yaml
I want to Concatenate all four things in one like
Tried below solution :
[TOPDIR + "/ABC/path/" + name .yaml] appending this in Generartor
Not working. Can anyone give solution for this.
TOPDIR = '/../../'
name = os.environ['ABC'].lower()
print([TOPDIR + "/ABC/PATH/" + name + '.yaml'])
This should work.
What's the error it throws actually?
Perhaps you have a typo "os.enviorn" should be "os.environ"

How do you process many files from a blob storage with long paths in databricks?

I've enabled logging for an API Management service and the logs are being stored in a storage account. Now I'm trying to process them in an Azure Databricks workspace but I'm struggling with accessing the files.
The issue seems to be that the automatically generated virtual folder structure looks like this:
/insights-logs-gatewaylogs/resourceId=/SUBSCRIPTIONS/<subscription>/RESOURCEGROUPS/<resource group>/PROVIDERS/MICROSOFT.APIMANAGEMENT/SERVICE/<api service>/y=*/m=*/d=*/h=*/m=00/PT1H.json
I've mounted the insights-logs-gatewaylogs container under /mnt/diags and a dbutils.fs.ls('/mnt/diags') correctly lists the resourceId= folder but dbutils.fs.ls('/mnt/diags/resourceId=') claims file not found
If I create empty marker blobs along the virtual folder structure I can list each subsequent level but that strategy obviously falls down since the last part of the path is dynamically organized by year/month/day/hour.
For example a
spark.read.format('json').load("dbfs:/mnt/diags/logs/resourceId=/SUBSCRIPTIONS/<subscription>/RESOURCEGROUPS/<resource group>/PROVIDERS/MICROSOFT.APIMANAGEMENT/SERVICE/<api service>/y=*/m=*/d=*/h=*/m=00/PT1H.json")
Yields in this error:
java.io.FileNotFoundException: File/resourceId=/SUBSCRIPTIONS/<subscription>/RESOURCEGROUPS/<resource group>/PROVIDERS/MICROSOFT.APIMANAGEMENT/SERVICE/<api service>/y=2019 does not exist.
So clearly the wild-card has found the first year folder but is refusing to go further down.
I setup a copy job in Azure Data Factory that copies all the json blobs within the same blob storage account successfully and removes the resourceId=/SUBSCRIPTIONS/<subscription>/RESOURCEGROUPS/<resource group>/PROVIDERS/MICROSOFT.APIMANAGEMENT/SERVICE/<api service> prefix (so the root folder starts with the year component) and that can be accessed successfully all the way down without having to create empty marker blobs.
So the problem seems to be related the to the long virtual folder structure which is mostly empty.
Is there another way on how to process these kind of folder structures in databricks?
Update: I've also tried providing the path as part of the source when mounting but that doesn't help either
I think I may have found the root cause of this. Should have tried this earlier but I provided the exact path to an existing blob like this:
spark.read.format('json').load("dbfs:/mnt/diags/logs/resourceId=/SUBSCRIPTIONS/<subscription>/RESOURCEGROUPS/<resource group>/PROVIDERS/MICROSOFT.APIMANAGEMENT/SERVICE/<api service>/y=2019/m=08/d=20/h=06/m=00/PT1H.json")
And I got a more meaningful error back:
shaded.databricks.org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: Incorrect Blob type, please use the correct Blob type to access a blob on the server. Expected BLOCK_BLOB, actual APPEND_BLOB.
Turns out the out-of-the box logging creates append blobs (and there doesn't seem to be a way to change this) and support for append blobs is still WIP by the looks of this ticket: https://issues.apache.org/jira/browse/HADOOP-13475
The FileNotFoundException could be a red herring which might caused by the inner exception being swallowed when trying expand the wild-cards and finding an unsupported blob type.
Update
Finally found a reasonable work-around. I installed the azure-storage Python package in my workspace (if you're at home with Scala it's already installed) and did the blob loading myself. Most code below is to add globbing support, you don't need it if you're happy to just match on path prefix:
%python
import re
import json
from azure.storage.blob import AppendBlobService
abs = AppendBlobService(account_name='<account>', account_key="<access_key>")
base_path = 'resourceId=/SUBSCRIPTIONS/<subscription>/RESOURCEGROUPS/<resource group>/PROVIDERS/MICROSOFT.APIMANAGEMENT/SERVICE/<api service>'
pattern = base_path + '/*/*/*/*/m=00/*.json'
filter = glob2re(pattern)
spark.sparkContext \
.parallelize([blob.name for blob in abs.list_blobs('insights-logs-gatewaylogs', prefix=base_path) if re.match(filter, blob.name)]) \
.map(lambda blob_name: abs.get_blob_to_bytes('insights-logs-gatewaylogs', blob_name).content.decode('utf-8').splitlines()) \
.flatMap(lambda lines: [json.loads(l) for l in lines]) \
.collect()
glob2re is courtesy of https://stackoverflow.com/a/29820981/220986:
def glob2re(pat):
"""Translate a shell PATTERN to a regular expression.
There is no way to quote meta-characters.
"""
i, n = 0, len(pat)
res = ''
while i < n:
c = pat[i]
i = i+1
if c == '*':
#res = res + '.*'
res = res + '[^/]*'
elif c == '?':
#res = res + '.'
res = res + '[^/]'
elif c == '[':
j = i
if j < n and pat[j] == '!':
j = j+1
if j < n and pat[j] == ']':
j = j+1
while j < n and pat[j] != ']':
j = j+1
if j >= n:
res = res + '\\['
else:
stuff = pat[i:j].replace('\\','\\\\')
i = j+1
if stuff[0] == '!':
stuff = '^' + stuff[1:]
elif stuff[0] == '^':
stuff = '\\' + stuff
res = '%s[%s]' % (res, stuff)
else:
res = res + re.escape(c)
return res + '\Z(?ms)'
Not pretty but avoids the copying around of data and can be wrapped up in a little utility class.
Try reading directly from the blob, not through the mount
You need to set up either access key or sas for this but I assume you know that
SAS
spark.conf.set(
"fs.azure.sas.<container-name>.<storage-account-name>.blob.core.windows.net",
"<complete-query-string-of-sas-for-the-container>")
or Access key
spark.conf.set(
"fs.azure.account.key.<storage-account-name>.blob.core.windows.net",
"<storage-account-access-key>")
then
val df = spark.read.json("wasbs://<container>#<account-name>.blob.core.windows.net/<path>")
For now this operation is not supported. It is crap that Microsoft provides technology that are not working each other (BYOML -> Log Analytics and export rule to storage account and then read the data from Databricks for example).
There is a workaround for that. You can create your own custom class and read it. Please take a look at the example of reading am-securityevent data on BYOML git:
https://github.com/Azure/Azure-Sentinel-BYOML

Corrupted Excel File & 7zip

I have a problem with a corrupted excel file. So far I have used 7zip to open it as an archive and extract most of the data. But some important sheets cannot be extracted.
Using the l command of 7zip I get the following output :
7z.exe l -slt "C:\Users\corrupted1.xlsm" xl/worksheets/sheet3.xml
Output:
Listing archive: C:\Users\corrupted1.xlsm
--
Path = C:\Users\corrupted1.xlsm
Type = zip
Physical Size = 11931916
----------
Path = xl\worksheets\sheet3.xml
Folder = -
Size = 57217
Packed Size = 12375
Modified = 1980-01-01 00:00:00
Created =
Accessed =
Attributes = .....
Encrypted = -
Comment =
CRC = 553C3C52
Method = Deflate
Host OS = FAT
Version = 20
However when trying to extract it (or test it for that matter) I get :
7z.exe t -slt "C:\Users\corrupted1.xlsm" xl/worksheets/sheet3.xml
Output:
Processing archive: C:\Users\corrupted1.xlsm
Testing xl\worksheets\sheet3.xml Unsupported Method
Sub items Errors: 1
The method listed above says Deflate, which is the same for all the worksheets.
Is there anything I can do? What kind of corruption is this? Is it the CRC? Can I ignore it somehow or something?
Please help!
Edit:
The following is the error when trying to extract or edit the xml file through 7zip:
Edit 2:
Tried with WinZip as well, getting :
Extracting to "C:\Users\axpavl\AppData\Local\Temp\wzf0b9\"
Use Path: yes Overlay Files: yes
Extracting xl\worksheets\sheet2.xml
Unable to find the local header for xl\worksheets\sheet2.xml.
Severe Error: Cannot find a local header.
This might help:
https://superuser.com/questions/145479/excel-edit-the-xml-inside-an-xlsx-file
and this on too: http://www.techrepublic.com/blog/tr-dojo/recover-data-from-a-damaged-office-file-with-the-help-of-7-zip/

Resources