Yubyte DB YCQL unable to source using COPY command - yugabytedb

Unable to source large csv file containing blob in ycql.
global name 'BlobType' is not defined, given up without retries

Related

passing the parameters from adf to notebook

Im ingesting a data with the api calls and would like to widgets to parametirze. In azure I have the following set up:
I have the list of attribute_codes, reading them with lookup activtiy and passing these parameter inside the databricks notebook code. Code inside the databricks:
data, response = get_data_url(url=f"https://p.cloud.com/api/rest/v1/attributes/{attribute_code}/options",access_token=access_token)
#Removing the folder in Data Lake
dbutils.fs.rm(f'/mnt/bronze/attribute_code/{day}',True)
#Creating the folder in the Data Lake
dbutils.fs.mkdirs(f'/mnt/bronze/attribute_code/{day}')
count = 0
#Putting the response inside of the Data Lake folder
dbutils.fs.put(f'/mnt/bronze/attribute_code/{day}/data_{count}.json', response.text)
My problem is that, since its in the ForEach loop, eveytime new parameter is passed, it deletes the entire folder with previosly, loaded data. Now someone can come and say to remove line where I drop and create the spacific daily folder but pipeline should run multiple times a day and I need to drop previously loaded data on that day and load new one.
My goal is to iterte over the entire list of the attribute_code and load them all in one folder with the name "data_{count}.json
Instead of using dbutils.fs.rm in your notebook, you can use delete activity before for each activity to get desired results.
Using dbutils.fs.rm, the folder is being deleted each time the notebook is triggered inside for each loop deleting previously created files as well.
So, using a delete activity only before for each loop to delete the folder (deletes only if it exists), you can load data as per requirement.
For path, I have used the following dynamic content:
attribute/#{formatDateTime(utcNow(),'yyyy-MM-dd')}
And using the following code in my databricks notebook:
#I used similar code
data, response = get_data_url(url=f"https://p.cloud.com/api/rest/v1/attributes/{attribute_code}/options",access_token=access_token)
#Creating the folder in the Data Lake
dbutils.fs.mkdirs(f'/mnt/bronze/attribute_code/{day}')
count = 0
#Putting the response inside of the Data Lake folder
dbutils.fs.put(f'/mnt/bronze/attribute_code/{day}/data_{count}.json', response.text)
Lets say I have the following output from my look up activity:
When I run the pipeline, it would run successfully. Only the latest look up data would be loaded.

Reading GeoJSON in databricks, no mount point set

We have recently made changes to how we connect to ADLS from Databricks which have removed mount points that were previously established within the environment. We are using databricks to find points in polygons, as laid out in the databricks blog here: https://databricks.com/blog/2019/12/05/processing-geospatial-data-at-scale-with-databricks.html
Previously, a chunk of code read in a GeoJSON file from ADLS into the notebook and then projected it to the cluster(s):
nights = gpd.read_file("/dbfs/mnt/X/X/GeoSpatial/Hex_Nights_400Buffer.geojson")
a_nights = sc.broadcast(nights)
However, the new changes that have been made have removed the mount point and we are now reading files in using the string:
"wasbs://Z#Y.blob.core.windows.net/X/Personnel/*.csv"
This works fine for CSV and Parquet files, but will not load a GeoJSON! When we try this, we get an error saying "File not found". We have checked and the file is still within ADLS.
We then tried to copy the file temporarily to "dbfs" which was the only way we had managed to read files previously, as follows:
dbutils.fs.cp("wasbs://Z#Y.blob.core.windows.net/X/GeoSpatial/Nights_new.geojson", "/dbfs/tmp/temp_nights")
nights = gpd.read_file(filename="/dbfs/tmp/temp_nights")
dbutils.fs.rm("/dbfs/tmp/temp_nights")
a_nights = sc.broadcast(nights)
This works fine on the first use within the code, but then a second GeoJSON run immediately after (which we tried to write to temp_days) fails at the gpd.read_file stage, saying file not found! We have checked with dbutils.fs.ls() and can see the file in the temp location.
So some questions for you kind folks:
Why were we previously having to use "/dbfs/" when reading in GeoJSON but not csv files, pre-changes to our environment?
What is the correct way to read in GeoJSON files into databricks without a mount point set?
Why does our process fail upon trying to read the second created temp GeoJSON file?
Thanks in advance for any assistance - very new to Databricks...!
Pandas uses the local file API for accessing files, and you accessed files on DBFS via /dbfs that provides that local file API. In your specific case, the problem is that even if you use dbutils.fs.cp, you didn't specify that you want to copy file locally, and it's by default was copied onto DBFS with path /dbfs/tmp/temp_nights (actually it's dbfs:/dbfs/tmp/temp_nights), and as result local file API doesn't see it - you will need to use /dbfs/dbfs/tmp/temp_nights instead, or copy file into /tmp/temp_nights.
But the better way would be to copy file locally - you just need to specify that destination is local - that's done with file:// prefix, like this:
dbutils.fs.cp("wasbs://Z#Y.blob.core.windows.net/...Nights_new.geojson",
"file:///tmp/temp_nights")
and then read file from /tmp/temp_nights:
nights = gpd.read_file(filename="/tmp/temp_nights")

Data Factory Data Flow sink file name

I have a data flow that merges multiple pipe delimited files into one file and stores it in Azure Blob Container. I'm using a file pattern for the output file name concat('myFile' + toString(currentDate('PST')), '.txt').
How can I grab the file name that's generated after the dataflow is completed? I have other activities to log the file name into a database, but not able to figure out how to get the file name.
I tried #{activity('Data flow1').output.filePattern} but it didn't help.
Thank you
You can use GetMeta data activity to get the file name that is generated after the data flow.

BigQuery howto load data from local file as content

I have a requirement where in I will receive file content which I need to load to BigQuery tables. Standard API shows how to load data from local file but I don't see any variant of the load method which accepts file content as string rather than a file path. Any idea how I can achieve this ?
As we can see in the source code and official documentation load function loads data only from a local file or Storage File. Allowed options are:
AVRO,
CSV,
JSON,
ORC,
PARQUET
The load job is created and it will run your data load asynchronously. If you would like instantaneous access to your data, insert it using Table insert function, where you need to provide the rows to insert into the table:
// Insert a single row
table.insert({
INSTNM: 'Motion Picture Institute of Michigan',
CITY: 'Troy',
STABBR: 'MI'
}, insertHandler);
If you want to load i.e. CSV file, firstly you need to save data to a CSV in Node.js manually. Then, load it as a single column CSV using load() method. That will load the whole string as a single column.
Additionally, what I can recommend you is to use Dataflow templates, i.e. Cloud Storage Text to BigQuery, that read text files stored in Cloud Storage, transform them using a JavaScript User Defined Function (UDF), and output the result to BigQuery. But your data to load needs to be stored in Cloud Storage.

Azure Storage Explorer: Properties of type '' are not supported error

I inherited a project that uses an Azure table storage database. I'm using Microsoft Azure Storage Explorer as a tool to query and manage data. I'm attempting to migrate data from my Dev database to my QA database. To do this, I'm exporting a CSV from a Dev database table and then trying to import into the QA database table. For a small number of tables, I get the following error when I try to import the CSV:
Failed: Properties of type '' are not supported.
When I ran into this before, since I exported a "typed" CSV from Dev, I checked to make sure all "#type" columns had values. They did. Then I split the CSV (with thousands of records) up into smaller files to try to determine which record was the issue. When I did this and started importing them, I was ultimately able to import all of the records successfully by individual files which is peculiar. Almost like a constraint violation issue.
I'm also seeing errors with different types. Eg:
Properties of type 'Double' are not supported.
In this case, there is already a column in the particular table of type "Double".
Anyway, now that I'm seeing it again, I'm having trouble resolving it. Any thoughts?
UPDATE
I was able to track a few of these errors to "bad" data in the CSV. It was a JSON string in a Edm.String field that for some reason, it wasn't liking. I minified the JSON using an online tool and it imported fine. There is one data set, though, that has over 7,000 records I'm trying to import (the one I referenced breaking up previously earlier in this post). I ended up breaking it up into different files and was able to successfully import them individually. When I try to import the entire file after loading all the data through individual files, though, I again get an error.
I split the CSV (with thousands of records) up into smaller files to try to determine which record was the issue. When I did this and started importing them, I was ultimately able to import all of the records successfully by individual files which is peculiar.
Based on your test, the format and data of source CSV file seems ok. It will be difficult to find out why Azure Storage Explorer return those unexpected error while importing large CSV file. You can try to upgrade your Azure Storage Explorer and check if you can export and import data successfully using the latest Azure Storage Explorer.
Besides, you can try to use AzCopy (designed for copying data to and from Microsoft Azure Blob, File, and Table storage using simple commands with optimal performance) to export/import table.
Export table:
AzCopy /Source:https://myaccount.table.core.windows.net/myTable/ /Dest:C:\myfolder\ /SourceKey:key /Manifest:abc.manifest
Import table:
AzCopy /Source:C:\myfolder\ /Dest:https://myaccount.table.core.windows.net/mytable1/ /DestKey:key /Manifest:"abc.manifest" /EntityOperation:InsertOrReplace

Resources