copy command in aws kinesis firehose not working - amazon

I am using Kinesis firehose which loads data in S3 and creates Menifest file.
I am copying data from Menifest, but it is throwing an error.
My menifest file: mydeliverystream-2018-07-22-08-01-06-bd895f6a-4fad-485
{
"entries" : [ {
"url" : "s3://testings3/2018/07/22/07/mydeliverystream-3-2018-07-22-07-56-04-89605e0e-26bf-4017-a102-338ceb15481d",
"mandatory" : true
} ]
}
Copy command which I am using firehose
COPY redshiftproduct FROM 's3://testings3/mydeliverystream' CREDENTIALS 'aws_iam_role=arn:aws:iam::329723704744:role/aws-service-role/redshift.amazonaws.com/AWSServiceRoleForRedshift' delimiter ',' MANIFEST region 'us-east-1'
Please note; I have also try by adding delimiter after manifest keyword.
Error which I am receiving on the copy command contains a syntax error.
Data file:
delhi,ac,4000,2011
Haryana,TV,5000,2001
Channai,TV,3000,2011
Mumbai,Laptop,4000,2012
new delhi,ac,5000,2012
Kolkatta,fridge,1000,2012
Kanpur,TV,2000,2013
Haryana,ac,2000,2019
Kanpur,ac,2000,2019
What am I doing wrong and how can I fix it?

1- You need to add FORMAT AS JSON 's3://yourbucketname/aJsonPathFile.txt'. AWS has not mentioned this already. Please note that this only works when your data is in JSON form like
{'attr1': 'val1', 'attr2': 'val2'} {'attr1': 'val1', 'attr2': 'val2'} {'attr1': 'val1', 'attr2': 'val2'} {'attr1': 'val1', 'attr2': 'val2'}
2- You also needs to verify the column order in kinesis firehouse and in csv file.and try adding
TRUNCATECOLUMNS blanksasnull emptyasnull
3- An example
COPY testrbl3 ( eventId,serverTime,pageName,action,ip,userAgent,location,plateform,language,campaign,content,source,medium,productID,colorCode,scrolltoppercentage) FROM 's3://bucketname/' CREDENTIALS 'aws_iam_role=arn:aws:iam:::role/' MANIFEST json 'auto' TRUNCATECOLUMNS blanksasnull emptyasnull;

Related

How to format the file path in an MLTable for Azure Machine Learning uploaded during a pipeline job?

How is the path to a (.csv) file to be expressed in a MLTable file
that is created in a local folder but then uploaded as part of a
pipline job?
I'm following the Jupyter notebook automl-forecasting-task-energy-demand-advance from the azuerml-examples repo (article and notebook). This example has a MLTable file as below referencing a .csv file with a relative path. Then in the pipeline the MLTable is uploaded to be accessible to a remote compute (a few things are omitted for brevity)
my_training_data_input = Input(
type=AssetTypes.MLTABLE, path="./data/training-mltable-folder"
)
compute = AmlCompute(
name=compute_name, size="STANDARD_D2_V2", min_instances=0, max_instances=4
)
forecasting_job = automl.forecasting(
compute=compute_name, # name of the compute target we created above
# name="dpv2-forecasting-job-02",
experiment_name=exp_name,
training_data=my_training_data_input,
# validation_data = my_validation_data_input,
target_column_name="demand",
primary_metric="NormalizedRootMeanSquaredError",
n_cross_validations="auto",
enable_model_explainability=True,
tags={"my_custom_tag": "My custom value"},
)
returned_job = ml_client.jobs.create_or_update(
forecasting_job
)
ml_client.jobs.stream(returned_job.name)
But running this gives the error
Error meassage:
Encountered user error while fetching data from Dataset. Error: UserErrorException:
Message: MLTable yaml schema is invalid:
Error Code: Validation
Validation Error Code: Invalid MLTable
Validation Target: MLTableToDataflow
Error Message: Failed to convert a MLTable to dataflow
uri path is not a valid datastore uri path
| session_id=857bd9a1-097b-4df6-aa1c-8871f89580d8
InnerException None
ErrorResponse
{
"error": {
"code": "UserError",
"message": "MLTable yaml schema is invalid: \nError Code: Validation\nValidation Error Code: Invalid MLTable\nValidation Target: MLTableToDataflow\nError Message: Failed to convert a MLTable to dataflow\nuri path is not a valid datastore uri path\n| session_id=857bd9a1-097b-4df6-aa1c-8871f89580d8"
}
}
paths:
- file: ./nyc_energy_training_clean.csv
transformations:
- read_delimited:
delimiter: ','
encoding: 'ascii'
- convert_column_types:
- columns: demand
column_type: float
- columns: precip
column_type: float
- columns: temp
column_type: float
How am I supposed to run this? Thanks in advance!
For Remote PATH you can use the below and here is the document for create data assets.
It's important to note that the path specified in the MLTable file must be a valid path in the cloud, not just a valid path on your local machine.

Split seems to throw an error when I attempt to pass in a deeper reference for an array

I have some simple code:
{ "bustime-response": { "vehicle": [ {"vid": "foo"}] }}
And when i attempt to run the conf file i set up, it will crash. I think it has to do with the split field requirements but im not sure what is going on.
split {
field => "[bustime-response][vehicle]"
}
When removing the split, the system will log the full json object, but im trying to create seperate events for each "vehicle" whereas each 'vehicle' has a 'vid' as the primary key as there could be multiple vehicles on a route.
Am i missing something here when dealaing with JSON? I was looking at the docs for split and the info for field is:
The field which value is split by the terminator. Can be a multiline message or the ID of an array. Nested arrays are referenced like: "[object_id][array_id]"
The Logs from within my docker container is:
[ERROR] 2020-10-12 19:14:14.450 [Converge PipelineAction::Create<main>] agent - Failed to execute action {:action=>LogStash::PipelineAction::Create/pipeline_id:main, :exception=>"LogStash::ConfigurationError", :message=>"Expected one of [ \\t\\r\\n], \"#\", \"input\", \"filter\", \"output\" at line 12, column 1 (byte 270) after ", :backtrace=>["/usr/share/logstash/logstash-core/lib/logstash/compiler.rb:32:in `compile_imperative'", "org/logstash/execution/AbstractPipelineExt.java:183:in `initialize'", "org/logstash/execution/JavaBasePipelineExt.java:69:in `initialize'", "/usr/share/logstash/logstash-core/lib/logstash/java_pipeline.rb:44:in `initialize'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline_action/create.rb:52:in `execute'", "/usr/share/logstash/logstash-core/lib/logstash/agent.rb:357:in `block in converge_state'"]}
[INFO ] 2020-10-12 19:14:14.554 [Api Webserver] agent - Successfully started Logstash API endpoint {:port=>9600}
[INFO ] 2020-10-12 19:14:19.590 [LogStash::Runner] runner - Logstash shut down.
split, like mutate need to exist inside of the "filter" block. They are not standalone options.

Python SnowflakeOperator setup snowflake_default

Good day, I cannot find how to do basic setup to airflow.contrib.operators.snowflake_operator.SnowflakeOperatorto connect to snowflake. snowflake.connector.connect works fine.
When I do it with SnowflakeOperator :
op = snowflake_operator.SnowflakeOperator(sql = "create table test(*****)", task_id = '123')
I get the
airflow.exceptions.AirflowException: The conn_idsnowflake_defaultisn't defined
I tried to insert in backend sqlite db
INSERT INTO connection(
conn_id, conn_type, host
, schema, login, password
, port, is_encrypted, is_extra_encrypted
) VALUES (*****)
But after it I get an error:
snowflake.connector.errors.ProgrammingError: 251001: None: Account must be specified.
Passing account kwarg into SnowflakeOperator constructor does not help. It seems I cannot pass account into db or into constructor, but it's required.
Please help me, let me know what data I should insert into backend local db to be able to connect via SnowflakeOperator
Go to Admin -> Connections and update snowflake_default connection like this:
based on source code airflow/contrib/hooks/snowflake_hook.py:53 we need to add extras like this:
{
"schema": "schema",
"database": "database",
"account": "account",
"warehouse": "warehouse"
}
With this context:
$ airflow version
2.2.3
$ pip install snowflake-connector-python==2.4.1
$ pip install apache-airflow-providers-snowflake==2.5.0
You have to specify the Snowflake Account and Snowflake Region twice like this:
airflow connections add 'my_snowflake_db' \
--conn-type 'snowflake' \
--conn-login 'my_user' \
--conn-password 'my_password' \
--conn-port 443 \
--conn-schema 'public' \
--conn-host 'my_account_xyz.my_region_abc.snowflakecomputing.com' \
--conn-extra '{ "account": "my_account_xyz", "warehouse": "my_warehouse", "region": "my_region_abc" }'
Otherwise it doesn't work throwing the Python exception:
snowflake.connector.errors.ProgrammingError: 251001: 251001: Account must be specified
I think this might be due to that airflow command parameter --conn-host that is expecting a full domain with subdomain (the my_account_xyz.my_region_abc), that usually for Snowflake are specified as query parameters in a way similar to this template (although I did not check all the combinations of the command airflow connections add and the DAG execution):
"snowflake://{user}:{password}#{account}{region}{cloud}/{database}/{schema}?role={role}&warehouse={warehouse}&timezone={timezone}"
Then a dummy Snowflake DAG like this SELECT 1; will find its own way to the Snowflake cloud service and will work:
import datetime
from datetime import timedelta
from airflow.models import DAG
# https://airflow.apache.org/docs/apache-airflow-providers-snowflake/stable/operators/snowflake.html
from airflow.providers.snowflake.operators.snowflake import SnowflakeOperator
my_dag = DAG(
"example_snowflake",
start_date=datetime.datetime.utcnow(),
default_args={"snowflake_conn_id": "my_snowflake_db"},
schedule_interval="0 0 1 * *",
tags=["example"],
catchup=False,
dagrun_timeout=timedelta(minutes=10),
)
sf_task_1 = SnowflakeOperator(
task_id="sf_task_1",
dag=my_dag,
sql="SELECT 1;",
)

How to submit PySpark and Python jobs to Livy

Ii am trying to submit a PySpark job to Livy using the /batches endpoint, but I haven't found any good documentation. Life has been easy because we are submitting Scala-compiled JAR files to Livy, and specifying the job with className.
For the JAR file, we use:
data={
'file': 's3://foo-bucket/bar.jar',
'className': 'com.foo.bar',
'jars': [
's3://foo-bucket/common.jar',
],
'args': [
bucket_name,
'https://foo.bar.com',
"oof",
spark_master
],
'name': 'foo-oof bar',
'driverMemory': '2g',
'executorMemory': '2g',
'driverCores': 1,
'executorCores': 3,
'conf': {
'spark.driver.memoryOverhead': '600',
'spark.executor.memoryOverhead': '600',
'spark.submit.deployMode': 'cluster'
}
I am unsure how to submit a PySpark job in a similar manner, where the package also has some relative imports...any thoughts?
For reference, the folder structure is below:
bar2
__init__.py
foo2.py
bar3
__init__.py
foo3.py
I would want to then run:
from foo2 import ClassFoo
class_foo = ClassFoo(arg1, arg2)
class_foo.auto_run()
You can try passing pyFiles
data={
'file': 's3://foo-bucket/bar.jar',
'className': 'com.foo.bar',
'jars': [
's3://foo-bucket/common.jar',
],
"pyFiles": ["s3://<busket>/<folder>/foo2.py", "s3://<busket>/<folder>/foo3.py”]
'args': [
bucket_name,
'https://foo.bar.com',
"oof",
spark_master
],
'name': 'foo-oof bar',
'driverMemory': '2g',
'executorMemory': '2g',
'driverCores': 1,
'executorCores': 3,
'conf': {
'spark.driver.memoryOverhead': '600',
'spark.executor.memoryOverhead': '600',
'spark.submit.deployMode': 'cluster'
}
In the above example
"pyFiles": ["s3://<busket>/<folder>/foo2.py", "s3://<busket>/<folder>/foo3.py”]
I have tried saving the files on the master node via bootstraping , but noticed that Livy would send the request randomly to the slave nodes where the files might not be present.
Also you may pass the files as a .zip,Although I havent tried it
You need to submit with file being the main Python executable, and pyFiles being the additional internal libraries that are being used. My advice would be to provision the server with a bootstrap action which copies your own libraries over, and installs the pip-installable libraries on the master and nodes.

paperclip content type for xls and xlsx

struggling with paperclip content type, need to upload xls/xlsx file.
has_attached_file :sheet
validates_attachment_content_type :sheet, content_type: [
'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
'application/zip',
'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
'application/vnd.ms-excel',
'application/xls',
'application/xlsx',
'application/octet-stream'
],
message: ' Only EXCEL files are allowed.'
NOTE: Created a sheet from google drive.
Above content types i tried but every time got the same error
Output
Command :: file -b --mime '/var/folders/zy/khy_wsfn7jbd40bsdps7qwqc0000gt/T/5a76e813d6a0a40548b91acc11557bd220160328-13642-1meqjap.xlsx'
(0.2ms) BEGIN
Command :: file -b --mime '/var/folders/zy/khy_wsfn7jbd40bsdps7qwqc0000gt/T/5a76e813d6a0a40548b91acc11557bd220160328-13642-114d8t6.xlsx'
(0.3ms) ROLLBACK
{:sheet_content_type=>[" Only EXCEL files are allowed."], :sheet=>[" Only EXCEL files are allowed."]}
Missed the path ; (
Fixed it by using
has_attached_file :sheet,
:path => ":rails_root/public/system/:attachment/:id/:filename"
validates_attachment :sheet, presence: true,
content_type: { content_type: [
"application/vnd.ms-excel",
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
]
},
message: ' Only EXCEL files are allowed.'

Resources