Include base64 code of image in csv file using Nifi - base64

I have json array response from InvokeHTTP. I am using the below Flow to convert some json info to csv. One of the json info is id which is used to get image and then convert it to base64. I need to add this base64 code to my csv. I don't understand how to save it in an attribute so that it can be put in AttributeToCsv.
Also, I was reading here https://community.cloudera.com/t5/Support-Questions/Nifi-attribute-containing-large-text-value/td-p/190513
that it is not recommended to store large values in attributes due to memory concern. What would be an optimal approach in this scenario.
Json response during first call:
[ {
"fileNumber" : "1",
"uuid" : "abc",
"attachedFiles" : [ {
"id" : "bkjdbkjdsf",
"name" : "image1.png",
}, {
"id" : "xzcv",
"name" : "image2.png",
} ],
"date":null
},
{ "fileNumber" : "2",
"uuid" : "def",
"attachedFiles" : [],
"date":null
}]
Final Csv (after merge or expected output):
Id,File Name, File Data(base64 code)
bkjdbkjdsf,image1.png, iVBORw0KGgo...ji
xzcv,image1.png,ZEStWRGau..74
My approach (will change as per suggestions):
After splitting Json response, I use EvaluateJsonPath to get "attachedFiles".
I find length of array "attachedFiles" and then decide if need to split further if 2 or more files are there. If 0 then do nothing. In second EvaluateJsonPath I add properties Id,File Name and set the values from json using $.id etc.. I use the Id to invoke other URL which I encode to Base64.
Current output - csv file which needs to be updated with third column File Data(base64 code) and it's value:
Id,File Name
bkjdbkjdsf,image1.png
xzcv,image1.png

as a variant use ExecuteGroovyScript:
def ff=session.get()
if(!ff)return
ff.write{sin, sout->
sout.withWriter('UTF-8'){w->
//write attribute values for names 'Id' and 'filename' delimited with coma
w << ff.attributes.with{a->[a.'Id', a.'filaname']}.join(',')
w << ',' //wtite coma
//sin.withReader('UTF-8'){r-> w << r} //write current content of the file after last coma
w << sin.bytes.encodeBase64()
w << '\n'
}
}
REL_SUCCESS << ff
UPD: i put sin.bytes.encodeBase64() instead of copying flowfile content. this one creates one-line base64 string for input file. if you are using this option - you should remove Base64EncodeContent to prevent double base64 encoding.

Related

Python - Reading YAML file with escape characters and escape them

I have a yaml file with Latex-strings in its entries, in particular with many un-escaped escape signs \. The file could look like that
content:
- "explanation" : "\text{Explanation 1} "
"formula" : "\exp({{a}}^2) = {{d}}^2 - {{b}}^2"
- "explanation" : "\text{Explanation 2}"
"formula" : "{{b}}^2 = {{d}}^2 - \exp({{a}}^2) "
The desired output form (in python) looks like that:
config = {
"content" : [
{"explanation" : "\\text{Now} ",
"formula" : "\\exp({{a}}^2) = {{d}}^2 - {{b}}^2"},
{"explanation" : "\\text{With}",
"formula" : "{{a}}^2 = {{d}}^2 + 3 ++ {{b}}^2"}
]
}
where the \ have been escaped, but not the "{" and "}" as you would have when using re.escape(string).
path = "config.yml"
with open(path, "r",encoding = 'latin1') as stream:
config1 = yaml.safe_load(stream)
with open(path, "r",encoding = 'utf-8') as stream:
config2 = yaml.safe_load(stream)
# Codecs
import codecs
with codecs.open(path, "r",encoding='unicode_escape') as stream:
config3 = yaml.safe_load(stream)
with codecs.open(path, "r",encoding='latin1') as stream:
config4 = yaml.safe_load(stream)
with codecs.open(path, 'r', encoding='utf-8') as stream:
config5 = yaml.safe_load(stream)
#
with open(path, "r", encoding = 'utf-8') as stream:
stream = stream.read()
config6 = yaml.safe_load(stream)
with open(path, "r", encoding = 'utf-8') as stream:
config7 = yaml.load(stream,Loader = Loader)
None of these solutions seems to work, e.g. the "unicode-escape" option still reads in
\x1bxp({{a}}^2) instead of \exp({{a}}^2).
What can I do? (The dictionary entries are later given to a Latex-Parser but I can't escape all the \ signs by hand.).
\n, \e and \t are all special characters when double-quoted in YAML, and if you're going treat them literally you're basically asking the YAML parser to blindly treat double-quoted text as plain text, which means that you're going to have to write your own non-conforming YAML parser.
Instead of writing a parser from the ground up, however, an easier approach would be to customize an existing YAML parser by monkey-patching the method that scans double-quoted texts and making it the same as the method that scans plain texts. In case of PyYAML, that can be done with a simple override:
yaml.scanner.Scanner.fetch_double = yaml.scanner.Scanner.fetch_plain
If you want to avoid affecting other parts of the code that may parse YAML normally, you can use unittest.mock.patch as a context manager to patch the fetch_double method temporarily just for the loader call:
import yaml
from unittest.mock import patch
with patch('yaml.scanner.Scanner.fetch_double', yaml.scanner.Scanner.fetch_plain):
with open('config.yml') as stream:
config = yaml.safe_load(stream)
With your sample input, config would become:
{
'content': [
{'"explanation"': '"\\text{Explanation 1} "',
'"formula"': '"\\exp({{a}}^2) = {{d}}^2 - {{b}}^2"'},
{'"explanation"': '"\\text{Explanation 2}"',
'"formula"': '"{{b}}^2 = {{d}}^2 - \\exp({{a}}^2) "'}
]
}
Demo: https://replit.com/#blhsing/WaryDirectWorkplaces
Note that this approach comes with the obvious consequence that you lose all the capabilities of double quotes within the same call. If the configuration file has other double-quoted texts that need proper escaping, this will not parse them correctly. But if the configuration file has only the kind of input you posted in your question, it will help parse it in the way you prefer without having to modify the code that generates such an (improper) YAML file (since presumably you're asking this question because you don't have the authorization to modify the code that generates the YAML file).

How can i read an print the values from the text file

I have data in the text file.
M10 M2GBXR100A.PGM 8.00000000 3.0000000 3.00000000 2545.07500000sec 0.0
I am trying to read and print the text file data but how can just get the individual data.
I have used
File file = new File("C:/File/stat_l15.txt")
printn file.text
String Name = file.text.substring(0, file.text.indexOf(' '))
By this I am able to retrieve M10 but how can I get M2GBXR100A
Finally I need the output as
Name : M10
pg_name : M2GBXR100A.PGM
right : 8.00000000
left : 3.0000000
these data i am saving in a table !!
Since your file is delimited by spaces, you can use Split:
File file = new File("C:/File/stat_l15.txt")
println file.text
List values = file.text.split(' ')
println "Name: ${values[0]}"
println "pg_name: ${values[1]}"
println "right: ${values[2]}"
println "left : ${values[3]}"

Storing string datasets in hdf5 with unicode

I am trying to store variable string expressions from a file which contains special characters, like ø, æ , and å. Here is my code:
import h5py as h5
file = h5.File('deleteme.hdf5','a')
dt = h5.special_dtype(vlen=str)
dset = file.create_dataset("text",(1,),dtype=dt)
dset.attrs[str(1)] = "some text with ø, æ, å"
However the text is not stored properly. The data stored contains text:
"some text with \37777777703\37777777670, \37777777703\37777777646,\37777777703\37777777645"
How can I store the special characters properly? I have tried to follow the guide provided in the documentation here: Strings in HDF5 - Variable-length UTF-8
Edit:
The output was from h5dump. The answer below verified that the characters are properly stored as utf-8.
With:
import numpy as np
import h5py as h5
file = h5.File('deleteme.hdf5','w')
dt = h5.special_dtype(vlen=str)
dset = file.create_dataset("text",(3,),dtype=dt)
dset[:] = 'ø æ å'.split()
dset.attrs["1"] = "some text with ø, æ, å"
file.close()
file = h5.File('deleteme.hdf5','r')
print(file['text'][:])
print(file['text'].attrs["1"])
file.close()
I see:
$ python3 stack44661467.py
['ø' 'æ' 'å']
some text with ø, æ, å
That is h5py does see/interpret the strings as unicode - writing and reading.
With the dump utility:
$ h5dump deleteme.hdf5
HDF5 "deleteme.hdf5" {
GROUP "/" {
DATASET "text" {
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
DATASPACE SIMPLE { ( 3 ) / ( 3 ) }
DATA {
(0): "\37777777703\37777777670", "\37777777703\37777777646",
(2): "\37777777703\37777777645"
}
ATTRIBUTE "1" {
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
DATA {
(0): "some text with \37777777703\37777777670, \37777777703\37777777646, \37777777703\37777777645"
}
}
}
}
}
Note that in both case the datatype is marked UTF8
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
That's what the docs say:
http://docs.h5py.org/en/latest/strings.html#variable-length-utf-8
They can store any character a Python unicode string can store, with the exception of NULLs. In the file these are created as variable-length strings with character set H5T_CSET_UTF8.
Let h5py (or other reader) worry about interpreting \37777777703\37777777670 as the proper unicode character.
You should try storing your data in UTF-8 format by doing the following:
To encode in utf-8 format (before storingwith h5py) do:
u"æ".encode("utf-8")
which returns:
'\xc3\xa6'
Then to decode you could use the string decode like this:
'\xc3\xa6'.decode("utf-8")
which would return:
æ
Hope it helps!
EDIT
When you open files and you want them to be in utf-8, you can use the encoding parameter on the read file method:
f = open(fname, encoding="utf-8")
This should help properly encoding the original file.
Source: python-notes

AvroTypeException: When writing in python3

My avsc file is as follows:
{"type":"record",
"namespace":"testing.avro",
"name":"product",
"aliases":["items","services","plans","deliverables"],
"fields":
[
{"name":"id", "type":"string" ,"aliases":["productid","itemid","item","product"]},
{"name":"brand", "type":"string","doc":"The brand associated", "default":"-1"},
{"name":"category","type":{"type":"map","values":"string"},"doc":"the list of categoryId, categoryName associated, send Id as key, name as value" },
{"name":"keywords", "type":{"type":"array","items":"string"},"doc":"this helps in long run in long run analysis, send the search keywords used for product"},
{"name":"groupid", "type":["string","null"],"doc":"Use this to represent or flag value of group to which it belong, e.g. it may be variation of same product"},
{"name":"price", "type":"double","aliases":["cost","unitprice"]},
{"name":"unit", "type":"string", "default":"Each"},
{"name":"unittype", "type":"string","aliases":["UOM"], "default":"Each"},
{"name":"url", "type":["string","null"],"doc":"URL of the product to return for more details on product, this will be used for event analysis. Provide full url"},
{"name":"imageurl","type":["string","null"],"doc":"Image url to display for return values"},
{"name":"updatedtime", "type":"string"},
{"name":"currency","type":"string", "default":"INR"},
{"name":"image", "type":["bytes","null"] , "doc":"fallback in case we cant provide the image url, use this judiciously and limit size"},
{"name":"features","type":{"type":"map","values":"string"},"doc":"Pass your classification attributes as features in key-value pair"}
]}
I am able to parse this but when I try to write on this as follows, I keep getting issue. What am I missing ? This is in python3. I verified it is well formated json, too.
from avro import schema as sc
from avro import datafile as df
from avro import io as avio
import os
_prodschema = 'product.avsc'
_namespace = 'testing.avro'
dirname = os.path.dirname(__file__)
avroschemaname = os.path.join( os.path.dirname(__file__),_prodschema)
sch = {}
with open(avroschemaname,'r') as f:
sch= f.read().encode(encoding='utf-8')
f.close()
proschema = sc.Parse(sch)
print("Schema processed")
writer = df.DataFileWriter(open(os.path.join(dirname,"products.json"),'wb'),
avio.DatumWriter(),proschema)
print("Just about to append the json")
writer.append({ "id":"23232",
"brand":"Relaxo",
"category":[{"123":"shoe","122":"accessories"}],
"keywords":["relaxo","shoe"],
"groupid":"",
"price":"799.99",
"unit":"Each",
"unittype":"Each",
"url":"",
"imageurl":"",
"updatedtime": "03/23/2017",
"currency":"INR",
"image":"",
"features":[{"color":"black","size":"10","style":"contemperory"}]
})
writer.close()
What am I missing here ?

How to compress warc records with lzma (*.warc.xz) in python3?

I have a list of warc records. Every single item in list is created like this:
header = warc.WARCHeader({
"WARC-Type": "response",
"WARC-Target-URI": "www.somelink.com",
}, defaults=True)
data = "Some string"
record = warc.WARCRecord(header, data.encode('utf-8','replace'))
Now, I am using *.warc.gz to store my records like this:
output_file = warc.open("my_file.warc.gz", 'wb')
And write records like this:
output_file.write_record(record) # type of record is WARCRecord
But how can I compress with lzma as *.warc.xz? I have tried replacing gz with xz when callig warc.open, but warc in python3 do not support this format. I have found this trial, but I was not able to save WARCRecord with this:
output_file = lzma.open("my_file.warc.xz", 'ab', preset=9)
header = warc.WARCHeader({
"WARC-Type": "response",
"WARC-Target-URI": "www.somelink.com",
}, defaults=True)
data = "Some string"
record = warc.WARCRecord(header, data.encode('utf-8','replace'))
output_file.write(record)
The error message is:
TypeError: a bytes-like object is required, not 'WARCRecord'
Thanks for any help.
The WARCRecord class has a write_to method, to write records to a file object.
You could use that to write records to a file created with lzma.open().

Resources