In the documentation, is it not clear how can I use this option?
Is it for telling arangoimport : "Hey, please use this field as _from/_to field when you import" ?
define string… Define key=value for a #key# entry in config file
This has nothing to do with data import. arangod, arangosh etc. all support --define to set environment variables, which can be used in configuration files using placeholders like #FOO# and be set like --define FOO=something on the command line.
This is briefly explained here: https://www.arangodb.com/docs/stable/administration-configuration.html#environment-variables-as-parameters
Example configuration file example.conf:
[server]
endpoint = tcp://127.0.0.1:#PORT#
Example invocation:
arangosh --config example.conf --define PORT=8529
For delimited source files (CSV, TSV) you can use the option --translate to map columns to different attributes, e.g. --translate "child=_from" --translate "parent=_to".
https://www.arangodb.com/docs/stable/programs-arangoimport-examples-csv.html#attribute-name-translation
If the references are just keys, then you may use --from-collection-prefix and to-collection-prefix to prepend the collection name.
--translate is not supported for JSON inputs. You can do the translation and import using a driver, or edit the source file somehow, or import into a collection and then use AQL to adjust the fields.
Related
I have a CSV File in the Following format which want to copy from an external share to my datalake:
Test; Text
"1"; "This is a text
which goes on on a second line
and on on a third line"
"2"; "Another Test"
I do now want to load it with a Copy Data Task in an Azure Synapse Pipeline. The result is the following:
Test; Text
"1";" \"This is a text"
"which goes on on a second line";
"and on on a third line\"";
"2";" \"Another Test\""
So, yo see, it is not handling the Multi-Line Text correct. I also do not see an option to handle multiline text within a Copy Data Task. Unfortunately i'm not able to use a DataFlow Task, because it is not allowing to run with an external Azure Runtime, which i'm forced to use, due to security reasons.
In fact, i'm of course not speaking about this single test file, instead i do have x thousands of files.
My settings for the CSV File look like follows:
CSV Connector Settings
Can someone tell me how to handle this kind of multiline data correctly?
Do I have any other options within Synapse (apart from the Dataflows)?
Thanks a lot for your help
Well turns out this is not possible with a CSV File.
The pragmatic solution is to use "binary" files instead, to transfer the CSV Files and only load and transform them later on with a Python Notebook in Synapse.
You can achieve this in azure data factory by iterating through all lines and check for delimiter in each line. And then, use string manipulation functions with set variable activities to convert multi-line data to a single line.
Look at the following example. I have a set variable activity with empty value (taken from parameter) for req variable.
In lookup, create a dataset with following configuration to the multiline csv:
In foreach, where I iterate each row by giving items value as #range(0,sub(activity('Lookup1').output.count,1)). Inside for each, I have an if activity with following condition:
#contains(activity('Lookup1').output.value[item()]['Prop_0'],';')
If this is true, then I concat the current result to req variable using 2 set variable activities.
temp: #if(contains(activity('Lookup1').output.value[add(item(),1)]['Prop_0'],';'),concat(variables('req'),activity('Lookup1').output.value[item()]['Prop_0'],decodeUriComponent('%0D%0A')),concat(variables('req'),activity('Lookup1').output.value[item()]['Prop_0'],' '))
actual (req variable): #variables('val')
For false, I have handled the concatenation in the following way:
temp1: #concat(variables('req'),activity('Lookup1').output.value[item()]['Prop_0'],' ')
actual1 (req variable): #variables('val2')
Now, I have used a final variable to handle last line of the file. I have used the following dynamic content for that:
#if(contains(last(activity('Lookup1').output.value)['Prop_0'],';'),concat(variables('req'),decodeUriComponent('%0D%0A'),last(activity('Lookup1').output.value)['Prop_0']),concat(variables('req'),last(activity('Lookup1').output.value)['Prop_0']))
Finally, I have taken copy data activity with a sample source file with 1 column and 1 row (using this to copy our actual data).
Now, take source file configuration as shown below:
Create an additional column with value as final variable value:
Create a sink with following configuration and select mapping for only above created column:
When I run the pipeline, I get the data as required. The following is an output image for reference.
I've downloaded some JSON data from Shodan, and only want to retain some fields from it. To explore what I want, I'm running the following, which works-
shodan parse --fields ip,port --separator , "data.json.gz"
However, I now want to output/ export the data; I'm trying to run the following -
shodan parse --fields ip,port -O "data_processed.json.gz" "data.json.gz"
It's requiring me to specify a filter parameter, which I don't need. If I do add an empty filter as so, it tells me data_processes.json.gz doesn't exist.
shodan parse --fields ip,port -f -O "data_processed.json.gz" "data.json.gz"
I'm a bit stumped on how to export only certain fields of my data; how do I go about doing so?
If you only want to output those 2 properties then you can simply pipe them to a file:
shodan parse --fields ip,port --separator , data.json.gz > data_processed.csv
A few things to keep in mind:
You probably want to export the ip_str property as it's a more user-friendly version of the IP address. The ip property is a numeric version of the IP address and aimed at users storing the information in a database.
You can convert your data file into Excel or CSV format using the shodan convert command. For example: shodan convert data.json.gz csv See here for a quick guide: https://help.shodan.io/guides/how-to-convert-to-excel
I wanted to add things such as Size, BuildHost, BuildDate etc in rpm query but adding this thing in spec file results in unknown tag?? How can I do this so that these things are reflected when i give the rpm query command?
These tags are determined when the package is built; they cannot be forced to specific values.
For example BuildHost is hardcoded in rpmbuild and cannot be changed. There is RFE https://bugzilla.redhat.com/show_bug.cgi?id=1309367 to allow it modify from command line. But right now you cannot change it by any tag in spec file nor by passing some option on command line to rpmbuild.
I assume it will be very similar to other values you specified.
RPM5 permits arbitrary unique tag names to be added to header metadata.
The tag names are configured in a colon separated list in a macro. Then the new tags can be used in spec files and can be extracted using --queryformat.
All arbitrary tags are string (or string array) valued.
If I have a line like below in my spark-env.sh file
export MY_JARS==$(jars=(/my/lib/dir/*.jar); IFS=,; echo "${jars[*]}")
which gives me a comma delimited list of jars in /my/lib/dir, is there a way I can specify
spark.jars $MY_JARS
in the spark-defaults.conf?
tl;dr No, it cannot, but there is a solution.
Spark reads the conf file as a properties file without any additional env var substitution.
What you could do however is to write the computed value MY_JARS from spark-env.sh straight to spark-defaults.conf using >> (append). The last wins so no worry there could be many similar entries.
I tried with Spark 1.4 and did not worked.
spark-defaults.conf is a Key/ value and looking the code it seems values are not evaluated.
At least in Spark 3+, there is a way to do this: ${env:VAR_NAME}.
For instance if you want to add the current username to the Spark Metrics Namespace, add this to your spark-defaults.conf file:
spark.metrics.namespace=${env:USER}
The generated metrics will show the username instead of the default namespace:
testuser.driver.BlockManager.disk.diskSpaceUsed_MB.csv
testuser.driver.BlockManager.memory.maxMem_MB.csv
testuser.driver.BlockManager.memory.maxOffHeapMem_MB.csv
testuser.driver.BlockManager.memory.maxOnHeapMem_MB.csv
... etc ...
https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/internal/VariableSubstitution.html
A helper class that enables substitution using syntax like ${var}, ${system:var} and ${env:var}.
I'm using JMeter 2.6, and have the following setup for my test:
-
|-test.jmx
|-myschema.xsd
I've set up an XML Schema Assertion, and typed "myschema.xsd" in the File Name field. Unfortunately, this doesn't work:
HTTP Request
Output schema : error: line=1 col=114 schema_reference.4:
Failed to read schema document 'myschema.xsd', because
1) could not find the document;
2) the document could not be read;
3) the root element of the document is not <xsd:schema>.
I've tried adding several things to the path, including ${__P(user.dir)} (points to the home dir of the user) and ${__BeanShell(pwd())} (doesn't return anything). I got it working by giving the absolute path, but the script is supposed to be used by others, so that's no good.
I could make it use a property value defined in the command line, but I'd like to avoid it as well, for the same reason.
How can I correctly point the Assertion to the schema under these circumstances?
Looks like you have to in this situation
validate your xml against xsd manually: simply use corresponding java code from e.g. BeanShell Assertion or BeanShell PostProcessor;
here is a pretty nice solution: https://stackoverflow.com/a/16054/993246 (as well you can use any other you want for this);
dig into jmeter's sources, amend XML Schema file obtaining to support variables in path (File Name field) - like CSV Data Set Config does;
but the previous way seems to be much easier;
run your jmeter test-scenario from shell-script or ant-task which will first copy your xsd to jmeter's /bin dir before script execution - at least XML Schema Assertion can be used "as is".
Perhaps if you will find any other/better - please share it.
Hope this helps.
Summary: in the end I've used http://path.to.schema/myschema.xsd as the File Name parameter in the Assertion.
Explanation: following Alies Belik's advice, I've found that the code for setting up the schema looks something like this:
DocumentBuilderFactory parserFactory = DocumentBuilderFactory.newInstance();
...
parserFactory.setAttribute("http://java.sun.com/xml/jaxp/properties/schemaSource", xsdFileName);
where xsdFileName is a string (the attribute string is actually a constant, I inlined it for readability).
According to e.g. this page, the attribute, when in the form a String, is interpreted as an URI - which includes HTTP URLs. Since I already have the schema accessible through HTTP, I've opted for this solution.
Add the 'myschema.xsd' to the \bin directory of your apache-jmeter next to the 'ApacheJMeter.jar' or set the 'File Name' from the 'XML Schema Assertion' to your 'myschema.xsd' from this starting point.
E.g.
JMeter: C:\Users\username\programs\apache-jmeter-2.13\bin\ApacheJMeter.jar
Schema: C:\Users\username\workspace\yourTest\schema\myschema.xsd
File Name: ..\\..\\..\workspace\yourTest\schema\myschema.xsd