[![enter image description here][1]][1]I want to create streams using springXD. According to basic definition of streams, I have:
Source : http
Transform script : I written one groovy script which I stored at
/xd/modules/processor/script/transform.groovy
Sink : cassandra
I want to store the unstructured json data, which I posted from http to cassandra table.
I run springXD on single-node mode, and then run xd-shell.
For stream creation, I use:
stream create --name test2 --definition "http --port=9000 | transform --
script=insert_transform.groovy |cassandra --contactPoints=127.0.0.1 --
keyspace=db1 --ingestQuery='insert into table1 (emp_id,emp_name,amount,time)
values (?,?,?,?)'" –deploy
I got message:
Created and deployed new stream 'test2'
After that when I am posting data through http it gives following errors:
500 INTERNAL_SERVER_ERROR
On the xd-singlenode, following error log appears:
Caused by: java.io.FileNotFoundException: class path resource [insert_transform.groovy] cannot be opened because it does not exist.
Which version of Spring XD? I just tested it with 1.3.0.RELEASE and it worked fine...
xd:>stream create foo --definition "time | transform --script=test.groovy | log" --deploy
.
$ cat xd/modules/processor/scripts/test.groovy
'Time = ' + payload
.
2016-01-11T09:45:19-0500 1.3.0.RELEASE INFO xdbus.foo.1-1 sink.foo - Time = 2016-01-11 09:45:19
2016-01-11T09:45:20-0500 1.3.0.RELEASE INFO xdbus.foo.1-1 sink.foo - Time = 2016-01-11 09:45:20
2016-01-11T09:45:21-0500 1.3.0.RELEASE INFO xdbus.foo.1-1 sink.foo - Time = 2016-01-11 09:45:21
Related
I am trying to do a COPY INTO statement for about 10 files of size 100mb each that are stored in the data lake, and it keeps throwing this error;
Msg 110806, Level 11, State 0, Line 7
110806;A distributed query failed: A severe error occurred on the current command. The results, if any, should be discarded.
Operation cancelled by user.
The command i used is;
COPY INTO AAA.AAAA FROM 'https://AAAA.blob.core.windows.net/data_*.csv' WITH (
CREDENTIAL = (IDENTITY = 'MANAGED IDENTITY'), FIELDQUOTE = N'"', FIELDTERMINATOR = N',',FIRSTROW = 2 );
where did i go wrong? please advise.
I tried using the py4j referred Connecting and testing a JDBC driver from Python
from py4j.JavaGateway import java_gateway
# Open JVM interface with the JDBC Jar
jdbc_jar_path = 'C:\Program Files\CData\CData JDBC Driver for MongoDB 2019\lib\cdata.jdbc.mongodb.jar'
gateway = java_gateway(classpath=jdbc_jar_path)
# Load the JDBC Jar
jdbc_class = "cdata.jdbc.mongodb.MongoDBDriver"
gateway.jvm.class.forName(jdbc_class)
# Initiate connection
jdbc_uri = "jdbc:mongodb:Server=127.0.0.1;Port=27017;Database=EmployeeDB;"
con = gateway.jvm.DriverManager.getConnection(jdbc_uri)
# Run a query
sql = "select * from Employees"
stmt = con.createStatement(sql)
rs = stmt.executeQuery()
while rs.next():
rs.getInt(1)
rs.getFloat(2)
.
.
rs.close()
stmt.close()
Getting error as
File "assignment.py", line 9
gateway.jvm.class.forName(jdbc_class)
^
SyntaxError: invalid syntax
Try replacing
gateway.jvm.class.forName(jdbc_class)
with
gateway.jvm.Class.forName(jdbc_class)
(i.e. capitalise the c in class.)
Class.forName is the Java method you want to call here. (Note also how the D in DriverManager is capitalised in gateway.jvm.DriverManager.getConnection(...).) However, the syntax error is caused because class is a Python keyword. You can't have a local variable, or a function or method, named class.
I'm using the following script to output the results of a SPARQL query to a file in Azure Data Store. However, instead of creating a file called myresults.json and publishing the results to the myresults.json file, the script publishes the results to a random file name like part-0000-tid like in the image below:
The code is as follows:
example1 = spark.sql("""SELECT
CF.CountryName AS CountryCarsSold
,COUNT(CF.CountryName) AS NumberCountry
,MAX(CB.SalesDetailsID) AS TotalSold
FROM Data_SalesDetails CB
INNER JOIN Data_Sales CD
ON CB.SalesID = CD.SalesID
INNER JOIN Data_Customer CG
ON CD.CustomerID = CG.CustomerID
INNER JOIN Data_Country CF
ON CG.Country = CF.CountryISO2
GROUP BY CF.CountryName""")
example1.coalesce(1).write.mode("append").json("adl://carlslake.azuredatalakestore.net/jfolder2/outputfiles/myoutput3/myresults.json")
Can someone let me know how to save as a single file and each file is overwritten each time it is saved.
Thanks
Using a very simple-minded approach to read data, select a subset of it, and write it, I'm getting that 'DataFrameWriter' object is not callable.
I'm surely missing something basic.
Using an AWS EMR:
$ pyspark
> dx = spark.read.parquet("s3://my_folder/my_date*/*.gz.parquet")
> dx_sold = dx.filter("keywords like '%sold%'")
# select customer ids
> dc = dx_sold.select("agent_id")
Question
The goal is to now save the values of dc ... e.g. to s3 as a line-separated text file.
What's a best-practice to do so?
Attempts
I tried
dc.write("s3://my_folder/results/")
but received
TypeError: 'DataFrameWriter' object is not callable
Also tried
X = dc.collect()
but eventually received a TimeOut error message.
Also tried
dc.write.format("csv").options(delimiter=",").save("s3://my_folder/results/")
But eventually received messages of the form
TaskSetManager: Lost task 4323.0 in stage 9.0 (TID 88327, ip-<hidden>.internal, executor 96): TaskKilled (killed intentionally)
The first comment is correct: it was an FS problem.
Ad-hoc solution was to convert desired results to list and then serialize the list. E.g.
dc = dx_sold.select("agent_id").distinct()
result_list = [str(c) for c in dc.collect()]
pickle.dump(result_list, open(result_path, "wb"))
I'm trying to extract data from the Data Lake store, which are saved in the json. When I try to submit script I get the error:
Vertex failure triggered quick job abort. Vertex failed:
SV1_Extract[0] with error: Vertex user code error.
I will add that I used:
https://github.com/Azure/usql/tree/master/Examples/DataFormats/Microsoft.Analytics.Samples.Formats
REFERENCE ASSEMBLY [jsonextr];
REFERENCE ASSEMBLY [Newtonsoft.Json];
#searchlog =
EXTRACT city string
FROM "/weather/logs/2016/09/23/08_0_a26cf4d21dd24f53b7903a9206195c58.json"
USING new Microsoft.Analytics.Samples.Formats.Json.JsonExtractor();
#res =
SELECT *
FROM #searchlog;
OUTPUT #res
TO "/datastreamanalitics/SearchLog-from-Data-Lake.json"
USING new Microsoft.Analytics.Samples.Formats.Json.JsonOutputter();
json:
{"city":{"id":7532702,"name":"Brodnica","coord":{"lon":19.406401,"lat":53.2579},"country":"PL","population":0},"cod":"200","message":0.0067,"cnt":1,"list":[{"dt":1474624800,"temp":{"day":15.97,"min":10,"max":17.14,"night":10.01,"eve":17.02,"morn":10},"pressure":1025.32,"humidity":79,"weather":[{"id":802,"main":"Clouds","description":"scattered clouds","icon":"03d"}],"speed":3.22,"deg":271,"clouds":32}],"EventProcessedUtcTime":"2016-09-23T08:04:06.9372695Z","PartitionId":0,"EventEnqueuedUtcTime":"2016-09-23T08:04:05.2300000Z"}