Spark %sql get the partition index

Spark %sql get the partition index - apache-spark

This type of logic:
.mapPartitionsWithIndex((index, iter) => {
iter.map(x => (index, x))
})
How to do in just SQL %sql in Spark? I do not think this is possible.

Related

Enforcing LogStash events to be outputted in order

I've a pipeline like this:
input {
jdbc {
jdbc_driver_library => "/usr/share/logstash/logstash-core/lib/jars/postgresql-42.2.6.jar"
jdbc_driver_class => "org.postgresql.Driver"
jdbc_connection_string => "${JDBC_CONNECTION_STRING}"
jdbc_user => "${JDBC_USER}"
jdbc_password => "${JDBC_PASSWORD}"
statement =>
# Select everything from the my_table and
# add a fake record to indicate the end of the results.
"SELECT * FROM my_table
UNION ALL
SELECT 'END_OF_QUERY_RESULTS' AS some_key;
"
}
}
filter {
ruby {
init => "
require 'net/http'
require 'json'
"
code => '
if event.get("some_key") == "END_OF_QUERY_RESULTS"
uri = URI.parse(ENV["MY_URL"])
response = Net::HTTP.get_response(uri)
result = JSON.parse(response.body)
if response.code == "202"
puts "Success!"
else
puts "ERROR: Couldn\'t start processing."
end
event.cancel()
end
'
}
}
output {
mongodb {
bulk => true
bulk_interval => 2
collection => "${MONGO_DB_COLLECTION}"
database => "${MONGO_DB_NAME}"
generateId => true
uri => "mongodb://${MONGO_DB_HOST}:${MONGO_DB_PORT}/${MONGO_DB_NAME}"
}
}
I simply grab all the data from a PostreSQL table to a MongoDB collection.
What I'm trying to achieve is: I want to call an API after loading ALL the data into MongoDB collection.
What I tried:
I tried the above approach to add a fake record at the end of the SQL query results to use as a flag to indicate the last event. The problem with this approach is LogStash does not maintain the order of events, hence, the event with 'END_OF_QUERY_RESULTS' string can become to the filter before it is actually the last one.
Setting pipeline.workers: 1 and pipeline.ordered: true, both don't seem to work.
I tried to sleep for a while in the Ruby filter and it works but I don't/can't really know how much time I should sleep.

How to return GenericInternalRow from spark udf

I have a spark udf written in scala that takes couuple of columns and apply some logic and output InternalRow. There is spark schema of StructType also present.
But when I try to return the InternalRow from UDF there is exception
java.lang.UnsupportedOperationException: Schema for type
org.apache.spark.sql.catalyst.GenericInternalRow is not supported
val getData = (hash : String, type : String) => {
val schema = hash match {
case "people" =>
peopleSchema
case "empl" => emplSchema
}
getGenericInternalRow(schema)
}
val data = udf(getData)
Spark Version : 2.4.5

SQL Statement on Logstash JDBC | How do i filter only on the today's inserted rows?

In my oracle database i have a column insert_date. On elasticsearch i want to index only the events that are inserted today.
Here is my conf:
jdbc {
type => "D"
jdbc_connection_string => "jdbc:oracle:thin:#//xxx.xx.xx.xx:1521/xx"
jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
jdbc_user => "xxx"
jdbc_password => "xx"
statement => "select * from mytable where insert_date = TRUNC(SYSDATE) order by insert_date desc"
schedule => "0 * * * * *"
clean_run => true
last_run_metadata_path => "/data/application/.logstash_jdbc_last_run"
}
I run this script but keep getting an error on logstash.

Can you try the below query.
Select * from mytable where insert_date >
sysdate-24/24 order by insert_date

Elasticsearch/Logstash duplicating output when use schedule

Im using Elasticsearch with Logstash.
I want to update indexes when database changes. So i decided to use LS schedule. But every 1 minute output appended by database table records.
Example: contract table has 2 rows.
First 1 minute total: 2, 1 minute after total output is : 4;
How can i solve this?
There is my config file. Command is bin/logstash -f contract.conf
input {
jdbc {
jdbc_connection_string => "jdbc:postgresql://localhost:5432/resource"
jdbc_user => "postgres"
jdbc_validate_connection => true
jdbc_driver_library => "/var/www/html/iltodgeree/logstash/postgres.jar"
jdbc_driver_class => "org.postgresql.Driver"
statement => "SELECT * FROM contracts;"
schedule => "* * * * *"
codec => "json"
}
}
output {
elasticsearch {
index => "resource_contracts"
document_type => "metadata"
hosts => "localhost:9200"
}
}

You need to modify your output by specifying the document_id setting and use the ID field from your contracts table. That way, you'll never get duplicates.
output {
elasticsearch {
index => "resource_contracts"
document_type => "metadata"
document_id => "%{ID_FIELD}"
hosts => "localhost:9200"
}
}
Also if you have an update timestamp in your contracts table, you can modify the SQL statement in your input like below in order to only copy the records that changed recently:
statement => "SELECT * FROM contracts WHERE timestamp > :sql_last_value;"

Joining two RDD[String] -Spark Scala

I have two RDDS :
rdd1 [String,String,String]: Name, Address, Zipcode
rdd2 [String,String,String]: Name, Address, Landmark
I am trying to join these 2 RDDs using the function : rdd1.join(rdd2)
But I am getting an error : error: value fullOuterJoin is not a member of org.apache.spark.rdd.RDD[String]
The join should join the RDD[String] and the output RDD should be something like :
rddOutput : Name,Address,Zipcode,Landmark
And I wanted to save these files as a JSON file in the end.
Can someone help me with the same ?

As said in the comments, you have to convert your RDDs to PairRDDs before joining, which means that each RDD must be of type RDD[(key, value)]. Only then you can perform the join by the key. In your case, the key is composed by (Name, Address), so you you would have to do something like:
// First, we create the first PairRDD, with (name, address) as key and zipcode as value:
val pairRDD1 = rdd1.map { case (name, address, zipcode) => ((name, address), zipcode) }
// Then, we create the second PairRDD, with (name, address) as key and landmark as value:
val pairRDD2 = rdd2.map { case (name, address, landmark) => ((name, address), landmark) }
// Now we can join them.
// The result will be an RDD of ((name, address), (zipcode, landmark)), so we can map to the desired format:
val joined = pairRDD1.fullOuterJoin(pairRDD2).map {
case ((name, address), (zipcode, landmark)) => (name, address, zipcode, landmark)
}
More info about PairRDD functions in the Spark's Scala API documentation

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark %sql get the partition index - apache-spark

This type of logic: .mapPartitionsWithIndex((index, iter) => { iter.map(x => (index, x)) }) How to do in just SQL %sql in Spark? I do not think this is possible.

Related

Enforcing LogStash events to be outputted in order

How to return GenericInternalRow from spark udf

SQL Statement on Logstash JDBC | How do i filter only on the today's inserted rows?

Elasticsearch/Logstash duplicating output when use schedule

Joining two RDD[String] -Spark Scala

Categories

Resources