Joining two RDD[String] -Spark Scala

Joining two RDD[String] -Spark Scala - string

I have two RDDS :
rdd1 [String,String,String]: Name, Address, Zipcode
rdd2 [String,String,String]: Name, Address, Landmark
I am trying to join these 2 RDDs using the function : rdd1.join(rdd2)
But I am getting an error : error: value fullOuterJoin is not a member of org.apache.spark.rdd.RDD[String]
The join should join the RDD[String] and the output RDD should be something like :
rddOutput : Name,Address,Zipcode,Landmark
And I wanted to save these files as a JSON file in the end.
Can someone help me with the same ?

As said in the comments, you have to convert your RDDs to PairRDDs before joining, which means that each RDD must be of type RDD[(key, value)]. Only then you can perform the join by the key. In your case, the key is composed by (Name, Address), so you you would have to do something like:
// First, we create the first PairRDD, with (name, address) as key and zipcode as value:
val pairRDD1 = rdd1.map { case (name, address, zipcode) => ((name, address), zipcode) }
// Then, we create the second PairRDD, with (name, address) as key and landmark as value:
val pairRDD2 = rdd2.map { case (name, address, landmark) => ((name, address), landmark) }
// Now we can join them.
// The result will be an RDD of ((name, address), (zipcode, landmark)), so we can map to the desired format:
val joined = pairRDD1.fullOuterJoin(pairRDD2).map {
case ((name, address), (zipcode, landmark)) => (name, address, zipcode, landmark)
}
More info about PairRDD functions in the Spark's Scala API documentation

Related

Insert multiple values into a column of a table in PostgreSQL [duplicate]

A single row can be inserted like this:
client.query("insert into tableName (name, email) values ($1, $2) ", ['john', 'john#gmail.com'], callBack)
This approach automatically comments out any special characters.
How do i insert multiple rows at once?
I need to implement this:
"insert into tableName (name, email) values ('john', 'john#gmail.com'), ('jane', 'jane#gmail.com')"
I can just use js string operators to compile such rows manually, but then i need to add special characters escape somehow.

Use pg-format like below.
var format = require('pg-format');
var values = [
[7, 'john22', 'john22#gmail.com', '9999999922'],
[6, 'testvk', 'testvk#gmail.com', '88888888888']
];
client.query(format('INSERT INTO users (id, name, email, phone) VALUES %L', values),[], (err, result)=>{
console.log(err);
console.log(result);
});

One other way using PostgreSQL json functions:
client.query('INSERT INTO table (columns) ' +
'SELECT m.* FROM json_populate_recordset(null::your_custom_type, $1) AS m',
[JSON.stringify(your_json_object_array)], function(err, result) {
if (err) {
console.log(err);
} else {
console.log(result);
}
});

Following this article: Performance Boost from pg-promise library, and its suggested approach:
// Concatenates an array of objects or arrays of values, according to the template,
// to use with insert queries. Can be used either as a class type or as a function.
//
// template = formatting template string
// data = array of either objects or arrays of values
function Inserts(template, data) {
if (!(this instanceof Inserts)) {
return new Inserts(template, data);
}
this.rawType = true;
this.toPostgres = function () {
return data.map(d=>'(' + pgp.as.format(template, d) + ')').join(',');
};
}
An example of using it, exactly as in your case:
var users = [['John', 23], ['Mike', 30], ['David', 18]];
db.none('INSERT INTO Users(name, age) VALUES $1', Inserts('$1, $2', users))
.then(data=> {
// OK, all records have been inserted
})
.catch(error=> {
// Error, no records inserted
});
And it will work with an array of objects as well:
var users = [{name: 'John', age: 23}, {name: 'Mike', age: 30}, {name: 'David', age: 18}];
db.none('INSERT INTO Users(name, age) VALUES $1', Inserts('${name}, ${age}', users))
.then(data=> {
// OK, all records have been inserted
})
.catch(error=> {
// Error, no records inserted
});
UPDATE-1
For a high-performance approach via a single INSERT query see Multi-row insert with pg-promise.
UPDATE-2
The information here is quite old now, see the latest syntax for Custom Type Formatting. What used to be _rawDBType is now rawType, and formatDBType was renamed into toPostgres.

You are going to have to generate the query dynamically. Although possible, this is risky, and could easily lead to SQL Injection vulnerabilities if you do it wrong. It's also easy to end up with off by one errors between the index of your parameters in the query and the parameters you're passing in.
That being said, here is an example of how you could do write this, assuming you have an array of users that looks like {name: string, email: string}:
client.query(
`INSERT INTO table_name (name, email) VALUES ${users.map(() => `(?, ?)`).join(',')}`,
users.reduce((params, u) => params.concat([u.name, u.email]), []),
callBack,
)
An alternative approach, is to use a library like #databases/pg (which I wrote):
await db.query(sql`
INSERT INTO table_name (name, email)
VALUES ${sql.join(users.map(u => sql`(${u.name}, ${u.email})`), ',')}
`)
#databases requires the query to be tagged with sql and uses that to ensure any user data you pass is always automatically escaped. This also lets you write the parameters inline, which I think makes the code much more readable.

Using npm module postgres (porsager/postgres) which has Tagged Template Strings at the core:
https://github.com/porsager/postgres#multiple-inserts-in-one-query
const users = [{
name: 'Murray',
age: 68,
garbage: 'ignore'
},
{
name: 'Walter',
age: 80,
garbage: 'ignore'
}]
sql`insert into users ${ sql(users, 'name', 'age') }`
// Is translated to:
insert into users ("name", "age") values ($1, $2), ($3, $4)
// Here you can also omit column names which will use all object keys as columns
sql`insert into users ${ sql(users) }`
// Which results in:
insert into users ("name", "age", "garbage") values ($1, $2, $3), ($4, $5, $6)
Just thought I'd post since it's like brand new out of beta and I've found it to be a better philosophy of SQL library. I think would be preferable over the other postgres/node libraries posted in other answers. IMHO

Hi I know I am late to the party, but what worked for me was a simple map.
I hope this will help someone seeking for same
let sampleQuery = array.map(myRow =>
`('${myRow.column_a}','${myRow.column_b}') `
)
let res = await pool.query(`INSERT INTO public.table(column_a, column_b) VALUES ${sampleQuery} `)

client.query("insert into tableName (name, email) values ($1, $2),($3, $4) ", ['john', 'john#gmail.com','john', 'john#gmail.com'], callBack)
doesn't help?
Futher more, you can manually generate a string for query:
insert into tableName (name, email) values (" +var1 + "," + var2 + "),(" +var3 + ", " +var4+ ") "
if you read here, https://github.com/brianc/node-postgres/issues/530 , you can see the same implementation.

Insert multiple rows at once in node postgres [duplicate]

A single row can be inserted like this:
client.query("insert into tableName (name, email) values ($1, $2) ", ['john', 'john#gmail.com'], callBack)
This approach automatically comments out any special characters.
How do i insert multiple rows at once?
I need to implement this:
"insert into tableName (name, email) values ('john', 'john#gmail.com'), ('jane', 'jane#gmail.com')"
I can just use js string operators to compile such rows manually, but then i need to add special characters escape somehow.

Use pg-format like below.
var format = require('pg-format');
var values = [
[7, 'john22', 'john22#gmail.com', '9999999922'],
[6, 'testvk', 'testvk#gmail.com', '88888888888']
];
client.query(format('INSERT INTO users (id, name, email, phone) VALUES %L', values),[], (err, result)=>{
console.log(err);
console.log(result);
});

One other way using PostgreSQL json functions:
client.query('INSERT INTO table (columns) ' +
'SELECT m.* FROM json_populate_recordset(null::your_custom_type, $1) AS m',
[JSON.stringify(your_json_object_array)], function(err, result) {
if (err) {
console.log(err);
} else {
console.log(result);
}
});

Following this article: Performance Boost from pg-promise library, and its suggested approach:
// Concatenates an array of objects or arrays of values, according to the template,
// to use with insert queries. Can be used either as a class type or as a function.
//
// template = formatting template string
// data = array of either objects or arrays of values
function Inserts(template, data) {
if (!(this instanceof Inserts)) {
return new Inserts(template, data);
}
this.rawType = true;
this.toPostgres = function () {
return data.map(d=>'(' + pgp.as.format(template, d) + ')').join(',');
};
}
An example of using it, exactly as in your case:
var users = [['John', 23], ['Mike', 30], ['David', 18]];
db.none('INSERT INTO Users(name, age) VALUES $1', Inserts('$1, $2', users))
.then(data=> {
// OK, all records have been inserted
})
.catch(error=> {
// Error, no records inserted
});
And it will work with an array of objects as well:
var users = [{name: 'John', age: 23}, {name: 'Mike', age: 30}, {name: 'David', age: 18}];
db.none('INSERT INTO Users(name, age) VALUES $1', Inserts('${name}, ${age}', users))
.then(data=> {
// OK, all records have been inserted
})
.catch(error=> {
// Error, no records inserted
});
UPDATE-1
For a high-performance approach via a single INSERT query see Multi-row insert with pg-promise.
UPDATE-2
The information here is quite old now, see the latest syntax for Custom Type Formatting. What used to be _rawDBType is now rawType, and formatDBType was renamed into toPostgres.

You are going to have to generate the query dynamically. Although possible, this is risky, and could easily lead to SQL Injection vulnerabilities if you do it wrong. It's also easy to end up with off by one errors between the index of your parameters in the query and the parameters you're passing in.
That being said, here is an example of how you could do write this, assuming you have an array of users that looks like {name: string, email: string}:
client.query(
`INSERT INTO table_name (name, email) VALUES ${users.map(() => `(?, ?)`).join(',')}`,
users.reduce((params, u) => params.concat([u.name, u.email]), []),
callBack,
)
An alternative approach, is to use a library like #databases/pg (which I wrote):
await db.query(sql`
INSERT INTO table_name (name, email)
VALUES ${sql.join(users.map(u => sql`(${u.name}, ${u.email})`), ',')}
`)
#databases requires the query to be tagged with sql and uses that to ensure any user data you pass is always automatically escaped. This also lets you write the parameters inline, which I think makes the code much more readable.

Using npm module postgres (porsager/postgres) which has Tagged Template Strings at the core:
https://github.com/porsager/postgres#multiple-inserts-in-one-query
const users = [{
name: 'Murray',
age: 68,
garbage: 'ignore'
},
{
name: 'Walter',
age: 80,
garbage: 'ignore'
}]
sql`insert into users ${ sql(users, 'name', 'age') }`
// Is translated to:
insert into users ("name", "age") values ($1, $2), ($3, $4)
// Here you can also omit column names which will use all object keys as columns
sql`insert into users ${ sql(users) }`
// Which results in:
insert into users ("name", "age", "garbage") values ($1, $2, $3), ($4, $5, $6)
Just thought I'd post since it's like brand new out of beta and I've found it to be a better philosophy of SQL library. I think would be preferable over the other postgres/node libraries posted in other answers. IMHO

Hi I know I am late to the party, but what worked for me was a simple map.
I hope this will help someone seeking for same
let sampleQuery = array.map(myRow =>
`('${myRow.column_a}','${myRow.column_b}') `
)
let res = await pool.query(`INSERT INTO public.table(column_a, column_b) VALUES ${sampleQuery} `)

client.query("insert into tableName (name, email) values ($1, $2),($3, $4) ", ['john', 'john#gmail.com','john', 'john#gmail.com'], callBack)
doesn't help?
Futher more, you can manually generate a string for query:
insert into tableName (name, email) values (" +var1 + "," + var2 + "),(" +var3 + ", " +var4+ ") "
if you read here, https://github.com/brianc/node-postgres/issues/530 , you can see the same implementation.

Bind blob parameter in node-sqlite3

I have an SQLite3 table with BLOB primary key (id):
CREATE TABLE item (
id BLOB PRIMARY KEY,
title VARCHAR(100)
);
In javascript models, the primary key (id) is represented as a Javascript string (one HEX byte per character):
var item = {
id: "2202D1B511604790922E5A090C81E169",
title: "foo"
}
When I run the query below, the id parameter gets bound as a string. But I need it to be bound as a BLOB.
db.run('INSERT INTO item (id, title) VALUES ($id, $title)', {
$id: item.id,
$title: item.title
});
To illustrate, the above code generates the following SQL:
INSERT INTO item (id, title) VALUES ("2202D1B511604790922E5A090C81E169", "foo");
What I need is this:
INSERT INTO item (id, title) VALUES (X'2202D1B511604790922E5A090C81E169', "foo");

Apparently, the string needs to be converted to a buffer:
db.run('INSERT INTO item (id, title) VALUES ($id, $title)', {
$id: Buffer.from(item.id, 'hex'),
$title: item.title
});

Try casting the string as a blob:
INSERT INTO item(id, title) VALUES(CAST(id_string AS BLOB), 'foo');
Note also that the right way to quote strings in SQL is to use single quotes.

How to get all table id's from bigquery without partitioning?

I'm using node.js bigquery client library and need to get a list of tables from а dataset without partitioning block.
For example, I have a number of partitioned tables:
table1_20170101
table1_20170102
...
table1_20170131
table2_20170101
table2_20170102
...
table2_20170131
I need to get [table1,table2] as a result but using getTables method I get [table1_20170101,table1_20170102...]
Script example below:
dataset.getTables(function (err, tables) {
let result = [];
for (let key in tables) {
result.push(tables[key].id);
}
console.log(result);
res.send(result);
});
Is there any available method to get "unpartitioned" table names?
Getting all tables with _date, split and make it unique seems to be very slow if there're a lot of partitioned tables.

You could perform a query against the __TABLES_SUMMARY__ table, instead of using the getTables method.
The sample below gets all the tables in a Dataset, splits the name on the _ character and takes the first part. It then creates a distinct list.
bigquery.query({
query: [
'SELECT DISTINCT SPLIT(table_id,"_")[ORDINAL(1)] as tableName',
'FROM `DATASETNAME.__TABLES_SUMMARY__`;'
].join(' '),
params: []
}, function(err, rows) {
let result = [];
for (row of rows) {
result.push(row.tableName);
}
console.log(result);
});

You could use a meta query:
select * from `wr_live.__TABLES_SUMMARY__`

How do I groupby and aggregate column in Spark and create nested Json

I have data like this and I want to create following JSON Document.
How can I achieve it in Spark? What is the most efficient way to do it in Spark?
name|contact |type
jack|123-123-1234 |phone
jack|jack.reach#xyz.com |email
jack|123 main street |address
jack|34545544445 |mobile
{
"name" : "jack",
"contacts":[
{
"contact" : "123-123-1234",
"type" : "phone"
},
{
"contact" : "jack.reach#xyz.com",
"type" : "email"
},
{
"contact" : "123 main street",
"type" : "address"
},
{
"contact" : "34545544445",
"type" : "mobile"
}
]
}
This is just a sample use case I provided. I have large data set where
I have to collapse multi column rows into one row with some grouping
logic.
My Current approach is I write a UDAF that reads each row, stores in
buffer and merge it. So the code would be
val mergeUDAF = new ColumnUDAF
val tempTable = inputTable.withColumn("contacts",struct($"contact",$"type")
val outputTable = tempTable.groupby($"name").agg(mergeUDAF($"contacts").alias("contacts"))
I am trying to figure out what other approaches there can be. I am
trying to achieve this using Spark-SQL.

I think you should just create an RDD form your csv data, group by "name" than map to json string:
val data = sc.parallelize(Seq("jack|123-123-1234|phone", "jack|jack.reach#xyz.com |email", "david|123 main street|address", "david|34545544445|mobile")) // change to load your data as RDD
val result = data.map(_.split('|')).groupBy(a => a(0)).map(a => {
val contact = a._2.map(c => s"""{"contact": "${c(1)}", "type": "${c(2)}" }""" ).mkString(",")
s"""{"name": "${a._1}", "contacts":[ ${contact}] }"""
}).collect.mkString(",")
val json = s"""[ ${result} ] """

case class contact(contact:String,contactType:String)
case class Person(name:String,contact:Seq[contact])
object SparkTestGrouping {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("LocalTest").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val inputData=Seq("jack|123-123-1234|phone","jack|jack.reach#xyz.com|email","jack|123 main street|address","jack|34545544445|mobile")
val finalData = sc.parallelize(inputData)
val convertData = finalData.map(_.split('|'))
.map(line => (line(0),Seq(line(1) +"|" +line(2))))
.reduceByKey((x,y) => x ++: y)
val output = convertData.map(line => (line._1,line._2.map(_.split('|')).map(obj => contact(obj(0),obj(1)))))
val finalOutput = output.map(line => Person(line._1,line._2))
finalOutput.toDF().toJSON.foreach(println)
sc.stop()
}
}
You can crate tuples from the data with the key field and use
reducebyKey to group the data. In the above example, I created a tuple
(name,Seq("contact|contactType")) and used reducebykey to group the
data by name. After the data is grouped, you can use case class to
convert to DataFrame and DataSets if you need to do further join on it
or simply have to create json document.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Joining two RDD[String] -Spark Scala - string

Related

Insert multiple values into a column of a table in PostgreSQL [duplicate]

Insert multiple rows at once in node postgres [duplicate]

Bind blob parameter in node-sqlite3

How to get all table id's from bigquery without partitioning?

How do I groupby and aggregate column in Spark and create nested Json

Categories

Resources