How to extract urls from HYPERLINKS in excel file when reading into scala spark dataframe - excel

I have an Excel file with Column A containing HYPERLINKS like this:
=HYPERLINK("https://google.com","View Link")
I can load the Excel file in scala spark dataframe using com.crealytics.spark.excel library but only with the 'View Link' text which DOES NOT contain the url
import org.apache.spark.sql._
import org.apache.spark.sql.types._
object Tut {
def main(args: Array[String]): Unit = {
println("started")
val spark = SparkSession
.builder()
.appName("MySpark")
.config("spark.master", "local")
.getOrCreate()
val customSchema = StructType(Array(
StructField("A", StringType, nullable = false),
StructField("B", IntegerType, nullable = false)))
val df = spark.read.format("com.crealytics.spark.excel")
.option("useHeader", "true").schema(customSchema)
.option("dataAddress", "A1")
.load("/MY_PATH/src/main/resources/SampFile.xlsx")
df.printSchema()
df.show()
}
}
My goal is to load the entire content of the HYPERLINK as a string:
=HYPERLINK("https://google.com","View Link")
and then extract the url
https://google.com.
Do you know if there is a way to do this using com.crealytics.spark.excel library or any other spark library? Thanks in advance!

About the other question link you provided in the comments, they're trying to read the column as BinaryType, and cast it out of the box into StringType, well, such thing is not possible (even in scala itself), since you need to know how to read the bytes and represent it as a human readable string, right? for instance the encoding, etc.
Now we know that we need to define a custom approach. I used a sample in-code dataframe, and this approach worked:
scala> import spark.implicits._
import spark.implicits._
scala> val df = Seq(
| ("ddd".getBytes, 1)
| ).toDF("A", "B")
df: org.apache.spark.sql.DataFrame = [A: binary, B: int]
scala> val btos: Array[Byte] => String = bytes => new String(bytes) // short fot bytes to string
btos: Array[Byte] => String = $Lambda$2322/665683021#738f6e44
scala> spark.udf.register("btos", btos)
res0: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction($Lambda$2322/665683021#738f6e44,StringType,List(Some(class[value[0]: binary])),Some(btos),true,true)
scala> df.withColumn("C", expr("btos(A)")).show
+----------+---+---+
| A| B| C|
+----------+---+---+
|[64 64 64]| 1|ddd|
+----------+---+---+
Hope this works for you.

Related

spark sql add comment with withComment, it is not work

I want to add remarks to the dataframe, then write hive table,but it is not work.That is to say, the remarks of the table are not added.
I try in spark 2.4 and spark 3, it is not work. But the lower version seems to work, I don't know why,I tried to read the source code but found nothing, if you know why, please tell me, thank you
The code as follows
val personRDD: RDD[Row] = GetTestRDD.map((line: String) => {
val arr: Array[String] = line.split(" ")
Row(arr(0).toInt, arr(1), arr(2).toInt)
})
val schema: StructType = StructType(List(
StructField("id", IntegerType, nullable = false),
StructField("name", StringType, nullable = false),
StructField("age", IntegerType, nullable = false)
))
val frame: DataFrame = sparkSession.createDataFrame(personRDD, schema)
println("输出原始信息")
frame.schema.foreach((s: StructField) => println(s.name, s.metadata))
//添加备注后处理
val commentMap: Map[String, String] = Map("id" -> "唯一标识", "name" -> "姓名", "age" -> "年龄")
val newSchema: Seq[StructField] = frame.schema.map((s: StructField) => {
println(commentMap(s.name))
s.withComment(commentMap(s.name))
})
sparkSession.createDataFrame(frame.rdd, StructType(newSchema)).repartition(10)
println("输出处理后的信息")
frame.schema.foreach((s: StructField) => println(s.name, s.metadata))
the output
输出原始信息
(id,{})
(name,{})
(age,{})
输出处理后的信息
(id,{})
(name,{})
(age,{})

How to group row values into a column based on an identifier?

Please see this example; I am trying to achieve this using spark sql/spark scala, but did not find any direct solution. Please let me know if it's not possible using Spark SQL / Spark Scala, in that case I can write a java/python program by writing a file out of As-Is.
github: https://github.com/mvasyliv/LearningSpark/blob/master/src/main/scala/spark/GroupListValueToColumn.scala
source code
{
package spark
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object GroupListValueToColumn extends App {
val spark = SparkSession.builder()
.master("local")
.appName("Mapper")
.getOrCreate()
case class Customer(
cust_id: Int,
addresstype: String
)
import spark.implicits._
val source = Seq(
Customer(300312008, "credit_card"),
Customer(300312008, "to"),
Customer(300312008, "from"),
Customer(300312009, "to"),
Customer(300312009, "from"),
Customer(300312010, "to"),
Customer(300312010, "credit_card"),
Customer(300312010, "from")
).toDF()
val res = source.groupBy("cust_id").agg(collect_list("addresstype"))
res.show(false)
// +---------+-------------------------+
// |cust_id |collect_list(addresstype)|
// +---------+-------------------------+
// |300312010|[to, credit_card, from] |
// |300312008|[credit_card, to, from] |
// |300312009|[to, from] |
// +---------+-------------------------+
val res1 = source.groupBy("cust_id").agg(collect_set("addresstype"))
res1.show(false)
// +---------+------------------------+
// |cust_id |collect_set(addresstype)|
// +---------+------------------------+
// |300312010|[from, to, credit_card] |
// |300312008|[from, to, credit_card] |
// |300312009|[from, to] |
// +---------+------------------------+
}
}
Since answers are being given as opposed to good googling:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
(1, "a"),
(1, "c"),
(2, "e")
).toDF("k", "v")
val df1 = df.groupBy("k").agg(collect_list("v"))
df1.show

Create an empty DataFrame with specified schema without SparkContext with SparkSession [duplicate]

I want to create on DataFrame with a specified schema in Scala. I have tried to use JSON read (I mean reading empty file) but I don't think that's the best practice.
Lets assume you want a data frame with the following schema:
root
|-- k: string (nullable = true)
|-- v: integer (nullable = false)
You simply define schema for a data frame and use empty RDD[Row]:
import org.apache.spark.sql.types.{
StructType, StructField, StringType, IntegerType}
import org.apache.spark.sql.Row
val schema = StructType(
StructField("k", StringType, true) ::
StructField("v", IntegerType, false) :: Nil)
// Spark < 2.0
// sqlContext.createDataFrame(sc.emptyRDD[Row], schema)
spark.createDataFrame(sc.emptyRDD[Row], schema)
PySpark equivalent is almost identical:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
schema = StructType([
StructField("k", StringType(), True), StructField("v", IntegerType(), False)
])
# or df = sc.parallelize([]).toDF(schema)
# Spark < 2.0
# sqlContext.createDataFrame([], schema)
df = spark.createDataFrame([], schema)
Using implicit encoders (Scala only) with Product types like Tuple:
import spark.implicits._
Seq.empty[(String, Int)].toDF("k", "v")
or case class:
case class KV(k: String, v: Int)
Seq.empty[KV].toDF
or
spark.emptyDataset[KV].toDF
As of Spark 2.0.0, you can do the following.
Case Class
Let's define a Person case class:
scala> case class Person(id: Int, name: String)
defined class Person
Import spark SparkSession implicit Encoders:
scala> import spark.implicits._
import spark.implicits._
And use SparkSession to create an empty Dataset[Person]:
scala> spark.emptyDataset[Person]
res0: org.apache.spark.sql.Dataset[Person] = [id: int, name: string]
Schema DSL
You could also use a Schema "DSL" (see Support functions for DataFrames in org.apache.spark.sql.ColumnName).
scala> val id = $"id".int
id: org.apache.spark.sql.types.StructField = StructField(id,IntegerType,true)
scala> val name = $"name".string
name: org.apache.spark.sql.types.StructField = StructField(name,StringType,true)
scala> import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructType
scala> val mySchema = StructType(id :: name :: Nil)
mySchema: org.apache.spark.sql.types.StructType = StructType(StructField(id,IntegerType,true), StructField(name,StringType,true))
scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row
scala> val emptyDF = spark.createDataFrame(sc.emptyRDD[Row], mySchema)
emptyDF: org.apache.spark.sql.DataFrame = [id: int, name: string]
scala> emptyDF.printSchema
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
Java version to create empty DataSet:
public Dataset<Row> emptyDataSet(){
SparkSession spark = SparkSession.builder().appName("Simple Application")
.config("spark.master", "local").getOrCreate();
Dataset<Row> emptyDataSet = spark.createDataFrame(new ArrayList<>(), getSchema());
return emptyDataSet;
}
public StructType getSchema() {
String schemaString = "column1 column2 column3 column4 column5";
List<StructField> fields = new ArrayList<>();
StructField indexField = DataTypes.createStructField("column0", DataTypes.LongType, true);
fields.add(indexField);
for (String fieldName : schemaString.split(" ")) {
StructField field = DataTypes.createStructField(fieldName, DataTypes.StringType, true);
fields.add(field);
}
StructType schema = DataTypes.createStructType(fields);
return schema;
}
import scala.reflect.runtime.{universe => ru}
def createEmptyDataFrame[T: ru.TypeTag] =
hiveContext.createDataFrame(sc.emptyRDD[Row],
ScalaReflection.schemaFor(ru.typeTag[T].tpe).dataType.asInstanceOf[StructType]
)
case class RawData(id: String, firstname: String, lastname: String, age: Int)
val sourceDF = createEmptyDataFrame[RawData]
Here you can create schema using StructType in scala and pass the Empty RDD so you will able to create empty table.
Following code is for the same.
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql._
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.types.BooleanType
import org.apache.spark.sql.types.LongType
import org.apache.spark.sql.types.StringType
//import org.apache.hadoop.hive.serde2.objectinspector.StructField
object EmptyTable extends App {
val conf = new SparkConf;
val sc = new SparkContext(conf)
//create sparksession object
val sparkSession = SparkSession.builder().enableHiveSupport().getOrCreate()
//Created schema for three columns
val schema = StructType(
StructField("Emp_ID", LongType, true) ::
StructField("Emp_Name", StringType, false) ::
StructField("Emp_Salary", LongType, false) :: Nil)
//Created Empty RDD
var dataRDD = sc.emptyRDD[Row]
//pass rdd and schema to create dataframe
val newDFSchema = sparkSession.createDataFrame(dataRDD, schema)
newDFSchema.createOrReplaceTempView("tempSchema")
sparkSession.sql("create table Finaltable AS select * from tempSchema")
}
This is helpful for testing purposes.
Seq.empty[String].toDF()
Here is a solution that creates an empty dataframe in pyspark 2.0.0 or more.
from pyspark.sql import SQLContext
sc = spark.sparkContext
schema = StructType([StructField('col1', StringType(),False),StructField('col2', IntegerType(), True)])
sqlContext.createDataFrame(sc.emptyRDD(), schema)
I had a special requirement wherein I already had a dataframe but given a certain condition I had to return an empty dataframe so I returned df.limit(0) instead.
I'd like to add the following syntax which was not yet mentioned:
Seq[(String, Integer)]().toDF("k", "v")
It makes it clear that the () part is for values. It's empty, so the dataframe is empty.
This syntax is also beneficial for adding null values manually. It just works, while other options either don't or are overly verbose.
As of Spark 2.4.3
val df = SparkSession.builder().getOrCreate().emptyDataFrame

Spark DataFrame: How to specify schema when writing as Avro

I want to write a DataFrame in Avro format using a provided Avro schema rather than Spark's auto-generated schema. How can I tell Spark to use my custom schema on write?
After applying the patch in https://github.com/databricks/spark-avro/pull/222/, I was able to specify a schema on write as follows:
df.write.option("forceSchema", myCustomSchemaString).avro("/path/to/outputDir")
Hope below method helps.
import org.apache.spark.sql.types._
val schema = StructType( StructField("title", StringType, true) ::StructField("averageRating", DoubleType, false) ::StructField("numVotes", IntegerType, false) :: Nil)
titleMappedDF.write.option("avroSchema", schema.toString).avro("/home/cloudera/workspace/movies/avrowithschema")
Example:
Download data from below site. https://datasets.imdbws.com/
Download the movies data title.ratings.tsv.gz
Copy to below location. /home/cloudera/workspace/movies/title.ratings.tsv.gz
Start Spark-shell and type below command.
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val title = sqlContext.read.text("file:///home/cloudera/Downloads/movies/title.ratings.tsv.gz")
scala> title.limit(5).show
+--------------------+
| value|
+--------------------+
|tconst averageRat...|
| tt0000001 5.8 1350|
| tt0000002 6.5 157|
| tt0000003 6.6 933|
| tt0000004 6.4 93|
+--------------------+
val titlerdd = title.rdd
case class Title(titleId:String, averageRating:Float, numVotes:Int)
val titlefirst = titlerdd.first
val titleMapped = titlerdd.filter(e=> e!=titlefirst).map(e=> {
val rowStr = e.getString(0)
val splitted = rowStr.split("\t")
val titleId = splitted(0).trim
val averageRating = scala.util.Try(splitted(1).trim.toFloat) getOrElse(0.0f)
val numVotes = scala.util.Try(splitted(2).trim.toInt) getOrElse(0)
Title(titleId, averageRating, numVotes)
})
val titleMappedDF = titleMapped.toDF
scala> titleMappedDF.limit(2).show
+---------+-------------+--------+
| titleId|averageRating|numVotes|
+---------+-------------+--------+
|tt0000001| 5.8| 1350|
|tt0000002| 6.5| 157|
+---------+-------------+--------+
import org.apache.spark.sql.types._
val schema = StructType( StructField("title", StringType, true) ::StructField("averageRating", DoubleType, false) ::StructField("numVotes", IntegerType, false) :: Nil)
titleMappedDF.write.option("avroSchema", schema.toString).avro("/home/cloudera/workspace/movies/avrowithschema")

Spark group by - Pig conversion

I am trying to achieve something like this in spark. The following code snippet is from Pig Latin. Is there anyway I can do the same thing with Spark?
A = load 'student' AS (name:chararray,age:int,gpa:float);
DESCRIBE A;
A: {name: chararray,age: int,gpa: float} DUMP A; (John,18,4.0F)
(Mary,19,3.8F) (Bill,20,3.9F) (Joe,18,3.8F)
B = GROUP A BY age;
Result: (18,{(John,18,4.0F),(Joe,18,3.8F)}) (19,{(Mary,19,3.8F)})
(20,{(Bill,20,3.9F)})
Thanks.
It's easy to do a list of names by age. I believe the Spark API doesn't allow you to collect complete rows and get a complete row list in the same way.
// Input data
val df = {
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import scala.collection.JavaConverters._
import java.time.LocalDate
val simpleSchema = StructType(
StructField("name", StringType) ::
StructField("age", IntegerType) ::
StructField("gpa", FloatType) :: Nil)
val data = List(
Row("John", 18, 4.0f),
Row("Mary", 19, 3.8f),
Row("Bill", 20, 3.9f),
Row("Joe", 18, 3.8f)
)
spark.createDataFrame(data.asJava, simpleSchema)
}
df.show()
val df2 = df.groupBy(col("age")).agg(collect_list(col("name")))
df2.show()

Resources