We receive 2 files (data file and metadata file) from Vendors for data ingestion.
Vendor 1
data file format
user_id has_insurance postal_code city
101 Y 20001 Newyork
102 N 40001 Boston
metadata file format
user_id,String
has_insurance,Boolean
postal_code,String
city, String
we will receive same data fields from another vendor but fields order in data file might be different as below
Vendor 2
data file format
user_id postal_code city has_insurance
101 20001 Newyork Y
102 40001 Boston N
metadata file format
user_id,String
postal_code,String
city, String
has_insurance,Boolean
The metadata file will contain the fields order. Is it possible to assign schema dynamically based on the metadata file while reading CSV file?
//function to derive spark datatype for the given field data type
def strToDataType(str: String): DataType = {
| if (str == "String") StringType
| else if (str == "Boolean") BooleanType
| else StringType
| }
val metadataDf = spark.sqlContext.textFile("metadata_folder")
val headerSchema = StructType(metadataDf.map(_.split(",")).map(x => StructField(x(0),strToDataType(x(1)),true)))
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.schema(headerSchema) // defining based on the custom schema
.load("data_file.csv")
val headerSchema =
StructType(bxfd.map(_.split(",")).map(x => StructField(x(0),strToDataType(x(1)),true)))
when I tried to create schema dynamically using the above command, getting below error. Could you please advise.
<console>:34: error: overloaded method value apply with alternatives:
(fields: Array[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType <and>
(fields: java.util.List[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType <and>
(fields: Seq[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType
cannot be applied to (org.apache.spark.rdd.RDD[org.apache.spark.sql.types.StructField])
You can't use the CSV data source in this case, but you can parse it yourself without too much difficult. The basic steps are read the metadata file as one text blob, then the data file separating data lines from header lines and apply the schema.
I made some assumptions about your separators. One thing to remember is you shouldn't assume any ordering to the lines in your DataFrame once you load it, because you don't know what data is being loaded on what worker. This tries to identify data vs metadata lines based upon their content, so you may need to adjust those rules.
def loadVendor(dataPath: String, metaPath: String): DataFrame = {
val df = spark.read.text(path)
// First read the metadata file, wholeTextFiles lets us get it all
// as a single string we so we can parse locally
val metaText = sc.wholeTextFiles(metaPath).first._2
val metaLines = metaText.
split("\n").
map(_.split(","))
// Identify header as line that has all the field names
val fields = Seq("user_id", "has_insurance", "postal_code", "city")
val headerDf = fields.foldLeft(df)((df2, fname) => df2.filter($"value".contains(fname)))
val headerLine = headerDf.first.getString(0)
val header = headerLine.split(" ")
// Identify data rows as ones that aren't special lines or the header
var data = df.
filter($"value" =!= headerLine).
filter(!$"value".startsWith("Vendor 1")).
filter(!$"value".startsWith("data file format"))
// Split the data fields on separator and assign column names. Assumed any whitespace is your separator
val rows = data.select(split($"value", raw"\W+") as "fields")
val named = header.zipWithIndex.map( { case (f, idx) => $"fields".getItem(idx).alias(f)} )
val table = rows.select(named:_*)
// Cast to the right types
val castCols = metaLines.map { case Array(cname, ctype) => col(cname).cast(ctype) }
val typed = table.select(castCols:_*)
// Return the columns sorted by name, so union'ing multiple DF's will line up
typed.select(table.columns.sorted.map(col):_*)
}
Here is the data printed
scala> df.show
+-------+-------------+-----------+-------+
| city|has_insurance|postal_code|user_id|
+-------+-------------+-----------+-------+
|Newyork| true| 20001| 101|
| Boston| false| 40001| 102|
+-------+-------------+-----------+-------+
And here is the schema printed
scala> df.printSchema
root
|-- city: string (nullable = true)
|-- has_insurance: boolean (nullable = true)
|-- postal_code: string (nullable = true)
|-- user_id: string (nullable = true)
Related
Not able to split the column into multiple columns in Spark Data-frame and through RDD.
I tried other some codes but works with only fixed columns.
Ex:
Datatype is name:string , city =list(string)
I have a text file and input data is like below
Name, city
A, (hyd,che,pune)
B, (che,bang,del)
C, (hyd)
Required Output is:
A,hyd
A,che
A,pune
B,che,
C,bang
B,del
C,hyd
after reading text file and converting DF.
Data-frame will look like below,
scala> data.show
+----------------+
| |
| value |
| |
+----------------+
| Name, city
|
|A,(hyd,che,pune)
|
|B,(che,bang,del)
|
| C,(hyd)
|
| D,(hyd,che,tn)|
+----------------+
You can use explode function on your DataFrame
val explodeDF = inputDF.withColumn("city", explode($"city")).show()
http://sqlandhadoop.com/spark-dataframe-explode/
Now that I understood you're loading your full line as a string, here is the solution on how to achieve your output
I have defined two user defined functions
val split_to_two_strings: String => Array[String] = _.split(",",2) # to first split your input two elements to convert to two columns (name, city)
val custom_conv_to_Array: String => Array[String] = _.stripPrefix("(").stripSuffix(")").split(",") # strip ( and ) then convert to list of cities
import org.apache.spark.sql.functions.udf
val custom_conv_to_ArrayUDF = udf(custom_conv_to_Array)
val split_to_two_stringsUDF = udf(split_to_two_strings)
val outputDF = inputDF.withColumn("tmp", split_to_two_stringsUDF($"value"))
.select($"tmp".getItem(0).as("Name"), trim($"tmp".getItem(1)).as("city_list"))
.withColumn("city_array", custom_conv_to_ArrayUDF($"city_list"))
.drop($"city_list")
.withColumn("city", explode($"city_array"))
.drop($"city_array")
outputDF.show()
Hope this helps
I have the files collection specified with comma separator, like:
hdfs://user/cloudera/date=2018-01-15,hdfs://user/cloudera/date=2018-01-16,hdfs://user/cloudera/date=2018-01-17,hdfs://user/cloudera/date=2018-01-18,hdfs://user/cloudera/date=2018-01-19,hdfs://user/cloudera/date=2018-01-20,hdfs://user/cloudera/date=2018-01-21,hdfs://user/cloudera/date=2018-01-22
and I'm loading the files with Apache Spark, all in once with:
val input = sc.textFile(files)
Also, I have additional information associated with each file - the unique ID, for example:
File ID
--------------------------------------------------
hdfs://user/cloudera/date=2018-01-15 | 12345
hdfs://user/cloudera/date=2018-01-16 | 09245
hdfs://user/cloudera/date=2018-01-17 | 345hqw4
and so on
As the output, I need to receive the DataFrame with the rows, where each row will contain the same ID, as the ID of the file from which this line was read.
Is it possible to pass this information in some way to Spark in order to be able to associate with the lines?
Core sql approach with UDF (the same thing you can achieve with join if you represent File -> ID mapping as Dataframe):
import org.apache.spark.sql.functions
val inputDf = sparkSession.read.text(".../src/test/resources/test")
.withColumn("fileName", functions.input_file_name())
def withId(mapping: Map[String, String]) = functions.udf(
(file: String) => mapping.get(file)
)
val mapping = Map(
"file:///.../src/test/resources/test/test1.txt" -> "id1",
"file:///.../src/test/resources/test/test2.txt" -> "id2"
)
val resutlDf = inputDf.withColumn("id", withId(mapping)(inputDf("fileName")))
resutlDf.show(false)
Result:
+-----+---------------------------------------------+---+
|value|fileName |id |
+-----+---------------------------------------------+---+
|row1 |file:///.../src/test/resources/test/test1.txt|id1|
|row11|file:///.../src/test/resources/test/test1.txt|id1|
|row2 |file:///.../src/test/resources/test/test2.txt|id2|
|row22|file:///.../src/test/resources/test/test2.txt|id2|
+-----+---------------------------------------------+---+
text1.txt:
row1
row11
text2.txt:
row2
row22
This could help (not tested)
// read single text file into DataFrame and add 'id' column
def readOneFile(filePath: String, fileId: String)(implicit spark: SparkSession): DataFrame = {
val dfOriginal: DataFrame = spark.read.text(filePath)
val dfWithIdColumn: DataFrame = dfOriginal.withColumn("id", lit(fileId))
dfWithIdColumn
}
// read all text files into DataFrame
def readAllFiles(filePathIdsSeq: Seq[(String, String)])(implicit spark: SparkSession): DataFrame = {
// create empty DataFrame with expected schema
val emptyDfSchema: StructType = StructType(List(
StructField("value", StringType, false),
StructField("id", StringType, false)
))
val emptyDf: DataFrame = spark.createDataFrame(
rowRDD = spark.sparkContext.emptyRDD[Row],
schema = emptyDfSchema
)
val unionDf: DataFrame = filePathIdsSeq.foldLeft(emptyDf) { (intermediateDf: DataFrame, filePathIdTuple: (String, String)) =>
intermediateDf.union(readOneFile(filePathIdTuple._1, filePathIdTuple._2))
}
unionDf
}
References
spark.read.text(..) method
Create empty DataFrame
I am trying to load a file into spark.
If I load a normal textFile into Spark like below:
val partFile = spark.read.textFile("hdfs://quickstart:8020/user/cloudera/partfile")
The outcome is:
partFile: org.apache.spark.sql.Dataset[String] = [value: string]
I can see a dataset in the output. But if I load a Json file:
val pfile = spark.read.json("hdfs://quickstart:8020/user/cloudera/pjson")
The outcome is a dataframe with a readymade schema:
pfile: org.apache.spark.sql.DataFrame = [address: struct<city: string, state: string>, age: bigint ... 1 more field]
The Json/parquet/orc files have schema. So I can understand that this is a feature from Spark version:2x, which made things easier as we directly get a DataFrame in this case and for a normal textFile you get a dataset where there is no schema which makes sense.
What I'd like to know is how can I add a schema to a dataset that is a resultant of loading a textFile into spark. For an RDD, there is case class/StructType option to add the schema and convert it to a DataFrame.
Could anyone let me know how can I do it ?
When you use textFile, each line of the file will be a string row in your Dataset. To convert to DataFrame with a schema, you can use toDF:
val partFile = spark.read.textFile("hdfs://quickstart:8020/user/cloudera/partfile")
import sqlContext.implicits._
val df = partFile.toDF("string_column")
In this case, the DataFrame will have a schema of a single column of type StringType.
If your file contains a more complex schema, you can either use the csv reader (if the file is in a structured csv format):
val partFile = spark.read.option("header", "true").option("delimiter", ";").csv("hdfs://quickstart:8020/user/cloudera/partfile")
Or you can process your Dataset using map, then using toDF to convert to DataFrame. For example, suppose you want one column to be the first character of the line (as an Int) and the other column to be the fourth character (also as an Int):
val partFile = spark.read.textFile("hdfs://quickstart:8020/user/cloudera/partfile")
val processedDataset: Dataset[(Int, Int)] = partFile.map {
line: String => (line(0).toInt, line(3).toInt)
}
import sqlContext.implicits._
val df = processedDataset.toDF("value0", "value3")
Also, you can define a case class, which will represent the final schema for your DataFrame:
case class MyRow(value0: Int, value3: Int)
val partFile = spark.read.textFile("hdfs://quickstart:8020/user/cloudera/partfile")
val processedDataset: Dataset[MyRow] = partFile.map {
line: String => MyRow(line(0).toInt, line(3).toInt)
}
import sqlContext.implicits._
val df = processedDataset.toDF
In both cases above, calling df.printSchema would show:
root
|-- value0: integer (nullable = true)
|-- value3: integer (nullable = true)
I have a Hive table with the structure:
I need to read the string field, breaking the keys and turn into a Hive table columns, the final table should look like this:
Very important, the number of keys in the string is dynamic and the name of the keys is also dynamic
An attempt would be to read the string with Spark SQL, create a dataframe with the schema based on all the strings and use saveAsTable () function to transform the dataframe the hive final table, but do not know how to do this
Any suggestion ?
A naive (assuming unique (code, date) combinations and no embedded = and ; in the string) can look like this:
import org.apache.spark.sql.functions.{explode, split}
val df = Seq(
(1, 1, "key1=value11;key2=value12;key3=value13;key4=value14"),
(1, 2, "key1=value21;key2=value22;key3=value23;key4=value24"),
(2, 4, "key3=value33;key4=value34;key5=value35")
).toDF("code", "date", "string")
val bits = split($"string", ";")
val kv = split($"pair", "=")
df
.withColumn("bits", bits) // Split column by `;`
.withColumn("pair", explode($"bits")) // Explode into multiple rows
.withColumn("key", kv(0)) // Extract key
.withColumn("val", kv(1)) // Extract value
// Pivot to wide format
.groupBy("code", "date")
.pivot("key")
.agg(first("val"))
// +----+----+-------+-------+-------+-------+-------+
// |code|date| key1| key2| key3| key4| key5|
// +----+----+-------+-------+-------+-------+-------+
// | 1| 2|value21|value22|value23|value24| null|
// | 1| 1|value11|value12|value13|value14| null|
// | 2| 4| null| null|value33|value34|value35|
// +----+----+-------+-------+-------+-------+-------+
This can be easily adjust to handle the case when (code, date) are not unique and you can process more complex string patterns using UDF.
Depending on a language you use and a number of columns you may be better with using RDD or Dataset. It is also worth to consider dropping full explode / pivot in favor of an UDF.
val parse = udf((text: String) => text.split(";").map(_.split("=")).collect {
case Array(k, v) => (k, v)
}.toMap)
val keys = udf((pairs: Map[String, String]) => pairs.keys.toList)
// Parse strings to Map[String, String]
val withKVs = df.withColumn("kvs", parse($"string"))
val keys = withKVs
.select(explode(keys($"kvs"))).distinct // Get unique keys
.as[String]
.collect.sorted.toList // Collect and sort
// Build a list of expressions for subsequent select
val exprs = keys.map(key => $"kvs".getItem(key).alias(key))
withKVs.select($"code" :: $"date" :: exprs: _*)
In Spark 1.5 you can try:
val keys = withKVs.select($"kvs").rdd
.flatMap(_.getAs[Map[String, String]]("kvs").keys)
.distinct
.collect.sorted.toList
Spark sql how to execute sql command in a loop for every record in input DataFrame
I have a DataFrame with following schema
%> input.printSchema
root
|-- _c0: string (nullable = true)
|-- id: string (nullable = true)
I have another DataFrame on which I need to execute sql command
val testtable = testDf.registerTempTable("mytable")
%>testDf.printSchema
root
|-- _1: integer (nullable = true)
sqlContext.sql(s"SELECT * from mytable WHERE _1=$id").show()
$id should be from the input DataFrame and the sql command should execute for all input table ids
Assuming you can work with a single new DataFrame containing all the rows present in testDf that matches the values present in the id column of input, you can do an inner join operation, as stated by Alberto:
val result = input.join(testDf, input("id") == testDf("_1"))
result.show()
Now, if you want a new, different DataFrame for each distinct value present in testDf, the problem is considerably harder. If this is the case, I would suggest you to make sure the data in your lookup table can be collected as a local list, so you could loop through its values and create a new DataFrame for each one as you already thought (this is not recommended):
val localArray: Array[Int] = input.map { case Row(_, id: Integer) => id }.collect
val result: Array[DataFrame] = localArray.map {
i => testDf.where(testDf("_1") === i)
}
Anyway, unless the lookup table is very small, I suggest that you adapt your logic to work with the single joined DataFrame of my first example.