Here is the spark-shell script which I am using to convert csv data into parquet:
import org.apache.spark.sql.types._;
val sqlContext = new org.apache.spark.sql.SQLContext(sc);
val df = sqlContext.read.format("com.databricks.spark.csv").option("header","true").load("/uploads/01ff5191-27c4-42db-a8e0-0d6594de3a5d/Worker_Snapshot_50_100000.csv");
val schema= StructType(Array(
StructField("CF_EmplID", StringType,true),
StructField("Worker", StringType,true),
StructField("CF_Sup Org Level 1", StringType,true),
StructField("CF_Sup Org Level 2", StringType,true),
StructField("CF_Sup Org Level 3", StringType,true),
StructField("CF_Sup Org Level 4", StringType,true),
StructField("Business Process Transaction", StringType,true),
StructField("Record Date", StringType,true),
StructField("CF_Fiscal Period", StringType,true),
StructField("Business Process Type", StringType,true),
StructField("Business Process Reason", StringType,true),
StructField("Active Status", BooleanType,true),
StructField("Age Group", StringType,true),
StructField("Annual Base Pay", StringType,true),
StructField("Base Pay Segment", StringType,true),
StructField("Compa-Ratio", StringType,true),
StructField("Company", StringType,true),
StructField("Compensation Grade", BooleanType,true),
StructField("Contingent Worker Type", StringType,true),
StructField("Cost Center", StringType,true),
StructField("Current Rating", StringType,true),
StructField("Employee Type", StringType,true),
StructField("Ending Headcount", IntegerType,true),
StructField("Ethnicity", StringType,true),
StructField("Exempt", BooleanType,true),
StructField("FTE", StringType,true),
StructField("Gender", StringType,true),
StructField("Highest Degree", StringType,true),
StructField("Hire Count", IntegerType,true),
StructField("Hire Year Text", IntegerType,true),
StructField("Hiring Source", StringType,true),
StructField("Involuntary Termination", StringType,true),
StructField("Involuntary Termination Count", IntegerType,true),
StructField("Is Critical Job", BooleanType,true),
StructField("Is High Loss Impact Risk", BooleanType,true),
StructField("Is High Potential", BooleanType,true),
StructField("Is Manager", BooleanType,true),
StructField("Is Retention Risk", BooleanType,true),
StructField("Job Category", StringType,true),
StructField("Job Code", IntegerType,true),
StructField("Job Family", IntegerType,true),
StructField("Job Family Group", StringType,true),
StructField("Job Profile", StringType,true),
StructField("Length of Service in Years including Partial Year", StringType,true),
StructField("Location", StringType,true),
StructField("Location - Country", StringType,true),
StructField("Loss Impact", StringType,true),
StructField("Management Level", StringType,true),
StructField("Manager", StringType,true),
StructField("Manager Count", IntegerType,true)
));
val dataFrame = spark.createDataFrame(df.rdd, schema)
var newDf = dataFrame
for(col <- dataFrame.columns){
newDf = newDf.withColumnRenamed(col,col.replaceAll("\\s", "_"))
}
newDf.write.parquet("/output_dir/parquet")
Seems pretty straight forward so far, but I am running into this exception which seems to be about trying to parse non-int value into a int field.
Here is the exception I am getting:
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290)
at org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:573)
at org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:573)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:315)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)
... 8 more
Caused by: java.lang.RuntimeException: java.lang.String is not a valid external type for schema of int
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.evalIfFalseExpr22$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_9$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:287)
... 20 more
Am I doing something wrong when applying the schema to the data frame? I tried using the "inferSchema" option in the "sqlContext.read.format", but that seems to be guessing the types incorrectly.
Instead of
val dataFrame = spark.createDataFrame(df.rdd, schema)
use:
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header","true")
.schema(schema)
.load(...);
Try comparing schema of df and custom schema , convert the datatype of the columns as it matches both schema column-data types.
Related
I have a table with 10 columns, StepDet 1 - StepDet 10. In each one is an embedded table. I want to expand each table but not with all the columns. I only want to expand Step Name and Step 1 - 10.
I tried the below code, but I get an error that Expression.Error: The name 'prevTable' wasn't recognized. Make sure it's spelled correctly.
I would appreciate any help to move forward.
let
// Load the query
Source = Source_name,
// Generate the list of columns to expand
columnsToExpand = List.Generate(
() => [i = 1],
each [i] <= 10,
each [i = [i] + 1],
each "StepDet " & Text.From([i])
),
// Expand the columns
expandedTable = List.Last(List.Generate(
() => [i = 0, prevTable = Source],
each [i] <= List.Count(columnsToExpand)-1,
each [
i = [i] + 1,
prevTable = Table.ExpandTableColumn(
prevTable,
columnsToExpand{[i]},
{"Step Name", "ActionDet 1", "ActionDet 2", "ActionDet 3", "ActionDet 4", "ActionDet 5", "ActionDet 6", "ActionDet 7", "ActionDet 8", "ActionDet 9", "ActionDet 10"},
List.Transform(
{"Step Name", "ActionDet 1", "ActionDet 2", "ActionDet 3", "ActionDet 4", "ActionDet 5", "ActionDet 6", "ActionDet 7", "ActionDet 8", "ActionDet 9", "ActionDet 10"},
each columnsToExpand{[i]} & "." & _
)
)
],
each [prevTable]
))
in
expandedTable
Does something like this get you closer?
let Source = < >,
columnsToExpand = List.Transform({1 .. 10}, each "StepDet "&Text.From(_)) & {"Step Name"},
#"Unpivoted Columns" = Table.UnpivotOtherColumns(Source, List.Difference(Table.ColumnNames(Source),columnsToExpand), "Attribute", "Value"),
ColumnsToExpand = List.Distinct(List.Combine(List.Transform(Table.Column(#"Unpivoted Columns", "Value"), each if _ is table then Table.ColumnNames(_) else {}))),
#"Expanded" = Table.ExpandTableColumn(#"Unpivoted Columns", "Value",ColumnsToExpand ,ColumnsToExpand )
in #"Expanded"
I am able to read data from Kafka topic and able to print the data on the console using spark streaming.
I wanted the data to be in a dataframe format.
Here is my code:
spark = SparkSession \
.builder \
.appName("StructuredSocketRead") \
.getOrCreate()
spark.sparkContext.setLogLevel('ERROR')
lines = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers","********") \
.option("subscribe","******") \
.option("startingOffsets", "earliest") \
.load()
readable = lines.selectExpr("CAST(value AS STRING)")
query = readable \
.writeStream \
.outputMode("append") \
.format("console") \
.option("truncate", "False") \
.start()
query.awaitTermination()
The output is in JSON file format. How to convert this into a dataframe?Please find the output below:
{"items": [{"SKU": "23565", "title": "EGG CUP MILKMAID HELGA ", "unit_price": 2.46, "quantity": 2}], "type": "ORDER", "country": "United Kingdom", "invoice_no": 154132541847735, "timestamp": "2020-11-02 20:56:01"}
IICU, please use explode() and getItems() in order to create a Dataframe out of the json..
Create the dataframe here
a_json = {"items": [{"SKU": "23565", "title": "EGG CUP MILKMAID HELGA ", "unit_price": 2.46, "quantity": 2}], "type": "ORDER", "country": "United Kingdom", "invoice_no": 154132541847735, "timestamp": "2020-11-02 20:56:01"}
df = spark.createDataFrame([(a_json)])
df.show(truncate=False)
+--------------+---------------+-------------------------------------------------------------------------------------+-------------------+-----+
|country |invoice_no |items |timestamp |type |
+--------------+---------------+-------------------------------------------------------------------------------------+-------------------+-----+
|United Kingdom|154132541847735|[[quantity -> 2, unit_price -> 2.46, title -> EGG CUP MILKMAID HELGA , SKU -> 23565]]|2020-11-02 20:56:01|ORDER|
+--------------+---------------+-------------------------------------------------------------------------------------+-------------------+-----+
Logic Here
df = df.withColumn("items_array", F.explode("items"))
df = df.withColumn("quantity", df.items_array.getItem("quantity")).withColumn("unit_price", df.items_array.getItem("unit_price")).withColumn("title", df.items_array.getItem("title")).withColumn("SKU", df.items_array.getItem("SKU"))
df.select("country", "invoice_no", "quantity","unit_price", "title", "SKU", "timestamp", "timestamp").show(truncate=False)
+--------------+---------------+--------+----------+-----------------------+-----+-------------------+-------------------+
|country |invoice_no |quantity|unit_price|title |SKU |timestamp |timestamp |
+--------------+---------------+--------+----------+-----------------------+-----+-------------------+-------------------+
|United Kingdom|154132541847735|2 |2.46 |EGG CUP MILKMAID HELGA |23565|2020-11-02 20:56:01|2020-11-02 20:56:01|
+--------------+---------------+--------+----------+-----------------------+-----+-------------------+-------------------+
Trying to teach myself Power Query and can't get my head around this particular process.
I understand there may be other, perhaps more efficient methods of attaining the desired result, but my purpose here is to understand the process of performing multiple operations on each table in a list of tables, where the number of entries in the list is unknown and/or large.
My original data has pairs of information [Credit, Name] in adjacent columns with an unknown number of pairs.
Since it is a table, the different column/pairs have different Names.
Credit|Name|Credit1|Name1|...
If I demote the headers and transpose the table, the column headers will wind up in Column 1, and I can strip off the differentiating digit.
Using Table.Split, I can then create a number of tables where each pair of columns has the Identical headers.
I can then combine these tables to create a single, two column table, where I can Group and aggregate to get the results.
My problem is that I have not been able to figure out how to do the:
Table.PromoteHeaders(Table.Transpose(table))
operation on each table.
This M-code produces the desired result for the four pairs of columns in the provided data, but is clearly not scalable since the number of tables needs to be known in advance.
let
//Create the table
Tbl1= Table.FromRecords({
[Credit = 1, Name = "Bob", Credit2 = 2, Name2 = "Jim", Credit3 = 1, Name3 = "George", Credit4 = 1.75, Name4="Phil"],
[Credit = 2, Name = "Phil", Credit2 = 4, Name2="George", Credit3 = 2.5, Name3 = "Stephen",Credit4 = 6, Name4="Bob"]
}),
//Demote headers and transpose
transpose1 = Table.Transpose( Table.DemoteHeaders(Tbl1)),
//Create matching names for what will eventually be the final Column Headers
#"Split Column by Character Transition" = Table.SplitColumn(transpose1, "Column1", Splitter.SplitTextByCharacterTransition((c) => not List.Contains({"0".."9"}, c), {"0".."9"}), {"Column1.1", "Column1.2"}),
#"Removed Columns" = Table.RemoveColumns(#"Split Column by Character Transition",{"Column1.2"}),
//Create multiple tables from above
multTables = Table.Split(#"Removed Columns",2),
/*transpose and promote the headers for each table
HOW can I do this in a single step when I don't know how many tables there might be???
*/
tbl0 = Table.PromoteHeaders(Table.Transpose(multTables{0}),[PromoteAllScalars=true]),
tbl1 = Table.PromoteHeaders(Table.Transpose(multTables{1}),[PromoteAllScalars=true]),
tbl2 = Table.PromoteHeaders(Table.Transpose(multTables{2}),[PromoteAllScalars=true]),
tbl3 = Table.PromoteHeaders(Table.Transpose(multTables{3}),[PromoteAllScalars=true]),
combTable = Table.Combine({tbl0,tbl1,tbl2,tbl3})
in
combTable
Original Table
Demoted headers / Transposed table
Desired Result
Any help would be appreciated.
You could also try replacing this part of your code:
tbl0 = Table.PromoteHeaders(Table.Transpose(multTables{0}),[PromoteAllScalars=true]),
tbl1 = Table.PromoteHeaders(Table.Transpose(multTables{1}),[PromoteAllScalars=true]),
tbl2 = Table.PromoteHeaders(Table.Transpose(multTables{2}),[PromoteAllScalars=true]),
tbl3 = Table.PromoteHeaders(Table.Transpose(multTables{3}),[PromoteAllScalars=true]),
combTable = Table.Combine({tbl0,tbl1,tbl2,tbl3})
in
combTable
with this:
Custom1 = List.Transform(multTables, each Table.PromoteHeaders( Table.Transpose(_),[PromoteAllScalars=true])),
#"Converted to Table" = Table.FromList(Custom1, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
#"Expanded Column1" = Table.ExpandTableColumn(#"Converted to Table", "Column1", {"Credit", "Name"}, {"Credit", "Name"})
in
#"Expanded Column1"
A bit clunky but seems to work with any number of rows and any number of 2 column paired columns.
First, a bunch of indexes modified in different ways, then a filter into two table, unpivot and merge
let Tbl1= Table.FromRecords({
[Credit = 1, Name = "Bob", Credit2 = 2, Name2 = "Jim", Credit3 = 1, Name3 = "George", Credit4 = 1.75, Name4="Phil"],
[Credit = 2, Name = "Phil", Credit2 = 4, Name2="George", Credit3 = 2.5, Name3 = "Stephen",Credit4 = 6, Name4="Bob"],
[Credit = 3, Name = "Sam", Credit2 = 5, Name2="Allen", Credit3 = 3.5, Name3 = "Ralph",Credit4 = 7, Name4="Nance"]
}),
#"Transposed Table" = Table.Transpose(Tbl1),
#"Added Index" = Table.AddIndexColumn(#"Transposed Table", "Index", 0, .5),
#"Added Custom" = Table.AddColumn(#"Added Index", "Index2", each Number.RoundDown([Index])),
#"Removed Columns" = Table.RemoveColumns(#"Added Custom",{"Index"}),
#"Added Index1" = Table.AddIndexColumn(#"Removed Columns", "Index", 0, 1),
#"Added Custom1" = Table.AddColumn(#"Added Index1", "Custom", each Number.Mod([Index],2)),
#"Removed Columns1" = Table.RemoveColumns(#"Added Custom1",{"Index"}),
#"Unpivoted Other Columns" = Table.UnpivotOtherColumns(#"Removed Columns1", {"Index2", "Custom"}, "Attribute", "Value"),
#"Filtered Rows" = Table.SelectRows(#"Unpivoted Other Columns", each ([Custom] = 0)),
#"Filtered Rows2" = Table.SelectRows(#"Unpivoted Other Columns", each ([Custom] = 1)),
#"Merged Queries" = Table.NestedJoin(#"Filtered Rows2",{"Index2", "Attribute"},#"Filtered Rows",{"Index2", "Attribute"},"Filtered Rows",JoinKind.LeftOuter),
#"Expanded Filtered Rows" = Table.ExpandTableColumn(#"Merged Queries", "Filtered Rows", {"Value"}, {"Value.1"}),
#"Removed Columns2" = Table.RemoveColumns(#"Expanded Filtered Rows",{"Index2", "Custom", "Attribute"})
in #"Removed Columns2"
Another way of doing it is to create two tables by selecting and unpivoting groups of columns based on their column names, then combining them using a custom column referring to the row index in each table
let Tbl1= Table.FromRecords({
[Credit = 1, Name = "Bob", Credit2 = 2, Name2 = "Jim", Credit3 = 1, Name3 = "George", Credit4 = 1.75, Name4="Phil"],
[Credit = 2, Name = "Phil", Credit2 = 4, Name2="George", Credit3 = 2.5, Name3 = "Stephen",Credit4 = 6, Name4="Bob"],
[Credit = 3, Name = "Sam", Credit2 = 5, Name2="Allen", Credit3 = 3.5, Name3 = "Ralph",Credit4 = 7, Name4="Nance"]
}),
Credit = List.Select(Table.ColumnNames(Tbl1), each Text.Contains(_, "Credit")),
Name = List.Select(Table.ColumnNames(Tbl1), each Text.Contains(_, "Name")),
// create table of just Names with index
#"Removed Columns1" = Table.RemoveColumns(Tbl1,Credit),
#"Added Index" = Table.AddIndexColumn(#"Removed Columns1", "Index", 0, 1),
#"Unpivoted Other Columns" = Table.UnpivotOtherColumns(#"Added Index", {"Index"}, "Attribute", "Value"),
// create table of just Credits with index
#"Removed Columns2" = Table.RemoveColumns(Tbl1,Name),
#"Added Index2" = Table.AddIndexColumn(#"Removed Columns2", "Index", 0, 1),
#"Unpivoted Other Columns2" = Table.UnpivotOtherColumns(#"Added Index2", {"Index"}, "Attribute", "Value"),
#"Added Index1" = Table.AddIndexColumn(#"Unpivoted Other Columns2", "Index.1", 0, 1),
//merge two table together and remove excess columns
#"Added Custom" = Table.AddColumn(#"Added Index1", "Custom", each #"Unpivoted Other Columns"{[Index.1]}[Value]),
#"Removed Columns" = Table.RemoveColumns(#"Added Custom",{"Index", "Attribute", "Index.1"})
in #"Removed Columns"
If I have the following source:
#"My Source" = Table.FromRecords({
[Name="Jared Smith", Age=24],
[Name = "Tom Brady", Age=44],
[Name="Hello Tom", Age = null],
[Name = "asdf", Age = "abc"]
}),
How would I add a new column from a list of values, for example:
Table.AddColumn(#"My Source", "New Col", {'x', 'y', 'z', null})
Now my table would have three columns. How could this be done?
Here's another way. It starts similarly to the approach used by Ron, by adding an index, but then instead of using merge it just uses the index as a reference to the appropriate list item.
let
Source1 = Table.FromRecords({
[Name="Jared Smith", Age=24],
[Name = "Tom Brady", Age=44],
[Name="Hello Tom", Age = null],
[Name = "asdf", Age = "abc"]
}),
#"Added Index" = Table.AddIndexColumn(Source1, "Index", 0, 1),
#"Added Custom" = Table.AddColumn(#"Added Index", "Custom", each {"x", "y", "z", null}{[Index]}),
#"Removed Columns" = Table.RemoveColumns(#"Added Custom",{"Index"})
in
#"Removed Columns"
I'm a PQ beginner, so there may be more efficient methods, but here's one:
Add an Index column to each of the tables
Merge the two tables, using the Index column as the key
Delete the Index column
let
Source1 = Table.FromRecords({
[Name="Jared Smith", Age=24],
[Name = "Tom Brady", Age=44],
[Name="Hello Tom", Age = null],
[Name = "asdf", Age = "abc"]
}),
#"Added Index" = Table.AddIndexColumn(Source1, "Index", 0, 1),
Source2 = Table.FromRecords({
[New="x"],
[New = "y"],
[New = "z"],
[New = null]
}),
#"Added Index2" = Table.AddIndexColumn(Source2, "Index", 0, 1),
Merge = Table.Join(#"Added Index", "Index",#"Added Index2", "Index"),
#"Removed Columns" = Table.RemoveColumns(Merge,{"Index"})
in
#"Removed Columns"
Reading the this post I wonder how can we group the a Dataset but with multiple columns.
Like:
val test = Seq(("New York", "Jack", "jdhj"),
("Los Angeles", "Tom", "ff"),
("Chicago", "David", "ff"),
("Houston", "John", "dd"),
("Detroit", "Michael", "fff"),
("Chicago", "Andrew", "ddd"),
("Detroit", "Peter", "dd"),
("Detroit", "George", "dkdjkd")
)
I would like to get
Chicago, [( "David", "ff"), ("Andrew", "ddd")]
Create a case class as below
case class TestData (location: String, name: String, value: String)
Dummy Data
val test = Seq(("New York", "Jack", "jdhj"),
("Los Angeles", "Tom", "ff"),
("Chicago", "David", "ff"),
("Houston", "John", "dd"),
("Detroit", "Michael", "fff"),
("Chicago", "Andrew", "ddd"),
("Detroit", "Peter", "dd"),
("Detroit", "George", "dkdjkd")
)
//change each row to TestData object
.map(x => TestData(x._1, x._2, x._3))
.toDS() // create dataset from above data
Output as you require
test.groupBy($"location")
.agg(collect_list(struct("name", "value")).as("data"))
.show(false)
Output:
+-----------+--------------------------------------------+
|location |data |
+-----------+--------------------------------------------+
|Los Angeles|[[Tom,ff]] |
|Detroit |[[Michael,fff], [Peter,dd], [George,dkdjkd]]|
|Chicago |[[David,ff], [Andrew,ddd]] |
|Houston |[[John,dd]] |
|New York |[[Jack,jdhj]] |
+-----------+--------------------------------------------+
I have suggested a case class way in the link that you have provided in the question. Here's something different.
RDD way
You can simply do the following
val rdd = sc.parallelize(test) //creating rdd from test
val resultRdd = rdd.groupBy(x => x._1) //grouping by the first element
.mapValues(x => x.map(y => (y._2, y._3))) //collecting the second and third element in the grouped datset
resultRdd.foreach(println) should give you
(New York,List((Jack,jdhj)))
(Houston,List((John,dd)))
(Chicago,List((David,ff), (Andrew,ddd)))
(Detroit,List((Michael,fff), (Peter,dd), (George,dkdjkd)))
(Los Angeles,List((Tom,ff)))
Converting rdd to dataframe
If you require output in table format you can just call .toDF() after some manipulation as
val df = resultRdd.map(x => (x._1, x._2.toArray)).toDF()
df.show(false) should give you
+-----------+--------------------------------------------+
|_1 |_2 |
+-----------+--------------------------------------------+
|New York |[[Jack,jdhj]] |
|Houston |[[John,dd]] |
|Chicago |[[David,ff], [Andrew,ddd]] |
|Detroit |[[Michael,fff], [Peter,dd], [George,dkdjkd]]|
|Los Angeles|[[Tom,ff]] |
+-----------+--------------------------------------------+