Reading a .txt file with colon (:) in spark 2.4 - apache-spark

i am trying to read .txt file in Spark 2.4 and load it to dataframe.
FILE data looks like :-
under a single Manager there is many employee
Manager_21: Employee_575,Employee_2703,
Manager_11: Employee_454,Employee_158,
Manager_4: Employee_1545,Employee_1312
Code i have written in Scala Spark 2.4 :-
val df = spark.read
.format("csv")
.option("header", "true") //first line in file has headers
.option("mode", "DROPMALFORMED")
.load("D:/path/myfile.txt")
df.printSchema()
Unfortunately while printing schema it is visible all Employee under single Manager_21.
root
|-- Manager_21: servant_575: string (nullable = true)
|-- Employee_454: string (nullable = true)
|-- Employee_1312 string (nullable = true)
.......
...... etc
I am not sure if it is possible in spark scala....
Expected Output:
all employee of a manager in same column.
for ex: Manager 21 has 2 employee and all are in same column.
Or How can we see which all employee are under a particular manager.
Manager_21 |Manager_11 |Manager_4
Employee_575 |Employee_454 |Employee_1545
Employee_2703|Employee_158|Employee_1312
is it possible to do some other way..... please suggest
Thanks

Try using spark.read.text then using groupBy and .pivot to get the desired result.
Example:
val df=spark.read.text("<path>")
df.show(10,false)
//+--------------------------------------+
//|value |
//+--------------------------------------+
//|Manager_21: Employee_575,Employee_2703|
//|Manager_11: Employee_454,Employee_158 |
//|Manager_4: Employee_1545,Employee_1312|
//+--------------------------------------+
import org.apache.spark.sql.functions._
df.withColumn("mid",monotonically_increasing_id).
withColumn("col1",split(col("value"),":")(0)).
withColumn("col2",split(split(col("value"),":")(1),",")).
groupBy("mid").
pivot(col("col1")).
agg(min(col("col2"))).
select(max("Manager_11").alias("Manager_11"),max("Manager_21").alias("Manager_21") ,max("Manager_4").alias("Manager_4")).
selectExpr("explode(arrays_zip(Manager_11,Manager_21,Manager_4))").
select("col.*").
show()
//+-------------+-------------+--------------+
//| Manager_11| Manager_21| Manager_4|
//+-------------+-------------+--------------+
//| Employee_454| Employee_575| Employee_1545|
//| Employee_158|Employee_2703| Employee_1312|
//+-------------+-------------+--------------+
UPDATE:
val df=spark.read.text("<path>")
val df1=df.withColumn("mid",monotonically_increasing_id).
withColumn("col1",split(col("value"),":")(0)).
withColumn("col2",split(split(col("value"),":")(1),",")).
groupBy("mid").
pivot(col("col1")).
agg(min(col("col2"))).
select(max("Manager_11").alias("Manager_11"),max("Manager_21").alias("Manager_21") ,max("Manager_4").alias("Manager_4")).
selectExpr("explode(arrays_zip(Manager_11,Manager_21,Manager_4))")
//create temp table
df1.createOrReplaceTempView("tmp_table")
sql("select col.* from tmp_table").show(10,false)
//+-------------+-------------+--------------+
//|Manager_11 |Manager_21 |Manager_4 |
//+-------------+-------------+--------------+
//| Employee_454| Employee_575| Employee_1545|
//|Employee_158 |Employee_2703|Employee_1312 |
//+-------------+-------------+--------------+

Related

Spark: load csv file with a different schema

I have a csv file like this :
product price,product origin,phone number
20,US,200200
I would like to load the csv file using a new schema so that my dataset should look like this:
|price | origin | number |
|20 | US | 200200 |
I tried to create a schema using structfield :
sparkSession.read().format("csv")
.option("header", "false")
.option("delimiter", ",")
.schema(myScheme).load(csv)
but what I got is like this:
|price | origin | number |
|200200 | US | 20 |
What is the correct way to load the csv with a new scheme with correct column orders ?
Using a csv file with the exact contents that you posted in your question:
product price,product origin,phone number
20,US,200200
You should be able to create a schema by using types from org.apache.spark.sql.types._. You could do something like this:
import org.apache.spark.sql.types._
val mySchema = new StructType()
.add("product price", IntegerType)
.add("product origin", StringType)
.add("phone number", StringType)
val df = spark
.read
.option("header", "true")
.schema(mySchema)
.csv("./simpleCSV.csv")
df.show
+-------------+--------------+------------+
|product price|product origin|phone number|
+-------------+--------------+------------+
| 20| US| 200200|
+-------------+--------------+------------+
df.printSchema
root
|-- product price: integer (nullable = true)
|-- product origin: string (nullable = true)
|-- phone number: string (nullable = true)
Hope this helps!

Is there any way to handle time in pyspark?

I have a string with 6 characters which should be loaded into SQL Server as the TIME data type.
But spark doesn't have any time data type. I have tried a few ways but data type is not returning in the timestamp.
I am reading the data as a string and converting it to timestamp and then finally trying to extract time values but it is returning value as string again.
df.select('time_col').withColumn("time_col",to_timestamp(col("time_col"),"HHmmss").cast(TimestampType())).withColumn("tim2", date_format(col("time_col"), "HHmmss")).printSchema()
root
|-- time_col: timestamp (nullable = true)
|-- tim2: string (nullable = true)
And the data looks like this but in a different data type.
df.select('time_col').withColumn("time_col",to_timestamp(col("time_col"),"HHmmss").cast(TimestampType())).withColumn("tim2", date_format(col("time_col"), "HHmmss")).show(5)
+-------------------+------+
| time_col| tim2|
+-------------------+------+
|1970-01-01 14:44:51|144451|
|1970-01-01 14:48:37|144837|
|1970-01-01 14:46:10|144610|
|1970-01-01 11:46:39|114639|
|1970-01-01 17:44:33|174433|
+-------------------+------+
Is there any way I can get tim2 column in timestamp column or column equivalent to TIME data type of SQL Server?
I think you won't get what you are trying to do, there's no type in PySpark to handle "HH:mm:ss", see: What data type should be used for a time column
I'd suggest you to use it as string.
In my case I used to convert into timestamp in spark and before sending to SQL server just make it string.. it worked fine with me.
Maybe this will help you, but it seems to me that this changes the column in str
df.withColumn('TIME', date_format('datetime', 'HH:mm:ss'))
In scala, python will be similar:
scala> val df = Seq("144451","144837").toDF("c").select('c.cast("INT").cast("TIMESTAMP"))
df: org.apache.spark.sql.DataFrame = [c: timestamp]
scala> df.show()
+-------------------+
| c|
+-------------------+
|1970-01-02 17:07:31|
|1970-01-02 17:13:57|
+-------------------+
scala> df.printSchema()
root
|-- c: timestamp (nullable = true)

Convert spark dataframe with string column to StructType column

I have a CSV file with a header as "message" and rows as
{"a":1,"b":"hello 1","c":"1234"}
{"a":2,"b":"hello 2","c":"2345"}
I want to convert them in different columns a,b,c.
I tried the following code:
df1 = spark.read.format("csv").option("header","true")
.option("delimiter","^")
.option("inferSchema","false")
.load("testing.csv")
But it is taking it as a string column.
df1.printScema() --> String
Your file is in json format, with the first line as "message".
The first line can be ignored using the option "DROPMALFORMED" while reading using Spark's DataFrameReader
file : json-test.txt
message
{"a":1,"b":"hello 1","c":"1234"}
{"a":2,"b":"hello 2","c":"2345"}
reading a json file by ignoring bad records [initial record]:
val jsondf = spark.read
.option("multiLine", false)
.option("mode", "DROPMALFORMED")
.json("files/file-reader-test/json-test.txt")
jsondf.show()
output:
+---+-------+----+
| a| b| c|
+---+-------+----+
| 1|hello 1|1234|
| 2|hello 2|2345|
+---+-------+----+
schema :
jsondf.printSchema()
root
|-- a: long (nullable = true)
|-- b: string (nullable = true)
|-- c: string (nullable = true)

Writing Spark SQL query on data without header or schema

I want to write a generic script that can run SQL queries on a file that doesn't have a header or pre-defined schema. For example, a file could look like:
Bob,32
Alice, 24
Jane,65
Doug,33
Peter,19
And the SQL query might be:
SELECT COUNT(DISTINCT ??)
FROM temp_table
WHERE ?? > 32
I am wondering what to put in the ??.
you can define 'custom schema' while reading like
val schema = StructType(
StructField("field1", StringType, true) ::
StructField("field2", IntegerType, true) :: Nil
)
val df = spark.read.format("csv")
.option("sep", ",")
.option("header", "false")
.schema(schema)
.load("examples/src/main/resources/people.csv")
also you can ignore the schema part that would end-up in default names (not-preferred)
val df = spark.read.format("csv")
.option("sep", ",")
.option("header", "false")
.load("examples/src/main/resources/people.csv")
+-----+-----+
| _c0| _c1|
+-----+-----+
| Bob| 32 |
| .. | ... |
+-----+-----+
with that you can fill the column names in your spark-sql.
It seems default schema has column names _c0, _c1 etc.
val df = spark.read.format("csv").load("test.txt")
scala> df.printSchema
root
|-- _c0: string (nullable = true)
|-- _c1: string (nullable = true)
In Spark 2.0,
df.createOrReplaceTempView("temp_table")
spark.sql("SELECT COUNT(DISTINCT _c1) FROM temp_table WHERE cast(_c1 as int) > 32")

Spark Adding a column consisting of a tuple to a dataframe

I am using Spark 1.6 and I want to add a column to a dataframe. The new column actually is a constant sequence: Seq("-0", "-1", "-2", "-3")
Here is my original dataframe:
scala> df.printSchema()
root
|-- user_name: string (nullable = true)
|-- test_name: string (nullable = true)
df.show()
|user_name| test_name|
+------------+--------------------+
|user1| SAT|
| user9| GRE|
| user7|MCAT|
I want to add this extra column (attempt) so that the new dataframe becomes:
|user_name|test_name|attempt|
+------------+--------------------+
|user1| SAT|Seq("-0","-1","-2","-3")|
| user9| GRE|Seq("-0","-1","-2","-3")
| user7|MCAT|Seq("-0","-1","-2","-3")
How do I do that?
you can use the withColumn function:
import org.apache.spark.sql.functions._
df.withColumn("attempt", lit(Array("-0","-1","-2","-3")))
You can add using the typedLit(Spark version > 2.2).
import org.apache.spark.sql.functions.typedLit
df.withColumn("attempt", typedLit(Seq("-0", "-1", "-2", "-3")))

Resources