Insert into TempView using Spark.sql - apache-spark

How can I make a simple insert in Spark SQL ?
spark 2.1
I am able to make it work with simple sql code inside spark, with Spark.sql but it is not possible for me to make just an insert.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Basics').getOrCreate()
df=spark.read.json(/path/.'/people.json')
df.sow()
+-----+---------+
|age | name |
+-----+---------+
|null | Michael |
| 30 | And |
+-----+---------+
df.CreateOrReplaceTempView('people') # create temp table
spark.sql("SELECT * FROM people where age == 30")
+-----+---------+
|age | name |
+-----+---------+
| 30 | Andy |
+-----+---------+
So I understand SQL but I dont know who to make an Insert.
I tried all the posibles ways I imagine.

You don't insert into dataframes, they are immutable and lazy.
You need to create a new dataframe which is the union between the original dataframe and the new data you want to add to it.

Related

show first occurence(s) of a column

I want to use pyspark to create new dataframe based on input where it prints out the first occurrence of each different value column. Would rownumber() work or window(). Not sure best way approach this or would sparksql be best. Basically the second table is what I want output to be where it prints out just the first occurrence of a value column from input. I only interested in first occurrence of the "value" column. If a value is repeated only show the first one seen.
+--------+--------+--------+
| VALUE| DAY | Color
+--------+--------+--------+
|20 |MON | BLUE|
|20 |TUES | BLUE|
|30 |WED | BLUE|
+--------+--------+--------+
+--------+--------+--------+
| VALUE| DAY | Color
+--------+--------+--------+
|20 |MON | BLUE|
|30 |WED | BLUE|
+--------+--------+--------+
Here's how I'd do this without using window. It will likely perform better on large data sets as it can use more of the cluster to do the work. You would need to use 'VALUE' as Department and 'Salary' as 'DATE' in your case.
from pyspark.sql import SparkSession,Row
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [("James","Sales",3000),("Michael","Sales",4600),
("Robert","Sales",4100),("Maria","Finance",3000),
("Raman","Finance",3000),("Scott","Finance",3300),
("Jen","Finance",3900),("Jeff","Marketing",3000),
("Kumar","Marketing",2000)]
df = spark.createDataFrame(data,["Name","Department","Salary"])
unGroupedDf = df.select( \
df["Department"], \
f.struct(*[\ # Make a struct with all the record elements.
df["Salary"].alias("Salary"),\ #will be sorted on Salary first
df["Department"].alias("Dept"),\
df["Name"].alias("Name")] )\
.alias("record") )
unGroupedDf.groupBy("Department")\ #group
.agg(f.collect_list("record")\ #Gather all the element in a group
.alias("record"))\
.select(\
f.reverse(\ #Make the sort Descending
f.array_sort(\ #Sort the array ascending
f.col("record")\ #the struct
)\
)[0].alias("record"))\ #grab the "Max element in the array
).select( f.col("record.*") ).show() # use struct as Columns
.show()
+---------+------+-------+
| Dept|Salary| Name|
+---------+------+-------+
| Sales| 4600|Michael|
| Finance| 3900| Jen|
|Marketing| 3000| Jeff|
+---------+------+-------+
Appears to me you want to drop duplicated items by VALUE. if so, use dropDuplicates
df.dropDuplicates(['VALUE']).show()
+-----+---+-----+
|VALUE|DAY|Color|
+-----+---+-----+
| 20|MON| BLUE|
| 30|WED| BLUE|
+-----+---+-----+
Here's how to do it with a window. In this example they us salary as the example. In your case I think you'd use 'DAY' for orderBy and 'Value' for partitionBy.
from pyspark.sql import SparkSession,Row
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [("James","Sales",3000),("Michael","Sales",4600),
("Robert","Sales",4100),("Maria","Finance",3000),
("Raman","Finance",3000),("Scott","Finance",3300),
("Jen","Finance",3900),("Jeff","Marketing",3000),
("Kumar","Marketing",2000)]
df = spark.createDataFrame(data,["Name","Department","Salary"])
df.show()
from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number
w2 = Window.partitionBy("department").orderBy(col("salary"))
df.withColumn("row",row_number().over(w2)) \
.filter(col("row") == 1).drop("row") \
.show()
+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
| James| Sales| 3000|
| Maria| Finance| 3000|
| Kumar| Marketing| 2000|
+-------------+----------+------+
Yes, you'd need to develop a way of ordering days, but I think you get that it's possible and you picked the correct tool. I always like to warn people, this uses a window and they suck all the data to 1 executor to complete the work. This is not particularly efficient. On small datasets this is likely performant. On larger data sets it may take way too long to complete.

Dataframe in Pyspark

I was just dropping a column from dataframe. it was dropped. after calling show method, it seems like column is not dropped in dataframe.
Code:
df.drop('Salary').show()
+-----+
| Name|
+-----+
| Arun|
| Joe|
|Jerry|
+-----+
df.show()
+-----+------+
| Name|Salary|
+-----+------+
| Arun| 5000|
| Joe| 6300|
|Jerry| 9600|
+-----+------+
I am using spark 2.4.4 version. could you please tell why its not dropped? And I thought that its like a dropping column form table in oracle database.
The drop method returns a new DataFrame. The original df is not changed by this transformation, so calling df.show() a second time will return the original data with your Salary column.
You need to save the dataframe after dropping.
df2 = df.drop('Salary')
df2.show()

Create a column with values created from all other columns as a JSON in PySPARK

I have a dataframe as below:
+----------+----------+--------+
| FNAME| LNAME| AGE|
+----------+----------+--------+
| EARL| JONES| 35|
| MARK| WOOD| 20|
+----------+----------+--------+
I am trying to add a new column as value to this dataframe which should be like this:
+----------+----------+--------+------+------------------------------------+
| FNAME| LNAME| AGE| VALUE |
+----------+----------+--------+-------------------------------------------+
| EARL| JONES| 35|{"FNAME":"EARL","LNAME":"JONES","AGE":"35"}|
| MARK| WOOD| 20|{"FNAME":"MARK","WOOD":"JONES","AGE":"20"} |
+----------+----------+--------+-------------------------------------------+
I am not able to achieve this using withColumn or any json function.
Any headstart would be appreciated.
Spark: 2.3
Python: 3.7.x
Please consider using the SQL function to_jsonwhich you can find in org.apache.spark.sql.functions
Here's the solution :
df.withColumn("VALUE", to_json(struct($"FNAME", $"LNAME", $"AGE"))
And you can also avoid specifying the columns' names as follows :
df.withColumn("VALUE", to_json(struct(df.columns.map(col): _*)
PS: the code I provided is written in scala, but it's the same logic for Python, you just have to use a spark SQL function which is available in both programming languages.
I hope It helps,
scala solution:
val df2 = df.select(
to_json(
map_from_arrays(lit(df.columns), array('*))
).as("value")
)
pyton solution: (I don't know how to do it for n-cols like in scala because map_from_arrays not exists in pyspark)
import pyspark.sql.functions as f
df.select(f.to_json(
f.create_map(f.lit("FNAME"), df.FNAME, f.lit("LNAME"), df.LNAME, f.lit("AGE"), df.AGE)
).alias("value")
).show(truncate=False)
output:
+-------------------------------------------+
|value |
+-------------------------------------------+
|{"FNAME":"EARL","LNAME":"JONES","AGE":"35"}|
|{"FNAME":"MARK","LNAME":"WOOD","AGE":"20"} |
+-------------------------------------------+
Achieved using:
df.withColumn("VALUE", to_json(struct([df[x] for x in df.columns])))

Remove rows from dataframe based on condition in pyspark

I have one dataframe with two columns:
+--------+-----+
| col1| col2|
+--------+-----+
|22 | 12.2|
|1 | 2.1|
|5 | 52.1|
|2 | 62.9|
|77 | 33.3|
I would like to create a new dataframe which will take only rows where
"value of col1" > "value of col2"
Just as a note the col1 has long type and col2 has double type
the result should be like this:
+--------+----+
| col1|col2|
+--------+----+
|22 |12.2|
|77 |33.3|
I think the best way would be to simply use "filter".
df_filtered=df.filter(df.col1>df.col2)
df_filtered.show()
+--------+----+
| col1|col2|
+--------+----+
|22 |12.2|
|77 |33.3|
Another possible way could be using a where function of DF.
For example this:
val output = df.where("col1>col2")
will give you the expected result:
+----+----+
|col1|col2|
+----+----+
| 22|12.2|
| 77|33.3|
+----+----+
The best way to keep rows based on a condition is to use filter, as mentioned by others.
To answer the question as stated in the title, one option to remove rows based on a condition is to use left_anti join in Pyspark.
For example to delete all rows with col1>col2 use:
rows_to_delete = df.filter(df.col1>df.col2)
df_with_rows_deleted = df.join(rows_to_delete, on=[key_column], how='left_anti')
you can use sqlContext to simplify the challenge.
first register as temp table as example:
df.createOrReplaceTempView("tbl1")
then run the sql like
sqlContext.sql("select * from tbl1 where col1 > col2")

Why pyspark sql does not count correctly with group by clause?

I load parquet file into sql context like this:
sqlCtx = SQLContext(sc)
rdd_file = sqlCtx.read.parquet("hdfs:///my_file.parquet")
rdd_file.registerTempTable("type_table")
Then I run this simple query:
sqlCtx.sql('SELECT count(name), name from type_table group by name order by count(name)').show()
The result:
+----------------+----------+
|count(name) |name |
+----------------+----------+
| 0| null|
| 226307| x|
+----------------+----------+
However, if I use groupBy from rdd set. I got a different result:
sqlCtx.sql("SELECT name FROM type_table").groupBy("name").count().show()
+----------+------+
| name | count|
+----------+------+
| x|226307|
| null|586822|
+----------+------+
The count of x is the same for the two methods but null is quite different. It seems like the sql statement doesn't count null with group by correctly. Can you point out what I did wrong?
Thanks,
count(name) will exclude null values , if you give count(*) it will give you the null values as well .
Try below.
sqlCtx.sql('SELECT count(*), name from type_table group by name order by count(*)').show()

Resources