I am trying to do some analysis with spark. I tried the same query with foreach which shows the results correctly but if I use show or in sql it is weird, it is not showing anything.
sqlContext.sql("select distinct device from TestTable1 where id = 23233").collect.foreach(println)
[ipad]
[desktop]
[playstation]
[iphone]
[android]
[smarTv]
gives proper device but if I use just show or any sql :
sqlContext.sql("select distinct device from TestTable1 where id = 23233").show()
%sql
select distinct device from TestTable1 where id = 23233
+-----------+
|device |
+-----------+
| |
| |
|ion|
| |
| |
| |
+-----------+
I need graph and charts, so I would like to use %sql. But this is giving weird results with $sql. Does any one have any idea why I am getting like this ?
show is a formatted output of your data, whereas collect.foreach(println) is merely printing the Row data. They are two different things. If you want to format your data in a specific way, then stick with foreach...keeping in mind you are printing a sequence of Row. You'll have to pull the data out of the row if you want to get your own formatting for each column.
I can probably provide more specific information if you provide the version of spark and zeppelin that you are using.
You stated that you are using %sql because you need Zeppelin's graphs and charts--i.e. you wouldn't be swapping to %sql if you didn't have to.
You can just stick with using Spark dataframes by using z.show(), for example:
%pyspark
df = sqlContext.createDataFrame([
(23233, 'ipad'),
(23233, 'ipad'),
(23233, 'desktop'),
(23233, 'playstation'),
(23233, 'iphone'),
(23233, 'android'),
(23233, 'smarTv'),
(12345, 'ipad'),
(12345, 'palmPilot'),
], ('id', 'device'))
foo = df.filter('id = 23233').select('device').distinct()
z.show(foo)
In the above, z.show(foo) renders the default Zeppelin table view, with options for the other chart types.
Related
I'm working on a data lake where I use AWS Glue for ETL with PySpark. I ran into an issue with a few records where the values end up in the wrong column after the resulting DF is written to a file.
The two first rows in the following example have incorrect column order such as customer_postal_code being lb. The three rows after that show correct values:
+-----------------+--------------------+----------------+------------+-------------+
|total_weight_unit|customer_postal_code|from_postal_code|from_density|service_level|
+-----------------+--------------------+----------------+------------+-------------+
|4.070000171661377| lb| 96863| 17406.0| 447.0|
|12.34000015258789| lb| 96863| 17406.0| 447.0|
| lb| 78665| 76051| 1258.0| standard|
| lb| 06708| 17406| 447.0| standard|
| lb| 67357| 63376| 1705.0| standard|
| lb| 91730| 17406| 447.0| standard|
I'm using a custom transform to return a DynamicFrame with the DF in the shape I'm expecting like this:
extracted_fields_node = extract_db_fields(
glue_context,
DynamicFrameCollection(
{"source_node": source_node},
glue_context
),
)
Here's the definition of extract_db_fields:
def extract_db_fields(glue_context: GlueContext, dfc: DynamicFrameCollection) -> DynamicFrameCollection:
data_frame = dfc.select(list(dfc.keys())[0]).toDF()
df_base = extract_db_fields_df(data_frame)
glue_df = DynamicFrame.fromDF(
df_base, glue_context, "extract_partner_fields"
)
return DynamicFrameCollection({"ExtractFieldsTransform": glue_df}, glue_context)
I chose to process the DynamicFrame as a Spark DataFrame with a extract_db_fields_df function to allow for simple Spark unit testing. Here's a snippet of how I extract one of the fields that is the wrong value when written to a file:
df = df.withColumn("total_weight",
col("shipments_totalWeight").cast(DoubleType()))
Some notable details:
The data that is processed by extract_db_fields is a DF of three joined datasets.
Before writing to a file, I repartition by 1 to create a single file for discovery and analysis by an ML team.
I've verified the source data does not include values like the ones shown above.
I have creating a new column in using "AS" statement in pyspark-sql code
accounts_2.price - COALESCE(cast(accounts_2.price as Numeric(38,8)), 0) AS follow_on_price
As you see here I am creating a new column "follow_on_price
but when I am trying to use this newly created column in my same spark sql code
, accounts_2.unlock_price - COALESCE(cast(accounts_2.upfront_price as Numeric(38,8)), 0) AS follow_on_price
, follow_on_price * exchange_rate_usd AS follow_on_price_usd
It does not recognise the follow_on_price used immediately in the same spark SQL so when I create a new temp view and use it as new table for the next step then its able to do the same . Please explain why so? Why can't spark Sql take the new column reference from the same spark code so that I don't have to create an extra step for "follow_on_price * exchange_rate_usd AS follow_on_price_usd" and it can be done in single step. like we do in normal sql like Postgres.
it's SQL standard behavior, to prevents ambiguities in query. we cannot reference column aliases in the same SELECT list.
you can use inner query instead like below
>>> data2 = [{"COL_A": 'A',"COL_B": "B","price":212891928.90},{"COL_A": "A1","COL_B": "cndmmkdssac","price":9496943.4593},{"COL_A": 'A',"COL_B": "cndds","price":4634609.6994}]
>>> df=spark.createDataFrame(data2)
>>> df.show()
+-----+-----------+-------------+
|COL_A| COL_B| price|
+-----+-----------+-------------+
| A| B|2.128919289E8|
| A1|cndmmkdssac| 9496943.4593|
| A| cndds| 4634609.6994|
+-----+-----------+-------------+
>>> df.registerTempTable("accounts_2")
>>> spark.sql("select s1.follow_on_price, s1.follow_on_price*70 as follow_on_price_usd from (select COALESCE(accounts_2.price) AS follow_on_price from accounts_2) s1")
+---------------+-------------------+
|follow_on_price|follow_on_price_usd|
+---------------+-------------------+
| 2.128919289E8| 1.4902435023E10|
| 9496943.4593| 6.64786042151E8|
| 4634609.6994| 3.24422678958E8|
I have a Dataframe with 20 columns and I want to update one particular column (whose data is null) with the data extracted from another column and do some formatting. Below is a sample input
+------------------------+----+
|col1 |col2|
+------------------------+----+
|This_is_111_222_333_test|NULL|
|This_is_111_222_444_test|3296|
|This_is_555_and_666_test|NULL|
|This_is_999_test |NULL|
+------------------------+----+
and my output should be like below
+------------------------+-----------+
|col1 |col2 |
+------------------------+-----------+
|This_is_111_222_333_test|111,222,333|
|This_is_111_222_444_test|3296 |
|This_is_555_and_666_test|555,666 |
|This_is_999_test |999 |
+------------------------+-----------+
Here is the code I have tried and it is working only when the the numeric is continuous, could you please help me with a solution.
df.withColumn("col2",when($"col2".isNull,regexp_replace(regexp_replace(regexp_extract($"col1","([0-9]+_)+",0),"_",","),".$","")).otherwise($"col2")).show(false)
I can do this by creating a UDF, but I am thinking is it possible with the spark in-built functions. My Spark version is 2.2.0
Thank you in advance.
A UDF is a good choice here. Performance is similar to that of the withColumn approach given in the OP (see benchmark below), and it works even if the numbers are not contiguous, which is one of the issues mentioned in the OP.
import org.apache.spark.sql.functions.udf
import scala.util.Try
def getNums = (c: String) => {
c.split("_").map(n => Try(n.toInt).getOrElse(0)).filter(_ > 0)
}
I recreated your data as follows
val data = Seq(("This_is_111_222_333_test", null.asInstanceOf[Array[Int]]),
("This_is_111_222_444_test",Array(3296)),
("This_is_555_666_test",null.asInstanceOf[Array[Int]]),
("This_is_999_test",null.asInstanceOf[Array[Int]]))
.toDF("col1","col2")
data.createOrReplaceTempView("data")
Register the UDF and run it in a query
spark.udf.register("getNums",getNums)
spark.sql("""select col1,
case when size(col2) > 0 then col2 else getNums(col1) end new_col
from data""").show
Which returns
+--------------------+---------------+
| col1| new_col|
+--------------------+---------------+
|This_is_111_222_3...|[111, 222, 333]|
|This_is_111_222_4...| [3296]|
|This_is_555_666_test| [555, 666]|
| This_is_999_test| [999]|
+--------------------+---------------+
Performance was tested with a larger data set created as follows:
val bigData = (0 to 1000).map(_ => data union data).reduce( _ union _)
bigData.createOrReplaceTempView("big_data")
With that, the solution given in the OP was compared to the UDF solution and found to be about the same.
// With UDF
spark.sql("""select col1,
case when length(col2) > 0 then col2 else getNums(col1) end new_col
from big_data""").count
/// OP solution:
bigData.withColumn("col2",when($"col2".isNull,regexp_replace(regexp_replace(regexp_extract($"col1","([0-9]+_)+",0),"_",","),".$","")).otherwise($"col2")).count
Here is another way, please check the performance.
df.withColumn("col2", expr("coalesce(col2, array_join(filter(split(col1, '_'), x -> CAST(x as INT) IS NOT NULL), ','))"))
.show(false)
+------------------------+-----------+
|col1 |col2 |
+------------------------+-----------+
|This_is_111_222_333_test|111,222,333|
|This_is_111_222_444_test|3296 |
|This_is_555_666_test |555,666 |
|This_is_999_test |999 |
+------------------------+-----------+
I have to check if incoming data is having any null or "" or " " value or not. The column for which I have to check is not fixed. I am reading from a config where the column name is stored for different files with permissible null-ability.
+----------+------------------+--------------------------------------------+
| FileName | Nullable | Columns |
+----------+------------------+--------------------------------------------+
| Sales | Address2,Phone2 | OrderID,Address1,Address2,Phone1,Phone2 |
| Invoice | Bank,OfcAddress | InvoiceNo,InvoiceID,Amount,Bank,OfcAddress |
+----------+------------------+--------------------------------------------+
So for each data/file I have to see which field shouldn't contain null. On basis of that process/error out the file. Is there any pythonic way to do this?
The table structure you’re showing makes me believe you have read the file containing these job details as a Spark DataFrame. You probably shouldn’t, as it’s very likely not big data. If you have it as a Spark DataFrame, collect it to the driver, so that you can create separate Spark jobs for each file.
Then, each job is fairly straightforward: you have a certain file location from which you must read. That info is captured by the FileName, I presume. Now, I will also presume the file format for each of these files is identical. If not, you’ll have to add meta data indicating the file format. For now, I assume it’s CSV.
Next, you must determine the subset of columns that needs to be checked for the presence of nulls. That’s easy: given that you have a list of all columns in the DataFrame (which could’ve been derived from the DataFrame generated by the previous step (the loading)) and a list of all columns that can contain nulls, the list of columns that can’t contain nulls is simply the difference between these two.
Finally, you aggregate over the DataFrame the number of nulls within each of these columns. As this is a DataFrame aggregate, there’s only one row in the result set, so you can take head to bring it to the driver. Cast is to a dict for easier access to the attributes.
I’ve added a function, summarize_positive_counts, that returns the columns where there was at least one null record found, thereby invalidating the claim in the original table.
df.show(truncate=False)
# +--------+---------------+------------------------------------------+
# |FileName|Nullable |Columns |
# +--------+---------------+------------------------------------------+
# |Sales |Address2,Phone2|OrderID,Address1,Address2,Phone1,Phone2 |
# |Invoice |Bank,OfcAddress|InvoiceNo,InvoiceID,Amount,Bank,OfcAddress|
# +--------+---------------+------------------------------------------+
jobs = df.collect() # bring it to the driver, to create new Spark jobs from its
from pyspark.sql.functions import col, sum as spark_sum
def report_null_counts(frame, job):
cols_to_verify_not_null = (set(job.Columns.split(","))
.difference(job.Nullable.split(",")))
null_counts = frame.agg(*(spark_sum(col(_).isNull().cast("int")).alias(_)
for _ in cols_to_verify_not_null))
return null_counts.head().asDict()
def summarize_positive_counts(filename, null_counts):
return {filename: [colname for colname, nbr_of_nulls in null_counts.items()
if nbr_of_nulls > 0]}
for job in jobs: # embarassingly parallellizable
frame = spark.read.csv(job.FileName, header=True)
null_counts = report_null_counts(frame, job)
print(summarize_positive_counts(job.FileName, null_counts))
I wonder if Spark SQL support caching result for the query defined in WITH clause.
The Spark SQL query is something like this:
with base_view as
(
select some_columns from some_table
WHERE
expensive_udf(some_column) = true
)
... multiple query join based on this view
While this query works with Spark SQL, I noticed that the UDF were applied to the same data set multiple times.
In this use case, the UDF is very expensive. So I'd like to cache the query result of base_view so the subsequent queries would benefit from the cached result.
P.S. I know you can create and cache a table with the given query and then reference it in the subqueries. In this specific case, though, I can't create any tables or views.
That is not possible. The WITH result cannot be persisted after execution or substituted into new Spark SQL invocation.
The WITH clause allows you to give a name to a temporary result set so it ca be reused several times within a single query. I believe what he's asking for is a materialized view.
This can be done by excuting several sql query.
-- first cache sql
spark.sql("
CACHE TABLE base_view as
select some_columns
from some_table
WHERE
expensive_udf(some_column) = true")
-- then use
spark.sql("
... multiple query join based on this view
")
Not sure if you are still interested in the solution, but the following is a workaround to accomplish the same:-
spark.sql("""
| create temp view my_view
| as
| WITH base_view as
| (
| select some_columns
| from some_table
| WHERE
| expensive_udf(some_column) = true
| )
| SELECT *
| from base_view
""");
spark.sql("""CACHE TABLE my_view""");
Now you can use the my_view temp view to join to other tables as shown below-
spark.sql("""
| select mv.col1, t2.col2, t3.col3
| from my_view mv
| join tab2 t2
| on mv.col2 = t2.col2
| join tab3 t3
| on mv.col3 = t3.col3
""");
Remember to uncache the view after using-
spark.sql("""UNCACHE TABLE my_view""");
Hope this helps.