I'm working on a data lake where I use AWS Glue for ETL with PySpark. I ran into an issue with a few records where the values end up in the wrong column after the resulting DF is written to a file.
The two first rows in the following example have incorrect column order such as customer_postal_code being lb. The three rows after that show correct values:
+-----------------+--------------------+----------------+------------+-------------+
|total_weight_unit|customer_postal_code|from_postal_code|from_density|service_level|
+-----------------+--------------------+----------------+------------+-------------+
|4.070000171661377| lb| 96863| 17406.0| 447.0|
|12.34000015258789| lb| 96863| 17406.0| 447.0|
| lb| 78665| 76051| 1258.0| standard|
| lb| 06708| 17406| 447.0| standard|
| lb| 67357| 63376| 1705.0| standard|
| lb| 91730| 17406| 447.0| standard|
I'm using a custom transform to return a DynamicFrame with the DF in the shape I'm expecting like this:
extracted_fields_node = extract_db_fields(
glue_context,
DynamicFrameCollection(
{"source_node": source_node},
glue_context
),
)
Here's the definition of extract_db_fields:
def extract_db_fields(glue_context: GlueContext, dfc: DynamicFrameCollection) -> DynamicFrameCollection:
data_frame = dfc.select(list(dfc.keys())[0]).toDF()
df_base = extract_db_fields_df(data_frame)
glue_df = DynamicFrame.fromDF(
df_base, glue_context, "extract_partner_fields"
)
return DynamicFrameCollection({"ExtractFieldsTransform": glue_df}, glue_context)
I chose to process the DynamicFrame as a Spark DataFrame with a extract_db_fields_df function to allow for simple Spark unit testing. Here's a snippet of how I extract one of the fields that is the wrong value when written to a file:
df = df.withColumn("total_weight",
col("shipments_totalWeight").cast(DoubleType()))
Some notable details:
The data that is processed by extract_db_fields is a DF of three joined datasets.
Before writing to a file, I repartition by 1 to create a single file for discovery and analysis by an ML team.
I've verified the source data does not include values like the ones shown above.
Related
So, im trying to load avro files in to dlt and create pipelines and so fourth.
As a simple data frame in Databbricks, i can read and unpack to avro files, using functions json / rdd.map /lamba function. Where i can create a temp view then do a sql query and then select the fields i want.
--example command
in_path = '/mnt/file_location/*/*/*/*/*.avro'
avroDf = spark.read.format("com.databricks.spark.avro").load(in_path)
jsonRdd = avroDf.select(avroDf.Body.cast("string")).rdd.map(lambda x: x[0])
data = spark.read.json(jsonRdd)
data.createOrReplaceTempView("eventhub")
--selecting the data
sql_query1 = sqlContext.sql("""
select distinct
data.field.test1 as col1
,data.field.test2 as col2
,data.field.fieldgrp.city as city
from
eventhub
""")
However, i am trying to replicate the process , but use delta live tables and pipelines.
I have used autoloader to load the files into a table, and kept the format as is. So bronze is just avro in its rawest form.
I then planned to create a view that listed the unpack avro file. Much like I did above with "eventhub". Whereby it will then allow me to create queries.
The trouble is, I cant get it to work in dlt. I fail at the 2nd step, after i have imported the file into a bronze layer. It just does not seem to apply the functions to make the data readable/selectable.
This is the sort of code i have been trying. However, it does not seem to pick up the schema, so it is as if the functions are not working. so when i try and select a column, it does not recognise it.
--unpacked data
#dlt.view(name=f"eventdata_v")
def eventdata_v():
avroDf = spark.read.format("delta").table("live.bronze_file_list")
jsonRdd = avroDf.select(avroDf.Body.cast("string")).rdd.map(lambda x: x[0])
data = spark.read.json(jsonRdd)
return data
--trying to query the data but it does not recognise field names, even when i select "data" only
#dlt.view(name=f"eventdata2_v")
def eventdata2_v():
df = (
dlt.read("eventdata_v")
.select("data.field.test1 ")
)
return df
I have been working on this for weeks, trying to use different approach's but still no luck.
Any help will be so appreciated. Thankyou
I have to check if incoming data is having any null or "" or " " value or not. The column for which I have to check is not fixed. I am reading from a config where the column name is stored for different files with permissible null-ability.
+----------+------------------+--------------------------------------------+
| FileName | Nullable | Columns |
+----------+------------------+--------------------------------------------+
| Sales | Address2,Phone2 | OrderID,Address1,Address2,Phone1,Phone2 |
| Invoice | Bank,OfcAddress | InvoiceNo,InvoiceID,Amount,Bank,OfcAddress |
+----------+------------------+--------------------------------------------+
So for each data/file I have to see which field shouldn't contain null. On basis of that process/error out the file. Is there any pythonic way to do this?
The table structure you’re showing makes me believe you have read the file containing these job details as a Spark DataFrame. You probably shouldn’t, as it’s very likely not big data. If you have it as a Spark DataFrame, collect it to the driver, so that you can create separate Spark jobs for each file.
Then, each job is fairly straightforward: you have a certain file location from which you must read. That info is captured by the FileName, I presume. Now, I will also presume the file format for each of these files is identical. If not, you’ll have to add meta data indicating the file format. For now, I assume it’s CSV.
Next, you must determine the subset of columns that needs to be checked for the presence of nulls. That’s easy: given that you have a list of all columns in the DataFrame (which could’ve been derived from the DataFrame generated by the previous step (the loading)) and a list of all columns that can contain nulls, the list of columns that can’t contain nulls is simply the difference between these two.
Finally, you aggregate over the DataFrame the number of nulls within each of these columns. As this is a DataFrame aggregate, there’s only one row in the result set, so you can take head to bring it to the driver. Cast is to a dict for easier access to the attributes.
I’ve added a function, summarize_positive_counts, that returns the columns where there was at least one null record found, thereby invalidating the claim in the original table.
df.show(truncate=False)
# +--------+---------------+------------------------------------------+
# |FileName|Nullable |Columns |
# +--------+---------------+------------------------------------------+
# |Sales |Address2,Phone2|OrderID,Address1,Address2,Phone1,Phone2 |
# |Invoice |Bank,OfcAddress|InvoiceNo,InvoiceID,Amount,Bank,OfcAddress|
# +--------+---------------+------------------------------------------+
jobs = df.collect() # bring it to the driver, to create new Spark jobs from its
from pyspark.sql.functions import col, sum as spark_sum
def report_null_counts(frame, job):
cols_to_verify_not_null = (set(job.Columns.split(","))
.difference(job.Nullable.split(",")))
null_counts = frame.agg(*(spark_sum(col(_).isNull().cast("int")).alias(_)
for _ in cols_to_verify_not_null))
return null_counts.head().asDict()
def summarize_positive_counts(filename, null_counts):
return {filename: [colname for colname, nbr_of_nulls in null_counts.items()
if nbr_of_nulls > 0]}
for job in jobs: # embarassingly parallellizable
frame = spark.read.csv(job.FileName, header=True)
null_counts = report_null_counts(frame, job)
print(summarize_positive_counts(job.FileName, null_counts))
Here is how I use Spark-SQL in a little application I am working with.
I have two Hbase tables say t1,t2.
My input being a csv file, I parse each and every line and query(SparkSQL) the table t1. I write the output to another file.
Now I parse the second file and query the second table and I apply certain functions over the result and I output the data.
the table t1 hast the purchase details and t2 has the list of items that were added to cart along with the time frame by each user.
Input -> CustomerID(list of it in a csv file)
Output - > A csv file in a particular format mentioned below.
CustomerID, Details of the item he brought,First item he added to cart,All the items he added to cart until purchase.
For a input of 1100 records, It takes two hours to complete the whole process!
I was wondering if I could speed up the process but I am struck.
Any help?
How about this DataFrame approach...
1) Create a dataframe from CSV.
how-to-read-csv-file-as-dataframe
or something like this in example.
val csv = sqlContext.sparkContext.textFile(csvPath).map {
case(txt) =>
try {
val reader = new CSVReader(new StringReader(txt), delimiter, quote, escape, headerLines)
val parsedRow = reader.readNext()
Row(mapSchema(parsedRow, schema) : _*)
} catch {
case e: IllegalArgumentException => throw new UnsupportedOperationException("converted from Arg to Op except")
}
}
2) Create Another DataFrame from Hbase data (if you are using Hortonworks) or phoenix.
3) do join and apply functions(may be udf or when othewise.. etc..) and resultant file could be a dataframe again
4) join result dataframe with second table & output data as CSV as in pseudo code as an example below...
It should be possible to prepare dataframe with custom columns and corresponding values and save as CSV file.
you can this kind in spark shell as well.
val df = sqlContext.read.format("com.databricks.spark.csv").
option("header", "true").
option("inferSchema","true").
load("cars93.csv")
val df2=df.filter("quantity <= 4.0")
val col=df2.col("cost")*0.453592
val df3=df2.withColumn("finalcost",col)
df3.write.format("com.databricks.spark.csv").
option("header","true").
save("output-csv")
Hope this helps.. Good luck.
I am trying to do some analysis with spark. I tried the same query with foreach which shows the results correctly but if I use show or in sql it is weird, it is not showing anything.
sqlContext.sql("select distinct device from TestTable1 where id = 23233").collect.foreach(println)
[ipad]
[desktop]
[playstation]
[iphone]
[android]
[smarTv]
gives proper device but if I use just show or any sql :
sqlContext.sql("select distinct device from TestTable1 where id = 23233").show()
%sql
select distinct device from TestTable1 where id = 23233
+-----------+
|device |
+-----------+
| |
| |
|ion|
| |
| |
| |
+-----------+
I need graph and charts, so I would like to use %sql. But this is giving weird results with $sql. Does any one have any idea why I am getting like this ?
show is a formatted output of your data, whereas collect.foreach(println) is merely printing the Row data. They are two different things. If you want to format your data in a specific way, then stick with foreach...keeping in mind you are printing a sequence of Row. You'll have to pull the data out of the row if you want to get your own formatting for each column.
I can probably provide more specific information if you provide the version of spark and zeppelin that you are using.
You stated that you are using %sql because you need Zeppelin's graphs and charts--i.e. you wouldn't be swapping to %sql if you didn't have to.
You can just stick with using Spark dataframes by using z.show(), for example:
%pyspark
df = sqlContext.createDataFrame([
(23233, 'ipad'),
(23233, 'ipad'),
(23233, 'desktop'),
(23233, 'playstation'),
(23233, 'iphone'),
(23233, 'android'),
(23233, 'smarTv'),
(12345, 'ipad'),
(12345, 'palmPilot'),
], ('id', 'device'))
foo = df.filter('id = 23233').select('device').distinct()
z.show(foo)
In the above, z.show(foo) renders the default Zeppelin table view, with options for the other chart types.
E.g
sqlContext = SQLContext(sc)
sample=sqlContext.sql("select Name ,age ,city from user")
sample.show()
The above statement prints theentire table on terminal. But I want to access each row in that table using for or while to perform further calculations.
You simply cannot. DataFrames, same as other distributed data structures, are not iterable and can be accessed using only dedicated higher order function and / or SQL methods.
You can of course collect
for row in df.rdd.collect():
do_something(row)
or convert toLocalIterator
for row in df.rdd.toLocalIterator():
do_something(row)
and iterate locally as shown above, but it beats all purpose of using Spark.
To "loop" and take advantage of Spark's parallel computation framework, you could define a custom function and use map.
def customFunction(row):
return (row.name, row.age, row.city)
sample2 = sample.rdd.map(customFunction)
or
sample2 = sample.rdd.map(lambda x: (x.name, x.age, x.city))
The custom function would then be applied to every row of the dataframe. Note that sample2 will be a RDD, not a dataframe.
Map may be needed if you are going to perform more complex computations. If you just need to add a simple derived column, you can use the withColumn, with returns a dataframe.
sample3 = sample.withColumn('age2', sample.age + 2)
Using list comprehensions in python, you can collect an entire column of values into a list using just two lines:
df = sqlContext.sql("show tables in default")
tableList = [x["tableName"] for x in df.rdd.collect()]
In the above example, we return a list of tables in database 'default', but the same can be adapted by replacing the query used in sql().
Or more abbreviated:
tableList = [x["tableName"] for x in sqlContext.sql("show tables in default").rdd.collect()]
And for your example of three columns, we can create a list of dictionaries, and then iterate through them in a for loop.
sql_text = "select name, age, city from user"
tupleList = [{name:x["name"], age:x["age"], city:x["city"]}
for x in sqlContext.sql(sql_text).rdd.collect()]
for row in tupleList:
print("{} is a {} year old from {}".format(
row["name"],
row["age"],
row["city"]))
It might not be the best practice, but you can simply target a specific column using collect(), export it as a list of Rows, and loop through the list.
Assume this is your df:
+----------+----------+-------------------+-----------+-----------+------------------+
| Date| New_Date| New_Timestamp|date_sub_10|date_add_10|time_diff_from_now|
+----------+----------+-------------------+-----------+-----------+------------------+
|2020-09-23|2020-09-23|2020-09-23 00:00:00| 2020-09-13| 2020-10-03| 51148 |
|2020-09-24|2020-09-24|2020-09-24 00:00:00| 2020-09-14| 2020-10-04| -35252 |
|2020-01-25|2020-01-25|2020-01-25 00:00:00| 2020-01-15| 2020-02-04| 20963548 |
|2020-01-11|2020-01-11|2020-01-11 00:00:00| 2020-01-01| 2020-01-21| 22173148 |
+----------+----------+-------------------+-----------+-----------+------------------+
to loop through rows in Date column:
rows = df3.select('Date').collect()
final_list = []
for i in rows:
final_list.append(i[0])
print(final_list)
Give A Try Like this
result = spark.createDataFrame([('SpeciesId','int'), ('SpeciesName','string')],["col_name", "data_type"]);
for f in result.collect():
print (f.col_name)
If you want to do something to each row in a DataFrame object, use map. This will allow you to perform further calculations on each row. It's the equivalent of looping across the entire dataset from 0 to len(dataset)-1.
Note that this will return a PipelinedRDD, not a DataFrame.
above
tupleList = [{name:x["name"], age:x["age"], city:x["city"]}
should be
tupleList = [{'name':x["name"], 'age':x["age"], 'city':x["city"]}
for name, age, and city are not variables but simply keys of the dictionary.