Call AWS lambda service from glue job - apache-spark

In glue script I am collecting data from table in dataframe . In that dataframe I need to add new calculated column . Can I call AWS lambda service for new column for which calculation needs to be done?

Related

How to Change the data type of a cloumn(schema) using aws athena through query

I have uploaded my data and started the crawler but now i want to change the datatype of one perticular column in the glue crawler for example cloumn 1. name its datatype is string and 2. day its datatype is double now i want to change the day datatype to string is it possible through query or is their any option to change or update the column data type through glue api. Their is option to change datatype in glue console but i want to know how to change using aws athena or apis of glue. See i have refered this https://docs.aws.amazon.com/athena/latest/ug/types-of-updates.html#updates-changing-column-type but it gives me error.
[FAILED: ParseException line 1:6 missing EOF at '-' near 'my' ]

how a table data gets loaded into a dataframe in databricks? row by row or bulk?

I am new to databricks notebooks and dataframes. I have a requirement to load few columns(out of many) in a table of around 14million records into a dataframe. once the table is loaded, I need to create a new column based on values present in two columns.
I want to write the logic for the new column along with the select command while loading the table into dataframe.
Ex:
df = spark.read.table(tableName)
.select(columnsList)
.withColumn('newColumnName', 'logic')
will it have any performance impact? is it better to first load the table for the few columns into the df and then perform the column manipulation on the loaded df?
does the table data gets loaded all at once or row by row into the df? if row by row, then by including column manipulation logic while reading the table, am I causing any performance degradation?
Thanks in advance!!
This really depends on the underlying format of the table - is it backed by Parquet or Delta, or it's an interface to the actual database, etc. In general, Spark is trying to read only necessary data, and if, for example, Parquet is used (or Delta), then it's easier because it's column-oriented file format, so data for each column is placed together.
Regarding the question on the reading - Spark is lazy by default, so even if you put df = spark.read.table(....) as separate variable, then add .select, and then add .withColumn, it won't do anything until you call some action, for example .count, or write your results. Until that time, Spark will just check that table exists, your operations are correct, etc. You can always call .explain on the resulting dataframe to see how Spark will perform operations.
P.S. I recommend to grab a free copy of the Learning Spark, 2ed that is provided by Databricks - it will provide you a foundation for development of the code for Spark/Databricks

How do we create a generic mapping dataflow in datafactory that will dynamically extract data from different tables with different schema?

I am trying to create a azure datafactory mapping dataflow that is generic for all tables. I am going to pass table name, the primary column for join purpose and other columns to be used in groupBy and aggregate functions as parameters to the DF.
parameters to df
I am unable to refernce this parameter in groupBy
Error: DF-AGG-003 - Groupby should reference atleast one column -
MapDrifted1 aggregate(
) ~> Aggregate1,[486 619]
Has anyone tried this scenario? Please help if you have some knowledge on this or if it can be handled in u-sql script.
We need to first lookup your parameter string name from your incoming source data to locate the metadata and assign it.
Just add a Derived Column previous to your Aggregate and it will work. Call the column 'groupbycol' in your Derived Column and use this formula: byName($group1).
In your Agg, select 'groupbycol' as your groupby column.

How does Structured Streaming execute pandas_udf?

I'd like to understand how structured streaming treats new data coming.
If more rows arrive at the same time, spark append them to the input streaming dataframe, right?
If I have a withColumn and apply a pandas_udf, the function is called once per each row, or only one time and the rows are passed to the pandas_udf?
Let's say something like this:
dfInt = spark \
.readStream \
.load() \
.withColumn("prediction", predict( (F.struct([col(x) for x in (features)]))))
If more rows arrive at the same time, they are processed together or once per each?=
There is the chance to limit this to only one row per time?
If more rows arrive at the same time, spark append them to the input streaming dataframe, right?
Let's talk Micro-Batch Execution Engine only, right? That's what you most likely use in streaming queries.
Structured Streaming queries the streaming sources in a streaming query using Source.getBatch (DataSource API V1):
getBatch(start: Option[Offset], end: Offset): DataFrame
Returns the data that is between the offsets (start, end]. When start is None, then the batch should begin with the first record.
Whatever the source returns in a DataFrame is the data to be processed in a micro-batch.
If I have a withColumn and apply a pandas_udf, the function is called once per each row
Always. That's how user-defined functions work in Spark SQL.
or only one time and the rows are passed to the pandas_udf?
This says:
Pandas UDFs are user defined functions that are executed by Spark using Arrow to transfer data and Pandas to work with the data.
The Python function should take pandas.Series as inputs and return a pandas.Series of the same length. Internally, Spark will execute a Pandas UDF by splitting columns into batches and calling the function for each batch as a subset of the data, then concatenating the results together.
If more rows arrive at the same time, they are processed together or once per each?
If "arrive" means "part of a single DataFrame", then "they are processed together", but one row at a time (per the UDF contract).
There is the chance to limit this to only one row per time?
You don't have to. It's as such by design. One row at a time only.

Google Cloud Storage bucket copy

I have two buckets in GCS. Each bucket has a table.
I want to copy the buckets' content into Hadoop using Java SPARK. Is it possible via GCS Hadoop connector?
GCS pricing relays on the number of operations and their class (A or B), how can I estimate the number of operations needed? For example, is the number of operations to copy the content of a table equal to the number of fields (number of columns * number of lines) or is there another calculation method?

Resources