Get value of column in AWS Glue Custom Transform

Get value of column in AWS Glue Custom Transform - python-3.x

I'm working on ETL in AWS Glue. I need to decode text from table which is in base64 - I'm doing that in Custom Transform in Python3.
My code is below:
def MyTransform (glueContext, dfc) -> DynamicFrameCollection:
import base64
newdf = dfc.select(list(dfc.keys())[0]).toDF()
data = newdf["email"]
data_to_decrypt = base64.b64decode(data)
I've got error like that:
TypeError: argument should be a bytes-like object or ASCII string, not 'Column'
How to get plan string from the Column object?

I was wrong and it was completely different thing than I thought.
Column object from newdf["email"] consists all rows for this single column, so it's not possible to just fetch one value from that.
What I ended up doing is iterating through whole rows and mapping them to new value like that:
def map_row(row):
id = row.id
client_key = row.client_key
email = decrypt_jasypt_string(row.email.strip())
phone = decrypt_jasypt_string(row.phone.strip())
created_on = row.created_on
return (id, email, phone, created_on, client_key)
df = dfc.select(list(dfc.keys())[0]).toDF()
rdd2=df.rdd.map(lambda row: map_row(row))
df2=rdd2.toDF(["id","email","phone", "created_on", "client_key"])
dyf_filtered = DynamicFrame.fromDF(df2, glueContext, "does it matter?")

Related

How can I convert from SQLite3 format to dictionary

How can i convert my SQLITE3 TABLE to a python dictionary where the name and value of the column of the table is converted to key and value of dictionary.

I have made a package to solve this issue if anyone got into this problem..
aiosqlitedict
Here is what it can do
Easy conversion between sqlite table and Python dictionary and vice-versa.
Get values of a certain column in a Python list.
Order your list ascending or descending.
Insert any number of columns to your dict.
Getting Started
We start by connecting our database along with
the reference column
from aiosqlitedict.database import Connect
countriesDB = Connect("database.db", "user_id")
Make a dictionary
The dictionary should be inside an async function.
async def some_func():
countries_data = await countriesDB.to_dict("my_table_name", 123, "col1_name", "col2_name", ...)
You can insert any number of columns, or you can get all by specifying
the column name as '*'
countries_data = await countriesDB.to_dict("my_table_name", 123, "*")
so you now have made some changes to your dictionary and want to
export it to sql format again?
Convert dict to sqlite table
async def some_func():
...
await countriesDB.to_sql("my_table_name", 123, countries_data)
But what if you want a list of values for a specific column?
Select method
you can have a list of all values of a certain column.
country_names = await countriesDB.select("my_table_name", "col1_name")
to limit your selection use limit parameter.
country_names = await countriesDB.select("my_table_name", "col1_name", limit=10)
you can also arrange your list by using ascending parameter
and/or order_by parameter and specifying a certain column to order your list accordingly.
country_names = await countriesDB.select("my_table_name", "col1_name", order_by="col2_name", ascending=False)

Script to insert data into tables dynamically using Flask SQLAlchemy

I understand that the documented way to insert data into a table looks like
```class Table(db.Model):
__tablename___ = 'table'
id = db.Column(db.Integer, primary_key=True)
data = db.Column(db.String(50)
...
insert = Table(id = '0', data = 'new data')```
However, I am working on a project that has multiple tables all with different columns, lengths, and data. I have worked out how to get the dynamic data into a dict, prepped to create rows. Below is my actual code:
def load_csv_data(self, ctx):
data_classes = [Locations, Scents, Classes]
data_tables = ['locations', 'scents', 'classes']
tables = len(data_tables)
for i in range(tables):
with open('./development/csv/{}.csv'.format(data_tables[i]), newline='') as times_file:
times_reader = csv.reader(times_file, delimiter=',', quotechar='|')
for row in times_reader:
data_columns = data_classes[i].__table__.columns
columns = len(data_columns)
insert_data = {}
for col in range(columns):
row_key = data_columns[col].key
row_value = row[col]
insert_data.update({row_key: row_value})
The challenge I am having is finding a way to do the actual insert based on these dynamic params. So if the above returns:
insert_data = {val1: val2, val3: val4, val5: val6}
I would like to convert this to:
insert = Table(val1='val2', val3='val4', val5='val6)
Everything I have tried so far has issued a __init__() missing 2 required positional arguments: error.
Anyone have any thoughts on how I might accomplish this?

Return a dataframe to a dataframe

I am trying to take a list of "ID"s from my dataframe dfrep and pass the column with the ID into a function I created in order to pass the values into a query to return back to dfrep.
My function is returning a dataframe, but the results of the dataframe are including the header and when I print dfrep there are two lines. I also cannot write the dataframe to excel using xlwings because I get TypeError: must be a pywintypes time object (got DataFrame).
def overrides(id):
sql = f"select name from sales..rep where id in({id})"
mydf = pd.read_sql(sql, conn)
return mydf
overrides = np.vectorize(overrides)
dfrep['name'] = overrides(dfrep['ID'])
wsData.range('A1').options(pd.DataFrame,index=False).value = dfrep
My goal is to load the column(s) in my function's dataframe into my main dataframe dfrep and then write to excel via xlwings. Any help is appreciated.

CreateDataFrame or SaveAsTable intuitively encode in pyspark 1.6

I am trying to save a table in spark1.6 using pyspark. All of the tables columns are saved as text, I'm wondering if I can change this:
product = sc.textFile('s3://path/product.txt')
product = m3product.map(lambda x: x.split("\t"))
product = sqlContext.createDataFrame(product, ['productid', 'marketID', 'productname', 'prod'])
product.saveAsTable("product", mode='overwrite')
Is there something in the last 2 commands that could automatically recognize productid and marketid as numerics? I have a lot of files and a lot of fields to upload so ideally it would be automatic

Is there something in the last 2 commands that could automatically recognize productid and marketid as numerics
If you pass int or float (depending on what you need) pyspark will convert the data type for you.
In your case, changing the lambda function in
product = m3product.map(lambda x: x.split("\t"))
product = sqlContext.createDataFrame(product, ['productid', 'marketID', 'productname', 'prod'])
to
from pyspark.sql.types import Row
def split_product_line(line):
fields = line.split('\t')
return Row(
productid=int(fields[0]),
marketID=int(fields[1]),
...
)
product = m3product.map(split_product_line).toDF()
You will find it much easier to control data types and possibly error/exception checks.
Try to prohibit lambda functions if possible :)

Accessing a global lookup Apache Spark

I have a list of csv files each with a bunch of category names as header columns. Each row is a list of users with a boolean value (0, 1) whether they are part of that category or not. Each of the csv files does not have the same set of header categories.
I want to create a composite csv across all the files which has the following output:
Header is a union of all the headers
Each row is a unique user with a boolean value corresponding to the category column
The way I wanted to tackle this is to create a tuple of a user_id and a unique category_id for each cell with a '1'. Then reduce all these columns for each user to get the final output.
How do I create the tuple to begin with? Can I have a global lookup for all the categories?
Example Data:
File 1
user_id,cat1,cat2,cat3
21321,,,1,
21322,1,1,1,
21323,1,,,
File 2
user_id,cat4,cat5
21321,1,,,
21323,,1,,
Output
user_id,cat1,cat2,cat3,cat4,cat5
21321,,1,1,,,
21322,1,1,1,,,
21323,1,1,,,,

Probably the title of the question is misleading in the sense that conveys a certain implementation choice as there's no need for a global lookup in order to solve the problem at hand.
In big data, there's a basic principle guiding most solutions: divide and conquer. In this case, the input CSV files could be divided in tuples of (user,category).
Any number of CSV files containing an arbitrary number of categories can be transformed to this simple format. The resulting CSV results of the union of the previous step, extraction of the total nr of categories present and some data transformation to get it in the desired format.
In code this algorithm would look like this:
import org.apache.spark.SparkContext._
val file1 = """user_id,cat1,cat2,cat3|21321,,,1|21322,1,1,1|21323,1,,""".split("\\|")
val file2 = """user_id,cat4,cat5|21321,1,|21323,,1""".split("\\|")
val csv1 = sparkContext.parallelize(file1)
val csv2 = sparkContext.parallelize(file2)
import org.apache.spark.rdd.RDD
def toTuples(csv:RDD[String]):RDD[(String, String)] = {
val headerLine = csv.first
val header = headerLine.split(",")
val data = csv.filter(_ != headerLine).map(line => line.split(","))
data.flatMap{elem =>
val merged = elem.zip(header)
val id = elem.head
merged.tail.collect{case (v,cat) if v == "1" => (id, cat)}
}
}
val data1 = toTuples(csv1)
val data2 = toTuples(csv2)
val union = data1.union(data2)
val categories = union.map{case (id, cat) => cat}.distinct.collect.sorted //sorted category names
val categoriesByUser = union.groupByKey.mapValues(v=>v.toSet)
val numericCategoriesByUser = categoriesByUser.mapValues{catSet => categories.map(cat=> if (catSet(cat)) "1" else "")}
val asCsv = numericCategoriesByUser.collect.map{case (id, cats)=> id + "," + cats.mkString(",")}
Results in:
21321,,,1,1,
21322,1,1,1,,
21323,1,,,,1
(Generating the header is simple and left as an exercise for the reader)

You dont need to do this as a 2 step process if all you need is the resulting values.
A possible design:
1/ Parse your csv. You dont mention whether your data is on a distributed FS, so i'll assume it is not.
2/ Enter your (K,V) pairs into a mutable parallelized (to take advantage of Spark) map.
pseudo-code:
val directory = ..
mutable.ParHashMap map = new mutable.ParHashMap()
while (files[i] != null)
{
val file = directory.spark.textFile("/myfile...")
val cols = file.map(_.split(","))
map.put(col[0], col[i++])
}
and then you can access your (K/V) tuples by way of an iterator on the map.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Get value of column in AWS Glue Custom Transform - python-3.x

Related

How can I convert from SQLite3 format to dictionary

Script to insert data into tables dynamically using Flask SQLAlchemy

Return a dataframe to a dataframe

CreateDataFrame or SaveAsTable intuitively encode in pyspark 1.6

Accessing a global lookup Apache Spark

Categories

Resources