How to SELECT multiple columns dynamically in DolphinDB? - metaprogramming

Below is part of my table schema.
my table schema
I can use statement select factor.column(2) as code from factor to get the second column.
I wonder if I can SELECT the 2nd column, and the 4th to the last column.

Using metaprogramming is a good choice.
Get the column names of the table first using function columnNames.
Access column names by index and join them.
Create and execute a SQL statement using function sql and eval with the metacode generated by function sqlCol.
factor=table(2015.01.15 as date,`00000.SZ as code,-1.05 as factor_value,1.1 as factor01,1.2 as factor02)
colNames = factor.columnNames()
finalColNames = colNames[1] join colNames[3:]
sql(sqlCol(finalColNames), factor).eval()
code factor01 factor02
-------- -------- --------
00000.SZ 1.1 1.2

Related

What is the industry standard Deduping method in Dataflows?

So Deduping is one of the basic and imp Datacleaning technique.
There are a number of ways to do that in dataflow.
Like myself doing deduping with help of aggregate transformation where i put key columns(Consider "Firstname" and "LastName" as cols) which are need to be unique in Group by and a column pattern like name != 'Firstname' && name!='LastName'
$$ _____first($$) in aggregate tab.
The problem with this method is ,if we have a total of 200 cols among 300 cols to be considered as Unique cols, Its a very tedious to do include 200 cols in my column Pattern.
Can anyone suggest a better and optimised Deduping process in Dataflow acc to the above situation?
I tried to repro the deduplication process using dataflow. Below is the approach.
List of columns that needs to be grouped by are given in dataflow parameters.
In this repro, three columns are given. This can be extended as per requirements.
Parameter Name: Par1
Type: String
Default value: 'col1,col2,col3'
Source is taken as in below image.
(Group By columns: col1, col2, col3;
Aggregate column: col4)
Then Aggregate transform is taken and in group by,
sha2(256,byNames(split($Par1,','))) is given in columns and it is named as groupbycolumn
In Aggregates, + Add column pattern near column1 and then delete Column1. Then Enter true() in matching condition. Then click on undefined column expression and enter $$ in column name expression and first($$) in value expression.
Output of aggregation function
Data is grouped by col1,col2 and col3 and first value of col4 is taken for every col1,col2 and col3 combination.
Then using select transformation, groupbycolumn from above output can be removed before copying to sink.
Reference: ** MS document** on Mapping data flow script - Azure Data Factory | Microsoft Learn

How to Flatten a semicolon Array properly in Azure Data Factory?

Context: I've a data flow that extracts data from SQL DB, when data comes is just one column with a string separated by tab, in order to manipulate the data properly, I've tried to separate every single column with its corresponding data:
Firstly, to 'rebuild' the table properly I used a 'Derived Column' activity replacing tab with semicolons instead (1)
dropLeft(regexReplace(regexReplace(regexReplace(descripcion,[\t],';'),[\n],';'),[\r],';'),1)
So, after that use 'split()' function to get an array and build the columns (2)
split(descripcion, ';')
Problem: When I try to use 'Flatten' activity (as here https://learn.microsoft.com/en-us/azure/data-factory/data-flow-flatten), is just not working and data flow throws me just one column or if I add an additional column in the 'Flatten' activity I just get another column with the same data that the first one:
Expected output:
column2
column1
column3
2000017
ENVASE CORONA CLARA 24/355 ML GRAB
PC13
2004297
ENVASE V FAM GRAB 12/940 ML USADO
PC15
Could you say me what i'm doing wrong, guys? thanks by the way.
You can use the derived column activity itself, try as below.
After the first derived column, what you have is a string array which can just be split again using derived schema modifier.
Where firstc represent the source column equivalent to your column descripcion
Column1: split(firstc, ';')[1]
Column2: split(firstc, ';')[2]
Column3: split(firstc, ';')[3]
Optionally you can select the columns you need to write to SQL sink

dynamically create columns in pyspark sql select statement

I have a pyspark dataframe called unique_attributes. the dataframe has columns productname, productbrand, producttype, weight, id. I am partitioning by some columns and trying to get the first value of the id column using a window function. I would like to be able to dynamically pass a list of columns to partition by. so for example if I wanted to add the weight column to the partition without having to code another 'col('weight') in the select, just instead pass a list. does anyone have a suggestion how to accomplish this? I have an example below.
current code:
w2 = Window().partitionBy(['productname',
'productbrand',
'producttype']).orderBy(unique_attributes.id.asc())
first_item_id_df=unique_attributes\
.select(col('productname'),
col('productbrand'),
col('producttype')),first("id",True).over(w2).alias('matchid')).distinct()
desired dynamic code:
column_list=['productname',
'productbrand',
'producttype',
'weight']
w2 = Window().partitionBy(column_list).orderBy(unique_attributes.id.asc())
# somehow creates
first_item_id_df=unique_attributes\
.select(col('productname'),
col('productbrand'),
col('producttype'), col('weight'),first("id",True).over(w2).alias('matchid')).distinct()

SQLite returns int instead of real

I'm working on a website in node js and use SQLite as a database for the first time.
I want to be able to use real for some form data and I noticed that every real in my database are converted to integer once the query is made.
To vizualize the database i am using DB Browser and i checked if the columns are defined as REAL which they are.
If i try to query a data set as 0.1 in my DB I get this :
sqlite> select step_variable
from variables
where id=38;
0.0
After trying as suggested the command TYPEOF(step_variable) it returned :
0.0|real
In the SQLite CREATE TABLE command, one defines a data type affinity, not a data type. SQLite supports the following five column affinities: TEXT, NUMERIC, INTEGER, REAL, NONE.
Thus the data type you specify when creating a table does not enforce a certain data type. You can supply any data type you want or even omit the data type.
CREATE TABLE table1(
column1 ABC,
column2 Others,
column3 WHATEVER);
CREATE TABLE table2(column1, column2, column3);
Populate tables:
INSERT INTO table1 VALUES( 1, 'my text', 123.45);
INSERT INTO table2 VALUES( 1, 'my text', 123.45);
Now let us check what SQLite made out of it:
SELECT column1, TYPEOF(column1) from table1
SELECT column2, TYPEOF(column1) from table1
SELECT column3, TYPEOF(column1) from table1
With:
column TYPEOF(column)
------------------------
1 INTEGER
my text TEXT
123.45 REAL
When you go through a query result e.g. by using sqlite2_step you can use the sqlite3_column_type statement to confirm the column type - unless you know the result anyway and simply cast the result to the data type expected.
Martin
I found the solution it was simply that i didn't save my file after modifying it.

Join two tables with OR logic in PowerQuery

As the title states, I am trying to do a merge of 2 tables. I want a nested joint where the values from the first table are always there and rows matching the second table are added to the first. I believe this is known as the nested join.
Unfortunately, it only allows for 1 key to 1 key matching where as I need it for 1 key in table 1 to 2 keys in table 2
Here is an example
Table1:
Group
..
..
Time
Date
Table2:
Group 1
Group 2
..
..
..
Other Info
What I want is where "Group = Group 1 OR Group = Group 2" and display the matching row from table 2 nested into Table 1
I looked at the following example but I must be confused by the syntax because it doesn't seem to be working for me.
How to join two tables in PowerQuery with one of many columns matching?
So after further investigation of the answer post I linked earlier, I will add an explanation of it here:
Table.AddColumn(Source, "Name_of_Column",
(Q1) => Table.SelectRows(Query2,
each Q1[Col_from_q1] = [Col_from_q2] or Q1[Col_from_q1] = [2_Col_from_q2]
)
)
So this did work for me and it adds an extra column that needs to be expanded to get all the values from the table. What i would add is that I don't know / haven't tested if there are multiple matches and how it treats it, based on nestedjoin, I would assume that it will duplicate rows in the first table.

Resources