I am currently building a DF in ADF where i am converting the below Query which is already placed in another ETL tool called BigDecission. The Query looks like below
SELECT
Asset_ID,
MAX(CASE WHEN meter = 'LTPC' THEN reading_date ELSE NULL END) AS LTPC_Date,
MAX(CASE WHEN meter = 'LTPC' THEN page_Count ELSE NULL END) AS LTPC,
FROM
mv_latest_asset_read
GROUP BY
Asset_ID
While converting this piece in ADF DF i have used AGGREGATE transform and done GROUP BY "ASSET_ID" .
In the AGGREGATES Tab i am deriving the column "LTPC_DATE" and "LTPC" with below mentioned code.
LTPC_DATE ---- > max(case(METER=='LTPC',READING_DATE))
LTPC ---- > max(case(METER=='LTPC',PAGE_COUNT))
But in the output i am getting null values which shouldn't be the case. Can anyone identify the right way to do it.
I followed the same approach to reproduced above and getting proper result.
Please check the below:
My source data:
Here I have taken 2 additional columns using derived column transformation and giving a sample value.
Group By and aggregate:
Used max(case(condtion,expression)) here.
Result in Data preview:
Try to check your projection in the source. Also, transform this to a sink file and check if it gives correct result or not.
If it still gives same, you can try maxIf(condition, expression) as suggested by #Mark Kromer MSFT.
The above also giving the same result for me.
If your source is a database, you can try query option in the source of dataflow and give the above query.
After Importing projection, you can see the desired result in the Data preview.
Related
I'm trying to load the data from Salesforce table to ADLS path. To perform this I'm using SOQL formatted query in the source dataset(Salesforce) of ADF pipeline copy activity. Sample below.
Select distinct `col1`, `col2`, `col3`....... from table
This pipeline is working for all the tables except two table where it is failing with HybridDeliveryException (Exact error below)
I also tried pulling only 10 rows. still no luck. But for the same table is working without any issues by selecting all columns -> select * from table
Any suggestions greatly appreciated
Error:
Failure happened on 'Source' side. ErrorCode=UserErrorOdbcOperationFailed,'Type=Microsoft.DataTransfer.Common.Shared. ,Message=ERROR [HY000] [Microsoft][DSI] (20051) Internal error using swap file "D:\Users_azbatchtask_410\AppData\Local\Temp\a60f5b9a-da9c-47b3-9d03-14d64bf44dce.tmp" in "Simba::DSI::DiskSwapDevice::DoFlushBlock": "[Microsoft][Support] (40635) Simba::Support::BinaryFile: Write of 57168 bytes on file "D:\Users_azbatchtask_410\AppData\Local\Temp\a60f5b9a-da9c-47b3-9d03-14d64bf44dce.tmp" failed: No space left on device".,Source=Microsoft.DataTransfer.ClientLibrary.Odbc.OdbcConnector,''Type=System.Data.Odbc.OdbcException,Message=ERROR [HY000] [Microsoft][DSI] (20051) Internal error using swap file "D:\Users_azbatchtask_410\AppData\Local\Temp\a60f5b9a-da9c-47b3-9d03-14d64bf44dce.tmp" in "Simba::DSI::DiskSwapDevice::DoFlushBlock": "[Microsoft][Support] (40635) Simba::Support::BinaryFile: Write of 57168 bytes on file "D:\Users_azbatchtask_410\AppData\Local\Temp\a60f5b9a-da9c-47b3-9d03-14d64bf44dce.tmp" failed: No space left on device".,Source=Microsoft Salesforce ODBC Driver,'
This might not be a complete answer but this be helpful for someone as a workaround.
I ran some more tests today and when I remove the key word "distinct" in the SOQL statement, the query is working fine and no exceptions this time.
Seems like that the issue is occurring with only specific large tables.
But the SOQL with distinct (Select distinct col1, col2, col3.......) is working fine for other smaller tables.
My question is how to assign variables within a loop in KQL magic command in Jupyter lab. I refer to Microsoft's document on this subject and will base my question on the code given here:
https://learn.microsoft.com/en-us/azure/data-explorer/kqlmagic
1. First query below
%%kql
StormEvents
| summarize max(DamageProperty) by State
| order by max_DamageProperty desc
| limit 10
2. Second: Convert the resultant query to a dataframe and assign a variable to 'statefilter'
df = _kql_raw_result_.to_dataframe()
statefilter =df.loc[0].State
statefilter
3. This is where I would like to modify the above query and let statefilter have multiple variables (i.e. consist of different states):
df = _kql_raw_result_.to_dataframe()
statefilter =df.loc[0:3].State
statefilter
4. And finally I would like to run my kql query within a for loop for each of the variables within statefilter. This below syntax may not be correct but it can give an example for what I am looking for:
dfs = [] # an empty list to store dataframes
for state in statefilters:
%%kql
let _state = state;
StormEvents
| where State in (_state)
| do some operations here for that specific state
df = _kql_raw_result_.to_dataframe()
dfs.append(df) # store the df specific to state in the list
The reason why I am not querying all the desired states within the KQL query is to prevent resulting in really large query outcomes being assigned to dataframes. This is not for this sample StormEvents table which has a reasonable size but for my research data which consists of many sites and is really big. Therefore I would like to be able to run a KQL query/analysis for each site within a for loop and assign each site's query results to a dataframe. Please let me know if this is possible or perhaps there may other logical ways to do this within KQL...
There are few ways to do it.
The simplest is to refractor your %%kql cell magic to a %kql line magic.
Line magic can be embedded in python cell.
Other option is to: from Kqlmagic import kql
The Kqlmagic kql method, accept as a string a kql cell or line.
You can call kql from python.
Third way is to call the kql magic via the ipython method:
ip.run_cell_magic('kql', {your kql magic cell text})
You can call it from python.
Example of using the single line magic mentioned by Michael and a return statement that converted the result to JSON. Without the conversion to JSON I wasn't getting anything back.
def testKQL():
%kql DatabaseName | take 10000
return _kql_raw_result_.to_dataframe().to_json(orient='records')
We have a JSON file as input to the spark program(which describe schema definition and constraints which we want to check on each column) and I want to perform some data quality checks such as (Not NULL, UNIQUE) and datatype validations as well(Wants to check whether csv file contains the data according to json schema or not?).
JSON File:
{
"id":"1",
"name":"employee",
"source":"local",
"file_type":"text",
"sub_file_type":"csv",
"delimeter":",",
"path":"/user/all/dqdata/data/emp.txt",
"columns":[
{"column_name":"empid","datatype":"integer","constraints":["not null","unique"],"values_permitted":["1","2"]},
{"column_name":"empname","datatype":"string","constraints":["not null","unique"],"values_permitted":["1","2"]},
{"column_name":"salary","datatype":"double","constraints":["not null","unique"],"values_permitted":["1","2"]},
{"column_name":"doj","datatype":"date","constraints":["not null","unique"],"values_permitted":["1","2"]},
{"column_name":"location","string":"number","constraints":["not null","unique"],"values_permitted":["1","2"]}
]
}
Sample CSV input :
empId,empname,salar,dob,location
1,a,10000,11-03-2019,pune
2,b,10020,14-03-2019,pune
3,a,10010,15-03-2019,pune
a,1,10010,15-03-2019,pune
Keep in mind that,
1) intentionally I have put the invalid data for empId and name field(check last record).
2) The number of column in the json file is not fixed?
Question:
How can I ensure that an input data file contains all the records as per the given datatype(in JSON) file or not?
We have tried below things:
1) If we try to load the data from the CSV file using a data frame by applying external schema, then the spark program immediately throws some cast exception(NumberFormatException, etc) and it abnormally terminates the program. But I want to continue the execution flow and log the specific error as "Datatype mismatch error for column empID".
Above scenario works only when we call some RDD action on data frame which I felt a weird way to validate schema.
Please guide me, How we can achieve it in spark?
I don't think there is a free lunch here you have to write this process yourself but the process you can do is...
Read the csv file as a Dataset of Strings so that every row is good
Parse the Dataset using the map function to check for Null or datatype problems per column
Add an extra two columns, a boolean called like validRow and a String called like message or description
With the parser mentioned in '2.', do some sort of try/catch or a Try/Success/Failure for each value in each column and catch the exception and set the validRow and the description column accordingly
Do a filter and write one DataFrame/DataSet that is successful (validRow flag is set to True) to a success place, and write the error DataFrame/DataSet to an error place
hello I pushed some rows into a bigquery table as follows:
errors = client.insert_rows("course-big-query-python.api_data_set_course_33.my_table_aut33",[string_tuple], selected_fields = schema2)
assert errors == []
however when I verify the result at the visual interface I see that the actual table size is 0,
I verify the Streaming buffer statistics there is the table successfully inserted:
I also excecuted a query to the table and the result is appearing stored in a temporal table as follows:
So I would like to appreciate support to insert the table in the corresponding place rather than a temporary table
To load data in BigQuery, you can either stream or batch it in.
If you choose streaming, data will go straight into a temporal space until it gets consolidated into the table.
You can find a longer description of how a streaming insert works here:
https://cloud.google.com/blog/products/gcp/life-of-a-bigquery-streaming-insert
If you want to batch instead of stream, use jobs.load instead of insert_row.
There is NO documentation regarding how to convert pCollections into the pCollections necessary for input into .CoGroupByKey()
Context
Essentially I have two large pCollections and I need to be able to find differences between the two, for type II ETL changes (if it doesn't exist in pColl1 then add to a nested field found in pColl2), so that I am able to retain history of these records from BigQuery.
Pipeline Architecture:
Read BQ Tables into 2 pCollections: dwsku and product.
Apply a CoGroupByKey() to the two sets to return --> Results
Parse results to find and nest all changes in dwsku into product.
Any help would be recommended. I found a java link on SO that does the same thing I need to accomplish (but there's nothing on the Python SDK).
Convert from PCollection<TableRow> to PCollection<KV<K,V>>
Is there a documentation / support for Apache Beam, especially Python SDK?
In order to get CoGroupByKey() working, you need to have PCollections of tuples, in which the first element would be the key and second - the data.
In your case, you said that you have BigQuerySource, which in current version of Apache Beam outputs PCollection of dictionaries (code), in which every entry represents a row in the table which was read. You need to map this PCollections to tuples as stated above. This is easy to do using ParDo:
class MapBigQueryRow(beam.DoFn):
def process(self, element, key_column):
key = element.get(key_column)
yield key, element
data1 = (p
| "Read #1 BigQuery table" >> beam.io.Read(beam.io.BigQuerySource(query="your query #1"))
| "Map #1 to KV" >> beam.ParDo(MapBigQueryRow(), key_column="KEY_COLUMN_IN_TABLE_1"))
data2 = (p
| "Read #2 BigQuery table" >> beam.io.Read(beam.io.BigQuerySource(query="your query #2"))
| "Map #2 to KV" >> beam.ParDo(MapBigQueryRow(), key_column="KEY_COLUMN_IN_TABLE_2"))
co_grouped = ({"data1": data1, "data2": data2} | beam.CoGroupByKey())
# do your processing with co_grouped here
BTW, documentation of Python SDK for Apache Beam can be found here.