I am currently building a DF in ADF where i am converting the below Query which is already placed in another ETL tool called BigDecission. The Query looks like below
SELECT
Asset_ID,
MAX(CASE WHEN meter = 'LTPC' THEN reading_date ELSE NULL END) AS LTPC_Date,
MAX(CASE WHEN meter = 'LTPC' THEN page_Count ELSE NULL END) AS LTPC,
FROM
mv_latest_asset_read
GROUP BY
Asset_ID
While converting this piece in ADF DF i have used AGGREGATE transform and done GROUP BY "ASSET_ID" .
In the AGGREGATES Tab i am deriving the column "LTPC_DATE" and "LTPC" with below mentioned code.
LTPC_DATE ---- > max(case(METER=='LTPC',READING_DATE))
LTPC ---- > max(case(METER=='LTPC',PAGE_COUNT))
But in the output i am getting null values which shouldn't be the case. Can anyone identify the right way to do it.
I followed the same approach to reproduced above and getting proper result.
Please check the below:
My source data:
Here I have taken 2 additional columns using derived column transformation and giving a sample value.
Group By and aggregate:
Used max(case(condtion,expression)) here.
Result in Data preview:
Try to check your projection in the source. Also, transform this to a sink file and check if it gives correct result or not.
If it still gives same, you can try maxIf(condition, expression) as suggested by #Mark Kromer MSFT.
The above also giving the same result for me.
If your source is a database, you can try query option in the source of dataflow and give the above query.
After Importing projection, you can see the desired result in the Data preview.
I am extremely new to Python. I've created a DataFrame using a csv file. My file is a complex nested json file having header values at the lowest granular level.
[Example] df.columns = [ID1, fullID2, total.count, total.value, seedValue.id, seedValue.value1, seedValue.value2, seedValue.largeFile.id, seedValue.largeFile.value1, seedValue.largeFile.value2......]
Requirement: I have to create multiple smaller csvs using each of the columns that are granular and ID1, fullID2.
My approach that I figured is: save the smaller slices by splitting on the header value.
Problem 1: Not able to split the value correctly or traverse to the first location for comparison.
[Example]
I'm using df.columns.str.split('.').tolist(). Suppose I get the below listed value, I want to compare seedValue of id with seedValue of value1 and pull out this entire part as a new df.
{['seedValue','id'],['seedValue'.'value1'], ['seedValue'.'value2']}
Problem 2: Adding ID1 and fullID2 to this new df.
Any help or direction to achieve this would be super helpful !
[Final output]
df.columns = [ID1, fullID2, total.count, total.value, seedValue.id, seedValue.value1, seedValue.value2, seedValue.largeFile.id, seedValue.largeFile.value1, seedValue.largeFile.value2......]
post-processing the file -
seedValue.columns = ID1,fullID2,id,value1,value2
total.columns = ID1,fullID2,count,value
seedValue.largeFile.columns = ID1,fullID2,id,value1,value2
While I do not possess your complex data to provide a more particular solution. I was able to reproduce a similar case with a .csv sample data, which will exemplify how to achieve what you aim with your data.
In order to save in each ID in a different file, we need to loop through the ID's. Also, assuming there might be more duplicate ID's, the script will save each group of ID's into a .csv file. Below is the script, already with sample data:
import pandas as pd
import csv
my_dict = { 'ids' : [11,11,33,55,55],
'info' : ["comment_1","comment_2", "comment_3", "comment_4", "comment_5"],
'other_column': ["something", "something", "something", "", "something"]}
#Creating a dataframe from the .csv file
df = pd.DataFrame(my_dict)
#sorting the value
df = df.sort_values('ids')
#g=df.groupby('ids')
df
#looping through each group of ids and saving them into a file
for id,g in df.groupby('ids'):
g.to_csv('id_{}.csv'.format(id),index=False)#, header=True, index_label=False)
And the output,
id_11.csv
id_33.csv
id_55.csv
For instance, within id_11.csv:
ids info other_column
11 comment_1 something
11 comment_2 something
Notice that, we use the field ids in the name of each file. Moreover, index=False which means that a new column with indexes for each line of data won't be created.
ADDITIONAL INFO: I have used the Notebook in AI Platform within GCP to execute and test the code.
Compared to the more widely known pd.read_csv, pandas offers more granular json support through pd.json_normalize, which allows you to specify how to unnest the data, which meta-data to use, etc.
Apart from this, reading nested fields from a csv into a two-dimensional dataframe might not be the ideal solution here, and having nested objects inside a dataframe can often be tricky to work with.
Try to read the file as a pure dictionary or list of dictionaries. You can then loop through the keys and create a custom logic to check how many more levels you want to go down, how to return the values and so on. Once you are on a lower level and prefer to have this inside of a dataframe, create a new temporary dataframe, then append these parts together inside the loop.
I looked around at other questions but couldn't find out that addresses the issue I'm having. I am cleaning a data set in an ipython notebook. When I run the cleaning tasks individually they work as expected, but I am having trouble with the replace() and drop() functions when they are included in a UDF. Specifically, these lines aren't doing anything within the UDF, however, a dataframe is returned that completes the other tasks as expected (i.e. reads in the file, sets the index, and filters select dates out).
Any help is much appreciated!
Note that in this problem the df.drop() and df.replace() commands both work as expected when executed outside of the UDF. The function is below for your reference. The issue is with the last two lines "station.replace()" and "station.drop()".
def read_file(file_path):
'''Function to read in daily x data'''
if os.path.exists(os.getcwd()+'/'+file_path) == True:
station = pd.read_csv(file_path)
else:
!unzip alldata.zip
station = pd.read_csv(file_path)
station.set_index('date',inplace=True) #put date in the index
station = station_data[station_data.index > '1984-09-29'] #removes days where there is no y-data
station.replace('---','0',inplace=True)
station.drop(columns=['Unnamed: 0'],axis=1,inplace=True) #drop non-station columns
There was a mistake here:
station = station_data[station_data.index > '1984-09-29']
I was using an old table index. I corrected it to:
station = station[station.index > '1984-09-29']
Note, I had to restart the notebook and re-run it from the top for it to work. I believe it was an issue with conflicting table names in the UDF vs. what was already stored in memory.
There is NO documentation regarding how to convert pCollections into the pCollections necessary for input into .CoGroupByKey()
Context
Essentially I have two large pCollections and I need to be able to find differences between the two, for type II ETL changes (if it doesn't exist in pColl1 then add to a nested field found in pColl2), so that I am able to retain history of these records from BigQuery.
Pipeline Architecture:
Read BQ Tables into 2 pCollections: dwsku and product.
Apply a CoGroupByKey() to the two sets to return --> Results
Parse results to find and nest all changes in dwsku into product.
Any help would be recommended. I found a java link on SO that does the same thing I need to accomplish (but there's nothing on the Python SDK).
Convert from PCollection<TableRow> to PCollection<KV<K,V>>
Is there a documentation / support for Apache Beam, especially Python SDK?
In order to get CoGroupByKey() working, you need to have PCollections of tuples, in which the first element would be the key and second - the data.
In your case, you said that you have BigQuerySource, which in current version of Apache Beam outputs PCollection of dictionaries (code), in which every entry represents a row in the table which was read. You need to map this PCollections to tuples as stated above. This is easy to do using ParDo:
class MapBigQueryRow(beam.DoFn):
def process(self, element, key_column):
key = element.get(key_column)
yield key, element
data1 = (p
| "Read #1 BigQuery table" >> beam.io.Read(beam.io.BigQuerySource(query="your query #1"))
| "Map #1 to KV" >> beam.ParDo(MapBigQueryRow(), key_column="KEY_COLUMN_IN_TABLE_1"))
data2 = (p
| "Read #2 BigQuery table" >> beam.io.Read(beam.io.BigQuerySource(query="your query #2"))
| "Map #2 to KV" >> beam.ParDo(MapBigQueryRow(), key_column="KEY_COLUMN_IN_TABLE_2"))
co_grouped = ({"data1": data1, "data2": data2} | beam.CoGroupByKey())
# do your processing with co_grouped here
BTW, documentation of Python SDK for Apache Beam can be found here.
How do I use IronPython to set the Data Limit By Expression field on a visualization?
(I mean an example of a simple IP script to set the Data Limit By Expression field on a visualization, I couldn't find one on the internet)
for any type of chart, if this is the only operation you need to do, you can use code like:
from Spotfire.Dxp.Application.Visuals import Visualization
viz = v.As[Visualization]()
print viz.Data.WhereClauseExpression # prints Python's nil value None
viz.Data.WhereClauseExpression = "[Column] = 'Value'"
print viz.Data.WhereClauseExpression # prints the above expression
in this example, v is a parameter pointing to the desired visualization. you could also look it up by name or ID or some other method.
if you're already manipulating this visualization with a script and just want to add a data limit, you can add this to your existing script without importing the Visualization class. every visualization type's Data object has this WhereClauseExpression property