ADF Dataflow : Query about transformation - azure

I am trying to transform source data using ADF Dataflow.
My source data is as under:
Id,Name,Direction
123, ABC, North
123, ABC, South
123, ABC, East
123, ABC, West
I want the output to be:
Id,Name,Direction1,Direction2,Direction3,Direction4
123, ABC, North, South, East, West
What if there are more than one column which has multiple values.
For e.g. Source data is
Id,Name,Direction,Status
123, ABC, North, A
123, ABC, South, B
123, ABC, East, C
123, ABC, West, D
I want the output to be:
Id,Name,Direction1,Direction2,Direction3,Direction4, Status1, Status2, Status3, Status4
123, ABC, North, South, East, West, A,B,C,D
Can this be done in ADF Dataflow?
I have tried the pivot transformation in ADF Dataflow but unable to crack the logic.
Thx in advance,

You can use collect(column) aggregate function in two pivot transformations for two columns.
Please follow the below demonstration to get the desired result(only in this case).
Use two pivot transformations for Direction and status columns.
Direction column:
Status column:
Use join transformation to combine the above two.
For some reasons it is giving errors when loading it to sink file as csv. So, use sink cache and download the result from preview.
Downloaded result:

Related

How to truncate the data to first 3 letter in data flow?

I want to truncate data if unit=code.
Input:
Country, unit
India, code
Bangladesh, money
China, code
Output:
Country, unit
Ind, code
Bangladesh, money
Chi, code
What I tried?
I used case expression in dataflow but not able to truncate data to 3 letter code
You can use left() function in dataflow to get the first three characters of data.
I repro'd this with sample input.
Source data:
Derived column transformation is taken and expression for country column is given as case(unit=='code',left(Country,3) , Country)
Derived column settings:
Result:

Create a parent child tree dictionary based on two columns in dataframe

Suggest I've a dataframe containing countries and cities looking like this:
data = {'Parent':['Netherlands','Belgium','Germany','France'],'Child':['Amsterdam','Brussels','Berlin', '']}
I want to create a tree dictionary depicting which city belongs to which country.
In this example I the country France has no cities, I don't want to have the empty values as child node in the dictionary.
Can someone point me into the right direction on what to use to reach this solution?
Kind regards
Using pandas:
df = pd.DataFrame(data)
df.transpose()
0 1 2 3
Parent Netherlands Belgium Germany France
Child Amsterdam Brussels Berlin
Or using zip:
dict(zip(*data.values()))
{'Netherlands': 'Amsterdam', 'Belgium': 'Brussels', 'Germany': 'Berlin', 'France': ''}

sorting multi-column grouped by data frame

I'm trying to work on this data set drinks by country and find out the mean of beer servings of each country in each continent sorted from highest to lowest.
So my result should look something like below:
South America: Venezuela 333, Brazil 245, paraguay 213
and like that for the other continents (Don't want to mix countries of different continents!)
Creating the grouped data without the sorting is quite easy like below:
ddf = pd.read_csv(drinks.csv)
grouped_continent_and_country = ddf.groupby(['continent', 'country'])
print(grouped_continent_and_country['beer_servings'].mean())
but how to do the sorting??
Thanks a lot.
In this case you can just sort values by 'continent' and 'beer_servings' without applying .mean():
ddf = pd.read_csv('drinks.csv')
#sorting by continent and beer_servings columns
ddf = ddf.sort_values(by=['continent','beer_servings'], ascending=True)
#making the dataframe with only needed columns
ddf = ddf[['continent', 'country', 'beer_servings']].copy()
#exporting to csv
ddf.to_csv("drinks1.csv")
Output fragment:
continent,country,beer_servings
...
Africa,Botswana,173
Africa,Angola,217
Africa,South Africa,225
Africa,Gabon,347
Africa,Namibia,376
Asia,Afghanistan,0
Asia,Bangladesh,0
Asia,North Korea,0
Asia,Iran,0
Asia,Kuwait,0
Asia,Maldives,0
...

Creating cross-tabulation table from panda data frame

I am new to cross tables. I have 3 data frames (df_1,df_2,df_3) created in my Jupyther NoteBook:
df_1
person ID, date, price, location, code
542, 12/04/12, $2.5, 66234, 103
df_2
Brand, region, location
AA, Texas, 66234
BB, SF, 15467
df_3
person ID, First, Last, Region
542, Tom, Barker, Texas
I need to combine these data frames to get something like this:
Brand, code, Texas, SF
AA, 103, $2.5, $3.8
Any idea where should I start?

how to I select only a specific column from a Dataset after sorting it

I have the following table:
DEST_COUNTRY_NAME ORIGIN_COUNTRY_NAME count
United States Romania 15
United States Croatia 1
United States Ireland 344
Egypt United States 15
The table is represented as a Dataset.
scala> dataDS
res187: org.apache.spark.sql.Dataset[FlightData] = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]
I want to sort the table based on count column and want to see only count column. I have done it but I am doing it in 2 steps
1- I first sort to get sorted DS - dataDS.sort(col("count").desc)
2- then select on that DS- (dataDS.sort(col("count").desc)).select(col("count")).show();
The above feels like am embedded sql query to me. In sql however, I can do the same query without using an embedded query
select * from flight_data_2015 ORDER BY count ASC
Is there a better way for me to both sort and select without creating a new Dataset?
There is nothing wrong
(dataDS.sort(col("count").desc)).select(col("count")).show();
It is the right thing to do and has no negative performance implications, other than intrinsic problems of sorting as such.
Use it freely and don't worry about it anymore.

Resources