PETL - Sorting as descending order

PETL - Sorting as descending order - python-3.x

I'm having issues sorting the following code:
I first imported the dataframe with etl
I checked if column "quantity" is numeric (raw data present innumerous errors)
I sorted the "quantity" column by the largest amounts (I tried to use 'nlargest' but it doesn't work - not sure why)
I was supposed to sort in decrescent order, I tried a bunch of different combinations but no luck.
I'm wondering if the steps chosen to solve this problem are correct, or I'm missing something in the syntax... Really appreciate any help, thnx!
table = etl.fromdataframe(df)
table = etl.select(table, 'quantity', lambda quantity: quantity.isnumeric())
table2 = etl.head(table, 5)
table

Related

M Query Table.Group with Min based on two columns

I have a table with many columns. Three of these columns are:
Package Name (text)
Units Required (Int.64)
Assessment (Int.64)
What I am trying to do is to find the 'Minimum' "Package Name" first by selecting the smallest number of "Units Required", then because sometimes there are several instances where the number of required units will be the same, the row with the lowest "Assessment".
I am exploring the Table.Group() approach but I am not getting anywhere with my understanding of it. I am doing this in Power Query in Excel 365.
Psuedo Code would be something like:
Table.Group("Previous Step Name",{"Package Name"},{MIN("Units Required"),MIN("Assessment")})
As an aside - is it possible to use a single Table.Group and group at two levels? such as "Package Name" and "Column X" so that the result would be a: for each "Package Name" then for each "Column X" in each "Package Name" (nested as it were).
Thankyou in advance for taking a look at this.
Any help greatly appreciated.
Cheers
The Frog

I think you have to do it step by step.
Data
Queries
Load_Data
Load data from Excel table
let
Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content]
in
Source
Min_Unit
Identify min unit by grouping with empty "group by" field.
let
Source = Load_Data,
Group = Table.Group(Source, {}, {{"Min_Unit", each List.Min([Units Required]), type number}})
in
Group
Min_Unit_And_Assessment
Use inner join to filter original data for entries which equal min_unit. Next, group by "units required" to get the min_assessment.
let
Source = Table.NestedJoin(Load_Data, {"Units Required"}, Min_Unit, {"Min_Unit"}, "Min_Unit", JoinKind.Inner),
Group = Table.Group(Source , {"Units Required"}, {{"Min_Assessment", each List.Min([Assessment]), type nullable number}})
in
Group
Result
Inner join to filter original data for the combination of min_unit and min_assessment.
let
Source = Table.NestedJoin(Load_Data, {"Units Required", "Assessment"}, Min_Unit_And_Assessment, {"Units Required", "Min_Assessment"}, "Min_Unit_And_Assessment", JoinKind.Inner),
RemoveUnnecessaryColumns = Table.RemoveColumns(Source,{"Min_Unit_And_Assessment"})
in
RemoveUnnecessaryColumns
Result

Qualia, thankyou for pointing me in the right direction.
The way that I solved this was really simple in the end!
Step 1: Sort the rows based on the grouping criteria (package name, system class) in that order
Step 2: Add an Index Column so each row has a unique ID to work with
Step 3: Group the table based on the same fields (package name, system class) and 'aggregate' on the lowest Index Number (MIN)
Step 4: Perform a 'Merge Queries' with a Left Outer Join using the Index Number as the matching field between your current 'step' and the step from earlier in the processing where the Index was added - you can then have the rows matched and only the rows needed will be matched since the others are now gone due to the MIN aggregation from earlier. Here is my example:
Table.NestedJoin(#"Grouped Rows", {"Winner"}, #"Added Index", {"Index"}, "Lookup Data", JoinKind.LeftOuter)
- Grouped Rows was the grouping step (Step 3)
- Winner is the name of the Index that had the minimum value
- Added Index was the last step before grouping that still had all the columns (Step 2)
- Index is the column that was added after the sort to uniquely number each row
Step 5: Expand the table and select the columns of data that you want to hang onto
Treating it a bit like a database was a good approach and I appreciate the suggestion you put together for me. Hopefully this will allow others to solve some of their problems too.
Cheers and many thanks
The Frog

How can I groupby rows by the columns in which they actually posses a data point?

I don't even know if groupby is the correct function to use for this. It's a bit hard to understand so Ill include a screenshot of my dataframe: screenshot
Basically, this dataframe has way too many columns because each column is specific to only one or a few rows. You can see in the screenshot that the first few columns are specific towards the first row and the last few columns are specific to the last row. I want to make it so that each row only has the columns that actually pertain to it. I've tried several methods of using groupby('equipment name') and several methods using dropna but none work in the way I need it to. I'm also open to separating it into multiple dataframes.
Any method is acceptable, this bug has been driving me crazy. It took me a while to get to this point because this started out as an unintelligible 10,000 line json. I'm pretty new to programming as well.

This is a very cool answer that could be one option - and it does use groupby so sorry for dismissing!!! This will group your data into DataFrames where each DataFrame has a unique group of columns, and any row which only contains values for those columns will be in that DataFrame. If your data are such that there are multiple groups of rows which share the exact same columns, this solution is ideal I think.
Just to note, though, if your null values are more randomly spread out throughout the dataset, or if one row in a group of rows is missing a single entry (compared to related rows), you will end up with more combinations of unique non-null columns, and then more output DataFrames.
There are also (in my opinion) nice ways to search a DataFrame, even if it is very sparse. You can check the non-null values for a row:
df.loc[index_name].dropna()
Or for an index number:
df.iloc[index_number].dropna()
You could further store these values, say in a dictionary (this is a dictionary of Series, but could be converted to DataFrame:
row_dict = {row : df.loc[row].dropna() for row in df.index}
I could imagine some scenarios where something based off these options is more helpful for searching. But that linked answer is slick, I would try that.
EDIT: Expanding on the answer above based on comments with OP.
The dictionary created in the linked post contain the DataFrames . Basically you can use this dictionary to do comparisons with the original source data. My only issue with that answer was that it may be hard to search the dictionary if the column names are janky (as it looks like in your data), so here's a slight modification:
for i, (name,df) in enumerate(df.groupby(df.isnull().dot(df.columns))):
d['df' + str(i)] = df.dropna(1)
Now the dictionary keys are "df#", and the values are the DataFrames. So if you wanted to inspect the content one DataFrame, you can call:
d['df1'].head()
#OR
print(d['df0'])
If you wanted to look at all the DataFrames, you could call
for df in d.values():
print(df.head()) #you can also pass an integer to head to show more rows than 5
Or if you wanted to save each DataFrame you could call:
for name in sorted(d.keys()):
d[name].to_csv('path/to/file/' + name + '.csv')
The point is, you've gotten to a data structure where you can look at the original data, separated into DataFrames without missing data. Joining these back into a single DataFrame would be redundant, as it would create a single DataFrame (equal to the original) or multiple with some amount of missing data.
I think it comes down to what you are looking for and how you need to search the data. You could rename the dictionary keys / output .CSV files based on the types of machinery inside, for example.
I thought your last comment might mean that objects of similar type might not share the same columns; say for example if not all "Exhaust Fans" have the same columns, they will end up in different DataFrames in the dictionary. This maybe the type of case where it might be easier to just look at individual rows, rather than grouping them into weird categories:
df_dict = {row : pd.DataFrame(df.loc[row].dropna()).transpose() for row in df.index}
You could again then save these DataFrames as CSV files or look at them one by one (or e.g. search for Exhaust Fans by seeing if "Exhaust" is in they key). You could also print them all at once:
import pandas as pd
import numpy as np
import natsort
#making some randomly sparse data
columns = ['Column ' + str(i+1) for i in range(10)]
index = ['Row ' + str(i+1) for i in range(100)]
df = pd.DataFrame(np.random.rand(100,10), columns=columns,index=index)
df[df<.7] = np.nan
#creating the dictionary where each key is a row name
df_dict = {row : pd.DataFrame(df.loc[row].dropna()).transpose() for row in df.index}
#printing all the output
for key in natsort.natsorted(df_dict.keys())[:5]: #using [:5] to limit output
print(df_dict[key], '\n')
Out[1]:
Column 1 Column 4 Column 7 Column 9 Column 10
Row 1 0.790282 0.710857 0.949141 0.82537 0.998411
Column 5 Column 8 Column 10
Row 2 0.941822 0.722561 0.796324
Column 2 Column 4 Column 5 Column 6
Row 3 0.8187 0.894869 0.997043 0.987833
Column 1 Column 7
Row 4 0.832628 0.8349
Column 1 Column 4 Column 6
Row 5 0.863212 0.811487 0.924363
Instead of printing, you could write the output to a text file; maybe that's the type of document that you could look at (and search) to compare to the input tables. Bute not that even though the printed data are tabular, they can't be made into a DataFrame without accepting that there will be missing data for rows which don't have entries for all columns.

dropping rows from dataframe having values in range

I have a data frame, where one of the column is days.to.play. Now I want to drop any rows, where it is more than 1000. The name of the data frame is basketball_football_2.
I tried couple of solutions such as:
basketball_football_2.loc[~(basketball_football_2['days.to.play'] > 1000)]
OR
basketball_football_2['day.to.play'] = basketball_football_2[basketball_football_2['day.to.play'] >= 1000]
OR
basketball_football_2.drop(basketball_football_2.loc[basketball_football_2['days.to.play']>=1000].index, inplace=True)
but it is dropping all the values and making the entire data frame empty.

As already provided by #Marek Here's the link which should help you: Deleting DataFrame row in Pandas based on column value
Here's the code just for the easy reference purpose:
basketball_football_2 = basketball_football_2[basketball_football_2['days.to.play'].values < 1000]
Note: It's really a bad practice to use "." in your column names. Instead use something like "_".
Good Luck!
May the CODE be with you!

Filter based on existence in one table and non-existence in another

I have the following data model:
Record: Id, ..., CreateDate
FactA: RecordId, CreateDate
FactB: RecordId, CreateDate
Relationships exist from FactA to Record and FactB to Record.
I've written measures on Records such as this with no issues:
FactA's:=CALCULATE(DISTINCTCOUNT(Records[Id]), FactA)
FactB's:=CALCULATE(DISTINCTCOUNT(Records[Id]), FactB)
Now I'd like a count of Records with FactA but no FactB, in SQL I'd do a LEFT JOIN WHERE FactB.RecordId IS NULL but I can't figure out how to do similar in DAX. I've tried:
-- this returns blank, presumably because when there is a FactB then RecordId isn't blank, and when there is no Fact B then RecordId a NULL which isn't blank either
FactA_No_FactB:=CALCULATE(DISTINCTCOUNT(Records[Id]), FactA, FILTER(FactB, ISBLANK([RecordId])))
-- this returns the long "The value for columns "RecordId" in table "FactB" cannot be determined in the current context" error.
FactA_No_FactB:=CALCULATE(DISTINCTCOUNT(Records[Id]), FILTER(FactA, ISBLANK(FactB[RecordId])))
I've also tried various ways of using RELATED and RELATEDTABLE but I don't really understand enough about DAX and context to know what I'm doing.
Can someone explain how I can write the calculated measure to count Records with FactA but no FactB?
Thanks in advance.
Edit - Workaround
I've come up with this, it looks correct so far but I'm not sure if it is the generally correct way to do this:
-- Take the count with FactA and subtract the count of (FactA and FactB)
FactA_No_FactB:=CALCULATE(DISTINCTCOUNT(Records[Id]), FactA) - CALCULATE(DISTINCTCOUNT(Records[Id]), FactA, FactB)

Here's an alternative, that might still not be the best way of doing it:
FactA_No_FactB:=CALCULATE(DISTINCTCOUNT(Records[ID]), FILTER(Records,CONTAINS(FactA, FactA[RecordID],Records[ID]) && NOT(CONTAINS(FactB,FactB[RecordID],Records[ID]))))
The difference between my version and yours is that mine returns a value of 1 for those items in and A but not B and BLANK for everything else. Your version returns 1 for those items in A but not B, 0 for those in both A and B and BLANK for everything else. Depending on your use case, one outcome may be prefereable over the other.

Excel Power Query -- Select value in column specified in related table -- INDEX+MATCH alternative

Problem
I have two queries, one contains product data (data_query), the other (recode_query) contains product names from within the data_query and assigns them specific id_tags. id_tags are also column names within the data_query.
What I need to achieve and fail at
I need the data_query to look at the id_tag of the specific product name within the data_query, as parsed from the recode_query (this is already working and in place) and input the retrieved value within the specific custom column cell. In Excel, I would be using INDEX/MATCH combo:
{=INDEX(data_query[#Data];; MATCH(data_query[#id_tag]; data_query[#Headers]; 0))}
I have searched near and far, but I probably can't even spot the solution, even if I have come across it, as I am not that deep in the data manipulation and power query myself.

Is this what you're wanting?
let
DataQuery = Table.FromColumns({{1,2,3}, {"Boxed", "Bagged", "Rubberbanded"}}, {"ID","Pkg"}),
RecodeQuery = Table.FromColumns({{"Squirt Gun", "Coffee Maker", "Trenching Tool"}, {1,2,3}}, {"Prod Name", "ID2"}),
Rzlt = Table.Join(DataQuery, "ID", RecodeQuery, "ID2", JoinKind.Inner)
in
Rzlt

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

PETL - Sorting as descending order - python-3.x

Related

M Query Table.Group with Min based on two columns

How can I groupby rows by the columns in which they actually posses a data point?

dropping rows from dataframe having values in range

Filter based on existence in one table and non-existence in another

Excel Power Query -- Select value in column specified in related table -- INDEX+MATCH alternative

Categories

Resources