I have a input dataframe containing multiple list columns with unequal number of elements with in the list. I need to expand all the list columns into rows so that each bin has the corresponding value in the same row.
code for generating the df:
df_dict = {'vin':['VIN123','VIN123','VIN123','VIN234','VIN345'],
'date':['01-22-2022','01-23-2022','01-23-2022','01-23-2022','01-22-2022'],
'celltype':['A','A','B','A','B'],
'soc_bins':[['0-10','10-20','50-80','85-90','100-150','150-170'],['0-10','10-20','50-80','85-90','100-150','150-170'],['0-10','10-20','50-80','85-90','100-150','150-170'],['0-10','10-20','50-80','85-90','100-150','150-170'],['0-10','10-20','50-80','85-90','100-150','150-170']],
'soc_value': [[10,300,85,20,5,0],[20,400,125,670,5,7],[20,500,55,60,9,9],[40,300,65,90,1,0],[20,700,35,50,2,0]],
'temp_bins':[['50f-55f','60f-70f','90f-110f'],['50f-55f','60f-70f','90f-110f'],['50f-55f','60f-70f','90f-110f'],['50f-55f','60f-70f','90f-110f'],['50f-55f','60f-70f','90f-110f']],
'temp_value':[[1,2,3],[4,3,4],[5,3,5],[6,900,7],[3,600,9]],
'temp_bins':[['50f-55f','60f-70f','90f-110f'],['50f-55f','60f-70f','90f-110f'],['50f-55f','60f-70f','90f-110f'],['50f-55f','60f-70f','90f-110f'],['50f-55f','60f-70f','90f-110f']]}
Input_df:
Output_df:
vin
date
celltype
soc_bins
soc_value
temp_bins
temp_value
VIN123
01-22-2022
A
0-10
10
50f-55f
1
VIN123
01-22-2022
A
10-20
300
60f-70f
2
In short, each value in the soc_value column corresponds to the corresponding bin in the soc_bin column and same goes for the temp columns.
Few problems I encountered using the explode method or similar methods is:
The number of bins in soc_bins (5) and temp_bins (3) are not equal.
Also, there might be a same value for two bins (ex: 3rd row, soc_value contains two values as 9) so when I first expand the soc_value column there is no way for the explode fucntion to identify the two rows as different and hence i am getting an error "cannot handle a non-unique multi-index!"
There are a lot many columns that has to be manipulated in the same way.
Can use df.set_index('date','vin','celltype').apply(lambda x: x.apply(pd.Series).stack()).reset_index() but i am getting NaN's in the indexed columns.
To fill the NaN's I can use the .ffill() but I am unable to distinguish between original null values.
Also, in this method if some of indexes are null's i'm getting an error "cannot handle a non-unique multi-index!"
Current output:
Required output: I need the output similar to my current output but without the null values. I could use .ffill() to fill the null values, but then i am unable to differentiate the actual null values vs the the ones created from the df.set_index().
Assigning a row_number to the df before exploding it into columns has solved the "cannot handle a non-unique multi-index!" issue.
df['row_number'] = np.arange(len(df))
df.set_index('date','vin','celltype').apply(lambda x: x.apply(pd.Series).stack()).reset_index()
In Excel connected to SSAS, I am trying to build a pivot table and add a custom Measure Calculation using "OLAP Tools" and/or "OLAP Pivot Table Exensions". I am trying to add a calculation that is really simple in my mind, but I cannot get it to work. The calc I need is:
GOAL: A record count of the [Items] dimension records grouped by any of the
[Items] dimension fields.
In particular I am trying to group by [Items].[Items Groups] and [Items].[Item]. Item is the lowest grain, so the count should return value "1". I have created a couple calculations that are kind of in the ballpark (see below). But the calcs don't appears to be working as desired.
What I have tried:
Attempt #1 -- [Measures].[Items Count (With net amount values)]
DISTINCTCOUNT( {[Items].[Item].MEMBERS} )
The calc 'Items Count (With net amount values)' appears to be
returning a decent count value, but it appears it only counts the Item
if there are transnational records found (not sure why). Also, when
at the lowest grain level the calc returns that value for the parent
group, not the dimension level selected on the rows.
Attempt #2 -- [Measures].[Items Count (All)]
[Items].[Item].[Item].Count
This calc returns the TOTAL item count for the entire dimension
regardless of the dimension level placed on the rows.
Attempt #3 -- [Measures].[Items Count]
COUNT ( { [Items].[Item].MEMBERS}, EXCLUDEEMPTY)
This calc freezes up Excel and I have to quit Excel. No idea why. I have seen this sytnax recommended on a few different sites.
Screenshot:
Help please? This seems really simple, but I am not very skilled with MDX. In DAX and SSAS TABULAR this would be very simple expression. But I'm struggling to count the rows with MDX in SSAS MD.
The "Outside Purchased Beef" group has 18 items with transactions, but 41 items in total. I do not know how to calculate the "41" value.
SSAS Excel-CalcMeasure-CountRows.png
Take a look at the following samples on AdventureWorks.
with member [Measures].[CountTest]
as
count(existing [Product].[Subcategory].members - [Product].[Subcategory].[All])
select
{
[Measures].[Internet Sales Amount],[Measures].[CountTest]
}
on columns,
{
([Product].[Category].[Category]
,[Product].[Subcategory].[Subcategory] -- comment this line for the second result
)
}
on rows
from [Adventure Works]
Now comment the indicated line for the parent view.
I have a data set stored in an excel file, when i importing data using matlab function :
A=xlread(xls -filename)
matrix A only stored numeric values of my table.. and when i used another function such as:
B= readtable(xls-filename)
then table will view complete data include rows and columns headers but when i apply such operation on it like
Bnorm=normc(B)
its unable to perform normalization on it due to the rows and columns headers ..
my question are:
is there any way to avoid rows and columns header in table B.
is there any way to store rows and columns headers when read table using xlread function .. such that
column header = store first row in (xls-filename)
row headers = store first column in (xls-filename)
thanks for any suggestion
dataset table
normalized matrix when apply xlread(xls-filename
The answers to your specific questions are:
With a table, you can avoid row labels but column labels always exist.
As per the doc for xlsread, the first output is the numeric data, and the second output is the text data, which in this case would include your header information.
But, in this case, you just need to learn how to work with tables properly. You want something like,
>> Bnorm = normc(B{:,2:end});
which extracts all the numeric elements of table B and uses them as input to normc.
If you want the result to be a table then use
Bnorm = B;
Bnorm{:,2:end} = normc(B{:,2:end}));
I have this table which has foreign keys from several other keys:
Basically, this table shows which students registered in which module run by which teacher in what term.
I want to query the following:
How many students have registered for more than one module run by a given tutor?
It will look something like this:
For example, Vasiliy Kuznetsov runs two modules: FunPro and NO. If one student registers for both of them, he is counted as one.
My sql oriented mind is telling me this: Count all the rows in which student_id and tutor_id are the same. For example, in one row student_id is 5 and tutor_id is 10, and the same is true for the third row. Then, I count it as one.
How can I do that with DAX formulas?
RowCount:=
COUNTROWS( ModuleRegistration )
StudentsWithTwoOrMoreRegistrations:=
COUNTROWS(
FILTER(
VALUES( ModuleRegistration[Student_ID] )
,[RowCount] >= 2
)
)
I refer to arguments positionally, thus the first argument to a function is (1), the second (2), and so on.
So, [RowCount] is trivial.
[StudentsWithTwoOrMoreRegistrations] is a bit more involved. DAX, being a functional language, is best understood inside-out.
FILTER() takes a table expression in (1) and evaluates a boolean predicate, (2), for each row in (1). It returns all rows from (1) for which (2) evaluates to true.
Our FILTER()'s (1) is VALUES( ModuleRegistration[Student_ID] ). VALUES() returns the unique rows from a field based on current filter context (it respects slicers and filters in the pivot table). Thus, we will return some subset of the unique list of [Student_ID]s.
Our FILTER()'s (2) is [RowCount] >= 2. For each [Student_ID] in (1), we'll evaluate [RowCount], checking how many times that student appears in ModuleRegistration. [RowCount] is evaluated in the combination of filter context from the pivot table (the [Faculty Name] field in your sample pivot provides filter context) and row context from FILTER()'s (1). Thus it counts how many times the student appears in ModuleRegistration for the [Faculty Name] on the pivot table row.
We check that [RowCount] is >= 2.
You've not indicated if your measure needs to handle grand totals, or how you might want to see that. If you need more help for the grand total to get it to behave the way you like, let me know.
Edit for grand total
There are a few ways you might want to handle grand totals. I'm gong to assume that you want a unique count of students.
StudentsWithTwoOrMoreRegistrations:=
COUNTROWS(
SUMMARIZE(
FILTER(
SUMMARIZE(
ModuleRegistration
,ModuleRegistration[Tutor_ID]
,ModuleRegistration[Student_ID]
)
,[RowCount] >= 2
)
,ModuleRegistration[Student_ID]
)
)
WTF happened to our measure?
Let's examine:
Starting with the innermost SUMMARIZE(). SUMMARIZE() navigates relationships outward from the table in (1) and groups by the columns listed in (2)-(N) (these don't have to be from the table in (1), but must be reachable by navigating relationships).
This is equivalent to the following in SQL:
SELECT
mr.Tutor_ID
,mr.Student_ID
FROM ModuleRegistration mr
We use FILTER() on this table like earlier. [RowCount] is evaluated in the combination of filter context from the pivot table and the row in the table, defined by our SUMMARIZE() above.
Now our row context is instead of just a student, a student-tutor pair. This pair will have a [RowCount] >= 2 when the student has taken more than one module from a tutor.
Our FILTER() returns the pairs which have a [RowCount] >= 2. This output table has two fields, [Tutor_ID] and [Student_ID], but we want to count distinct [Student_ID]s out of this.
Thus, we use the table from FILTER() as our (1) in the outer SUMMARIZE(). We group only by the values of [Student_ID]. We then count the rows of this table.
When only one [Faculty_Name] is in context, e.g. on a pivot table row, then our inner SUMMARIZE() is grouping by a single value of [Tutor_ID] and whatever [Student_ID]s are associated with it. This is identical to our earlier measure.
When we have many [Tutor_ID]s in context, like in the grand total, then we'll see the appropriate behavior of only counting each [Student_ID] once.
I would like to calculate the sum of open positions in a receivables account. The entries in the accounting system provide three relevant columns in the source table to that end:
booking date
due (=pay) date
amount due
I would like to have a measure that I can use for a graph, showing the total of all open positions on each day.
An open position is an amount booked with a booking date before "today" and with a due date after "today".
I tried the following approach in my Power Pivot model (with three calendar tables):
booking date related to "calendar table 1"
due date related to "calendar table 2"
Date columns of "calendar table 1" and "calendar table 2" related to a third "calendar table main"
For that formula I am getting an error message:
Hm, not sufficiently proficient in PowerPivot to solve this problem.
SumAmt:=
SUM( Source_Table[Amount] )
OpenPositions:=
CALCULATE(
[SumAmt]
;FILTER(
VALUES( Source_Table[Booking_Date] )
;Source_Table[Booking_Date] < MAX( Calendar_Main[Calendar_Date] )
)
;FILTER(
VALUES( Source_Table[Due_Date] )
;Source_Table[Due_Date] > MAX( Calendar_Main[Calendar_Date] )
)
)
Your error is pretty self-explanatory. If you use a direct column reference in CALCULATE() you can only reference a single column. You are referencing two, Calendar_Main[Calendar_Date] and either Source_Data[Booking_Date] or Source_Data[Due_Date]. This is simply not allowed, so it throws the error.
The workaround is simply to wrap complex filtering logic in table expressions and use those as arguments to CALCULATE(). Pretty much, unless you are hard-coding a literal predicate for a single column, you should be using some sort of table expression, like FILTER(), as your arguments to CALCULATE().
What we do is call FILTER() twice to check the dates. We use MAX()s because we cannot perform comparisons between column references, we need to perform inequality comparisons between scalars.
Since we're FILTER()ing over Source_Data[Booking_Date] and Source_Data[Due_Date], the references to these are evaluated in row context and refer to the value of the current row in FILTER()'s iteration. The reference to Calendar_Main[Calendar_Date] is just a column reference, so we wrap it in MAX() to get a scalar value for our inequality. The MAX() refers to the current filter context coming in from the pivot table, which would be the current row label or column label.
If you aggregate to the month level, this will give you essentially the closing balance, since we're using MAX()s. At the month level the value will be identical to that on the last date of the month.
Finally, with the inequalities you've set up, you're ignoring anything opened on the current day or due on the current day. I'd expect you want [Booking_Date] <= [Calendar_Date] and [Due_Date] > [Calendar_Date].