Better way to refresh imported columns? - spotfire

I have a table in spotfire with a couple columns imported from another table as a lookup. As an example, Col2 is used to match for the import of ImportedCol:
+------+------+-------------+
| Col1 | Col2 | ImportedCol |
+------+------+-------------+
| 1 | A | Val1 |
| 2 | B | Val2 |
| 3 | A | Val1 |
| 4 | C | Val3 |
| 5 | B | Val2 |
| 6 | A | Val1 |
| 7 | D | Val4 |
+------+------+-------------+
However, the data in Col2 is subject to change. In that event, I need ImportedCol to change with it, however Spotfire seems to just keep the old imported data. Right now I've been deleting the imported column and re-adding it to refresh the link. Is there a way to dynamically import the data as the document loads or with any refresh of the information links?

I have found that this happens sometimes although I'm not exactly sure how to explain why. my workaround is to create "virtual" data tables based on your existing ones.
consider your linked table as A and your embedded table as B. start from a default state -- that is, before importing any columns.
add a new data table. the source for this table should be "From Current Analysis" and using A. we will consider this one as C, and it becomes your main data table, and C will update when any changes are made to A or B.
to illustrate:

I found the issue.
Turns out that pivoting on data in the same table creates a circular reference which overrides the embed/link setting on that table. My workaround was to make the pivot as its own information link, then have the table join the original link and the new pivot one.

Related

Join on inequality in Power Query

I have been trying to answer this question
With the following data
+---------+---------+-----------+---------+
| Column1 | Column2 | Column3 | Column4 |
+---------+---------+-----------+---------+
| 1 | happy | 1-veggies | GHF |
| 1 | sad | 1-veggies | HGF |
| 2 | angry | 1-veggies | GHG |
| 2 | sad | 1-veggies | FGH |
| 3 | sad | 1-veggies | HGF |
| 4 | moody | 2-meat | FFF |
| 4 | sad | 2-meat | HGF |
| 5 | excited | 2-meat | HGF |
+---------+---------+-----------+---------+
OP was asking for a way of finding how many records there were which matched 'sad' and '1-veggies', and also had another record with the same value in column 1 and a code of GHF or FGH in column 4. The first two rows qualify, but the fourth row does not qualify because (if I understand correctly) it has the correct code, but in the same record as the one matching 'sad' and '1-veggies'. The count should be one.
I think the answer would have been fairly standard if this had been a SQL question - you would do a self-join with an equality on the first column and an inequality on the row number. In SQL it would look something like this:
create table Veggies
(
num integer,
emotion varchar(10),
food varchar(10),
code varchar(10),
seq integer
)
insert into Veggies
values
(1,'happy','1-veggies','GHF',1),
(1,'sad','1-veggies','HGF',2),
(2, 'angry' ,'1-veggies' ,'GHG',3),
(2, 'sad', '1-veggies', 'FGH',4),
(3, 'sad', '1-veggies', 'HGF',5),
(4, 'moody', '2-meat', 'FFF',6),
(4, 'sad', '2-meat', 'HGF',7),
(5, 'excited', '2-meat', 'HGF',8)
with t1 (num,seq)
as
(
select num,seq
from veggies
where emotion='sad' and food='1-veggies'
),
t2 (num,seq)
as
(
select num,seq
from veggies
where code='GHF' or code='FGH'
)
select *
from t1 inner join t2 on t1.num=t2.num and t1.seq<>t2.seq
I thought it might be possible to do the same thing (join on first column equal but row number unequal) in Power Query, but I have worked through the steps of getting the two queries with row numbers, and am stuck here:
I don't see any way of expressing an inequality and the documentation seems unhelpful. Does anyone have any inside knowledge on how to do this?
So although it looks as though you can't translate the SQL in the question directly into Power Query and replicate this in a single step
select *
from t1 inner join t2 on t1.num=t2.num and t1.seq<>t2.seq
you can split it into two steps as suggested by #Ron Rosenfeld.
To recap, the initial steps which hopefully were fairly straightforward were:
Establish a connection to the data as Table 1
Add an index column
Duplicate the table and call it Table 2
Filter table 1 by 'sad' and '1-veggies'
filter table 2 by 'GHF' or 'FGH'
Now join Table 2 to Table 1 using an inner join on Column 1:
and exclude rows that were in table 1 using a left anti join on the index column:
This leaves one row as required.

Efficiently update rows of a postgres table from another table in another database based on a condition in a common column

I have two pandas DataFrames:
df1 from database A with connection parameters {"host":"hostname_a","port": "5432", "dbname":"database_a", "user": "user_a", "password": "secret_a"}. The column key is the primary key.
df1:
| | key | create_date | update_date |
|---:|------:|:-------------|:--------------|
| 0 | 57247 | 1976-07-29 | 2018-01-21 |
| 1 | 57248 | | 2018-01-21 |
| 2 | 57249 | 1992-12-22 | 2016-01-31 |
| 3 | 57250 | | 2015-01-21 |
| 4 | 57251 | 1991-12-23 | 2015-01-21 |
| 5 | 57262 | | 2015-01-21 |
| 6 | 57263 | | 2014-01-21 |
df2 from database B with connection parameters {"host": "hostname_b","port": "5433", "dbname":"database_b", "user": "user_b", "password": "secret_b"}. The column id is the primary key (these values are originally the same than the one in the column key in df1; it's only a renaming of the primary key column of df1).
df2:
| | id | create_date | update_date | user |
|---:|------:|:-------------|:--------------|:------|
| 0 | 57247 | 1976-07-29 | 2018-01-21 | |
| 1 | 57248 | | 2018-01-21 | |
| 2 | 57249 | 1992-12-24 | 2020-10-11 | klm |
| 3 | 57250 | 2001-07-14 | 2019-21-11 | ptl |
| 4 | 57251 | 1991-12-23 | 2015-01-21 | |
| 5 | 57262 | | 2015-01-21 | |
| 6 | 57263 | | 2014-01-21 | |
Notice that the row[2] and row[3] in df2 have more recent update_date values (2020-10-11 and 2019-21-11 respectively) than their counterpart in df1 (where id = key) because their creation_date have been modified (by the given users).
I would like to update rows (i.e. in concrete terms; create_date and update_date values) of df1 where update_date in df2 is more recent than its original value in df1 (for the same primary keys).
This is how I'm tackling this for the moment, using sqlalchemy and psycopg2 + the .to_sql() method of pandas' DataFrame:
import psycopg2
from sqlalchemy import create_engine
connector = psycopg2.connect(**database_parameters_dictionary)
engine = create_engine('postgresql+psycopg2://', creator=connector)
df1.update(df2) # 1) maybe there is something better to do here?
with engine.connect() as connection:
df1.to_sql(
name="database_table_name",
con=connection,
schema="public",
if_exists="replace", # 2) maybe there is also something better to do here?
index=True
)
The problem I have is that, according to the documentation, the if_exists argument can only do three things:
if_exists{‘fail’, ‘replace’, ‘append’}, default ‘fail’
Therefore, to update these two rows, I have to;
1) use .update() method on df1 using df2 as an argument, together with
2) replacing the whole table inside the .to_sql() method, which means "drop+recreate".
As the tables are really large (more than 500'000 entries), I have the feeling that this will need a lot of unnecessary work!
How could I efficiently update only those two newly updated rows? Do I have to generate some custom SQL queries to compares the dates for each rows and only take the ones that have really changed? But here again, I have the intuition that, looping through all rows to compare the update dates will take "a lot" of time. How is the more efficient way to do that? (It would have been easier in pure SQL if the two tables were on the same host/database but it's unfortunately not the case).
Pandas can't do partial updates of a table, no. There is a longstanding open bug for supporting sub-whole-table-granularity updates in .to_sql(), but you can see from the discussion there that it's a very complex feature to support in the general case.
However, limiting it to just your situation, I think there's a reasonable approach you could take.
Instead of using df1.update(df2), put together an expression that yields only the changed records with their new values (I don't use pandas often so I don't know this offhand); then iterate over the resulting dataframe and build the UPDATE statements yourself (or with the SQLAlchemy expression layer, if you're using that). Then, use the connection to DB A to issue all the UPDATEs as one transaction. With an indexed PK, it should be as fast as this would ever be expected to be.
BTW, I don't think df1.update(df2) is exactly correct - from my reading, that would update all rows with any differing fields, not just when updated_date > prev updated_date. But it's a moot point if updated_date in df2 is only ever more recent than those in df1.

Finding Duplicates and Creating a column which points out the duplicates in pandas

| Col1 | Col2 | Col3 |
|------|------|------|
| m | n | o |
| m | q | e |
| a | b | r |
Let's say I have a pandas DataFrame as shown above. Notice the col1 values are same for the 0th and 1st row. Is there way to find all the duplicate entries on the dataframe based on Col1 only.
Additionally i wold also like to add another column say is_duplicate which would say True for all the duplicate instances of my DataFrame and False otherwise.
Note: I want to find the duplicates based only on basis of the value in Col1 the other columuns can be or might not be duplicates, They should'nt be taken into consideration.
.duplicated() has exactly that functionality:
df['is_duplicate'] = df.duplicated('Col1')
I found it :
df["is_duplicate"] = df.Col1.duplicated(keep=False)

MS Excel: How to list all column if the rows contain a given date?

My data looks like below. I have Groups that I share topics each day. We do this randomly based on need.
| | Topic 1 | Topic 2 | Topic 3 | Topic 4 | Topic 5 | Topic 6 | Topic 7 | Topic 8 | Topic 9 |
|---------|---------|---------|---------|---------|---------|---------|---------|---------|---------|
| Group 1 | | 19-apr | 30-apr | | | | | | |
| Group 2 | 18-apr | 25-apr | | | | | | | |
| Group 3 | | | | | 19-apr | 30-apr | | | |
| Group 4 | 18-apr | 25-apr | | | | | | | |
| Group 5 | | | | | | | 19-apr | 30-apr | |
| Group 6 | | | 25-apr | | | | | | |
| Group 7 | 18-apr | 25-apr | | | | | | | |
For our metrics & analysis, we need a list of groups per date on a different sheet. We like to know which all groups were engaged a given day. Like below
Can somebody please help me how I can get this done with only using formulas and without macros?
I believe this can somehow be handled on Index Matching or look-ups.
You could definitely do this with macros. You can do something similar without macros; it may not be precisely what you were looking for because it will leave blank space where groups were not addressed.
Method 1
Here is the formula I used and a picture of the sheet it is in:
=IF(IFERROR(MATCH(L$4,$B6:$H6,0),FALSE),INDEX($B$5:$B$13,MATCH($K5,$B$5:$B$13,0),1),"")
The idea is that if you have absolute references alongside your list of groups per date, then you can use index and match to fill in that group's name, but only if Match finds that precise date code in that group's row from the previous table. If you place an equivalent formula in the first cell, you can drag it out to the rest of the array.
The formula I used is not the only way to do this, but if you know Index and Match, then it should make sense to you.
Method 2
A more convoluted method would be to use image references. With these, it is possible to make the report precisely what you asked for on a separate sheet.
Suppose you took Method 1 and separated each column out into a different table. Nearly the same formula inside the cells below the date heading, except that you enclose the heading reference in int() as shown below. Create one table for each of N dates, where N is the number of days you want to monitor at once. Then when you want the summary to show you different dates, you go to each table and change the heading, and filter out blanks.
formula:
=IF(IFERROR(MATCH(INT($L$2),$B4:$H4,1),FALSE),INDEX($B$4:$B$11,MATCH($K3,$B$4:$B$11,1),0),"")
The below image shows what I mean by one table for each date:
Then you insert an image. Doesn't matter what image; could be a screenshot of anything. Click on that image, then click into the formula bar. Then highlight the table column you want it to represent. Below is a screenshot of how to to that:
Now place that picture on its own sheet in the workbook. Place each date table on its own sheet in the workbook. The reason you do this is: if you filter a table, everything else overlapping the filtered rows outside the table will also be hidden. You move tables to separate sheets to prevent them from hiding each other.
Finally, arrange your pictures into the order you like, filter the blanks out of the tables, and your images will be exactly what you were looking for:
Again, this is a little convoluted because if you want the report to show you new date summaries, you would have to change the headings on every table. Then you would have to go to each table and refresh it's filter. This is where macros usually come in.
Assume range A1:J8 housed your Source table, and L1:P8 housed the Date/Group Output
1] In L2, copied across :
=IFERROR(1/(1/AGGREGATE(15,6,$B$2:$J$8/($B$2:$J$8>K$2),1)),"")
2] In L3, copied across to P3 and all copied down :
=IF(L$2="","",IFERROR(INDEX($A:$A,AGGREGATE(15,6,ROW($A$2:$A$8)/($B$2:$J$8=L$2),ROW(A1))),""))
You can use the following formula to get a list of dates from a table:
=IFERROR(AGGREGATE(15,6,($B$2:$J$8/($B$2:$J$8*(COUNTIF($A$15:A15,$B$2:$J$8)=0)))*$B$2:$J$8,1),"")
To get a list of groups by date, use the following:
=IFERROR(INDEX($A$1:$A$8,AGGREGATE(15,6,(1/(B$15=$B$1:$J$8))*ROW($B$1:$J$8),ROW(A1))),"")

How to align multiline values in AsciiDoc table?

I would like to dynamically generate a table with asciidoc, which could look like this :
--------------------------------------
|Text | Parameter | Value1 | Value2 |
--------------------------------------
|foo | param1 | val1 | val2 |
--------------------------------------
|bar | param2 | val3 | val4 |
| | param3 | value_ | val6 |
| | | multi_ | |
| | | 5 | |
| | param4 | val7 | val8 |
--------------------------------------
| baz | param5 | val9 | val10 |
--------------------------------------
That is, there might be multiple parameters to one text, and their
values might span multiple lines. I am looking for a way to automatically
align these. I have a program that gathers data which changes, so I can
not manually fix things.
What I currently do: I have frame and gridless nested tables in the
Parameter, Value1 and Value2 columns. The problem with this is they only align if each value does not span multiple lines.
I also tried making Parameter, Value1 and Value2 a nested table together, with grid but no frame.
It works in terms of alignment, but doesn't look very good because the grid lines do not touch the gridlines of the outer table. Adding a frame also looks dull since it emphasizes multiparameter entries.
What I really want to do is add an extra line to the outer table (no table nesting) with no horizontal line in between, if there is an extra parameter.
I can not see how to do this with AsciiDoc. Is that possible at all? Any other suggestions on how to solve this?
It turns out this is rather easy with spans (see chapter 23.5):
.Multiline values alined with spans
[cols=",,,",width="60%", options="header"]
|================
|Text | Parameter | Value1 | Value2
|foo | param1 | val1 | val2
.3+<.<|foo .3+<.<|bar | val3 | val4
| razzle bla fasel foo bar | dazzle
|bli | bla
|foo2 | param3 | val5 | val6
|================
Now all I need to do is tell my templating system (jinja2) how much rows I need to span, but that is rather a diligent but routine piece of work.
If you're using asciidoctor, there are many other options for tables including putting columns on new lines and using the metadata for the table to specify how many columns the table contains. This is the recommended way of doing tables in Asciidoctor. You can see this example and many others in the user's guide. To give an example here on SO:
[cols="2*"]
|===
|Cell in column 1, row 1
|Cell in column 2, row 1
|Cell in column 1, row 2
|Cell in column 2, row 2
|===
Asciidoctor can be a drop in replacement for the asciidoc command, though you will want to look at differences between the two.

Resources