So i've created a simple bit of analysis in alteryx, and I'm looking for a bit of advice.
I have a dataset of merchants and shoppers. I pull out the total number of unique shoppers with a summarize node countdistinct(shoppers) - let's say it's 100. Then, I create a table for the number of unique shoppers within each merchant, and that table looks something like this
Merchant | Unique Users
Merchant 1 | 76
Merchant 2 | 19
Merchant 3 | 97
Merchant 4 | 44
Merchant 5 | 55
...
I'd like to create a variable that will be [Number of Distinct Users]/countdistinct(shoppers).
I know that I could just append the value for countdistinct(shoppers) into a new third column in my table, but I'd prefer not to do that unless I have to. Is there a way to save the single number from countdistinct(shoppers) as a value and simply divide directly by that without having to append or join?
There currently is no way to do that in Alteryx. While it feels wrong or wasteful, by including the calculated count distinct value on every record, Alteryx can process each record individually and then pass it down stream.
I have heard talk of creating something like a global variable that could be accessed or updated to handle the situation that you're describing, but the complexity of managing reads and writes from different parts of the workflow as well as any resulting slowdown from thread coordination always sideline the discussion.
Related
I am trying to analyze data based on the following scenario:
A group of places, each with its own ID gets available for visiting from time to time for an exclusive number of people - this number varies according to how well the last visit season performed - so far visit seasons were opened 3 times.
Let's suppose ID_01 in those three seasons had the following available slots/sold-out slots ratio: 25/24, 30/30, and 30/30, ID_02 had: 25/15, 20/18, and 25/21, and ID_03 had: 25/10, 15/15 and 20/13.
What would be the best way to design the database for such analysis on a single table?
So far I have used a table for each ID with all their available slots and sold-out amounts, but as the number of IDs gets higher and the number of visit seasons too (way beyond three at this point) it has been proving to be not ideal, hard to keep track of, and terrible to work with.
The best solution I could come up with was putting all IDs on a column and adding two columns for each season (ID | 1_available | 1_soldout | 2_available | 2_soldout | ...).
The Wikipedia article on database normalization would be a good starting point.
Based on the information you provided in your question, you create one table.
AvailableDate
-------------
AvailableDateID
LocationID
AvailableDate
AvailableSlots
SoldOutSlots
...
You may also have other columns you haven't mentioned. One possibility is SoldOutTimestamp.
The primary key is AvailableDateID. It's an auto-incrementing integer that has no meaning, other than to sort the rows in input order.
You also create a unique index on (LocationID, AvailableDate) and another unique index on (AvailableDate, LocationID). This allows you to retrieve the row by LocationID or by AvailableDate.
I have a database in Tableau from an Excel file. Every row in the database is one ticket (assigned to an Id of a customer) for a different theme park across two years.
The structure is like the following:
Every Id can buy tickets for different parks (or same park several times), also in different years.
What I am not able to do is flagging those customers who have been in the same park in two different years (in the example, customer 004 has been to the park a in 2016 and 2017).
How do I create this calculated field in Tableau?
(I managed to solve this in Excel with a sumproduct fucntion, but the database has more than 500k rows and after a while it crashes / plus I want to use a calculated field in case I update the excel file with a new park or a new year)
Ideally, the structure of the output I thought should be like the following (but I am open to different views, as long I get to the result): flag with 1 those customers who have visited the same park in two different years.
Create a calculated field called customer_park_years =
{ fixed [Customerid], [Park] : countd([year]) }
You can use that on the filter shelf to only include data for customer_park_years >= 2
Then you will be able to visualize only the data related to those customers visiting specific parks that they visited in multiple years. If you also want to then look at their behavior at other parks, you'll have to adjust your approach instead of just simply filtering out the other data. Changes depend on the details of your question.
But to answer your specific question, this should be an easy way to go.
Note that countd() can be slow for very large data sets, but it makes answering questions without reshaping your data easy, so its often a good tradeoff.
Try this !
IFNULL(str({fixed [Customerid],[Park]:IF sum(1)>1 then 1 ELSE 0 END}),'0')
I just switched from Oracle to using Cassandra 2.0 with Datastax driver and I'm having difficulty structuring my model for this big data approach. I have a Persons table with UUID and serialized Persons. These Persons have lists of addresses, names, identifications, and DOBs. For each of these lists I have an additional table with a compound key on each value in the respective list and the additional person_UUID column. This model feels too relational to me, but I don't know how else to structure it so that I can have index(am able to search by) on address, name, identification, and DOB. If Cassandra supported indexes on lists I would have just the one Persons table containing indexed lists for each of these.
In my application we receive transactions, which can contain within them 0 or more of each of those address, name, identification, and DOB. The persons are scored based on which person matched which criteria. A single person with the highest score is matched to a transaction. Any additional address, name, identification, and DOB data from the transaction that was matched is then added to that person.
The problem I'm having is that this matching is taking too long and the processing is falling far behind. This is caused by having to loop through result sets performing additional queries since I can't make complex queries in Cassandra, and I don't have sufficient memory to just do a huge select all and filter in java. For instance, I would like to select all Persons having at least two names in common with the transaction (names can have their order scrambled, so there is no first, middle, last; that would just be three names) but this would require a 'group by' which Cassandra does not support, and if I just selected all having any of the names in common in order to filter in java the result set is too large and i run out of memory.
I'm currently searching by only Identifications and Addresses, which yield a smaller result set (although it could still be hundreds) and for each one in this result set I query to see if it also matches on names and/or DOB. Besides still being slow this does not meet the project's requirements as a match on Name and DOB alone would be sufficient to link a transaction to a person if no higher score is found.
I know in Cassandra you should model your tables by the queries you do, not by the relationships of the entities, but I don't know how to apply this while maintaining the ability to query individually by address, name, identification, and DOB.
Any help or advice would be greatly appreciated. I'm very impressed by Cassandra but I haven't quite figured out how to make it work for me.
Tables:
Persons
[UUID | serialized_Person]
addresses
[address | person_UUID]
names
[name | person_UUID]
identifications
[identification | person_UUID]
DOBs
[DOB | person_UUID]
I did a lot more reading, and I'm now thinking I should change these tables around to the following:
Persons
[UUID | serialized_Person]
addresses
[address | Set of person_UUID]
names
[name | Set of person_UUID]
identifications
[identification | Set of person_UUID]
DOBs
[DOB | Set of person_UUID]
But I'm afraid of going beyond the max storage for a set(65,536 UUIDs) for some names and DOBs. Instead I think I'll have to do a dynamic column family with the column names as the Person_UUIDs, or is a row with over 65k columns very problematic as well? Thoughts?
It looks like you can't have these dynamic column families in the new version of Cassandra, you have to alter the table to insert the new column with a specific name. I don't know how to store more than 64k values for a row then. With a perfect distribution I will run out of space for DOBs with 23 million persons, I'm expecting to have over 200 million persons. Maybe I have to just have multiple set columns?
DOBs
[DOB | Set of person_UUID_A | Set of person_UUID_B | Set of person_UUID_C]
and I just check size and alter table if size = 64k? Anything better I can do?
I guess it's just CQL3 that enforces this and that if I really wanted I can still do dynamic columns with the Cassandra 2.0?
Ugh, this page from Datastax doc seems to say I had it right the first way...:
When to use a collection
This answer is not very specific, but I'll come back and add to it when I get a chance.
First thing - don't serialize your Persons into a single column. This complicates searching and updating any person info. OTOH, there are people that know what they're saying that disagree with this view. ;)
Next, don't normalize your data. Disk space is cheap. So, don't be afraid to write the same data to two places. You code will need to make sure that the right thing is done.
Those items feed into this: If you want queries to be fast, consider what you need to make that query fast. That is, create a table just for that query. That may mean writing data to multiple tables for multiple queries. Pick a query, and build a table that holds exactly what you need for that query, indexed on whatever you have available for the lookup, such as an id.
So, if you need to query by address, build a table (really, a column family) indexed on address. If you need to support another query based on identification, index on that. Each table may contain duplicate data. This means when you add a new user, you may be writing the same data to more than one table. While this seems unnatural if relational databases are the only kind you've ever used, but you get benefits in return - namely, horizontal scalability thanks to the CAP Theorem.
Edit:
The two column families in that last example could just hold identifiers into another table. So, voilà you have made an index. OTOH, that means each query takes two reads. But, still will be a performance improvement in many cases.
Edit:
Attempting to explain the previous edit:
Say you have a users table/column family:
CREATE TABLE users (
id uuid PRIMARY KEY,
display_name text,
avatar text
);
And you want to find a user's avatar given a display name (a contrived example). Searching users will be slow. So, you could create a table/CF that serves as an index, let's call it users_by_name:
CREATE TABLE users_by_name (
display_name text PRIMARY KEY,
user_id uuid
}
The search on display_name is now done against users_by_name, and that gives you the user_id, which you use to issue a second query against users. In this case, user_id in users_by_name has the value of the primary key id in users. Both queries are fast.
Or, you could put avatar in users_by_name, and accomplish the same thing with one query by using more disk space.
CREATE TABLE users_by_name (
display_name text PRIMARY KEY,
avatar text
}
In Excel, I have a log of web requests that I need to analyze for bandwidth usage. I have parsed the log into a number of fields that I will groupby in different ways for different reports. Each website page load gets multiple resources - each being a separate line. The data structure:
RequestID | SIZE | IsImage | IsStatic | Language
A | 100 | TRUE | TRUE | EN
A | 110 | TRUE | FALSE | EN
A | 90 | FALSE | FALSE | EN
...
Report 1: I need the AVERAGE request size: AVERAGE( SELECT SUM(SIZE) GROUPBY RequestID ). I do not need to see the size of each individual request.
Report 2: More elaborate pivot table reports showing average request req size broken by isStatic / isImage / language / etc. This way I can check "average total images per request per language"
Is there a way to define a field/item "SUM(SIZE) GROUPBY RequestID" ?
As far as I know this is not possible to achieve in a single pivot table. This is because you need to apply two separate aggregations to the same set of number based on a condition (RequestId)
It is possible to get what you are looking for using two pivot tables, however I would not recommend it but this is how you would do it.
Create the first pivot table on your base table, add the requestId to the rows and the size to value, this will give you an intermediate table with the sum of size per requestId, you then build a second pivot table, this time using the first as the source pivot table as the source, in this instance you will only add the ‘sum of size’ value and take the average of this. See below for example
Again I would not recommend this approach for anything but the most simple analysis
A better way to do this is to use powerpivot, a separate yet related technology to the pivot tables that you have used. You will need to import the table, I have assumed with the name [Logs] with columns [RequestId] and [Size] you will then need to add a calculation
AvarageSizeOfRequests:=AVERAGEX(SUMMARIZE(Logs;Logs[RequestId];"sumOfSize";CALCULATE(sum(Logs[Size])));[SumOfSize])
This will give you the following result
The first is the strait sum which you already have, the second is the average which will be the same per requestID but will aggregate differently.
I guess I am not understanding your Q because I expect the group by for Request ID to be automatic (unavoidable in a PT with that as a Row label). Perhaps pick holes in the following and I might understand what I have misunderstood:
I have added i and s to your data just so it is clearer which column is which. It is possible it would be better to convert TRUE and FALSE into 1 and 0 so the PT might count or average these as well.
This seems vaguely along the right lines so let's try a different PT layout. It RequestID is of little or no relevance for the required analysis don't include it in the PT or, as here, park it as a Report Filter:
in which case however many millions of rows of data of the kind in the OP there are, the PT will always in effect be a 2x2 matrix at most (assuming Language is suited to Report Filter also). There is only one value per record (SIZE) and only two, boolean, variables. Language could make a difference but worst case is one such PT per Language (and bearing in mind only one such is shown in the example!...)
I'm trying to setup a SQL server database for an ASP.NET MVC site that will both store the latest information as well as a history of all the changes to the data.
The information for my site comes from XML files which are uploaded by the user. The site parses through the XML and writes the contained information into the sites database. Elements in successive uploads may actually represent the same thing, but some data may have changed. Yet like I said before I want to keep track of every version.
The table bellow shows one way I was thinking of approaching this. I would create duplicate records for each item in each upload. New items that match those in previous uploads will be assigned the same ID, but each item would be assigned to a unique upload ID.
Upload 1: Erik and Sara are added
Upload 2: Erik renamed to Eric, Bill added
Upload 3: Sarah grew 2" taller, Eric was removed.
[PERSONS TABLE]
PersonID Name Height UploadID
1 Erik 71 1
1 Eric 71 2
2 Sarah 70 1
2 Sarah 70 2
2 Sarah 72 3
3 Bill 76 2
3 Bill 76 3
[UPLOADS TABLE]
UploadID UploadTime
1 3/09/2011
2 4/01/2011
3 4/11/2011
However, this doesn't seem like the optimal solution to me because of how much information ends up being duplicated in the database. Is there a better way to approach this where only the changes are saved with each upload?
I think the problem is is that your PERSONS table no longer contains just information about PERSONS. It also contains information on the updloads. What I'm going to recommend probably won't decrease the size of your database; but it will make it a little easier to understand and work with.
PERSONS
PersonID, Name, Height
1 Eric 71
2 Sarah 72
3 Bill 76
UPLOAD
UploadID, UploadTime
1 3/09/2011
2 4/01/2011
3 4/11/2011
PERSONS_EDIT
PersonID, UploadID, ChangeSQL, ChangeDescription
1 1 "insert into PERSONS(Name, Height) VALUES('Erik', 71)" "Erik added"
1 2 "update PERSONS set name='Eric' where Name='Erik'" "Changed Erik's name"
.... ... ...... ....
I don't think you can do much beyond this to make your tables simpler or your database smaller. As you can see, your PERSONS_EDIT table is going to be your largest table. The database you're using might provide mechanisms to do this automatically (some sort of transaction recording or something) but I've never used anything like that so I'll leave it to other people on Stackoverflow to make any suggestions like that if they exist. If the PERSONS_EDIT table gets too large, you can look at deleting entries that are over a week/month/year old. The decision on when to do that would be up to you.
Some other reasons for making this change, in your first table, you had to use PersonId and UploadID as a primary key to your persons table. So, to actually get the most recent version of a PERSON within your application, you would have had to do something where you select person by id, and then order by their UploadId and select the one with the largest upload Id EVERY TIME YOU DO A TRANSACTION ON ONE PERSON.
Another benefit is that you don't have to do a bunch of fancy sql to get your edit history. Just do a select * from the PERSONS_EDIT table.