How can I store my hierarchical data? - object

I'm creating an iOS app which allows you to find the quickest route from one city to another. To make things easier for the user, you can filter down based on a city's continent or country too (so Europe would show Paris, London and Berlin, whereas France would only show Paris). I currently have all my objects stored in various arrays, with the City object pointing to the Country object, and the Country object pointing to the Continent object. See my diagram below for a visual representation:
I feel that this is really ineffective when it comes to filtering the data out. I want a data structure that will allow me to filter the cities out quickly. I am happy to use my 3 arrays that I currently have, but I really feel that this isn't efficient enough, and I am struggling to find a solution online. Thanks in advance.

You are trying to set up a network type database. Instead, try using a relational type database. One table of cities with extra columns for country. Then another table of countries

Related

Tableau Calculated Field using FIXED

I have a database in Tableau from an Excel file. Every row in the database is one ticket (assigned to an Id of a customer) for a different theme park across two years.
The structure is like the following:
Every Id can buy tickets for different parks (or same park several times), also in different years.
What I am not able to do is flagging those customers who have been in the same park in two different years (in the example, customer 004 has been to the park a in 2016 and 2017).
How do I create this calculated field in Tableau?
(I managed to solve this in Excel with a sumproduct fucntion, but the database has more than 500k rows and after a while it crashes / plus I want to use a calculated field in case I update the excel file with a new park or a new year)
Ideally, the structure of the output I thought should be like the following (but I am open to different views, as long I get to the result): flag with 1 those customers who have visited the same park in two different years.
Create a calculated field called customer_park_years =
{ fixed [Customerid], [Park] : countd([year]) }
You can use that on the filter shelf to only include data for customer_park_years >= 2
Then you will be able to visualize only the data related to those customers visiting specific parks that they visited in multiple years. If you also want to then look at their behavior at other parks, you'll have to adjust your approach instead of just simply filtering out the other data. Changes depend on the details of your question.
But to answer your specific question, this should be an easy way to go.
Note that countd() can be slow for very large data sets, but it makes answering questions without reshaping your data easy, so its often a good tradeoff.
Try this !
IFNULL(str({fixed [Customerid],[Park]:IF sum(1)>1 then 1 ELSE 0 END}),'0')

Creating relationship between two different columns- a relationship that effects other values

I'm going to try and make this as least confusing as possible, I apologize in advance if my attempt is a failure.
I work in education and am trying to create a predictive analysis document that will tell us how many class sections we will need to offer in a given semester.
I pulled data from the past five years and consolidated it into a pivot chart. I set the pivot chart to combine all courses with a common Subject, Course Title and Catalog Number (See image below for more detail), and output 3 different columns of values based on what we need.
The problem I am facing now is with curriculum changes throughout the years. There are some courses within the list that are no longer being offered and a new course with a new Course Title, Subject and Catalog Number that can now be substituted for the previously needed course. Since the data has been pulled into one pivot chart, both the old curriculum courses and the new curriculum courses are in one list.
I would like to somehow create a relationship between the old curriculum courses and the new curriculum courses. If possible I would like the names of the courses to remain separate, but the values of the old and new to be averaged out together in their respective rows.
In a new page, I plan on putting an easy to use form where the user can select a course subject and name, enter in some other necessary data and the document will output the amount of course sections needed.
Does anyone know of a way to make a relationship between two cells and have other cells effected by this relationship?
Thanks so much!
Mike
enter image description here

Pros and cons of two types of PivotTables for BI purposes

I am trying to figure out how to create the most useful PivotTable for a user to view data for BI purposes. Here are two options I was considerating:
(1) Traditional PivotTable, pivot values on top:
(2) Drill-down type PivotTable:
What are the pros and cons of each method? For example, one for each to start might be:
Drilldown
PRO: trivial to add additional drilldown variables.
Pivot:
PRO: can easily sort by the column headers in the table UI.
And, are there any other possible tabular displays of data, either another type of PivotTable or another type altogether?
I'll suggest to keep it simple. If the objective is to present a view of the revenue figures of each region summarized by gender then pivot table in option 1 is the most effective of both, as it shows everything relevant in one simple look, keeping similar data at the same level making easier to compare.
Bear in mind that management requested that view to be able to effectively see how each region is performing on that specific category.
If the focus is revenue by different gender. Option 1 shows that in same row continuously for each region. It can easily be seen that the best performer on revenue generated by females is US, while best performer on revenue generated by males is Canada. While is not easy to see that in option 2.
If the focus is revenue by same gender. Option 1 shows that in same column continuously, which is not the case with option two.
Option 2 will be useful if the primary focus is set on revenue by region then if there is a need to see additional details based on the performance of any region management can drilldown to see the details of what makes the primary number. Which in this case is not the objective as the request is to show both.
Also best advice is to always agree requirements with clients (internal and/or external) you might find that they might have requested only what they believe it is possible to achieve and after they have that they will apply some "manual steps" to achieve their ultimate goal, something you could have done entirely if only you would have known.
Pivot Tables are used to -
summarize data
analyze data
explore data
present summary data
Both ways (traditional and drill down) of Pivot table can do the above listed.
It depends on what you want to achieve in BI.
If detailed data is not required to show or sort then you can use drill down.
Mostly in BI, data used in summarised form. So Drill down method will be good for display of data. Anyways you can double click and see the detailed data. See how to get details of drill
Drill Down:
a. Pros
Summarise Easily
Add sub points for summary
"Get Details" of Pivot for more details
b. Cons
The way you are doing sorting by pivot is not possible. Instead I would suggest to use pivot to drill down. So you can sort (And will move to Pros section :P) and check pivot details in another form.
Traditional Way
This way you are making to use pivot tables of your data. You should explore more in given links below.
5 pivot tables you probably haven't seen before
pivot tables save your job
23 things you should know about Excel pivot tables
Generally every representation of data has a purpose and with this purpose there come certain advantages and disadvantages.
Obviously with any kind of report, the audience matters most. Which would put you in the classical Requirement Analysis situation where you need to figure out what your customer wants (What data is of interest? How should it be sliced? What medium is it consumed on?)
Is the Revenue by Gender an important KPI?
If it is not, why including it at all?
If it is, let's see what a potential reader would do to answer a question like "How does the womens sale for Mexico compare with Canada?"
Drill down table:
Understanding the table will take a couple of second since they have to understand the different levels and their representation, the meaning of the bold and regular lines and realize that the man and women values accumulate to the total value for a region
Find Mexico in the list
Find the row for women and the value of it on the right
Find Canada in the list
Find the row for women and the value of it on the right
Remember the the value for Mexico or look it up again
Important here is also that this process will be repeated more or less exactly for every follow up question.
Traditional table
Understanding the table will be faster, they see a country name on the left and male/female on the top. Generally people are used to these tables since primary school and won't need further explanation.
Find Mexico in the list, go to the right until they find the value for women (if you try it you will see that you automatically see the values and the heading)
Find Canada in the list, (realize that it is only one line above) go to the right and have both values on top of each other.
For all the following questions the structure is easy to remember and it's a find and match game between rows and columns
I know that might be a bit subjective, but I hope the general idea is understandable
If you know have a question like "In which region do we sell more to men than women?" the advantage of a traditional over a drill down table becomes even more obvious.
With the drill down you will have to juggle several rows and their values while with the traditional you just skim through one column and look for the biggest value.
Is the Revenue per Region the main KPI?
You should then rather use a drill down table, possibly with additional levels (ie. North America in case it's international data or US State since I would assume it would be of interest if your product sells better in Alaska than Florida).
Your audience can then decide which granularity they want to see and adjust it accordingly. The gender is on the bottom of the hierarchy so either you have curious people who are interested in just another figure or they don't care and just don't drill down that deep.
The assumption here is that you deliver the table on the highest aggregate level.
One could argue that the same problem of finding row etc. exists as well for this case but I would assume you wouldn't necessarily compare the sales for Yucatán with Alberta so you stay in one group of states for example and again just have to skim up or down to find the states of the same country so you can compare it.
Using drilldowns in pivot tables is, in my opinion, a tool to be used by analysts, and not managers. Pivot tables are not quite intuitive enough to be used on reports that are being sent on for BI review by management. Typically any report which is being circulated for review by the powers that be should be consistent from user to user. That means using drilldowns would display different numbers if different items are selected - which could lead to 2 people talking about different values without knowing it.
Many people in management level positions outside of the core analytical group will still print anything you e-mail them before they look at it. I suspect that this is more likely to be true in a less technologically advanced company (ie: one which uses Excel as its database analysis tool instead of a full ERP-type system). In either case, anything being submitted for high level review should already be formatted exactly as you want them to see it.
The key in Excel deliverables within the workbplace is to make review as easy as possible. That means all necessary information should be immediately visible on each tab, with a minimum of scrolling (maybe scrolling down if necessary, but never scrolling right), and absolutely no clicking required.
Conclusion - Do not force a reviewer to manipulate your Excel file to use it
You may like drilldowns because you see how powerful it is to adjust reports as you are analyzing data - but once you have made your own analytical conclusions, those conclusions should be immediately apperent from the visible workspace that you leave for review.
Therefore, in order to achieve simplicity in high level review documents, you should use the 'traditional' format as you have shown it, which shows all numbers next to eachother in an easy to read table.

How to optimize Cassandra model while still supporting querying by contents of lists

I just switched from Oracle to using Cassandra 2.0 with Datastax driver and I'm having difficulty structuring my model for this big data approach. I have a Persons table with UUID and serialized Persons. These Persons have lists of addresses, names, identifications, and DOBs. For each of these lists I have an additional table with a compound key on each value in the respective list and the additional person_UUID column. This model feels too relational to me, but I don't know how else to structure it so that I can have index(am able to search by) on address, name, identification, and DOB. If Cassandra supported indexes on lists I would have just the one Persons table containing indexed lists for each of these.
In my application we receive transactions, which can contain within them 0 or more of each of those address, name, identification, and DOB. The persons are scored based on which person matched which criteria. A single person with the highest score is matched to a transaction. Any additional address, name, identification, and DOB data from the transaction that was matched is then added to that person.
The problem I'm having is that this matching is taking too long and the processing is falling far behind. This is caused by having to loop through result sets performing additional queries since I can't make complex queries in Cassandra, and I don't have sufficient memory to just do a huge select all and filter in java. For instance, I would like to select all Persons having at least two names in common with the transaction (names can have their order scrambled, so there is no first, middle, last; that would just be three names) but this would require a 'group by' which Cassandra does not support, and if I just selected all having any of the names in common in order to filter in java the result set is too large and i run out of memory.
I'm currently searching by only Identifications and Addresses, which yield a smaller result set (although it could still be hundreds) and for each one in this result set I query to see if it also matches on names and/or DOB. Besides still being slow this does not meet the project's requirements as a match on Name and DOB alone would be sufficient to link a transaction to a person if no higher score is found.
I know in Cassandra you should model your tables by the queries you do, not by the relationships of the entities, but I don't know how to apply this while maintaining the ability to query individually by address, name, identification, and DOB.
Any help or advice would be greatly appreciated. I'm very impressed by Cassandra but I haven't quite figured out how to make it work for me.
Tables:
Persons
[UUID | serialized_Person]
addresses
[address | person_UUID]
names
[name | person_UUID]
identifications
[identification | person_UUID]
DOBs
[DOB | person_UUID]
I did a lot more reading, and I'm now thinking I should change these tables around to the following:
Persons
[UUID | serialized_Person]
addresses
[address | Set of person_UUID]
names
[name | Set of person_UUID]
identifications
[identification | Set of person_UUID]
DOBs
[DOB | Set of person_UUID]
But I'm afraid of going beyond the max storage for a set(65,536 UUIDs) for some names and DOBs. Instead I think I'll have to do a dynamic column family with the column names as the Person_UUIDs, or is a row with over 65k columns very problematic as well? Thoughts?
It looks like you can't have these dynamic column families in the new version of Cassandra, you have to alter the table to insert the new column with a specific name. I don't know how to store more than 64k values for a row then. With a perfect distribution I will run out of space for DOBs with 23 million persons, I'm expecting to have over 200 million persons. Maybe I have to just have multiple set columns?
DOBs
[DOB | Set of person_UUID_A | Set of person_UUID_B | Set of person_UUID_C]
and I just check size and alter table if size = 64k? Anything better I can do?
I guess it's just CQL3 that enforces this and that if I really wanted I can still do dynamic columns with the Cassandra 2.0?
Ugh, this page from Datastax doc seems to say I had it right the first way...:
When to use a collection
This answer is not very specific, but I'll come back and add to it when I get a chance.
First thing - don't serialize your Persons into a single column. This complicates searching and updating any person info. OTOH, there are people that know what they're saying that disagree with this view. ;)
Next, don't normalize your data. Disk space is cheap. So, don't be afraid to write the same data to two places. You code will need to make sure that the right thing is done.
Those items feed into this: If you want queries to be fast, consider what you need to make that query fast. That is, create a table just for that query. That may mean writing data to multiple tables for multiple queries. Pick a query, and build a table that holds exactly what you need for that query, indexed on whatever you have available for the lookup, such as an id.
So, if you need to query by address, build a table (really, a column family) indexed on address. If you need to support another query based on identification, index on that. Each table may contain duplicate data. This means when you add a new user, you may be writing the same data to more than one table. While this seems unnatural if relational databases are the only kind you've ever used, but you get benefits in return - namely, horizontal scalability thanks to the CAP Theorem.
Edit:
The two column families in that last example could just hold identifiers into another table. So, voilà you have made an index. OTOH, that means each query takes two reads. But, still will be a performance improvement in many cases.
Edit:
Attempting to explain the previous edit:
Say you have a users table/column family:
CREATE TABLE users (
id uuid PRIMARY KEY,
display_name text,
avatar text
);
And you want to find a user's avatar given a display name (a contrived example). Searching users will be slow. So, you could create a table/CF that serves as an index, let's call it users_by_name:
CREATE TABLE users_by_name (
display_name text PRIMARY KEY,
user_id uuid
}
The search on display_name is now done against users_by_name, and that gives you the user_id, which you use to issue a second query against users. In this case, user_id in users_by_name has the value of the primary key id in users. Both queries are fast.
Or, you could put avatar in users_by_name, and accomplish the same thing with one query by using more disk space.
CREATE TABLE users_by_name (
display_name text PRIMARY KEY,
avatar text
}

converting excel spreadsheet to MySql Database

I have a Horse Racing Database that has the results for all handicap races for the 2010 flat season. The spreadsheet has now got too big and I want to convert it to a MySQL Databse. I have looked at many sites about normalizing data and database structures but I just can't work out what goes where, and what are PRIMARY KEYS,FOREIGN KEYS ETC I have over 30000 lines in the spreadsheet. the Column headings are :-
RACE_NO,DATE,COURSE,R_TIME,AGE,FURS,CLASS,PRIZE,RAN,Go,BHB,WA,AA,POS,DRW,BTN,HORSE,WGT,SP,TBTN,PPL,LGTHS,BHB,BHBADJ,BEYER
most of the columns are obvious, the following explains the less obvious BHB is the class of race,WA and AA are weight allowances for age and weight,TBTN is total distance beaten,PPL is Pounds per length, the last 4 are ratings.
I managed to export into MySQL as a flat file by saving the spreadsheet as a comma delimited file but I need to structure the
data into a normalized state with the proper KEYS.
I would appreciate any advice
many thyanks
Davey H
To do this in the past, I've done it in several steps...
Import your Excel spreadsheet into Microsoft Access
Import your Microsoft Access database into MySQL using the MySQL Workbench (previously MySQL GUI Tools + MySQL Migration Toolkit)
It's a bit disjointed, but it usually works pretty well and saves me time in the long run.
It's kind of an involved question, and it would be difficult to give you a precise answer without knowing a little bit more about your system, but I can try and give you a high level overview of how Relational Database Mangement Systems (RDBMS's) are structured.
A primary key is some identifier for a particular record - usually it is unique to that record. In this case, your RACE_NO column might be a suitable primary key. That way, you can identify every race by its unique number.
Foreign keys are numbers that describe the relationships between other objects/tables in your database. For example, you may want to create a table that lists all the different classes of races. Each record in that table would have a primary key, unique to that class. If you wanted to indicate in your "races" table which class each race was, you might have a column for each record called class_id. The value of that column would be populated with primary keys from the "classes" table. You can then use join operations to bring all the information together into one view.
For more on data structures and mysql, I suggest the W3C tutorials on SQL: http://www.w3schools.com/sql/sql_intro.asp
Before anything else, You need to define your data: You have to fit every column into a value space known to MySQL.
Numeric value
http://dev.mysql.com/doc/refman/5.0/en/numeric-types.html
Textual value
http://dev.mysql.com/doc/refman/5.0/en/string-type-overview.html
Date/Time value
http://dev.mysql.com/doc/refman/5.0/en/date-and-time-type-overview.html

Resources