Origin and last destination of a customer who has taken multiple flights - apache-spark

For every customer id, I need to find out the first flight source and last flight destination using spark / pyspark
cid --> customerId
fid --> flight ID -- it is random
Input DataFrame:
cid
fid
origin
destination
1
f12
texas
newyork
1
f21
newyork
delhi
1
f08
delhi
japan
texas -> newyork -> japan
ouput dataframe:
cid
origin
destination
1
texas
japan
I tried doing it with joins but not sure how to merge the resultset after joins to filter only starting flight source and last flight destination

Update: In case the order is not certain:
df.groupBy("cid").agg(F.collect_list("origin").alias("origin"), F.collect_list("destination").alias("destination")).select(col("cid"), F.array_except(col("origin"), col("destination")).alias("origin"), F.array_except(col("destination"), col("origin")).alias("destination")).show(truncate=False)
Output:
Essentially, first, collect all origins and destinations as arrays.
As you can see, the stops (i.e. points except starting point and last destination) will common in both arrays.
If order is certain, this would also work:
df.groupBy("cid").agg(F.first("origin").alias("origin"), F.last("destination").alias("destination")).show()
Output:
Although this would not work if the order of rows is messed up.

Assuming that the passenger never takes more than one plane, and that therefore there is a unique path for them to travel
(assuming the start location start)
Run a query SELECT * FROM df where origin = start, that way you get all reachable destinations
store the destination in a list
run the above query with start = destinations for the next step
This will discover the above path.
In case start is not given, you can do something more complex, such as picking a random record, reaching alo destinations as explained above and then also walking back, reaching all sources, you then join these two paths and you have successfully reconstructed the walk.

Related

organize a set of data based on open slots/sold-out slots

I am trying to analyze data based on the following scenario:
A group of places, each with its own ID gets available for visiting from time to time for an exclusive number of people - this number varies according to how well the last visit season performed - so far visit seasons were opened 3 times.
Let's suppose ID_01 in those three seasons had the following available slots/sold-out slots ratio: 25/24, 30/30, and 30/30, ID_02 had: 25/15, 20/18, and 25/21, and ID_03 had: 25/10, 15/15 and 20/13.
What would be the best way to design the database for such analysis on a single table?
So far I have used a table for each ID with all their available slots and sold-out amounts, but as the number of IDs gets higher and the number of visit seasons too (way beyond three at this point) it has been proving to be not ideal, hard to keep track of, and terrible to work with.
The best solution I could come up with was putting all IDs on a column and adding two columns for each season (ID | 1_available | 1_soldout | 2_available | 2_soldout | ...).
The Wikipedia article on database normalization would be a good starting point.
Based on the information you provided in your question, you create one table.
AvailableDate
-------------
AvailableDateID
LocationID
AvailableDate
AvailableSlots
SoldOutSlots
...
You may also have other columns you haven't mentioned. One possibility is SoldOutTimestamp.
The primary key is AvailableDateID. It's an auto-incrementing integer that has no meaning, other than to sort the rows in input order.
You also create a unique index on (LocationID, AvailableDate) and another unique index on (AvailableDate, LocationID). This allows you to retrieve the row by LocationID or by AvailableDate.

Spark - partitioning/bucketing of n-tables with overlapping but not identical ids

i'm currently trying to optimize some kind of query of 2 rather large tables, which are characterized like this:
Table 1: id column - alphanumerical, about 300mil unique ids, more than 1bil rows overall
Table 2: id column - identical semantics, about 200mil unique ids, more than 1bil rows overall
Lets say on a given day, 17.03. i want to join those two tables on id.
Table 1 is left, table 2 is right, i get like 90% of matches, meaning table 2 has like 90% of those ids present in table 1.
One week later, said table 1 did not change (could but to make explanation easier, consider it didn't), table 2 was updated and now contains more records. I do the join again and now, from the former missing ids some came up, so i got like 95% matches now.
In general, table1.id has some matches with table2.id at a given time which might change on a day-per-day base.
I now want to optimize this join and came up on the bucketing feature. Is this possible?
Example:
1st join: id "ABC123" is present in table1, not in table2. ABC123 gets sorted into a certain bucket, e.g. "1".
2nd join (week later): id "ABC123" now came up in table2; how can it be ensured it comes into the bucket on table 2 which then is co-located with table 1?
Or am i having a general problem of understanding how it works?

Counting Unique Customers in Categories Based on Call Log Status Hierarchy

I have a table of call logs of customers. For the purposes of what I'm trying to do, the table only consists of Customer ID and Call Status. Each customer could have multiple entries with varied call statuses.
E.g table:
Customer Status
1 Didn't Reach - No Voicemail
3 Left a Voicemail
2 Left a Voicemail
1 Left a Voicemail
1 Talked With Customer
3 Didn't Reach - No Voicemail
2 Talked With Customer
2 Left a Voicemail
2 Talked With Customer
My Output should be:
Status # of Customers
Talked With Customer 2
Left a Voicemail 1
Didn't Reach - No Voicemail 0
This is due to the fact that from the 3 different status their heirarchy is as follows:
1. Talked with Customer
2. Left a Voicemail
3. Didn't Reach - No Voicemail
A patient should only be counted in the highest category (1>2>3), which is why in my example, there are no customers in "Didn't reach - no voicemail" as all of them had AT LEAST a voicemail left.
Hopefully this makes sense and thanks in advance!
An SQL statement can produce this output. Does require a table of status hierarchy if want to show all status even if there is no data for a status.
Can export query to Excel or Excel can link to query object. Or Excel VBA can connect to Access db and pull data with SQL statement to a recordset object and use CopyFromRecordset method to save data to worksheet range.
SELECT StatusHierarchy.StatusDesc, Count(Query1.Customer) AS CntStat
FROM (SELECT Customer, Min(StatusValue) AS MinOfStatusValue
FROM StatusHierarchy
LEFT JOIN Table1 ON StatusHierarchy.StatusDesc = Table1.Status
GROUP BY Table1.Customer) AS Query1
RIGHT JOIN StatusHierarchy ON Query1.MinOfStatusValue = StatusHierarchy.StatusValue
GROUP BY StatusHierarchy.StatusDesc
ORDER BY Count(Query1.Customer) DESC;

Resolve "missing relationships" message

I get the "relationships between tables may be needed" message in Excel 2016 PowerPivot. I suspect this is because I've got a many-to-many relationship, but lack experience to be sure. I'll be happy to read up on it if someone can help me put a name on the problem.
The data that I'm analyzing looks like this:
Units - Instances of a product. E.g. the computer was produced 100 times = 100 units. Each unit has ONE ProductID, and a UnitID (unique). Units have e.g. a production time.
Products - Each product can consist of several items. A product could be computer xyz = "laptop xyz, driver CD, manual, power brick, 2 chocolates and mains cable". Each product has a unique ProductID and an PackListID. Several products can have the same PackListID.
PackList - Unique PackListIDs
CompositeList - for each PackListID, this can contain several ItemIDs and number included in packing list (e.g. 2 pieces of chocolate)
Items - line items for a product packing list, e.g. "manual for computer xyz"
I want to answer the question: How many (compositelist.number) line items of each Items have been included in units produced on a particular date (units.date).
I read the data by SQL from a existing application/database server. I cannot influence how the data is structured :(
I've imported my data and connected as follows:
Unit.ProductID -> Product.ProductID (each product can have many units)
Product.PackListID -> PackList.PackListID (bridge table)
PackList.PackListID -> CompositeList.PackListID (each packlist can have many composite-rows)
CompositeList.ItemID -> Items.ItemID (each composite-list refers to one item)
If I filter by PackListID, the result is a correct list of line items with sums.
Now I'd like to multiply that with the number of times that packlist has been produced on a particular day...
My pivot looks like this:
Filter: Production date, filter to a particular date
Rows: Items.text
Values: Sum of compositelist.number
I would love to see the number of line items used in production on a certain day, e.g. 15 manuals and 32 mains cables.
Unfortunately I get the message "Relationships between tables may be needed".

Calculate Sum of occurrences in Excel

I currently have the following data set: >>> see first image for dataset and second image for expected outcome
This continues on for about 12 columns with Service_12 and has multiple rows for Company_id about 620,000 rows
I would like a way to find the interaction between services for a company
so i would like to create a table that looks something like this
Service_1 Service_2 Service_3
Service_1 (Amount of companies (Amount of companies (Amount comps
that used 1 that used that used
Service_1) service_1 and Service_2) service1 + sr3)
Service_2 (as above) (as above) (as above)
I would like this cross service to continue for all 12 services, so for example this is just the sum of the times a company has used both services,
Any ideas?
Thanks
I would like an aoutcome like the folloowing:
my dataset is the following:
Try to use this
=COUNTIFS(B2:B5,">0",C2:C5,">0")
this formula should give you how many companies used service 1 (column B) and service 2(column C) or
=SUMIFS(B2:B5,B2:B5,">0",C2:C5,">0")+SUMIFS(C2:C5,B2:B5,">0",C2:C5,">0")
if you want to know how many times the station were actually used.
I think a Pivot table would do this.
I would use a dynamic named range to capture your data if you intend to add companies more often:
https://support.microsoft.com/en-us/help/830287/how-to-create-a-dynamic-defined-range-in-an-excel-worksheet
If you put "Services" in the rows and columns, and Count(companies) in the values box that should do it!

Resources