I have a csv file containing Netflix viewing data for all users on an account (38k entries) which I am analyzing in Power Bi.
There was no column for Movie/TV which I needed, but its clear that entries with the word 'Episode' in them are episodes from TV shows/Netflix series etc so I created a column in Power Query based on that. Here is an example of what I mean (other columns removed).
Title
ContentType
The Office (U.S.): Season 5: Heavy Competition (Episode 24)
TV
Forensic Files: Collection 1: A Tight Leash (Episode 2)
TV
Kung Fu Panda
Movie
Teen Wolf: Season 1: Lunatic (Episode 8)
TV
Kung Fu Panda 2
Movie
This seems to have worked quite well, but ideally I want a way to be sure I don't have any erroneously labeled entries (e.g "Star Wars: A New Hope (Episode IV)", not an entry in this dataset, but there is a risk of other 'Movie' titles using this format that I cant manually check for.).
I am a total Regex beginner, and sloppily put together the expression \b[Ee]pisode[\s\S]\b[^0123456789] to try and find any entries with the word episode that didn't have a number following it, and all entries were still TV Shows, but this would not account for something like "A New Hope(Episode 4)".
I'm a little stuck now and there are likely other exceptions to the 'Episode' rule that I am not considering. Functionally, the way I have done this is working for my purposes, but I'm trying to show due diligence for anyone that reads my report.
My question: is there a better expression to try that would account for such outliers?
Thanks!
I have event tracking data in Excel that is a bit like this.
Event
Name
Email
Attend 1
Attend 2
Attend 3
Event 1
Joe
Joe#email
Yes
Event 2
Joe
Joe#email
No
Event 1
Bob
Bob#email
No
Event 1
Sara
Sara#email
Yes
Event 2
Sara
Sara#email
Yes
Event 3
Sara
Sara#email
Yes
Event 2
Ray
Ray#email
Yes
Event 3
Ray
Ray#email
No
I am trying to combine / collapse the data so that each Name & Email is a unique line (using email as the unique identifier) and merge the Attend row data. The blank cells are also important for later work (person did not register for the event). The end data should look like this:
Name
Email
Attend 1
Attend 2
Attend 3
Joe
Joe#email
Yes
No
Bob
Bob#email
No
Sara
Sara#email
Yes
Yes
Yes
Ray
Ray#email
Yes
No
I've tried things like adding helper columns to ID duplicate rows, filtering, and various lookup versions, but I keep getting stuck. Most solutions I find online showed concatenating data, using groups, or pivot tables that don't get the end result I need or used commercial add-ons like AbleBits. I saw a similar sounding post on SO (link) but it didn't quite get at my situation and the solution didn't seem to work for me (unless I was not doing it right). Is there a way to use Excel formulas or regular menu options to obtain my result? If there are more complex solutions, e.g. VBA or Power Query, please provide detailed steps as I'm not familiar with those tools. I could likely export the data and resolve it in R, but I'm looking for a native Excel solution.
Thanks!
You can do =UNIQUE(C1:C100) in a separate column and copy and paste those values over themselves (assuming the range of emails is in C2:C100 and that the headers are in the first row).
Then in an adjacent column do something like =CONCAT(FILTER(D$2:D$100, $C$2:$C$100=$H2)) and drag down and to the right (where column H contains the column of unique emails). Then simply copy and paste values over themselves again and remove the old columns.
I have an Excel table (which is formatted as table and named as "table 1"). So the table columns are named like #LastName, #FirstName and so on.
This table contains information on when people attended a seminar. They're able to attend a number of seminars but also can attend a single seminar twice or more often.
I now want to find out, when the last date was , when a person attended a special seminar.
Lets give an Example:
Table:
#FirstName #LastName #Seminar #Date
Frank Mayer Workshop 1 2017/01/15
Frank Mayer Workshop 2 2019/05/27
Sabine Adams Workshop 1 2017/01/15
Volker Mueller Workshop 1 2017/01/15
Frank Mayer Workshop 1 2018/04/23
As you can see from this simple example, Frank Mayer attended Workshop 1 2x. All others attended each Workshop only once.
Goal is to have a list of Name, Workshop and last attendance. So the final list should look like:
#FirstName #LastName #Seminar #Date
Frank Mayer Workshop 2 2019/05/27
Sabine Adams Workshop 1 2017/01/15
Volker Mueller Workshop 1 2017/01/15
Frank Mayer Workshop 1 2018/04/23
I really have no idea how to solve this with Excel Formulas, since there is not only comparing dates, but also find double entry which differ only with the date. If possible, I'd like to NOT use VBA programming.
Do you guys have any idea? My table has 1500 lines, so doing that by hand is not an option...
Maybe there is a way to create a new sheet or table with the results?
Best Regards
Olaf
A more dynamic approach
(i.e. you would not have to go through the entire process each time more data is added.)
Add a helper column with name+last name+workshop per cell named e.g. "ID". Use either Concatenate or =Name & Lastname & Workshop to consolidate
Create a pivot table from your table with the following field-list:Rows: Add "ID" , Values: Add "Date" and change the value field settings (right-click) to "Max of Date"
Remember to refresh the pivot table when adding new data or change the pivot settings to update automatically when opening the workbook.
There are a thousand and one ways to do this and you might be surprised how easy it is to do manually too.
If you're reducing the table, meaning, removing rows from the table; then you could...
Add a helper column which concatenates the first name, last name, and workshop.
Then sort by this column and by date decreasing values.
Remove duplicates based only in the helper column.
Remove the helper column
Sort as desired
The trick is the sort. Sorting the helper column groups the repeat names per workshop and including the date by decreasing values ensures the most recent is at the top. So when you remove duplicates, the first occurrence will be the most recent and will be retained but subsequent entries well be removed.
I'd advise you to use the "Subtotals" Excel feature, choosing the Max function.
You might encouter problems as you need to give a single column in the first entry of the dialog, but this can be solved, creating an extra column, appending first name and last name, and basing your that (hidden) column (sorry for the Dutch, I don't have an English Excel):
The corresponding entries for the other columns can be found using basic search formulas (Match, VLookup, ...).
I would recommend using PIVOT as the simplest and fastest way:
I'm working on a table where a list of courses and students all are listed and trying to find out which courses can put on the same time-slot without having any conflicts (to prevent having more than one exam at the same time) either in Google Sheets or Microsoft Excel.
Below is a simple example of the main table
StudentID Course Name
1 Math
1 English
1 Computer
2 English
2 Computer
3 Physics
and I want something similar to below
English Computer Physics
Math No No Yes
English - No Yes
Computer No - Yes
Physics - Yes -
Simply, I want to know which courses can put together in the same time-slot without conflicts.
paste in cell D2:
=UNIQUE(B2:B)
paste in cell E1:
=TRANSPOSE(UNIQUE(B3:B))
paste in cell E2 and drag down then drag to the right:
=ARRAYFORMULA(IF(E$1=$D2, "-",
IF(SUM(N(REGEXMATCH(FILTER($A$2:$A, $B$2:$B=E$1)&"",
"^"&TEXTJOIN("$|^", 1, FILTER($A$2:$A, $B$2:$B=$D2))&"$")))=0, "yes", "no")))
Google Sheets
The condition should be that the count of unique student id's is the same as the count of student ID's for the two subjects:
=if(E$1=$D2,"-",if(count(filter($A2:$A,regexmatch($B2:$B,E$1&"|"&$D2)))=countunique(filter($A2:$A,regexmatch($B2:$B,E$1&"|"&$D2))),"Yes","No"))
You can do it a little shorter just with a grouping query to test if any student has more than one of the two courses:
=if(E$1=$D2,"-",if(max(query($A2:$B,"select count(A) where B='"&E$1&"' or B='"&$D2&"' group by A"))=1,"Yes","No"))
Excel
This requires a bit more effort - you'd probably use Frequency to get the same effect as grouping in Google sheets. The logic is the same though:
=IF(E$1=$D2,"-",IF(MAX(FREQUENCY(IF(($B$2:$B$7=E$1)+($B$2:$B$7=$D2),$A$2:$A$7),$A$2:$A$7))=1,"Yes","No"))
This assumes that the ID's are numeric. Has to be entered as an array formula using CtrlShiftEnter.
Currently I am working with a spreadsheet where I have to collect addresses of UK business Directors. Some of the directors have multiple addresses. UK zip code consist of two segment and I have to ignore the address where the first segment of zip code starts with W1 , SW1, EC1, EC2, EC3 & EC4 and ends with character not number because those are generally industrial address. For example Please check the following addresses where the addresses starts with W1 , SW1, EC1, EC2, EC3 & EC4 and ends with letter and I have to ignore collecting those.
3RD FLOOR 207 REGENT STREET, LONDON, W1B 3HH
42, CHARTERHOUSE SQUARE, LONDON EC1, EC1M 6EU
5, WESTMINSTER GARDENS, LONDON, MARSHAM STREET, SW1P 4JA
160, QUEEN VICTORIA STREET, LONDON, EC4V 4QQ
THE BROADGATE TOWER 20, PRIMROSE STREET, LONDON, EC2A 2RS
BAKERS' HALL 9, HARP LANE, LONDON, EC3R 6DP
Also please take note the zip codes of W11,W12 W13 and so on are not prohibited and same thing applies for the SW1, EC1, EC2, EC3 & EC4.
Now as we are working on the bulk data, its impossible to notice the zip codes when collecting the address. So I have tried with combination of various formula to highlight the first segment of those mentioned zip codes in a separate column using just right after pasting the address. In this way I can see the prohibited zip codes right after pasting the address. Although I am not successful cracking on it but I was close of it. I am just sharing the thing I have tried and looking for a compact solution from you guys.
First of I have created a separate column E where it will show the first segment of the zip code when I will paste the addresses in the column D. Here is the code which I have used:
=TRIM(LEFT(RIGHT(" "&SUBSTITUTE(TRIM(D2)," ",REPT(" ",60)),120),60))
Next I have tried to show only those values which starts with W1 , SW1, EC1, EC2, EC3 & EC4 and ends with a letter. So I have created another column F and create the formula for "SW1"
=IF(NOT(ISNUMBER(VALUE(RIGHT(E2,1)))), IF(ISNUMBER(SEARCH("SW1",E2)), LEFT(E2,3), ""),"")
So if I have to check for W1 , EC1, EC2, EC3 & EC4, I have to create 5 more columns with the same formula where just have to change the value of search function. This lead me to 6 extra columns and I want a compact formula for savings the space because I generally split the browser and execl in a way that's why I can copy and paste data on the spreadsheet without minimizing the spreadsheet. This saves me a lots of time. But creating six more column will make my work more time consuming as I have to check all six columns for those zip codes.
Question - 1:
I want to ask, is there any way to make a compact formula for showing my desired result in a single column only?
Question - 2:
We also have to ignore the addresses which consist the words "floor","house" & "airport". I have tried the below formula for single query:
=IF(ISNUMBER(SEARCH("Floor",D2)), "Floor", "")
Is there any possibilities combing all required formula and show the result in one column?
Update regarding Question - 1:
I have tried to combine using some other formula to show the required result. But comes up with showing only those zip code which starts with W1 , SW1, EC1, EC2, EC3 & EC4 but can't modify it to restrict those results also where the last character is a number. Here is the code:
=IF(ISNUMBER(FIND("W1",(LEFT(E2,2)),1))=TRUE,LEFT(E2,2)&IF(NOT(ISNUMBER(VALUE(RIGHT(E2,1)))), RIGHT(E2,1),""),IF(ISNUMBER(FIND("SW1",E2,1))=TRUE,LEFT(E2,3)&IF(NOT(ISNUMBER(VALUE(RIGHT(E2,1)))), RIGHT(E2,1),""),IF(ISNUMBER(FIND("EC1",E2,1))=TRUE,LEFT(E2,3)&IF(NOT(ISNUMBER(VALUE(RIGHT(E2,1)))), RIGHT(E2,1),""),IF(ISNUMBER(FIND("EC2",E2,1))=TRUE,LEFT(E2,3)&IF(NOT(ISNUMBER(VALUE(RIGHT(E2,1)))), RIGHT(E2,1),""),IF(ISNUMBER(FIND("EC3",E2,1))=TRUE,LEFT(E2,3)&IF(NOT(ISNUMBER(VALUE(RIGHT(E2,1)))), RIGHT(E2,1),""),IF(ISNUMBER(FIND("EC4",E2,1))=TRUE,LEFT(E2,3)&IF(NOT(ISNUMBER(VALUE(RIGHT(E2,1)))), RIGHT(E2,1),""),""))))))
Your formula in column E could combine the last character is not a number with the parse of the first section of the postal code with something like the following.
=IF(ISERROR(--RIGHT(A2)), TRIM(LEFT(TRIM(RIGHT(A2, 8)), 4)), "")
That makes the parsing dependent on a British postal code being either 7 or 8 characters wide with the first section being either 3 or 4 characters. The 'wandering' space is either picked up and trimmed off or not picked up at all depending on the length.
With column Y listing the prefixes of the postal codes to be ignored and column Z (in any worksheet) devoted to a cross-reference list of exceptions like the following:
Note that each of the entries in the Ignore list and the Exceptions list all carry the wildcard asterisk as a suffix. This is necessary to deal with staggered lengths of the comparisons to be made.
The formula used for a Conditional Formatting Rule for A2:E999 would be,
=AND(SUMPRODUCT(COUNTIF($E2, $Y$2:$Y$7)), NOT(SUMPRODUCT(COUNTIF($C2, $Z$2:$Z$4))))
This resolves TRUE for any postal that should be ignored and is not in the exception list.
The cross-reference tables of ignores and exceptions may benefit from becoming a named range for easy reference. You could use a dynamic range definition for the Applies to: of something like:
=Sheet1!$Y$2:INDEX(Sheet1!$Y:$Y, MATCH("zzz", Sheet1!$Y:$Y))
=Sheet1!$Z$2:INDEX(Sheet1!$Z:$Z, MATCH("zzz", Sheet1!$Z:$Z))
Here's a formula to tell you whether the first word after the last comma ends with a number.
=ISNUMBER(NUMBERVALUE(RIGHT(LEFT(TRIM(RIGHT(SUBSTITUTE(B1,",",REPT(" ",LEN(B1))),LEN(B1))),SEARCH(" ",TRIM(RIGHT(SUBSTITUTE(B1,",",REPT(" ",LEN(B1))),LEN(B1))))-1))))