Replacing values that have duplicates in a pandas dataframe column - python-3.x

Suppose i have the following table that lists the maker of a unit.
import pandas as pd
df = pd.DataFrame({'Maker': ['Company1ID', 'SusanID', 'CeramiCorpID', 'PeterID', 'SaraID', 'CeramiCorpID', 'Company1ID']})
print(df)
Now consider that i have a much larger table with multiple Person and Corp ID's and i want to reclassify these into two categories, Person and Corporation has shown in the Expected Column. ID's are much more complex than what is shown (eg: f00568ab456b) and are unique for Each person or company but only companies show up in different rows.
| Maker | Expected |
|--------------|----------|
| Company1ID | Corp |
| SusanID | Person |
| CeramiCorpID | Corp |
| PeterID | Person |
| SaraID | Person |
| CeramiCorpID | Corp |
| Company1ID | Corp |
I am basically stuck trying to understand if i need to use either .apply(lamba x) or .replace with some kind of condition on .duplicated(keep=False). I'm unsure how to go about it either way.
Help appreciated!

Im not really sure this is what you want, but you could create the 'Expected' column like this:
df['Expected'] = ['Corp' if 'Corp' in maker else 'Person' for maker in df['Maker']]
EDIT:
If you want them to be classified on the amount of occurences:
df['Expected'] = ['Corp' if len(df[df['Maker'] == maker]) > 1 else 'Person' for maker in df['Maker']]
That would assume that there is no Corp which only occurs once. But if that can be the case then I how do you know if it's a Person or a Corp?

Related

Creating an MDX measure in Excel to count members of one dimension based on another dimension

I am trying to create an MDX measure in Excel (in OLAP Tools) that will count how many members there are for every other item in another dimension. As I don't know the exact syntax and notation for MDX and OLAP cubes I will try to simply explain what I want to do:
I have a pivot table based on an OLAP Cube. I have a Machine Number field stored in one dimension, that is the "parent" and for every machine number there is a number of articles that were produced (in certain period of time). Those articles are represented by Order Numbers. Those numbers are stored in another dimension. I would like the measure to count how many order numbers there are for every machine number.
So the table looks like this:
+------------------+----------------+
| [Machine Number] | [Order Number] |
+------------------+----------------+
| Machine001 | |
| | 111111111 |
| | 222222222 |
| | 333333333 |
| Machine002 | |
| | 444444444 |
| | 555555555 |
| | 666666666 |
| | 777777777 |
+------------------+----------------+
and I would like the result to be:
+------------------+----------------+------------+
| [Machine Number] | [Order Number] | [Measure1] |
+------------------+----------------+------------+
| Machine001 | | 3 |
| | 111111111 | |
| | 222222222 | |
| | 333333333 | |
| Machine002 | | 4 |
| | 444444444 | |
| | 555555555 | |
| | 666666666 | |
| | 777777777 | |
+------------------+----------------+------------+
I've tried using the COUNT function with EXISTING as well, but it wouldn't work (always showing 1, or the same wrong number for every machine). I believe that I have to somehow connect those two dimensions together so the Order Number is dependent to Machine Number, but lacking the knowledge about MDX and OLAP Cubes I don't even know how to ask Google how to do that.
Thanks in advance for any tips and solutions.
Your problem basicly is, you have two attributes in diffrent dimensions. You want to retrive the valid combinations of these attribute, further you want to count the number of attribute values avaliable in the sceond attribute based on the value of the first attribute.
Based on the above problem statement, in an OLAP cube a fact table or a Measure defines the relations between attributes of diffrent dimension linked to the Measure\Fact-Table. Take a look at the example below.(I have used the SSAS sample db Adventureworks)
--Iam trying to find the promotions that were offered for each product category.
select
[Measures].[Internet Sales Amount]
on columns,
([Product].[Category].[Category],[Promotion].[Promotion].[Promotion])
on rows
from
[Adventure Works]
Result
The result is cross-product of all the product categories and the promotions. Now lets make the cube return the valid combinations only.
select
[Measures].[Internet Sales Amount]
on columns,
nonempty(
([Product].[Category].[Category],[Promotion].[Promotion].[Promotion])
,[Measures].[Internet Sales Amount])
on rows
from
[Adventure Works]
Result
Now we indicated that it needs to return only valid combinations. Note that we provided a measure that belonged to the fact connecting the two dimensions. Now lets count them
with member
[Measures].[test]
as
count(
nonempty(([Product].[Category].currentmember,[Promotion].[Promotion].[Promotion]),[Measures].[Internet Sales Amount])
)
select
[Measures].[Test]
on columns,
[Product].[Category].[Category]
on rows
from
[Adventure Works]
Result
Alternate query
with member
[Measures].[test]
as
{nonempty(([Product].[Category].currentmember,[Promotion].[Promotion].[Promotion]),[Measures].[Internet Sales Amount]) }.count
select
[Measures].[Test]
on columns,
[Product].[Category].[Category]
on rows
from
[Adventure Works]

VLOOKUP to merge two tables into one

I have 2 price lists from 2 different companies but there are some many similar item numbers, is there a code to merge the pricelists into one? below example of what I have.
A-Pricelist
Item | Product | Price
382101 | Truck | 130$
212012 | car | 80$
B-Pricelist
Item | Product | Price
111011 | Airplane | 500$
382101 | truck | 50$
Expected result
Item | Product | A Price | B Price
382101 | Truck | 50$ | 130$
212012 | car | 80$ | -
111011 | Airplane | - |500$
I have seen it is done by Vlookup, but it is just not working for me, thanks.
So, vlookup will work fine, I would think about how to control the list of unique values using a dropdown...
But here is an example, updated to deal with missing prices:

Creating Excel Pivot Table to Include Only Rows with Non-Empty Values in Multiple Columns

I'm looking for something very similar to this: Converting multiple variables into values with excel pivot tables or power pivot
First, I want to say that I've been researching this and trying things for over a week, so my brain may be a little burned out. I could be missing something simple, or what I think should be possible may not be able to be done in Excel.
I think the best way to explain what I need is with a completely different type of situation (i.e. this is not the type of data I'm using, but the situation is the same):
Header row: Movie Name | Release Date | Fantasy | Sci Fi | Comedy | Horror
First row: Galaxy Quest | 1999 | | X | X |
Second row: Jumanji | 1995 | X | | X |
Third row: The Company of Wolves | 1984 | X | | | X
Fourth row: Aliens | 1986 | | X | | X
Fifth row: Spaceballs | 1987 | | X | X |
Sixth row: Willow | 1988 | X | | X |
I have a pivot table showing, at present, all of the rows of data.
What I need is a slicer that will show 1) all comedy, no matter the other choices; or 2) all comedy and fantasy, which would show Galaxy Quest (comedy), Jumanji (comedy and fantasy), The Company of Wolves (fantasy), Spaceballs (comedy), Willow (comedy and fantasy); etc.
I hope the question makes sense. Please let me know if you need further clarification.

Should all fields that are visible on a screen be validated in Gherkin?

We are creating Gherkin feature files for our application to create executable specifications. Currently we have files that look like this:
Given product <type> is found
When the product is clicked
Then detailed information on the product appears
And the field text has a value
And the field price has a value
And the field buy is available
We are wondering if this whole list of and keywords that validate if fields are visible on the screen is the way to go, or if we should shorten that to something like 'validate input'.
We have a similar case in that our service can return a lot of 10's of elements for each case that we could validate. We do not validate every element for each interaction, we only test the elements that are relevant to the test case.
To make it easier to maintain and switch which elements we are using, we use scenario outlines and tables of examples.
Scenario Outline: PO Boxes correctly located
When we search in the USA for "<Input>"
Then the address contains
| Label | Text |
| PO Box | <PoBox> |
| City name | <CityName> |
| State code | <StateCode> |
| ZIP Code | <ZipCode> |
| +4 code | <ZipPlus4> |
Examples:
| ID | Input | PoBox | CityName | StateCode | ZipCode |
| 01 | PO Box 123, 12345 | PO Box 123 | Boston | MA | 12345 |
| 02 | PO Box 321, Whitefish | PO Box 123 | Whitefish | MN | 54321 |
By doing it this way, we have a generic step "the address contains" that uses the 'Label' and 'Text' to test the individual elements. It is a neat and tidy way to test a lot of potential combinations - but it probably depends on your individual use case - how important all of the fields are.
You only need to validate the ones that provide business value, which is probably all of them. I would avoid using tech terms like "field" because it isn't related to a behavior. Al Mills is right on for using the tables.
I'd word it like this:
Scenario Outline: Review product details
Given I find the product <Type>
When I select the product
Then detailed information on the product appears including
| Description | <Description> |
| Price | <Price> |
And I can buy the product
Examples:
| Type | Description | Price |
| Hose | Rubber Hose | 31.99 |
| Sprinkler | Rotating Sprinker | 12.99 |
The words I chose are behaviors or whats, not technical implementations or hows.

SpecFlow - Is it possible to reuse test data within feature file?

Is there any way to reuse data in SpecFlow feature files?
E.g. I have two scenarios, which both uses the same data table:
Scenario: Some scenario 1
Given I have a data table
| Field Name | Value |
| Name | "Tom" |
| Age | 16 |
When ...
Scenario: Some scenario 2
Given I have a data table
| Field Name | Value |
| Name | "Tom" |
| Age | 16 |
And I have another data table
| Field Name | Value |
| Brand | "Volvo" |
| City | "London" |
When ...
In these simple examples the tables are small and there not a big problem, however in my case, the tables have 20+ rows and will be used in at least 5 tests each.
I'd imagine something like this:
Having data table "Employee"
| Field Name | Value |
| Name | "Tom" |
| Age | 16 |
Scenario: Some scenario 1
Given I have a data table "Employee"
When ...
Scenario: Some scenario 2
Given I have a data table "Employee"
And I have another data table
| Field Name | Value |
| Brand | "Volvo" |
| City | "London" |
When ...
I couldn't find anything like this in SpecFlow documentation. The only suggestion for sharing data was to put it into *.cs files. However, I can't do that because the Feature Files will be used by non-technical people.
The Background is the place for common data like this until the data gets too large and your Background section ends up spanning several pages. It sounds like that might be the case for you.
You mention the tables having 20+ rows each and having several data tables like this. That would be a lot of Background for readers to wade through before the get to the Scenarios. Is there another way you could describe the data? When I had tables of data like this in the past I put the details into a fixtures class in the automation code and then described just the important aspects in the Feature file.
Assuming for the sake of an example that "Tom" is a potential car buyer and you're running some sort of car showroom then his data table might include:
| Field | Value |
| Name | Tom |
| Age | 16 |
| Address | .... |
| Phone Number | .... |
| Fav Colour | Red |
| Country | UK |
Your Scenario 2 might be "Under 18s shouldn't be able to buy a car" (in the UK at least). Given that scenario we don't care about Tom's address phone number, only his age. We could write that scenario as:
Scenario: Under 18s shouldnt be able to buy a car
Given there is a customer "Tom" who is under 16
When he tries to buy a car
Then I should politely refuse
Instead of keeping that table of Tom's details in the Feature file we just reference the significant parts. When the Given step runs the automation can lookup "Tom" from our fixtures. The step references his age so that a) it's clear to the reader of the Feature file who Tom is and b) to make sure the fixture data is still valid.
A reader of that scenario will immediately understand what's important about Tom (he's 16), and they don't have to continuously reference between the Scenario and Background. Other Scenarios can also use Tom and if they are interested in other aspects of his information (e.g. Address) then they can specify the relevant information Given there is a customer "Tom" who lives at 10 Downing Street.
Which approach is best depends how much of this data you've got. If it's a small number of fields across a couple of tables then put it in the Background, but once it gets to be 10+ fields or large numbers of tables (presumably we have many potential customers) then I'd suggest moving it outside the Feature file and just describing the relevant information in each Scenario.
Yes, you use a background, i.e. from https://github.com/cucumber/cucumber/wiki/Background
Background:
Given I have a data table "Employee"
| Field Name | Value |
| Name | "Tom" |
| Age | 16 |
Scenario: Some scenario 1
When ...
Scenario: Some scenario 2
Given I have another data table
| Field Name | Value |
| Brand | "Volvo" |
| City | "London" |
If ever you aren't sure I find http://www.specflow.org/documentation/Using-Gherkin-Language-in-SpecFlow/ a great resource

Resources