Text classification on feature vector X with multiple vs. merged columns - text

I am working on text classification problem where I have around 95 data points and the data looks like this (only two dummy entries shown):
| ID | Location | Emails |
------------------------------------
| AZ12 | UK | Lorem Ipsum |
| MR34 | USA | Lorem Ipsum |
In my current approach, I have merged the data in .csv, space delimited (shown below) and I am using only one column to do text-classification.
| Merged_columns |
-------------------------
| AZ12 UK Lorem Ipsum |
| MR34 USA Lorem Ipsum |
This approach seems to work for me and I am getting around 70% of accuracy with test data.
Now I am thinking of performing text classification on multiple columns of my feature vector (X) instead of merging all columns of X vector into one column (like by performing feature engineering on individual columns of X vector and then concatenating the transformed vectors. The approach has been mention in this article as well : https://towardsdatascience.com/natural-language-processing-on-multiple-columns-in-python-554043e05308).
Now my question is : When it comes to NLP, are both approaches equivalent theoretically. Should the later approach yield any better/worst results than my former approach.
Thanks.

Related

Reducing storage space as much as possible in spark when dealing with columns with many repetitive values

I have a data frame of poi data with their corresponding city/states. I am looking for a way to minimize the data storage space for this data as much as possible in pyspark. One feature I thought might be useful to consider, is that the city and state columns, contain lots of duplicate data.
+------+---------+---------+
| poi | city |state |
+------+---------+---------+
| abcd | New York| New York|
| cdef | New York| New York|
| xcvd | Chicago | Illinois|
| hjkq | New York| New York|
| acdr | Austin | Texas |
+------+---------+---------+
I thought if I read the data, partition it by city and state, and then save it to the disk, it might save more space.
df = sqlContext.read.csv(inFile,sep="\t", quote=None, header=False)
df.repartition("city",'state').write.option("header", "true").partitionBy(["city",'state']).csv(outFile,compression="gzip")
That did not save any space compare to the original gz file. I won't be querying this table a lot so the main objective is just to save disk space. Is there anything else I could do?

EXCEL: SUMIFS criterion applied to a INDEX MATCH search equals a value

I've spent pretty much all day trying to figure this out. I've read so many threads on here and on various other sites. This is what I'm trying to do:
I've got the total sales output. It's large and the number of items on it varies depending on the time frame it's looked at. There is a major lack in the system where I cannot get the figures by region. That information is not stored in the system. The records only store the customer's name, the product information, number of units, price, and purchase date. I want to get the total number of each item sold by region so that I can compare item popularity across regions.
There are only about 50 customers, so it is feasible for me to create a separate sheet assigning a region to the customers.
So, I have three sheets:
Sheet 1: Sales
+-----------------------------------------------------+
|Customer Name | Product | Amount | Price | Date |
-------------------------------------------------------
| Joe's Fish | RT-01 | 7 | 5.45 | 2020/5/20 |
-------------------------------------------------------
| Joe's Fish | CB-23 | 17 | 0.55 | 2020/5/20 |
-------------------------------------------------------
| Mack's Bugs | RT-01 | 4 | 4.45 | 2020/4/20 |
-------------------------------------------------------
| Joe's Fish | VX-28 | 1 | 1.20 | 2020/5/13 |
-------------------------------------------------------
| Karen's \/ | RT-01 | 9 | 3.45 | 2020/3/20 |
+-----------------------------------------------------+
Sheet 2: Regions
+----------------------+
| Customer | Region |
------------------------
| Joe's Fish | NA |
------------------------
| Mack's Bugs | NA |
------------------------
| Karen's \/ | EU |
+----------------------+
And my results are going in Sheet 3:
+----------------------+
| | NA | EU |
------------------------
| RT-01 | 11 | 9 |
+----------------------+
So looking at the data I made up for this question, I want to compare the number of RW-01's sold in North America to those sold in Europe. I can do it if I add an INDEX MATCH column to the end of the sales sheet, but I would have to do that every time I update the sales information.
Is there some way to do a SUMIFS like:
SUMIFS(Sheet1!$D:$D,Sheet1!$A:$A,INDEX(Sheet2!$B:$B,MATCH(Sheet1!#Current A#,Sheet2!$A:$A))=Sheet3!$B2,Sheet1!$B:$B,Sheet3!$A3)
?
I think it's difficult to do it with a SUMIFS because the columns you're matching have to be ranges, but you can certainly do it with a SUMPRODUCT and COUNTIFS:
=SUMPRODUCT(Sheet1!$C$2:$C$10*(Sheet1!$B$2:$B$10=$A2)*COUNTIFS(Sheet2!$A$2:$A$5,Sheet1!$A$2:$A$10,Sheet2!$B$2:$B$5,B$1))
I don't recommend using full-column references because it could be slow.
BTW I was assuming that there were no duplicates in Sheet2 for a particular combination of customer and region - if there were, you could use
=SUMPRODUCT(Sheet1!$C$2:$C$10*(Sheet1!$B$2:$B$10=$A2)*
(COUNTIFS(Sheet2!$A$2:$A$5,Sheet1!$A$2:$A$10,Sheet2!$B$2:$B$5,B$1)>0))
EDIT
It is worth using a dynamic version of the formula, though it is not elegant:
=SUM(Sheet1!$C2:INDEX(Sheet1!$C:$C,MATCH(2,1/(Sheet1!$C:$C<>"")))*(Sheet1!$B2:INDEX(Sheet1!$B:$B,MATCH(2,1/(Sheet1!$B:$B<>"")))=$A2)*
(COUNTIFS(Sheet2!$A$2:INDEX(Sheet2!$A:$A,MATCH(2,1/(Sheet2!$A:$A<>""))),Sheet1!$A2:INDEX(Sheet1!$A:$A,MATCH(2,1/(Sheet1!$A:$A<>""))),Sheet2!$B$2:INDEX(Sheet2!$B:$B,MATCH(2,1/(Sheet2!$B:$B<>""))),B$1)>0))
As you would need to make the match in memory I don't think it's feasible in Excel, you'll have to use a vba dictionary.
On the other hand, if the number of columns is fixed in your sales sheet, you can just format as table and add your index match in F.
When updating the sales data delete all lines as of line 3 and copy paste the update value. Excel will automatically apply the index match on all rows.

Replacing values that have duplicates in a pandas dataframe column

Suppose i have the following table that lists the maker of a unit.
import pandas as pd
df = pd.DataFrame({'Maker': ['Company1ID', 'SusanID', 'CeramiCorpID', 'PeterID', 'SaraID', 'CeramiCorpID', 'Company1ID']})
print(df)
Now consider that i have a much larger table with multiple Person and Corp ID's and i want to reclassify these into two categories, Person and Corporation has shown in the Expected Column. ID's are much more complex than what is shown (eg: f00568ab456b) and are unique for Each person or company but only companies show up in different rows.
| Maker | Expected |
|--------------|----------|
| Company1ID | Corp |
| SusanID | Person |
| CeramiCorpID | Corp |
| PeterID | Person |
| SaraID | Person |
| CeramiCorpID | Corp |
| Company1ID | Corp |
I am basically stuck trying to understand if i need to use either .apply(lamba x) or .replace with some kind of condition on .duplicated(keep=False). I'm unsure how to go about it either way.
Help appreciated!
Im not really sure this is what you want, but you could create the 'Expected' column like this:
df['Expected'] = ['Corp' if 'Corp' in maker else 'Person' for maker in df['Maker']]
EDIT:
If you want them to be classified on the amount of occurences:
df['Expected'] = ['Corp' if len(df[df['Maker'] == maker]) > 1 else 'Person' for maker in df['Maker']]
That would assume that there is no Corp which only occurs once. But if that can be the case then I how do you know if it's a Person or a Corp?

Creating Excel Pivot Table to Include Only Rows with Non-Empty Values in Multiple Columns

I'm looking for something very similar to this: Converting multiple variables into values with excel pivot tables or power pivot
First, I want to say that I've been researching this and trying things for over a week, so my brain may be a little burned out. I could be missing something simple, or what I think should be possible may not be able to be done in Excel.
I think the best way to explain what I need is with a completely different type of situation (i.e. this is not the type of data I'm using, but the situation is the same):
Header row: Movie Name | Release Date | Fantasy | Sci Fi | Comedy | Horror
First row: Galaxy Quest | 1999 | | X | X |
Second row: Jumanji | 1995 | X | | X |
Third row: The Company of Wolves | 1984 | X | | | X
Fourth row: Aliens | 1986 | | X | | X
Fifth row: Spaceballs | 1987 | | X | X |
Sixth row: Willow | 1988 | X | | X |
I have a pivot table showing, at present, all of the rows of data.
What I need is a slicer that will show 1) all comedy, no matter the other choices; or 2) all comedy and fantasy, which would show Galaxy Quest (comedy), Jumanji (comedy and fantasy), The Company of Wolves (fantasy), Spaceballs (comedy), Willow (comedy and fantasy); etc.
I hope the question makes sense. Please let me know if you need further clarification.

SpecFlow - Is it possible to reuse test data within feature file?

Is there any way to reuse data in SpecFlow feature files?
E.g. I have two scenarios, which both uses the same data table:
Scenario: Some scenario 1
Given I have a data table
| Field Name | Value |
| Name | "Tom" |
| Age | 16 |
When ...
Scenario: Some scenario 2
Given I have a data table
| Field Name | Value |
| Name | "Tom" |
| Age | 16 |
And I have another data table
| Field Name | Value |
| Brand | "Volvo" |
| City | "London" |
When ...
In these simple examples the tables are small and there not a big problem, however in my case, the tables have 20+ rows and will be used in at least 5 tests each.
I'd imagine something like this:
Having data table "Employee"
| Field Name | Value |
| Name | "Tom" |
| Age | 16 |
Scenario: Some scenario 1
Given I have a data table "Employee"
When ...
Scenario: Some scenario 2
Given I have a data table "Employee"
And I have another data table
| Field Name | Value |
| Brand | "Volvo" |
| City | "London" |
When ...
I couldn't find anything like this in SpecFlow documentation. The only suggestion for sharing data was to put it into *.cs files. However, I can't do that because the Feature Files will be used by non-technical people.
The Background is the place for common data like this until the data gets too large and your Background section ends up spanning several pages. It sounds like that might be the case for you.
You mention the tables having 20+ rows each and having several data tables like this. That would be a lot of Background for readers to wade through before the get to the Scenarios. Is there another way you could describe the data? When I had tables of data like this in the past I put the details into a fixtures class in the automation code and then described just the important aspects in the Feature file.
Assuming for the sake of an example that "Tom" is a potential car buyer and you're running some sort of car showroom then his data table might include:
| Field | Value |
| Name | Tom |
| Age | 16 |
| Address | .... |
| Phone Number | .... |
| Fav Colour | Red |
| Country | UK |
Your Scenario 2 might be "Under 18s shouldn't be able to buy a car" (in the UK at least). Given that scenario we don't care about Tom's address phone number, only his age. We could write that scenario as:
Scenario: Under 18s shouldnt be able to buy a car
Given there is a customer "Tom" who is under 16
When he tries to buy a car
Then I should politely refuse
Instead of keeping that table of Tom's details in the Feature file we just reference the significant parts. When the Given step runs the automation can lookup "Tom" from our fixtures. The step references his age so that a) it's clear to the reader of the Feature file who Tom is and b) to make sure the fixture data is still valid.
A reader of that scenario will immediately understand what's important about Tom (he's 16), and they don't have to continuously reference between the Scenario and Background. Other Scenarios can also use Tom and if they are interested in other aspects of his information (e.g. Address) then they can specify the relevant information Given there is a customer "Tom" who lives at 10 Downing Street.
Which approach is best depends how much of this data you've got. If it's a small number of fields across a couple of tables then put it in the Background, but once it gets to be 10+ fields or large numbers of tables (presumably we have many potential customers) then I'd suggest moving it outside the Feature file and just describing the relevant information in each Scenario.
Yes, you use a background, i.e. from https://github.com/cucumber/cucumber/wiki/Background
Background:
Given I have a data table "Employee"
| Field Name | Value |
| Name | "Tom" |
| Age | 16 |
Scenario: Some scenario 1
When ...
Scenario: Some scenario 2
Given I have another data table
| Field Name | Value |
| Brand | "Volvo" |
| City | "London" |
If ever you aren't sure I find http://www.specflow.org/documentation/Using-Gherkin-Language-in-SpecFlow/ a great resource

Resources