How to create spark datasets from a file without using File reader - apache-spark

I have a data file that has 4 data sections. Header data, Summary data, Detail data and Footer data. Each section has a fixed number of columns.Each section is divided by two rows that just have a single "#" as the row content.But different sections have different of columns. Is there a way I can avoid creating new files and just use spark tsv(tab seperated foramt) module or any other module to read the file into 4 datasets directly.If I read the file directly then I am loosing the extra columns in the next data section. It only reads the from the file only those columns as the first row of the file.
#deptno dname location
10 Accounting New York
20 Research Dallas
30 Sales Chicago
40 Operations Boston
#
#
#grade losal hisal
1 700.00 1200.00
2 1201.00 1400.00
4 2001.00 3000.00
5 3001.00 99999.00
3 1401.00 2000.00
#
#
#ENAME DNAME JOB EMPNO HIREDATE LOC
ADAMS RESEARCH CLERK 7876 23-MAY-87 DALLAS
ALLEN SALES SALESMAN 7499 20-FEB-81 CHICAGO
BLAKE SALES MANAGER 7698 01-MAY-81 CHICAGO
CLARK ACCOUNTING MANAGER 7782 09-JUN-81 NEW YORK
FORD RESEARCH ANALYST 7902 03-DEC-81 DALLAS
JAMES SALES CLERK 7900 03-DEC-81 CHICAGO
JONES RESEARCH MANAGER 7566 02-APR-81 DALLAS
#
#
#Name Age Address
Paul 23 1115 W Franklin
Bessy the Cow 5 Big Farm Way
Zeke 45 W Main St
Output:
Dataset d1 :
#deptno dname location
10 Accounting New York
20 Research Dallas
30 Sales Chicago
40 Operations Boston
Dataset d2 :
#grade losal hisal
1 700.00 1200.00
2 1201.00 1400.00
4 2001.00 3000.00
5 3001.00 99999.00
3 1401.00 2000.00
Dataset d3 :
#ENAME DNAME JOB EMPNO HIREDATE LOC
ADAMS RESEARCH CLERK 7876 23-MAY-87 DALLAS
ALLEN SALES SALESMAN 7499 20-FEB-81 CHICAGO
BLAKE SALES MANAGER 7698 01-MAY-81 CHICAGO
CLARK ACCOUNTING MANAGER 7782 09-JUN-81 NEW YORK
FORD RESEARCH ANALYST 7902 03-DEC-81 DALLAS
JAMES SALES CLERK 7900 03-DEC-81 CHICAGO
JONES RESEARCH MANAGER 7566 02-APR-81 DALLAS
Dataset d4 :
#Name Age Address
Paul23 1115 W Franklin
Bessy the Cow 5 Big Farm Way
Zeke 45 W Main St

Related

Excel cell lookup in subtotaled range

I'd like to use index/match to lookup values in a subtotaled range. Using the sample data below, from another sheet (Sheet 2), I need to lookup the total NY Company hours for each employee.
Sheet 2:
| Bob | NY Company | ???? |
This formula returns the first match of NY Company Total
=INDEX('Sheet1!A1:C45,MATCH(Sheet2!B2 & " Total",'Sheet1!B1:B45,0),3)
Now I need to expand the lookup to include the Employee (Bob). Also, Column A is blank on the total Row. I've started to work with something like the following but no luck.
=INDEX('Sheet1!A1:C45,MATCH(1,('Sheet2!B2 & " Total"='Sheet1!B1:B45)*('Sheet2!B1='Sheet1!A1:A45)),3)
Also, as the sample data below looks perfect in the preview and then looks really bad after saving, I've added a pic with the sample data.
Sample data:
Sample Data:
A
B
C
Employee
Customer
Hours
Bob
ABC Company
5
Bob
ABC Company
3
ABC Company Total
8
Bob
NY Company
7
Bob
NY Company
7
Bob
NY Company
5
Bob
NY Company
3
NY Company Total
22
Bob
Jet Company
1
Jet Company Total
1
Carrie
ABC Company
1
Carrie
ABC Company
4
ABC Company Total
5
Carrie
NY Company
6
Carrie
NY Company
2
Carrie
NY Company
3
NY Company Total
11
Carrie
Jet Company
7
Carrie
Jet Company
9
Jet Company Total
16
Carrie
XYZ Company
4
XYZ Company Total
4
Gale
Cats Service
2
Gale
Cats Service
6
Gale
Cats Service
1
Cats Service Total
9
Gale
NY Company
6
Gale
NY Company
8
NY Company Total
14
Gale
XYZ Company
1
XYZ Company Total
1
John
NY Company
3
John
NY Company
5
NY Company Total
8
John
XYZ Company
8
John
XYZ Company
5
XYZ Company Total
13
Ken
ABC Company
10
ABC Company Total
10
Ken
NY Company
2
Ken
NY Company
3
Ken
NY Company
5
NY Company Total
10
Grand Total
132
Any suggestions??

Pandas: How to average rows with two columns having similar id's? [duplicate]

This question already has answers here:
Pandas dataframe: Group by two columns and then average over another column
(2 answers)
Closed 2 years ago.
I have a dataframe like the following:
State Name
County Name
Value
Idaho
Ada
20
Idaho
Ada
50
Pennsylvania
Adams
70
Colorado
Adams
25
Pennsylvania
Adams
21
Illinois
Adams
45
Illinois
Madison
45
Illinois
Madison
75
Then average the rows with similar State and County name such that the dataframe becomes this:
State Name
County Name
Mean
Idaho
Ada
12.5
Pennsylvania
Adams
55.47
Colorado
Adams
47.2
Illinois
Adams
19.5
Illinois
Madison
75.14
Any kind of help is appreciated.
Try:
df.groupby(['State Name','County Name']).mean()

Use my custom row order with pandas .describe() function

Assuming I have the following test DataFrame df:
Car Sold make profit
Honda 100 Accord 10
Honda 20 Fit 5
Toyota 300 Corolla 20
Hyundai 150 Elantra 20
BMW 20 Z-class 100
Toyota 45 Lexus 7
BMW 50 X-class 30
JEEP 150 cherokee 2
Honda 20 CRV 5
Toyota 30 Yaris 3
I need a summary statistic table for number of cars sold, by type of car.
I can do that this way:
df.groupby('Car')['Sold'].describe()
this gives me something like the following:
Car count mean std min 25th 50th 75th max
BMW 2
Honda 3
Hyundai 1
JEEP 1
Toyota 3
The 'Car' column values are listed in the summary statistic table in alphabetically ascending order. I am looking for a way to sort it in my own pre-specified way. I want the summary statistic table to be listed as "Toyota, Hyundai, JEEP, BMW, Honda"
df.groupby('Car')['Sold'].describe().loc[["Toyota", "Hyundai", "JEEP", "BMW", "Honda"]]
helps me put it in order, but I am not able to do it for multi-level indexing. For instance, if I want the summary statistics table by 'Car', and further by the make, .loc does not give me the desired solution.

Combine 2 different sheets with same data in Excel

I have the same data from different sources, both incomplete, but combined they may be less incomplete..
I have 2 files;
File #1 has; ID, Zipcode, YoB, Gender
File #2 has: Email, ID, Zipcode, Yob, Gender
The ID's in both files are the same, but #1 has some ID's that #2 hasn't, and the other way aroud.
The Email is connected to the ID. ID's are linked to the zipcode, YoB and gender. In both files are some of that info missing. E.g. File #1 and #2 both have ID 1234, only in #1 it only has a postal code, YoB but no Gender. And #2 has the zipcode and gender but no YoB.
I want to have all the information in one file;
Email, ID, YoB, Zipcode, Gender
I tried to sort both ID's alphabetically and put them next to each other and search for duplicates, but because #1 has some ID's that #2 doesnt I'm not able to combine them...
What's the best way to fix this?
By the way its about 12000 ID's from #1 and 9500 from #2
If you want a list of all the unique IDs then you could create a new sheet, copy both lots of IDs into the same column and then use Advanced Filter to copy Unique records only to another column.
Then use that column to do vlookups from the two files in the columns you require.
(I'm presuming this is a one-time job and you don't mind a bit of manual-ness)...
If on your first Sheet ("Sheet1") you have:
ID F_Name S_Name Age Favourite Cheese
1 Bob Smith 25 Brie
2 Fred Jones 29 Cheddar
3 Jeff Brown 18 Edam
4 Alice Smith 39 Mozzarella
5 Mark Jones 65 Cheddar
7 Sarah Smith 29 Mozzarella
8 Nick Jones 40 Brie
10 Betty Thompson 34 Edam
and on your second Sheet ("Sheet2") you have:
ID F_Name S_Name Age
1 Bob Smith 25
3 Jeff Brown 18
4 Alice Smith 39
5 Mark Jones 65
6 Frank Brown 44
7 Sarah Smith 29
9 Tom Brown 28
10 Betty Thompson 34
Then if you're combining them on a 3rd Sheet you need to do something like:
=IFERROR(VLOOKUP($A2,Sheet1!$A$1:$E$9,COLUMN(),FALSE),VLOOKUP($A2,Sheet2!$A$1:$E$9,COLUMN(),FALSE))
If you're trying to get to:
ID F_Name S_Name Age Favourite Cheese
1 Bob Smith 25 Brie
2 Fred Jones 29 Cheddar
3 Jeff Brown 18 Edam
4 Alice Smith 39 Mozzarella
5 Mark Jones 65 Cheddar
6 Frank Brown 44 0
7 Sarah Smith 29 Mozzarella
8 Nick Jones 40 Brie
9 Tom Brown 28 0
10 Betty Thompson 34 Edam

If a cell value equals this, another cell equals that

I have a spreadsheet with a column for cities, of which their are only 4 different values. What is the formula for equating a new column to show the corresponding state and apply it to the entire list? Example:
Atlanta equals GA,
Phoenix equals AZ,
Chicago equals IL,
Nashville equals TN
Thanks!!
You can use the VLookup function for that:
Make a table with your city name in one column and the state in the next column. Then the following formula next to the city that you want populated:
=VLOOKUP(A1,A$20:B$23,2,FALSE)
In this example, the city you want to identify is in A1, and this formula goes in B1. You can copy it down to B2, B3, etc because the table is hard-coded as A$20:B$23, rather than A20:B23 (where each successive copy down the column would look for a table one row down as well). This example put the lookup table in the A-B columns, but you could put it anywhere you like.
The FALSE at the end means, look for an exact match, not closest. So if you get a "Dallas" in your list, the function will return NA rather than guessing between the state for Chicago and the state for Nashville (either side of Dallas, alphabetically).
Hope that helps!
EDIT:
You added that you also need zipcode info, and that's easy enough to add.
Your table that defines everything would put the zipcode in the 3rd column, so down at A20:B23 (in my example above) you'd end up with A20:C23, where the table would look like
Atlanta GA 12345
Chicago IL 23456
Nashville TN 34567
Phoenix AZ 45678
The cell next to your city in the table you want to populate would be in B1 as shown above giving the state, and then in C1 you'd have the following formula:
=VLOOKUP(A1,A$20:C$23,3,FALSE)
The changes are that here the table is defined out to column C, and instead of "2" returning the second column (i.e. the state abbreviation shown in B), it returns the zipcode shown in column C, the third column.
Again, hope that helps.
Since you mention "only 4 different values" maybe:
=CHOOSE(MATCH(LEFT(A1),{"A","P","C","N"},0),"GA","AZ","IL","TN")
You can use a VLOOKUP Table that contains the city and state abbreviation.
Here is a table that has the Capital, State, State Abbreviation.
Montgomery Alabama AL
Juneau Alaska AK
Phoenix Arizona AZ
Little Rock Arkansas AR
Sacramento California CA
Denver Colorado CO
Hartford Connecticut CT
Dover Delaware DE
Tallahassee Florida FL
Atlanta Georgia GA
Honolulu Hawaii HI
Boise Idaho ID
Springfield Illinois IL
Indianapolis Indiana IN
Des Moines Iowa IA
Topeka Kansas KS
Frankfort Kentucky KY
Baton Rouge Louisiana LA
Augusta Maine ME
Annapolis Maryland MD
Boston Massachusetts MA
Lansing Michigan MI
Saint Paul Minnesota MN
Jackson Mississippi MS
Jefferson City Missouri MO
Helena Montana MT
Lincoln Nebraska NE
Carson City Nevada NV
Concord New Hampshire NH
Trenton New Jersey NJ
Santa Fe New Mexico NM
Albany New York NY
Raleigh North Carolina NC
Bismarck North Dakota ND
Columbus Ohio OH
Oklahoma City Oklahoma OK
Salem Oregon OR
Harrisburg Pennsylvania PA
Providence Rhode Island RI
Columbia South Carolina SC
Pierre South Dakota SD
Nashville Tennessee TN
Austin Texas TX
Salt Lake City Utah UT
Montpelier Vermont VT
Richmond Virginia VA
Olympia Washington WA
Charleston West Virginia WV
Madison Wisconsin WI
Cheyenne Wyoming WY
Then you would use =VLOOKUP(A1,A1:C50,3, FALSE) to look for A1 (Montgomery) in the table and it would output AL for example.

Resources