I am newbie in pyspark, and I'm trying to read and merge RDD rows into one row.
Assuming that I have the following text file:
A1 B1 C1
A2 B2 C2 D3
A3 X1 YY1
DELIMITER_ROW
Z1 B1 C1 Z4
X2 V2 XC2 D3
DELIMITER_ROW
T1 R1
M2 MB2 NC2
S3 BB1
AQ3 Q1 P1"
Now, I want to combine all rows appears in each section (between DELIMITER_ROW) into one row, and return a list of these merged rows.
I want to create this kind of list:
[[A1 B1 C1 A2 B2 C2 D3 A3 X1 YY1]
[Z1 B1 C1 Z4 X2 V2 XC2 D3]
[T1 R1 M2 MB2 NC2 S3 BB1 AQ3 Q1 P1]]
How can It be done in pyspark using RDD?
For now I know how to read the file and filter out the delimiter rows:
sc.textFile(pathToFile).filter(lambda line: DELIMITER_ROW not in line).collect()
but I don't know how to reduce/merge/combine/group the rows in each section into one row.
Thanks.
Rather than reading and splitting, You can use hadoopConfiguration.set to set the delimiter which separates the row and then split the row.
spark.sparkContext.hadoopConfiguration.set("textinputformat.record.delimiter", "DELIMITER_ROW")
Hope this helps!
Related
So, I have a pandas data frame:
df =
a b c
a1 b1 c1
a2 b2 c1
a2 b3 c2
a2 b4 c2
I want to rename a2 into a1 and then group by a and c and add the corresponding values of b
df =
a b c
a1 b1+b2 c1
a1 b3+b4 c2
So, something like this
df =
a value c
a1 10 c1
a2 20 c1
a2 50 c2
a2 60 c2
df =
a value c
a1 30 c1
a1 110 c2
How to do this?
What about
>>> res = df.replace({"a": {"a2": "a1"}}).groupby(["a", "c"], as_index=False).sum()
>>> res
a c value
0 a1 c1 30
1 a1 c2 110
which first replaces "a2"s with "a1" in only a column and then groups by and sums.
To get the original column order back, we can reindex:
>>> res.reindex(df.columns, axis=1)
a value c
0 a1 30 c1
1 a1 110 c2
Try this:
df.groupby([df['a'].replace({'a2':'a1'}),'c']).sum().reset_index()
sample dataframe:
avg
Key1 Key2
a1 b1 v1
b2 v2
b3 v3
a2 b4 v4
a3 b5 v5
b6 v6
a4 b7 v7
How to convert this to a datadict
{a1:v1, a1:v2, a1:v3, a2:v4, a3:v5, a3:v6, a4:v7}
I tried this with no luck
dict(zip(df['ColA'], df['avg']))
Appreciate any help !!
Since it is multiple index using get_level_values
dict(zip(df.index.get_level_values(1), df['avg']))
Need some help in figuring out an formula to count the number of times a value is listed in a column. I will try and explain the requirement below.
The below image show sample of data set.
The requirement is to list out issues and actions per customer.
As you can see, even from values clustered in cell, we need to find out individual unique values and then map it against the adjacent column or columns.
It just need an extra sheet/table to execute..
try :
A1 = a,b,c
A2 = b,c
A3 = c,b,a
A4 = c,a
A5 = b
B1 = ss
B2 = ss
B3 = dd
B4 = dd
B5 = ss
D1 = a
E1 = b
F1 = c
C7 = ss
C8 = dd
D2 =IF(FIND(D$1,$A2,1)>0,1,"") drag until F6
D7 =COUNTIFS($B$2:$B$6,$C7,D$2:D$6,1) drag until F8
D7:F8 will be your desired results. Happy trying.
I have a Table like this in Sheet1
A B
1234.jpg | c1
1234.jpg | c2
1234.jpg | c3
3456.jpg | c8
3456.jpg | c9
3456.jpg | c10
haha.jpg | c2
haha.jpg | c5
haha.jpg | c9
I need the to match the data according to the Columns in Sheet2 and the data should result something like this.
c1 c2 c3 c4 c5
123.jpg Y Y Y N N
3456.jpg N N N N N
haha.jpg N Y N N Y
I am currently only able to make out this
=IF(ISERROR(MATCH(A2,Sheet1!$A$1:$B$9,0)),"Y","N")
Which returns Y as long as A2 matches something from the array. How do I go about matching it as the Column in Sheet2? I'm open to using functions or VBA
Use following formula to D3 cell as per screenshot.
=IF(SUMPRODUCT(($A$2:$A$10=$C3)*($B$2:$B$10=D$2))=1,"Y","N")
....................................................................................................................................................... You can also use this array formula.
=IF(ISNUMBER(MATCH($C3&D$2,$A$2:$A$10&$B$2:$B$10,0)),"Y","N")
Press CTRL+SHIFT+ENTER to evaluate the formula as it is an array formula.
After entering formula as array formula, drag and drop to right and down as you need.
I need advice/help. I am working on calculation in excel where I have data like mentioned below.
. A B C D E F G H
1| A275 A277 A273 A777 A777 TOTAL A222 GRAND TOTAL
2| 5 7 4 3 4 7 7
Now, I want to count row 2 based on the header.
Here is the condition.
If A1 <> B1 then take A1, if B1 <> C1 then take B1, if C1 <> D1 then C1, so on.
But tricky part is...
If D1<>E1 then D1 else (if E1<>F1 then E1 else (if F1 = "TOTAL" then F1 else(if F1<>G1 then F1)))
In short H2 should have 30 and not 37.
Added comments:------------------------------------
So, Basically if A1<>B1 then take A1 but if A1=B1 then take B1, but then for B1, its a same rule like if B1<>C1 then take B1, but if B1=C1 then take C1 and for C1, same rule. Stopping point will be "TOTAL". Along with these logic I need to check if any cell in row 1 is "TOTAL" then take value for same column. Now this "TOTAL" can be in any cell in row 1.
So from above table my calculation will be 5(A2) + 7(B2) + 4(C2) + 7(F2) + 7(G2) = 30
In this calculation I have not included D2 and E2 as D2=E2 so I took D2, here E2<>F2 so I should have taken E2, but as F2="TOTAL" so I took F2 and not D2 and E2.
I hope this make sense. (Sorry, I know its confusing.)
I have data in more then 100 columns.
Can this be achieved using Macro?
------------------------------------------------------------
Another pain point is data and header are dynamic, so I can't have a fix format. Logic should be in a way that can handle the dynamic data and header.
Any help or suggestion will be greatly appreciated.
I achieved the results you want with this.
Add a helper row. In cell A3 write this formula and drag it to the right:
=IF(OR(A$1=B$1,B$1="TOTAL"),0,1)
Calculate sum in say cell H4 (not H2 because if the formula refers to entire row 2 there will be circular reference):
=SUMIF($3:$3,1,$2:$2)