Spark file cleansing and transformation - apache-spark

I have a file with 5 columns tab separated value (tsv). I need to do data scrubbing and transformation.
case 1) Remove the special character (\001 and \x0D ) with ""
case 2) Filter rows which has less then 5 columns count as Bad_Row RDD
case 3) Iterate Bad_Row RDD and check if last character of row is "\n" then remove last character "\n" and append next row until we get column count 5
Sample File Formate
------------------------------
one two 12345 four five
aaa ppp 12345 ttt
bbb
ccc rrr 12355
yyy
ddd
eee iii 12845 rrr two
Good_Rows RDD
-------------------------------
one two 12345 four five
eee iii 12845 rrr two
BAD_Row RDD
-------------------------------
aaa ppp 12345 ttt
bbb
ccc rrr 12355
yyy
ddd
Remove "\n" from second line and append third line into second line and recalculate column count if column count is 5 then consider as good_row and into Good_Rows RDD.
Sample Code Snippet
def FilterData(rdd):
row=rdd.split("\t")
col_count=len(row)
if col_count!=5 :
return row
textFile1=sc.textFile("hdfs://localhost:9000/A/test.tsv")
Clean_RDD=textFile1.map(lambda x: x.replace("\\001|\\x0D","")) # case 1
Badrow_RDD=Clean_RDD.map(FilterData) # case 2
Please help to implement case 3
Thanks
Vishal

Below is the code for your 3rd problem in Scala.
val data=sc.textFile("file:/home/rieter/Test_Streaming/ab.txt").map(x=>x.split(" +").map(x=>x.trim)).filter(x=>(x.size<5)).flatMap(x=>x).collect
val data1=data.take(data.length-(data.length%5))
data1.splitAt(5)
Output
scala> data1.splitAt(5)
res59: (Array[String], Array[String]) = (Array(aaa, ppp, 12345, ttt, bbb),Array(ccc, rrr, 12355, yyy, ddd))

Related

Group all combinations of IDs by separating multiple string IDs [duplicate]

This question already has answers here:
How to unnest (explode) a column in a pandas DataFrame, into multiple rows
(16 answers)
Closed 4 years ago.
Right now I am struggling with the following problem. I have in my dataframe two different IDs, the leading_ID and the follower_ID (String). The leading_ID is characterised by one unique ID. However the follower_ID has three different stages: first the ID can be assigned by None, second the ID can be assigned by one ID or third, the ID has multiple IDs which are separated by comma. The following table shows the structure specifically:
leading_ID follower_ID
abcd None
dfgh cvnr,eee,rrrr
jrtz brtz
vvvv tttt,dddd
wwww None
... ...
My goal is separate all the follower_IDs and group them along the leading_ID. Therefore all follower_IDs that are assigned by None should be dropped.
leading_ID follower_ID
dfgh cvnr
dfgh eee
dfgh rrrr
jrtz brtz
vvvv tttt
vvvv dddd
... ...
By executing my code is seems that it is never ending or taking to long.
from itertools import product
df1 = pd.DataFrame([j for i in df['follower_ID'].dropna().apply(lambda x: x.split(',')).values
for j in product(*i)], columns=df.leading_ID)
Methods which are described in How do I unnest (explode) a column in a pandas DataFrame? have not worked.
Use:
from itertools import chain
#remove missing values
df = df.dropna()
#for better performance list comprehension with split
s = [x.split(', ') for x in df['follower_ID']]
#repeat values with flattening
df = pd.DataFrame({
'user' : df['leading_ID'].repeat([len(x) for x in s]),
'follower_ID' : list(chain.from_iterable(s))
})
print (df)
user follower_ID
1 dfgh cvnr
1 dfgh eee
1 dfgh rrrr
2 jrtz brtz
3 vvvv tttt
3 vvvv dddd

Can you normalize data with embedded sections in excel?

I was provided with some data in excel that I'm trying to transform to make it filterable. The data is in a specific order and looks like this:
Group ID RowType Name
1 Section Name1
1 Row AAA
1 Row BBB
1 Row CCC
1 Section Name2
1 Row DDD
1 Row EEE
2 Section ...
I want to take the name of the Row Type "Section" and transform it into a constant in it's own column. The final product would look like:
Group ID Section RowType Name
1 Name1 Row AAA
1 Name1 Row BBB
1 Name1 Row CCC
1 Name2 Row DDD
1 Name2 Row EEE
2 ...
I feel like this has probably be done but I haven't been able to put together the right terms to search... Any help would be appreciated.

Getting a modified range of cells

I need to apply a function to a range of cells and then getting the modified version of this range.
I need to apply LEFT(A1;FIND(" ";A1&" ")-1) to a range (A1:A6) and get the range back so that =SUMIFS(B1:B6;A1:A6;"AAA") works.
Actual problem:
A company sells drugs and they want to know the total sales of a specific drug sold.
A | B
1 AAA tbl 20x50g | 3
2 AAA tbl 90x50g | 4
3 BBB 10% raz 1ml | 1
4 BBB 2% raz 1 ml | 8
5 BBB tbl 12x2,5 | 1
6 CCC vet 48x1kg | 7
On the next list is a list with only the "generic" name:
C | D
1 AAA | ?
2 BBB | ?
3 CCC | ?
I need to get the values in D1.
So far I have came to:
=SUMIFS(B1:B6;A1:A6;LEFT(A1;FIND(" ";A1&" ")-1))
This will compare everything in A1:A6 with the first company (AAA). If it gets a match (A1, A2) it will take the value from the B1:B6 and the answer would be 7.
But I need to apply the "get first word" function
LEFT(A1;FIND(" ";A1&" ")-1)
to the range A1:A6
EDIT:
My bad. It appears I was trying to reinvent fire by spiting water on wood.
The problem was elsewhere.
The values in B were being produces through
=IF(AB34=0;" ";AB34*$Y34)
and the TRUE clause (" ") messed up the math.
You would use the SUMPRODUCT with LEFT(FIND()):
=SUMPRODUCT((LEFT($A$1:$A$6;FIND(" ";$A$1:$A$6 & " ")-1)=D1)*$B$1:$B$6)

Sum the value in one column based on the value in another column - excel 2013

I have 3 Columns, B,C and D.
Columns B,C,D will be updated periodically and the number of rows will be increased day by day.
Columns B contains NAME, C contains Quantity and D has Date value.
Name Qt Date
SSS 20
SSS 30
NNN 50
PPP 40 13-Jul-15
PPP 20 13-Jul-15
AAA 20
CCC 100
GGG 300
FFF 200
BBB 50 28-Aug-15
AAA 20
GGG 100
BBB 30
JJJ 50
BBB 30
FFF 50
FFF 25 24-Aug-15
CCC 75
JJJ 30
FFF 50
JJJ 36 24-Aug-15
FFF 50 24-Aug-15
I need the result as below.
Name Qt
AAA 40
BBB 30
CCC 175
FFF 300
GGG 400
JJJ 80
NNN 50
SSS 50
Rows with date field(column D) having a value should be omitted from adding.
Name field can contain any name and future names can vary from current names.
The Result can be in the same sheet or in the different sheet.
This may not meet your exact needs but it should get you the data you want in the first instance.
In Excel select the Insert tab and select Pivot Table.
Set the table range to be A1 to C60000 (or whatever the max number of rows is)
Select New Worksheet and click OK.
Add the Name and Qt fields to the report. It should automatically sum Qt.
Add Date to the Filters. This should add a bar above the pivot table that says something like:
Date (All)
Click on (All) and select (blank), and that should exclude the rows with dates specified.
use helper cells next to the date range =isblank(d2) this will return TRUE in E2
put AAA in G6, in H6 put the following formula =SUMIFS(C:C,E:E,"TRUE",B:B,G6)
if there is trash data in the blank cells like random spaces and whatnot use =isblank(trim(d2))
i loooove pivot tables but sometimes they just make file sizes too big if using a bunch of them in a single workbook, i also get tired of refreshing them.

Excel - identify duplicates consisting of two columns, in changing order

I am trying to identify the second duplicate combinations of cells across two (or more) columns. Complicating matters, the order of combinations varies.
To illustrate - my data looks something like this:
aaa 111
222 aaa
111 aaa
111 bbb
bbb 111
I'm looking for a formula that for this example would tell me that rows 1, 2 and 4 are the first instances of the combined values, whereas rows 3 and 5 are the second instances.
If you don't mind adding another column to the sheet, you can create an "index" for each row, making sure you always list the columns in the same order. (You can always copy values and delete the intermediary column.)
For example the following formula:
=IF($A1<$B1,CONCATENATE($A1,"!",$B1),CONCATENATE($B1,"!",$A1))
Will create a unique identifier like the following, using a ! character for a delimiter:
aaa 111 111!aaa
222 aaa 222!aaa
111 aaa 111!aaa
111 bbb 111!bbb
bbb 111 111!bbb
Then it's just a matter of using COUNTIF, for example with the index in column C:
=COUNTIF($C$1:$C1,$C1)
This will produce the following sheet, the number being the ordinal 1-based count of how many times that pair has appeared.
aaa 111 111!aaa 1
222 aaa 222!aaa 1
111 aaa 111!aaa 2
111 bbb 111!bbb 1
bbb 111 111!bbb 2
Someone with more time on their hands can probably come up with an array formula to do the same thing in memory.

Resources