Two big files join as one to many relationship in Java Spark - apache-spark

I have two big files
email file
attachment file
For simplicity say
email file is having:
eId emailcontent
e1 xxxxxxxx
e2 yyyyyyyy
e3 zzzzzzzz
attachment file is having:
aid attachmentcontent eid
a1 att1 e1
a2 att2 e1
a3 att3 e2
a4 att4 e3
a5 att5 e3
a6 att6 e3
NOTE: Broadcast variable join has already performed with email file with some other small file. Both files are big enough that broadcast variable can't be used again.
I want to join these two files using JavaPairRDD with eid as join column but can't make pairRDD with eid because with same eid key multiple attachments are linked.
Tried to convert the JavaRDD<Email> and JavaRDD<Attachment> to Dataset and perform the join operation, but Email class is complex class(it contains multiple classes as list of variables) hence converting to Dataset does not return any records in it.
Above two approaches are not solving my problem. Hence looking for any solution which is not considered here or in above considered scenarios if I am missing something.

Above problem is solved using JavaPairRDD.
For email file created JavaPairRDD<eId, Email> as eId is unique for each email and for attachment file created JavaPairRDD<eId, Iterator<Attachment>> as eId is having multiple attachments.
Then created JavaPairRDD for email: JavaPairRDD<eId, Email> rddEmail = emailRdd.mapToPair(record -> new Tuple2<>(eId, email)); and JavaPairRDD for attachment: JavaPairRDD<eId, Iterator<Attachment>> rddAttachment = attachmentRdd.mapToPair(record -> new Tuple2<>(eId, attachment)).groupByKey();
Finally performed the rddEmail.join(rddAttachment) and other logics as per requirement.

Related

generate a inventory item coding in excel

Can you support me in the following I need to create coding for inventory items
I built the structure ( 4 letters, and 4 digits)
(Warehouse Type, Warehouse Group, Family, Sub-Family) after that add a sequence 4 numbers
if the item in Warehouse Type ( A ), Warehouse Group (B ), Family(C), Sub-Family(D),
I need to generate a code ABCD0001, the following item if it from the same warehouse, group,family, and sub-family take generate code (ABCD0002),
BUT if the item has any different thing in structure starting from 0001
like if the item:
Warehouse Type ( B ), Warehouse Group (B ), Family(C), Sub-Family(D)
the code should be BBCD0001
what should I do to achieve this, I have almost 1200 items and I need to add a code for all of them
Create 4 lookup tables (for sample I have made two for Warehouse Type & Group seen on the left). Then apply the mapping you wish to have for each Type, Group, Etc.
In your raw data (on the right) create two helper columns: one to create your warehouse string and another to count the instances of those strings i.e. your sequence number
The formulas are below. Add them to first row of raw data and drag down to create the full list of unique keys
I2 = VLOOKUP(G2,$A$1:$B$6,2,0) & VLOOKUP(H2,$D$1:$E$6,2,0)
J2 = I2 & TEXT(COUNTIF(I$2:I2,I2),"0000")
In cell I2 you will need to combine 4 VLOOKUPS so just copy the method shown here. One for Type, Group, Family, & Sub-Family.

Not able to overide schema of an ORC file read from adls location

I have to change the schema of a ORC file. The ORC is kept on an adls location.
The original schema in orc file is
Old Schema Column headers : (C1 , C2 , C3 , C4 )
I want to overide original schema with the new schema (created from StructType and StructField.)
New Schema Column headers : (Name , Age , Sex , Time)
The spark command i am using is :
val df2 = spark.read.format("orc").schema(schema).load("path/")
as soon as i run df2.show(2,false)
The data for all the columns become null.
When i do not override the already present old schema and run
val df2 = spark.read.format("orc").load("path/")
i get the data but the column headers are C1, C2 , C3 and C4.
Could you please tell me how to read data in the new schema and why it is not working ?
Thank you in advance.
why it is not working ?
Yes, this is expected behavior. Given your source df has columns c1, c2 ... etc. The .schema(...) while reading helps you select certain columns or cast them. This will only work if the give columns exist in the source. This option is mostly useful for text based formats like csv, json, text etc.
Since you are giving columns as (Name , Age , Sex , Time) and your source does not contain these columns, the data is null.
Could you please tell me how to read data in the new schema
Read the file normally,
val df = spark.read.format("orc").load("path/")
Explicitly rename the columns,
val df2 = df.withColumnRenamed(c1, "Name").withColumnRenamed(c2, "Age") ...

how to rename badly typed students' name in a column in a dataframe based on a reference list

We have student anwser MCQs after each lessons on socrative
They enter their name first, then anwser. For each lesson, we collect data from the Socrative platform but have issues "normalizing the names" such as 'John Doe', johndoe' or John,Doe' can be transformed into 'doe', as it is written is our main file.
Our main file for following up students (treated as a dataframe with python) has initially just 1 column, the name (as a string 'doe' for Mr. John Doe).
I'l like to write a function that goes through the 'name' column of my lesson1 dataframe and for each value of the name column, replace the badly typed name by the reference name.
To lower the case, suppress excessive spaces and suppress excessive punctuation, i've used the following code
lesson1["name"] = lesson1["name"].str.lower()
lesson1["name"] = lesson1["name"].str.strip()
import re
lesson1["name"]=lesson1["name"].apply(lambda x : re.sub('[^A-Za-z0-9]+', '', x))
Then I want to change the 'name' values for the reference name is necessary
I've tried the following code on 2 lists
bad=lesson1['name']
good=reference['name']
def changenames(lesson_list, reference_list):
for i,name in enumerate(lesson_list):
for j,ref in enumerate(reference_list):
if ref in name:
lesson_list[i]=ref
changenames(bad,good)
but 1/ it's not working due to SettingWithCopyWarning
2/ i fail to apply it to a column of the dataframe
Could you help me ?
Thx
L.
I've found out a way
I've 2 dataframes
- the reference_list dataframe, with the names of the students. It has a column 'name'
- the lesson dataframe with the names as the students type them when they answer the MCQs (not standardized) and the answers to the MCQs.
To transform the names of the students in the lesson dataframe, based on the well-types names in reference_list['name'], i have used :
for i in lesson['name']:
for ref in reference_list['name']:
if ref in i:
lesson.loc[lesson['name'] == i, 'name']=ref
and it works fine,
After that, you can apply functions to treat duplicates, merge data...
I've found help in this thread Replace single value in a pandas dataframe, when index is not known and values in column are unique
Hope it'll help some of you.
Louis

Copying data from a DataFrame and writing back to excel?

I have not worked with Pandas before and I am seeking guidance on the best course of action.
Currently, I have an excel(.xlsx)spreadsheet that I am reading into a data Pandas DataFrame. Within that excel spread sheet, it contains account data, document control number, contract id, manufacturer contract id, series number, include exclude, start date, end date and vendors customer id.
From that data, all of the account numbers need to be copied back to every row of data from document key co, document control number, contract id, manufacturer contract id, series number, include exclude, start date, end date and vendors customer id.
Here is a sample of the data:
I've read in the DataFrame and iterated over the DataFrame with the following code:
#reads in template data. Keeps leading zeros in column B and prevents "NaN" from appearing in blank cells
df = pd.read_excel('Contracts.xlsx', converters = {'document_key_co' : lambda x: str(x)}, na_filter = False)
#iterates over rows
for row in df.itertuples():
print(row)
After doing those things, that is where I am stuck. The desired outcome is this:
As you can see there are three accounts copied to the each of the contract id's.
Reading through the Pandas documentation, I considered separating each account into a separate DataFrame and using concat/merging it into another DataFrame that included document key co - vendors customer id, but felt like that was a lot of extra code when there's a likely a better solution.
I was able to accomplish the task utilizing this snippet of code:
concats = []
for x in df.account.values:
concats.append(df.copy())
concats[-1].account = x
pd.concat(concats)

copy data from another file in excel

I have two data from two different file, first file is from SHU.xls like this, data in C8:C1484
id
=========
198610030
199210037
199210038
199410020
199410042
and from ikprmeidet13.xls, data in B2:B1040
id name
===================
200210046 MARINA
200110026 ERRIE
200110031 KANAE
200210061 SHIINA
I want to copy data (id and name) from ikprmeidet13.xls that doesn't exist in SHU.xls, I tried this, but it doesn't work
=IF((VLOOKUP([ikprmeidet13.xls]ikprmeidet13!$B$2:$B$1040;$C$8:$C$1484;1;FALSE)<>$C$8:$C$1484);[ikprmeidet13.xls]ikprmeidet13!$B$2:$B$1040;"")
I put that function in cell A1489 in SHU.xls, when I tried to evaluate formula, vlookup got an error, is there any other way to do this?
Append a copy of one list to a copy of the other then identify and remove duplicates to suit. Say use:
=VLOOKUP(C1485,C$1:C$1484,1,FALSE)
to identify id matches (display id) or not (display #N/A), filter to select rows containing the formula showing an id and then delete them.

Resources