Controlling the data partition in Apache Spark - apache-spark

Data Looks Like:
col 1 col 2 col 3 col 4
row 1 row 1 row 1 row 1
row 2 row 2 row 2 row 2
row 3 row 3 row 3 row 3
row 4 row 4 row 4 row 4
row 5 row 5 row 5 row 5
row 6 row 6 row 6 row 6
Problem: I want to partition this data, lets say row 1 and row 2 will be processed as one partition, row 3 and row 4 as another, row 5 and row 6 as another and create a JSON data merging them together with the column (column headers with data values in rows).
Output should be like:
[
{col1:row1,col2:row1:col3:row1:col4:row1},
{col1:row2,col2:row2:col3:row2:col4:row2},
{col1:row3,col2:row3:col3:row3:col4:row3},
{col1:row4,col2:row4:col3:row4:col4:row4},...
]
I tried using repartion(num) available in spark but it is not exactly partitioning as i want. therefore the JSON data generated is not valid. i had issue with why my program was taking same time for processing the data even though i was using different number of cores which can be found here and the repartition suggestion was suggested by #Patrick McGloin . The code mentioned in that problem is something i am trying to do.

Guess what you need is partitionBy. In Scala you can provide to it a custom build HashParitioner, while in Python you pass partitionFunc. There is a number of examples out there in Scala, so let me briefly explain the Python flavour.
partitionFunc expects a tuple, with first element being the key. Lets assume you organise your data in the following fashion:
(ROW_ID, (A,B,C,..)) where ROW_ID = [1,2,3,...,k]. You can always add ROW_ID and remove it afterwards.
To get a new partition every two rows:
rdd.partitionBy(numPartitions = int(rdd.count() / 2),
partitionFunc = lambda key: int(key / 2)
partitionFunc will produce a sequence 0,0,1,1,2,2,... This number will be a number of partition to which given row will belong.

Related

How does excel calculate values when you drag out a range?

I have been trying to find an answer online but haven't been able to find one.
When given a range of values, selecting this range and dragging out the cells will generate more values. How are these values calculated? In certain cases it is easy to figure, like when all values are the same or when they are increasing by a steady interval, but how are values calculated when more random sequences of values are given?
For example, given the range
Val 1
Val 2
Val 3
Val 4
Val 5
Val 6
5
5
6
54
5
2
when selecting all values and dragging out to the right, I will end up with the following range:
Val 1
Val 2
Val 3
Val 4
Val 5
Val 6
Dragged out 1
Dragged out 2
Dragged out 3
5
5
6
54
5
2
16.133
17.976
18.019
How are the three dragged out values calculated?
This is done using linear regression, as calculated by the least squares method, explained in this Wikipedia-article.
As an illustration, I have created an Excel sheet, containing the numbers from 1 to 6 and I've added your numbers. Then I've added the numbers 7-9 and used least squares method (as supported by Excel) and put everything in a graph. Please realise that the original values are shown but overwritten by the estimated values in the attached graph (the yellow cells contain the formula of the cell at its left):

How to write function to extract n+1 column in pandas

I have a excel file with 200 columns. The first column is no. of visits, and other columns are the data with number of people for that number of visits
Visits A B C D
2 10 0 30 40
3 5 6 0 1
4 2 3 1 0
I want to write a function so that I have multiple dataframes with Visit column and A; visit column and B, and so on (I want to write a function, as the number of columns will increase in the future and I want to automatize the process). Also, I want to remove the data with 0.
Desired output:
dataframe 1:
visits A
dataframe 2:
Visits B
3 6
4 3
This is my first question. So sorry, if it is not properly framed. Thank you for your help.
Use DataFrame.items:
for i,col in df.set_index('visits').items():
print(col[col.ne(0)].to_frame(i).reset_index())
You can create a dict to save by the name of columns
dfs={i:col[col.ne(0)].to_frame(i).reset_index() for i,col in df.set_index('visits').items()}

How do you search through a row (Row 1) of a CSV file, but search through the next row (Row 2) at the same time?

Imagine there are THREE columns and a certain number of rows in a dataframe. First column are random values, second column are Names, third column are Ages.
I want to search through every row (First Row) of this dataframe and find when value 1 appears in the first column. Then simultaneously, I want to know that if value 1 does indeed exist in the column, does value 2 appear in the SAME column but in the next row.
If this is the case. Copy First Rows, Value, Name And Age into an empty dataframe. Every time this condition is met, copy these rows into an empty dataframe
EmptyDataframe = pd.DataFrame(columns['Name','Age'])
csvfile = pd.DataFrame(columns['Value', 'Name', 'Age'])
row_for_csv_dataframe = next(csv.iterrows())
for index, row_for_csv_dataframe in csv.iterrows():
if row_for_csv_dataframe['Value'] == '1':
# How to code this:
# if the NEXT row after row_for_csv_dataframe finds the 'Value' == 2
# then copy 'Age' and 'Name' from row_for_csv_dataframe into the empty DataFrame.
Assuming you have a dataframe data like this:
Value Name Age
0 1 Anne 10
1 2 Bert 20
2 3 Caro 30
3 2 Dora 40
4 1 Emil 50
5 1 Flip 60
6 2 Gabi 70
You could do something like this, although this is probably not the most efficient:
iterator1 = data.iterrows()
iterator2 = data.iterrows()
iterator2.__next__()
for current, next in zip(iterator1,iterator2):
if(current[1].Value==1 and next[1].Value==2):
print(current[1].Value, current[1].Name, current[1].Age)
And would get this result:
1 Anne 10
1 Flip 60

Exporting a list as a new column in a pandas dataframe as part of a nested for loop

I am inputting multiple spreadsheets with multiple columns of data. For each spreadsheet, the maximum value of each column is found. Then, for each element in the column, the element is divided by the maximum value of that column. The output should be a value (between 0 and 1) for each element in the column in ascending order. This is appended to a list which should be added to the source spreadsheet as a column.
Currently, the nested loops are performing correctly apart from the final step, as far as I understand. Each column is added to the spreadsheet EXCEPT the values are for the final column of the source spreadsheet rather than values related to each individual column.
I have tried changing the indents to associate levels of the code with different parts (as I think this is the problem) and tried moving the appended column along in the dataframe, to no avail.
for i in distlist:
#listname = i[4:] + '_norm'
df2 = pd.read_excel(i,header=0,index_col=None, skip_blank_lines=True)
df3 = df2.dropna(axis=0, how='any')
cols = []
for column in df3:
cols.append(column)
for x in cols:
listname = x + ' norm'
maxval = df3[x].max()
print(maxval)
mylist = []
for j in df3[x]:
findNL = (j/maxval)
mylist.append(findNL)
df3[listname] = mylist
saveloc = 'E:/test/'
filename = i[:-18] + '_Normalised.xlsx'
df3.to_excel(saveloc+filename, index=False)
New columns are added to the output dataframe with bespoke headings relating to the field headers in the source spreadsheet and renamed according to (listname). The data in each one of these new columns is identical and relates to the final column in the spreadsheet. To me, it seems to be overwriting the values each time (as if looping through the entire spreadsheet, not outputting for each column), and adding it to the spreadsheet.
Any help would be much appreciated. I think it's something simple, but I haven't managed to work out what...
If I understand you correctly, you are overcomplicating things. You dont need a for loop for this. You can simplify your code:
# Make example dataframe, this is not provided
df = pd.DataFrame({'col1':[1, 2, 3, 4],
'col2':[5, 6, 7, 8]})
print(df)
col1 col2
0 1 5
1 2 6
2 3 7
3 4 8
Now we can use DataFrame.apply and use add_suffix to give the new columns _norm suffix and after that concat the columns to one final dataframe
df_conc = pd.concat([df, df.apply(lambda x: x/x.max()).add_suffix('_norm')],axis=1)
print(df_conc)
col1 col2 col1_norm col2_norm
0 1 5 0.25 0.625
1 2 6 0.50 0.750
2 3 7 0.75 0.875
3 4 8 1.00 1.000
Many thanks. I think I was just overcomplicating it. Incidentally, I think my code may do the same job, but because there is so little difference in the values, it wasn't notable.
Thanks for your help #Erfan

How to get the latest date with same ID in Excel

I want to Get the Record with the most recent date as same ID's have different dates. Need to pick the BOLD values. Below is the sample data, As original data consist of 10000 records.
ID Date
5 25/02/2014
5 7/02/2014
5 6/12/2013
5 25/11/2013
5 4/11/2013
3 5/05/2013
3 19/02/2013
3 12/11/2012
1 7/03/2013
2 24/09/2012
2 7/09/2012
4 6/12/2013
4 19/04/2013
4 31/03/2013
4 26/08/2012
What I would do is in column B use this formula and fill down
=LEFT(A1,1)
in column C
=DATEVALUE(MID(A1,2,99))
then filter column B to a specific value of interest and sort by column C to order these values by date.
Edit: Even easier do a two level sort by B then by C newest to oldest. The first B in the list is newest.
Do you need a programmatic / formula only solution or can you use a workflow? If a workflow will work, then how about this:
Construct a pivot table of your data
Make the Rows Labels the ID
Make the Values Max of Date
The resulting table is your answer.
Row Labels Max of Date
1 07/03/13
2 24/09/12
3 05/05/13
4 06/12/13
5 25/02/14

Resources