How do you search through a row (Row 1) of a CSV file, but search through the next row (Row 2) at the same time? - python-3.x

Imagine there are THREE columns and a certain number of rows in a dataframe. First column are random values, second column are Names, third column are Ages.
I want to search through every row (First Row) of this dataframe and find when value 1 appears in the first column. Then simultaneously, I want to know that if value 1 does indeed exist in the column, does value 2 appear in the SAME column but in the next row.
If this is the case. Copy First Rows, Value, Name And Age into an empty dataframe. Every time this condition is met, copy these rows into an empty dataframe
EmptyDataframe = pd.DataFrame(columns['Name','Age'])
csvfile = pd.DataFrame(columns['Value', 'Name', 'Age'])
row_for_csv_dataframe = next(csv.iterrows())
for index, row_for_csv_dataframe in csv.iterrows():
if row_for_csv_dataframe['Value'] == '1':
# How to code this:
# if the NEXT row after row_for_csv_dataframe finds the 'Value' == 2
# then copy 'Age' and 'Name' from row_for_csv_dataframe into the empty DataFrame.

Assuming you have a dataframe data like this:
Value Name Age
0 1 Anne 10
1 2 Bert 20
2 3 Caro 30
3 2 Dora 40
4 1 Emil 50
5 1 Flip 60
6 2 Gabi 70
You could do something like this, although this is probably not the most efficient:
iterator1 = data.iterrows()
iterator2 = data.iterrows()
iterator2.__next__()
for current, next in zip(iterator1,iterator2):
if(current[1].Value==1 and next[1].Value==2):
print(current[1].Value, current[1].Name, current[1].Age)
And would get this result:
1 Anne 10
1 Flip 60

Related

How to find if a number in a row matches a number in a column, and then automatically offset the number in the column to another value

How to find if a number in a row matches a number in a column, and then automatically offset the number in the column to another value. For example:
Row 0 1 2 3 4 5
Answer x x x x x x
Column
0 100
1 340
2 500
3 266
4 455
5 800
So if "0" in the Row array matches "0" in the Column array, then show 340 and so on. I can to this with nested IF statements but is there an easier way if you have 100s of columns. Thanks

How to extract row before and after when flag change from 0 to 1

I have one dataframe , i want to extract 2 rows before flag change from 0 to one and get row where value 'B' is minimum , also extract two rows after flag 1 and get row with minimum value of 'B'
df=pd.DataFrame({'A':[1,3,4,7,8,11,1,15,20,15,16,87],
'B':[1,3,4,6,8,11,1,19,20,15,16,87],
'flag':[0,0,0,0,1,1,1,0,0,0,0,0]})
df_out=pd.DataFrame({'A':[4,1],
'B':[4,1],
'flag':[0,1]})
To find indices of both rows of interest, run:
ind1 = df[df.flag.shift(-1).eq(0) & df.flag.shift(-2).eq(1)].index[0]
ind2 = df[df.index > ind1].B.idxmin()
For your data sample the result is 2 and 6.
Then, to retrieve rows with these indices, run:
df.loc[[ind1, ind2]]
The result is:
A B flag
2 4 4 0
6 1 1 1

selecting different columns each row

I have a dataframe which has 500K rows and 7 columns for days and include start and end day.
I search a value(like equal 0) in range(startDay, endDay)
Such as, for id_1, startDay=1, and endDay=7, so, I should seek a value D1 to D7 columns.
For id_2, startDay=4, and endDay=7, so, I should seek a value D4 to D7 columns.
However, I couldn't seek different column range successfully.
Above-mentioned,
if startDay > endDay, I should see "-999"
else, I need to find first zero (consider the day range) and such as for id_3's, first zero in D2 column(day 2). And starDay of id_3 is 1. And I want to see, 2-1=1 (D2 - StartDay)
if I cannot find 0, I want to see "8"
Here is my data;
data = {
'D1':[0,1,1,0,1,1,0,0,0,1],
'D2':[2,0,0,1,2,2,1,2,0,4],
'D3':[0,0,1,0,1,1,1,0,1,0],
'D4':[3,3,3,1,3,2,3,0,3,3],
'D5':[0,0,3,3,4,0,4,2,3,1],
'D6':[2,1,1,0,3,2,1,2,2,1],
'D7':[2,3,0,0,3,1,3,2,1,3],
'startDay':[1,4,1,1,3,3,2,2,5,2],
'endDay':[7,7,6,7,7,7,2,1,7,6]
}
data_idx = ['id_1','id_2','id_3','id_4','id_5',
'id_6','id_7','id_8','id_9','id_10']
df = pd.DataFrame(data, index=data_idx)
What I want to see;
df_need = pd.DataFrame([0,1,1,0,8,2,8,-999,8,1], index=data_idx)
You can create boolean array to check in each row which 'Dx' column(s) are above 'startDay' and below 'endDay' and the value is equal to 0. For the first two conditions, you can use np.ufunc.outer with the ufunc being np.less_equal and np.greater_equal such as:
import numpy as np
arr_bool = ( np.less_equal.outer(df.startDay, range(1,8)) # which columns Dx is above startDay
& np.greater_equal.outer(df.endDay, range(1,8)) # which columns Dx is under endDay
& (df.filter(regex='D[0-9]').values == 0)) #which value of the columns Dx are 0
Then you can use np.argmax to find the first True per row. By adding 1 and removing 'startDay', you get the values you are looking for. Then you need to look for the other conditions with np.select to replace values by -999 if df.startDay >= df.endDay or 8 if no True in the row of arr_bool such as:
df_need = pd.DataFrame( (np.argmax(arr_bool , axis=1) + 1 - df.startDay).values,
index=data_idx, columns=['need'])
df_need.need= np.select( condlist = [df.startDay >= df.endDay, ~arr_bool.any(axis=1)],
choicelist = [ -999, 8],
default = df_need.need)
print (df_need)
need
id_1 0
id_2 1
id_3 1
id_4 0
id_5 8
id_6 2
id_7 -999
id_8 -999
id_9 8
id_10 1
One note: to get -999 for id_7, I used the condition df.startDay >= df.endDay in np.select and not df.startDay > df.endDay like in your question, but you can cahnge to strict comparison, you get 8 instead of -999 in this case.

Compare row with all other previous string in one column and change value of another column in Python

I have a csv file named namelist.csv, it includes:
Index String Size Name
1 AAA123000DDD 10 One
2 AAA123DDDQQQ 20 One
3 AAA123000DDD 25 One
4 AAA123D 20 One
5 ABA 15 One
6 FFFrrrSSSBBB 60 Two
7 FFFrrrSSSBBB 30 Two
8 FFFrrrSS 50 Two
9 AAA12 70 Two
I want to compare row in column String of each name group: if the string in each row is match or is substring of all above rows then remove the previous rows and sum the value of Size column to the value of subtring row.
Example: i take row 3rd: AAA123000DDD, i compare it to 2 row 1st and 2nd, it see that it is a match with 1st row, it will remove the 1st row then sum value of the 1st row column Size to the 3rd row column Size .
then the table will be like:
Index String Size Name
2 AAA123DDDQQQ 20 One
3 AAA123000DDD 35 One
4 AAA123D 20 One
...
the final result will be:
Index String Size Name
3 AAA123000DDD 35 One
4 AAA123D 40 One
5 ABA 15 One
8 FFFrrrSS 140 Two
9 AAA12 70 Two
i think of using groupby of pandas to group all Name column, but i don't know how to apply the comparison of String column and sum of Size column.
I am new to Python so any help I will very appreciate.
Assuming Name is distinct with String, here's how you would do the aggregation. I kept Name so that it also shows in the final DataFrame.
df_group = df.groupby(['String', 'Name'])['Size'].sum().reset_index()
Edit:
To match the substrings (and using the example above that it appears that a substring will not match with multiple strings), you can make a mapping of substrings to full strings and then group by the full string column as before:
all_strings = set(df['Strings'])
substring_dict = dict()
for row in df.itertuples():
for item in all_strings:
if row.String in item:
substring_dict[row.String] = item
def match_substring(x):
return substring_dict[x]
df['full_strings'] = df.String.apply(match_substring)
df_group = df.groupby(['full_strings', 'Name'])['Size'].sum().reset_index()

Controlling the data partition in Apache Spark

Data Looks Like:
col 1 col 2 col 3 col 4
row 1 row 1 row 1 row 1
row 2 row 2 row 2 row 2
row 3 row 3 row 3 row 3
row 4 row 4 row 4 row 4
row 5 row 5 row 5 row 5
row 6 row 6 row 6 row 6
Problem: I want to partition this data, lets say row 1 and row 2 will be processed as one partition, row 3 and row 4 as another, row 5 and row 6 as another and create a JSON data merging them together with the column (column headers with data values in rows).
Output should be like:
[
{col1:row1,col2:row1:col3:row1:col4:row1},
{col1:row2,col2:row2:col3:row2:col4:row2},
{col1:row3,col2:row3:col3:row3:col4:row3},
{col1:row4,col2:row4:col3:row4:col4:row4},...
]
I tried using repartion(num) available in spark but it is not exactly partitioning as i want. therefore the JSON data generated is not valid. i had issue with why my program was taking same time for processing the data even though i was using different number of cores which can be found here and the repartition suggestion was suggested by #Patrick McGloin . The code mentioned in that problem is something i am trying to do.
Guess what you need is partitionBy. In Scala you can provide to it a custom build HashParitioner, while in Python you pass partitionFunc. There is a number of examples out there in Scala, so let me briefly explain the Python flavour.
partitionFunc expects a tuple, with first element being the key. Lets assume you organise your data in the following fashion:
(ROW_ID, (A,B,C,..)) where ROW_ID = [1,2,3,...,k]. You can always add ROW_ID and remove it afterwards.
To get a new partition every two rows:
rdd.partitionBy(numPartitions = int(rdd.count() / 2),
partitionFunc = lambda key: int(key / 2)
partitionFunc will produce a sequence 0,0,1,1,2,2,... This number will be a number of partition to which given row will belong.

Resources