Make a subset of dtaframe according to column value python - python-3.x

I have generated dataframe frame and created a csv file now I want to make a subset of dataframe in which it checks the value of column "dst" (uptill 0) and then take value of Image column.
My current dataframe is:
Image Maxval locx locy dst
0 1.jpg 0.99 22 47 0
1 7.jpg 0.46 27 65 18.68
2 11.jpg 0.32 18 29 18.43
8 18.jpg 0.25 7 38 17.49
10 1.jpg 0.99 40 71 0
11 18.jpg 0.56 27 71 17.68
13 7.jpg 0.42 93 17 19.43
19 11.jpg 0.35 70 39 17.49
The image are sorted according to maxval so i don't want to change the order of images.I want my dataframe to be:
Image Image
1.jpg 1.jpg
7.jpg 18.jpg
11.jpg 7.jpg
18.jpg 11.jpg

If first value in dst column is always 0 compare value 0 and create new column value by cumsum fo cumulative sum and groups by GroupBy.cumcount, last use DataFrame.pivot:
df['c'] = df['dst'].eq(0).cumsum()
df['g'] = df.groupby('c').cumcount()
df1=df.pivot('g','c','Image').add_prefix('Image_').rename_axis(None).rename_axis(None,axis=1)
print (df1)
Image_1 Image_2
0 1.jpg 1.jpg
1 7.jpg 18.jpg
2 11.jpg 7.jpg
3 18.jpg 11.jpg

Here is another approach:
Get the groups of images based on the dst column
groups = df.groupby(df.dst.eq(0).cumsum())['Image']
Concat every groups after resetting the index for each of them:
pd.concat([group.rename('Image_' + str(indx)).reset_index(drop=True) for indx, group in groups], axis=1)
Output:
Image_1 Image_2
0 1.jpg 1.jpg
1 7.jpg 18.jpg
2 11.jpg 7.jpg
3 18.jpg 11.jpg
As you can see I also renamed the columns in the concat function by renaming the series, but that is not necessary if you really want to have the name "image" for every groups.

Related

Fill in missing values in DataFrame Column which is incrementing by 10

Say , Some Values in the 'Counts' column are missing. These numbers are meant to be increased by 10 with each row so '35' and '55' need to be put in place. I would want to fill in these missing values.
Counts
0 25
1 NaN
2 45
3 NaN
4 65
So my output should be :
Counts
0 25
1 35
2 45
3 55
4 65
Thanks,
We have interpolate
df=df.interpolate()
Counts
0 25.0
1 35.0
2 45.0
3 55.0
4 65.0
Since you now the pattern, you can simply recreate it:
start = df.iloc[0]['Counts'] # first row
end = df.iloc[-1]['Counts'] # last row
df['Counts'] = np.where(df['Counts'].notnull(), df['Counts'],
np.arange(start, end + 1, 10))

Pandas : Saving indivisual file which is having same column name in two dataframe

Hello I wanted cancat Two dataframe which share same column name and save as indivisual file having same column name and save file as column name
my dataframe looking like
A1=
name exam1 exam2 exam3 exam4
arun 0 12 25 0
joy 20 1 0 26
jeev 30 0 0 25
B2=
name exam1 exam2 exam3 exam4
arun 20 26 0 0
joy 30 0 25 3
jeev 17 2 15 25
what I wanted as a output
save diffrent file with column name such as exam1.txt, exam2.txt, exam3.txt etc i have very big dataframe
output indivisual file look like
example: exam1.txt
name exam1_A1 exam1_B1
arun 0 20
joy 20 30
jeev 30 17
I try to use cancat two dataframe pd.concat([A1,B1], axis=0) but not able get what I wanted. can any one suggest me ?
You can do a loop with merge:
for col in A1.columns[1:]:
(A1[['namme',col]]
.merge(B1[['name',col]], on='name', suffixes=('_A1','_B1'))
.to_csv(f'{col}.txt')
)

Pandas : merge dataframes with conditions

I'd like something pretty complicated, I think.
So i have 2 pandas DataFrames,
contact_extrafields (which is a CSV file converted to a DataFrame):
contact_id departement age region size
0 17068CE3 5 19.5
1 788159ED 59 18 ABC
2 4796EDA9 69 100.0
3 2BB080E4 32 DEF 50.5
4 8562B30E 10 GHI 79.95
5 9602758E 67 JKL 23.7
6 3CBBA9F7 65 MNO 14.7
7 DAE5EE44 75 98 159.6
8 5B9E3410 49 10 PQR 890.1
...
datafield_types (which is a dictionary converted to a DataFrame):
name datatype_id datafield_id datatype_name
0 size 1 4 float
1 region 2 3 string
2 age 3 2 integer
3 departement 3 1 integer
I would like a new DataFrame like this :
contact_id datafield_id string_value integer_value boolean_value float_value
0 17068CE3 4 19.5
1 17068CE3 3
2 17068CE3 2 5
3 17068CE3 1
4 788159ED 4
5 788159ED 3 ABC
6 788159ED 2 18
7 788159ED 1 59
....
The DataFrame contact_extrafields contains about 3 million lines.
EDIT (exemple):
If I take contact_id 788159ED from DataFrame contact_extrafields,
I'll take the name of the column and its value,
check the type of the value with in DataFrame datafield_types with the column name,
for example for the column department its value is 59 and its type is integrated according to the DataFrame datafield_types so the id is 3,
it should insert a line in the new DataFrame that i will create like this:
contact_id datafield_id string_value integer_value boolean_value float_value
0 788159ED 1 59
....
The datafield_id is retrieved from the DataFrame datafield_types this will allow me to know that the contact 788159ED had for the column department which is integer type the value 59.
Each column create a row in the DataFrame I want to create.
Is it possible to do it with pandas?
How to do it?
The columns in contact_extrafields can change (so i will change the datafield_types names too)
I've tried a lot of things that have led me to a memory saturation.
My code is running on a machine with 16 gigas of ram.
Thanks a lot !

Iterate over rows in a data frame create a new column then adding more columns based on the new column

I have a data frame as below:
Date Quantity
2019-04-25 100
2019-04-26 148
2019-04-27 124
The output that I need is to take the quantity difference between two next dates and average over 24 hours and create 23 columns with hourly quantity difference added to the column before such as below:
Date Quantity Hour-1 Hour-2 ....Hour-23
2019-04-25 100 102 104 .... 146
2019-04-26 148 147 146 .... 123
2019-04-27 124
I'm trying to iterate over a loop but it's not working ,my code is as below:
for i in df.index:
diff=(df.get_value(i+1,'Quantity')-df.get_value(i,'Quantity'))/24
for j in range(24):
df[i,[1+j]]=df.[i,[j]]*(1+diff)
I did some research but I have not found how to create columns like above iteratively. I hope you could help me. Thank you in advance.
IIUC using resample and interpolate, then we pivot the output
s=df.set_index('Date').resample('1 H').interpolate()
s=pd.pivot_table(s,index=s.index.date,columns=s.groupby(s.index.date).cumcount(),values=s,aggfunc='mean')
s.columns=s.columns.droplevel(0)
s
Out[93]:
0 1 2 3 ... 20 21 22 23
2019-04-25 100.0 102.0 104.0 106.0 ... 140.0 142.0 144.0 146.0
2019-04-26 148.0 147.0 146.0 145.0 ... 128.0 127.0 126.0 125.0
2019-04-27 124.0 NaN NaN NaN ... NaN NaN NaN NaN
[3 rows x 24 columns]
If I have understood the question correctly.
for loop approach:
list_of_values = []
for i,row in df.iterrows():
if i < len(df) - 2:
qty = row['Quantity']
qty_2 = df.at[i+1,'Quantity']
diff = (qty_2 - qty)/24
list_of_values.append(diff)
else:
list_of_values.append(0)
df['diff'] = list_of_values
Output:
Date Quantity diff
2019-04-25 100 2
2019-04-26 148 -1
2019-04-27 124 0
Now create the columns required.
i.e.
df['Hour-1'] = df['Quantity'] + df['diff']
df['Hour-2'] = df['Quantity'] + 2*df['diff']
.
.
.
.
There are other approaches which will work way better.

Spotfire Add several columns with a custom expression

I would like add several columns in a Bar Chart in Y with a custom expression. I have several columns which begin with "HB" or "PASS".
Their number change as well as their name every time I refresh the table. But HB or PASS remains in column name.
I tried to use this expression :
Sum($map("[$csearch([pvtable],"PASS*")]",","))/Count([SUBLOT_ID])
or
$map("[$csearch([pvtable],"PASS*")]",","))
If I have only one column with PASS or HB in key word it works, but not if I have several columns with this key words in their name.
It's an example of my datas. They are in percentage.
LOT_ID SUBLOD_ID WL_PART_CNT PASS_HB1 PASS_HB2 HB5 HB10 HB13 HB25
Q640123 01 3841 86 11 0.25 0.5 0.25 2
Q640123 05 3841 96 3 0 1 0 0
Q640123 10 3841 80 12 0 2 4 2
Q640123 16 3841 40 50 1 1 4 4
Q640123 22 3841 85 5 9 0.5 0.5 0
Q640345 01 3841 86 11 0.25 0.5 0.25 2
Q640345 05 3841 96 3 1 0 0 0
Q640345 10 3841 80 12 0 2 4 2
Q640345 16 3841 40 50 1 1 4 4
Q640345 22 3841 85 5 9 0.5 0.5 0
I want to put LOT_ID in X, and PASS together in Y. I don't want to color my bar chart but I would like a result like this. One bar chart with all columns PASS and an other with all columns HB.
This bar chart represent HB.
Thank you for your help, regards, Laurent
You shouldn't need the $map function, only the $csearch
Sum($csearch([pvtable],"PASS*")) /Count([SUBLOT_ID])
EDIT
After looking at your test data, you will need to map the values.
$map("sum([$csearch([pvtable],"PASS*")])","+"),$map("sum([$csearch([pvtable],"HB*")])","+")
Then, on your X-AXIS you will need: <[LOT_ID] NEST [Axis.Default.Names]>

Resources