New dataframe row by row from a different format dataframe - python-3.x

I'm wandering if something like this could be achieved with python.
I currently have the following dataframe (df1):
A B C D E F
1.1.1 amba 131 1 50 4
2.2.2 erto 50 7 131 8
3.3.3 gema 131 2 50 5
And I would like to get this output in a new dataframe (df2):
ID User 131 50
1.1.1 amba 1 4
2.2.2 erto 8 7
3.3.3 gema 2 5
Take in mind that df1 has an undetermined number of rows and df2 should have the same number of rows than df1. First and second column do not change and keep the same. Columns C and E in df1 store attribute IDs while columns D and F store attribute's values. For example, in df1 131=1 and 50=4 in the first row. Plus attribute IDs are not always in the same column and the attribute ID could be placed in Column C or column E.
I am thinking on creating df2 using a loop and analyzing rows with lambda but i am currently having issues to make work anything for the moment. Any idea?
I have understood evey part of the code and I am now adding columns but I am wondering if this could be done with a loop or something similar. This is how code looks after adding 4 extra colums:
import pandas as pd
import io
df1 = pd.read_csv(io.StringIO(""" A B C D E F G H I J
1.1.1 amba 131 1 50 4 40 3 150 5
2.2.2 erto 50 7 40 8 150 8 131 2
3.3.3 gema 131 2 150 5 40 1 50 3"""), sep="\s+")
df2 = (pd.concat([df1.drop(columns=["C","D","E","F","G","H"]).rename(columns={"I":"key","J":"val"}),
df1.drop(columns=["C","D","E","F","I","J"]).rename(columns={"G":"key","H":"val"}),
df1.drop(columns=["C","D","G","H","I","J"]).rename(columns={"E":"key","F":"val"}),
df1.drop(columns=["E","F","G","H","I","J"]).rename(columns={"C":"key","D":"val"}),
])
.rename(columns={"A":"ID","B":"User"})
.set_index(["ID","User","key"])
.unstack(2)
.reset_index()
)
# flatten the columns..
df2.columns = [c[1] if c[0]=="val" else c[0] for c in df2.columns.to_flat_index()]
df2
print(df2)
And this is the output:
ID User 40 50 131 150
0 1.1.1 amba 3 4 1 5
1 2.2.2 erto 8 7 2 8
2 3.3.3 gema 1 3 2 5
So yes, everything is working fine but I would like to find a way to make this with a loop instead of having tons of lines (I have about 70 columns per row). Thank you very much for the help. Thanks.
I have just one extra question and I will have everything working fine. In my actual table I have some rows with 60 columns nd other ones with just 30 or so. This means that I have tons of NaN in these rows with less colums, so I am getting an error when try to unstack. I have read about pivot_tables, drop_duplicates, etc, but not sure how to make work some of these options with this code. Thanks!

Logically you have a mix of keys being part of row and part of columns. Construct a df by concat() that has the whole key as part of the row. Then it's a simple case of using unstack() to get what you want
df1 = pd.read_csv(io.StringIO(""" A B C D E F
1.1.1 amba 131 1 50 4
2.2.2 erto 50 7 131 8
3.3.3 gema 131 2 50 5"""), sep="\s+")
df2 = (pd.concat([df1.drop(columns=["C","D"]).rename(columns={"E":"key","F":"val"}),
df1.drop(columns=["E","F"]).rename(columns={"C":"key","D":"val"}),
])
.rename(columns={"A":"ID","B":"User"})
.set_index(["ID","User","key"])
.unstack(2)
.reset_index()
)
# flatten the columns..
df2.columns = [c[1] if c[0]=="val" else c[0] for c in df2.columns.to_flat_index()]
df2
output
ID User 50 131
1.1.1 amba 4 1
2.2.2 erto 7 8
3.3.3 gema 5 2
......

Related

loops application in dataframe to find output

I have the following data:
dict={'A':[1,2,3,4,5],'B':[10,20,233,29,2],'C':[10,20,3040,230,238]...................}
and
df= pd.Dataframe(dict)
In this manner I have 20 columns with 5 numerical entry in each column
I want to have a new column where the value should come as the following logic:
0 A[0]*B[0]+A[0]*C[0] + A[0]*D[0].......
1 A[1]*B[1]+A[1]*C[1] + A[1]*D[1].......
2 A[2]*B[2]+A[2]*B[2] + A[2]*D[2].......
I tried in the following manner but manually I can not put 20 columns, so I wanted to know the way to apply a loop to get the desired output
:
lst=[]
for i in range(0,5):
j=df.A[i]*df.B[i]+ df.A[i]*df.C[i]+.......
lst.append(j)
i=i+1
A potential solution is the following. I am only taking the example you posted but is works fine for more. Your data is df
A B C
0 1 10 10
1 2 20 20
2 3 233 3040
3 4 29 230
4 5 2 238
You can create a new column, D by first subsetting your dataframe
add = df.loc[:, df.columns != 'A']
and then take the sum over all multiplications of the columns in D with column A in the following way:
df['D'] = df['A']*add.sum(axis=1)
which returns
A B C D
0 1 10 10 20
1 2 20 20 80
2 3 233 3040 9819
3 4 29 230 1036
4 5 2 238 1200

How can I sort 3 columns and assign it to one python pandas

I have a dataframe:
df = {A:[1,1,1], B:[2012,3014,3343], C:[12,13,45], D:[111,222,444]}
but I need to join the last 3 columns in consecutive order horizontally and thus assign it to the first column, some like this:
df2 = {A:[1,1,1,2,2,2], Fusion3:[2012,12,111,3014,13,222]}
I have tried with .melt, but you are struggling with some ideas and grateful for your comments
From the desired output I'm making the assumption that the initial dataframe should have 1,2,3 in the A column rather 1,1,1
import pandas as pd
df= pd.DataFrame({'A':[1,2,3], 'B':[2012,3014,3343], 'C':[12,13,45], 'D':[111,222,444]})
df = df.set_index('A')
df = df.stack().droplevel(1)
will give you this series:
A
1 2012
1 12
1 111
2 3014
2 13
2 222
3 3343
3 45
3 444
Check melt
out = df.melt('A').drop('variable',1)
Out[15]:
A value
0 1 2012
1 2 3014
2 3 3343
3 1 12
4 2 13
5 3 45
6 1 111
7 2 222
8 3 444

pandas get rows from one dataframe which are existed in other dataframe

I have two dataframes. The dataframes as follows:
df1 is
numbers
user_id
0 9154701244
1 9100913773
2 8639988041
3 8092118985
4 8143131334
5 9440609551
6 8309707235
7 8555033317
8 7095451372
9 8919206985
10 8688960416
11 9676230089
12 7036733390
13 9100914771
it's shape is (14,1)
df2 is
user_id numbers names type duration date_time
0 9032095748 919182206378 ramesh incoming 23 233445445
1 9032095748 918919206983 suresh incoming 45 233445445
2 9032095748 919030785187 rahul incoming 45 233445445
3 9032095748 916281206641 jay incoming 67 233445445
4 jakfnka998nknk 9874654411 query incoming 25 8571228412
5 jakfnka998nknk 9874654112 form incoming 42 678565487
6 jakfnka998nknk 9848022238 json incoming 10 89547212765
7 ukajhj9417fka 9984741215 keert incoming 32 8548412664
8 ukajhj9417fka 9979501984 arun incoming 21 7541344646
9 ukajhj9417fka 95463241 paru incoming 42 945151215451
10 ukajknva939o 7864621215 hari outgoing 34 49829840920
and it's shape is (10308,6)
Here in df1, the column name numbers are having the multiple unique numbers. These numbers are available in df2 and those are repeated depends on the duration. I want to get those data who all are existed in df2 based on the numbers which are available in df1.
Here is the code I've tried to get this but I'm not able to figure it out how it can be solved using pandas.
df = pd.concat([df1, df2]) # concat dataframes
df = df.reset_index(drop=True) # reset the index
df_gpby = df.groupby(list(df.columns)) #group by
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1] #reindex
df = df.reindex(idx)
It gives me only unique numbers column which are there in df2. But I need to get all the data including other columns from the second dataframe.
It would be great that anyone can help me on this. Thanks in advance.
Here is a sample dataframe, I created keeping the gist same.
df1=pd.DataFrame({"numbers":[123,1234,12345,5421]})
df2=pd.DataFrame({"numbers":[123,1234,12345,123,123,45643],"B":[1,2,3,4,5,6],"C":[2,3,4,5,6,7]})
final_df=df2[df2.numbers.isin(df1.numbers)]
Output DataFrame The result is all unique numbers that are present in df1 and present in df2 will be returned
numbers B C
0 123 1 2
1 1234 2 3
2 12345 3 4
3 123 4 5
4 123 5 6

Pandas : merge dataframes with conditions

I'd like something pretty complicated, I think.
So i have 2 pandas DataFrames,
contact_extrafields (which is a CSV file converted to a DataFrame):
contact_id departement age region size
0 17068CE3 5 19.5
1 788159ED 59 18 ABC
2 4796EDA9 69 100.0
3 2BB080E4 32 DEF 50.5
4 8562B30E 10 GHI 79.95
5 9602758E 67 JKL 23.7
6 3CBBA9F7 65 MNO 14.7
7 DAE5EE44 75 98 159.6
8 5B9E3410 49 10 PQR 890.1
...
datafield_types (which is a dictionary converted to a DataFrame):
name datatype_id datafield_id datatype_name
0 size 1 4 float
1 region 2 3 string
2 age 3 2 integer
3 departement 3 1 integer
I would like a new DataFrame like this :
contact_id datafield_id string_value integer_value boolean_value float_value
0 17068CE3 4 19.5
1 17068CE3 3
2 17068CE3 2 5
3 17068CE3 1
4 788159ED 4
5 788159ED 3 ABC
6 788159ED 2 18
7 788159ED 1 59
....
The DataFrame contact_extrafields contains about 3 million lines.
EDIT (exemple):
If I take contact_id 788159ED from DataFrame contact_extrafields,
I'll take the name of the column and its value,
check the type of the value with in DataFrame datafield_types with the column name,
for example for the column department its value is 59 and its type is integrated according to the DataFrame datafield_types so the id is 3,
it should insert a line in the new DataFrame that i will create like this:
contact_id datafield_id string_value integer_value boolean_value float_value
0 788159ED 1 59
....
The datafield_id is retrieved from the DataFrame datafield_types this will allow me to know that the contact 788159ED had for the column department which is integer type the value 59.
Each column create a row in the DataFrame I want to create.
Is it possible to do it with pandas?
How to do it?
The columns in contact_extrafields can change (so i will change the datafield_types names too)
I've tried a lot of things that have led me to a memory saturation.
My code is running on a machine with 16 gigas of ram.
Thanks a lot !

Using Pandas filtering non-numeric data from two columns of a Dataframe

I'm loading a Pandas dataframe which has many data types (loaded from Excel). Two particular columns should be floats, but occasionally a researcher entered in a random comment like "not measured." I need to drop any rows where any values in one of two columns is not a number and preserve non-numeric data in other columns. A simple use case looks like this (the real table has several thousand rows...)
import pandas as pd
df = pd.DataFrame(dict(A = pd.Series([1,2,3,4,5]), B = pd.Series([96,33,45,'',8]), C = pd.Series([12,'Not measured',15,66,42]), D = pd.Series(['apples', 'oranges', 'peaches', 'plums', 'pears'])))
Which results in this data table:
A B C D
0 1 96 12 apples
1 2 33 Not measured oranges
2 3 45 15 peaches
3 4 66 plums
4 5 8 42 pears
I'm not clear how to get to this table:
A B C D
0 1 96 12 apples
2 3 45 15 peaches
4 5 8 42 pears
I tried dropna, but the types are "object" since there are non-numeric entries.
I can't convert the values to floats without either converting the whole table, or doing one series at a time which loses the relationship to the other data in the row. Perhaps there is something simple I'm not understanding?
You can first create subset with columns B,C and apply to_numeric, check if all values are notnull. Then use boolean indexing:
print df[['B','C']].apply(pd.to_numeric, errors='coerce').notnull().all(axis=1)
0 True
1 False
2 True
3 False
4 True
dtype: bool
print df[df[['B','C']].apply(pd.to_numeric, errors='coerce').notnull().all(axis=1)]
A B C D
0 1 96 12 apples
2 3 45 15 peaches
4 5 8 42 pears
Next solution use str.isdigit with isnull and xor (^):
print df['B'].str.isdigit().isnull() ^ df['C'].str.isdigit().notnull()
0 True
1 False
2 True
3 False
4 True
dtype: bool
print df[df['B'].str.isdigit().isnull() ^ df['C'].str.isdigit().notnull()]
A B C D
0 1 96 12 apples
2 3 45 15 peaches
4 5 8 42 pears
But solution with to_numeric with isnull and notnull is fastest:
print df[pd.to_numeric(df['B'], errors='coerce').notnull()
^ pd.to_numeric(df['C'], errors='coerce').isnull()]
A B C D
0 1 96 12 apples
2 3 45 15 peaches
4 5 8 42 pears
Timings:
#len(df) = 5k
df = pd.concat([df]*1000).reset_index(drop=True)
In [611]: %timeit df[pd.to_numeric(df['B'], errors='coerce').notnull() ^ pd.to_numeric(df['C'], errors='coerce').isnull()]
1000 loops, best of 3: 1.88 ms per loop
In [612]: %timeit df[df['B'].str.isdigit().isnull() ^ df['C'].str.isdigit().notnull()]
100 loops, best of 3: 16.1 ms per loop
In [613]: %timeit df[df[['B','C']].apply(pd.to_numeric, errors='coerce').notnull().all(axis=1)]
The slowest run took 4.28 times longer than the fastest. This could mean that an intermediate result is being cached
100 loops, best of 3: 3.49 ms per loop

Resources