I've 2 related Data Frames, Is there any easy way to combine into multi-indexes dataframe?
import pandas as pd
df = pd.DataFrame( [[
1,0,1,0],
[1,1,0,0],
[1,0,0,1],
[0,1,0,1]], columns=["c1","c2","c3", "c4"]
)
idx= pd.Index(['p1','p2','p3','p4'])
df = df.set_index(idx)
df output is:
c1 c2 c3 c4
p1 1 0 1 0
p2 1 1 0 0
p3 1 0 0 1
p4 0 1 0 1
df2 = pd.DataFrame( [[
0,10,30,0],
[20,10,0,0],
[0,10,0,6],
[15,0,18,5]], columns=["c1","c2","c3", "c4"]
)
idx2= pd.Index(['a1','a2','a3','a4'])
df2 = df2.set_index(idx2)
df2 output is:
c1 c2 c3 c4
a1 0 10 30 0
a2 20 10 0 0
a3 0 10 0 6
a4 15 0 18 5
The final dataframe is multi-indexing (p,c,a) single column (value):
value
p1 c1 a2 20
a4 15
c3 a1 30
a4 18
p2 c1 a2 20
a4 15
c2 a1 10
a2 10
a3 10
p3 c1 a2 20
a4 15
c4 a3 6
a4 5
p4 c2 a1 10
a2 10
a3 10
c4 a3 6
a4 5
You can reshape an merge:
(df.reset_index().melt('index')
.loc[lambda x: x.pop('value').eq(1)]
.merge(df2.reset_index().melt('index').query('value != 0'),
on='variable')
.set_index(['index_x', 'variable', 'index_y'])
.rename_axis([None, None, None])
)
output:
value
p1 c1 a2 20
a4 15
p2 c1 a2 20
a4 15
p3 c1 a2 20
a4 15
p2 c2 a1 10
a2 10
a3 10
p4 c2 a1 10
a2 10
a3 10
p1 c3 a1 30
a4 18
p3 c4 a3 6
a4 5
p4 c4 a3 6
a4 5
If order matters:
(df.stack().reset_index()
.loc[lambda x: x.pop(0).eq(1)]
.set_axis(['index', 'variable'], axis=1)
.merge(df2.reset_index().melt('index').query('value != 0'),
on='variable', how='left')
.set_index(['index_x', 'variable', 'index_y'])
.rename_axis([None, None, None])
)
output:
value
p1 c1 a2 20
a4 15
c3 a1 30
a4 18
p2 c1 a2 20
a4 15
c2 a1 10
a2 10
a3 10
p3 c1 a2 20
a4 15
c4 a3 6
a4 5
p4 c2 a1 10
a2 10
a3 10
c4 a3 6
a4 5
I am attempting merge and fill in missing values in one data frame from another one. Hopefully this isn't too long of an explanation i have just been wracking my brain around this for too long. I am working with 2 huge CSV files so i made a small example here. I have included the entire code at the end in case you were curious to assist. THANK YOU SO MUCH IN ADVANCE. Here we go!
print(df1)
A B C D E
0 1 B1 D1 E1
1 C1 D1 E1
2 1 B1 D1 E1
3 2 B2 D2 E2
4 B2 C2 D2 E2
5 3 D3 E3
6 3 B3 C3 D3 E3
7 4 C4 D4 E4
print(df2)
A B C F G
0 1 C1 F1 G1
1 B2 C2 F2 G2
2 3 B3 F3 G3
3 4 B4 C4 F4 G4
I would essentially like to merge df2 into df1 by 3 different columns. i understand that you can merge on multiple column names but it seems to not give me the desired result. I would like to KEEP all data in df1, and fill in the data from df2 so i use how='left'.
I am fairly new to python and have done a lot of research but have hit a stuck point. Here is what i have tried.
data3 = df1.merge(df2, how='left', on=['A'])
print(data3)
A B_x C_x D E B_y C_y F G
0 1 B1 D1 E1 C1 F1 G1
1 C1 D1 E1 B2 C2 F2 G2
2 1 B1 D1 E1 C1 F1 G1
3 2 B2 D2 E2 NaN NaN NaN NaN
4 B2 C2 D2 E2 B2 C2 F2 G2
5 3 D3 E3 B3 F3 G3
6 3 B3 C3 D3 E3 B3 F3 G3
7 4 C4 D4 E4 B4 C4 F4 G4
As you can see here it sort of worked with just A, however since this is a csv file with blank values. the blank values seem to merge together. which i do not want. because df2 was blank in row 2 it filled in the data where it saw blanks, which is not what i want. it should be NaN if it could not find a match.
whenever i start putting additional rows into my "on=['A', 'B'] it does not do anything different. in-fact, A no longer merges.
data3 = df1.merge(df2, how='left', on=['A', 'B'])
print(data3)
A B C_x D E C_y F G
0 1 B1 D1 E1 NaN NaN NaN
1 C1 D1 E1 NaN NaN NaN
2 1 B1 D1 E1 NaN NaN NaN
3 2 B2 D2 E2 NaN NaN NaN
4 B2 C2 D2 E2 C2 F2 G2
5 3 D3 E3 NaN NaN NaN
6 3 B3 C3 D3 E3 F3 G3
7 4 C4 D4 E4 NaN NaN NaN
Rows A, B, and C are the values i want to correlate and merge on. Using both data frames it should know enough to fill in all the gaps. my ending df should look like:
print(desired_output):
A B C D E F G
0 1 B1 C1 D1 E1 F1 G1
1 1 B1 C1 D1 E1 F1 G1
2 1 B1 C1 D1 E1 F1 G1
3 2 B2 C2 D2 E2 F2 G2
4 2 B2 C2 D2 E2 F2 G2
5 3 B3 C3 D3 E3 F3 G3
6 3 B3 C3 D3 E3 F3 G3
7 4 B4 C4 D4 E4 F4 G4
even though A, B, and C have repeating rows i want to keep ALL the data and just fill in the data from df2 where it might fit, even if it is repeat data. i also do not want to have all of the _x and _y the suffix's from merging. i know how to rename but doing 3 different merges and merging those merges starts to get really complicated really fast with repeated rows and suffix's...
long story short, how can i merge both data-frames by A, and then B, and then C? order in which it happens is irrelevant.
Here is a sample of actual data. I have my own data that has additional data and i relate it to this data by certain identifiers. basically by MMSI, Name and IMO. i want to keep duplicates because they aren't actually duplicates, just additional data points for each vessel
MMSI BaseDateTime LAT LON VesselName IMO CallSign
366940480.0 2017-01-04T11:39:36 52.48730 -174.02316 EARLY DAWN 7821130 WDB7319
366940480.0 2017-01-04T13:51:07 52.41575 -174.60041 EARLY DAWN 7821130 WDB7319
273898000.0 2017-01-06T16:55:33 63.83668 -174.41172 MYS CHUPROVA NaN UAEZ
352844000.0 2017-01-31T22:51:31 51.89778 -176.59334 JACHA 8512920 3EFC4
352844000.0 2017-01-31T23:06:31 51.89795 -176.59333 JACHA 8512920 3EFC4
Let's say that I have a dataset that contains 4 binary columns for 2 rows.
It looks like this:
c1 c2 c3 c4 c5
r1 0 1 0 1 0
r2 1 1 1 1 0
I want to create a matrix that gives the number of occurrences of a column, given that it also occurred in another column. Kinda like a confusion matrix
My desired output is:
c1 c2 c3 c4 c5
c1 - 1 1 1 0
c2 1 - 1 2 0
c3 1 1 - 1 0
c4 1 2 1 - 0
I have used pandas crosstab but it only gives the desired output when using 2 columns. I want to use all of the columns
dot
df.T.dot(df)
# same as
# df.T # df
c1 c2 c3 c4 c5
c1 1 1 1 1 0
c2 1 2 1 2 0
c3 1 1 1 1 0
c4 1 2 1 2 0
c5 0 0 0 0 0
You can use np.fill_diagonal to make the diagonal zero
d = df.T.dot(df)
np.fill_diagonal(d.to_numpy(), 0)
d
c1 c2 c3 c4 c5
c1 0 1 1 1 0
c2 1 0 1 2 0
c3 1 1 0 1 0
c4 1 2 1 0 0
c5 0 0 0 0 0
And as long as we're using Numpy, you could go all the way...
a = df.to_numpy()
b = a.T # a
np.fill_diagonal(b, 0)
pd.DataFrame(b, df.columns, df.columns)
c1 c2 c3 c4 c5
c1 0 1 1 1 0
c2 1 0 1 2 0
c3 1 1 0 1 0
c4 1 2 1 0 0
c5 0 0 0 0 0
A way of using melt and merge with groupby
s=df.reset_index().melt('index').loc[lambda x : x.value==1]
s.merge(s,on='index').query('variable_x!=variable_y').groupby(['variable_x','variable_y'])['value_x'].sum().unstack(fill_value=0)
Out[32]:
variable_y c1 c2 c3 c4
variable_x
c1 0 1 1 1
c2 1 0 1 2
c3 1 1 0 1
c4 1 2 1 0
I've never used an excel macro before and was just given the task of writing a macro that moves a completed task to the bottom of the document 10 days after the entered completion date (i.e. if it was finished on April 14 and that was the date entered into the completion column then the whole row is moved to the bottom of the document).
Below is the table I made up to practice with.
Current_Date Id Samp1 Samp2 Samp3 Samp4 Samp5 Samp6 Completed_Date Completed_Plus_10 Real_C+10 Post
5/2/2016 1 a1 ab1 b1 bc1 c1 1 2/18/2016 2/28/2016 2/28/2016 Y
5/2/2016 2 a2 ab2 b2 bc2 c2 2 1/10/1900 N
5/2/2016 3 a3 ab3 b3 bc3 c3 3 1/10/1900 N
5/2/2016 4 a4 ab4 b4 bc4 c4 4 4/21/2016 5/1/2016 5/1/2016 Y
5/2/2016 5 a5 ab5 b5 bc5 c5 5 1/10/1900 N
5/2/2016 6 a6 ab6 b6 bc6 c6 6 1/10/1900 N
5/2/2016 7 a7 ab7 b7 bc7 c7 7 3/14/2016 3/24/2016 3/24/2016 Y
5/2/2016 8 a8 ab8 b8 bc8 c8 8 1/10/1900 N
5/2/2016 9 a9 ab9 b9 bc9 c9 9 1/10/1900 N
5/2/2016 10 a10 ab10 b10 bc10 c10 10 5/2/2016 5/12/2016 5/12/2016 N
5/2/2016 11 a11 ab11 b11 bc11 c11 11 1/10/1900 N
5/2/2016 12 a12 ab12 b12 bc12 c12 12 1/10/1900 N
5/2/2016 13 a13 ab13 b13 bc13 c13 13 1/10/1900 N
5/2/2016 14 a14 ab14 b14 bc14 c14 14 1/10/1900 N
5/2/2016 15 a15 ab15 b15 bc15 c15 15 1/10/1900 N
5/2/2016 16 a16 ab16 b16 bc16 c16 16 1/10/1900 N
5/2/2016 17 a17 ab17 b17 bc17 c17 17 1/10/1900 N
5/2/2016 18 a18 ab18 b18 bc18 c18 18 1/10/1900 N
5/2/2016 19 a19 ab19 b19 bc19 c19 19 1/10/1900 N
5/2/2016 20 a20 ab20 b20 bc20 c20 20 1/10/1900 N
Here is a little macro so that you don't have to start on an empty page:
My table looks as follows:
Date id samp1 samp2 Complete date
05.02.2016 2 abc abc asd
10.02.2016 4 ghi ghi 01.02.2016
07.02.2016 3 def def asd
11.05.2016 5 jkl jkl 06.05.2016
I used the macro recorder and copied the second line to the end of the table. Then I used "Find & Select" > "Go to Special" > choose "Blanks" > "Enter". Then I deleted them (found here: http://www.exceltrick.com/how_to/delete-blank-rows-in-excel/).
In the recorded Macro I added the variable LastRow (to make it dynamically)
Sub mv()
'
' mv Makro
'
' Tastenkombination: Strg+Umschalt+A
'
Dim LastRow As Long
LastRow = Cells(Rows.Count, "A").End(xlUp).Row
'here you sould add a loop over the rows and an the if statement
Range("A3:E3").Select
Selection.Cut
Range("A" & LastRow + 1).Select
ActiveSheet.Paste
'and here the end of the if statement and the loop
Range("A1").Select
Application.CommandBars("Selection and Visibility").Visible = False
Selection.SpecialCells(xlCellTypeBlanks).Select
Selection.EntireRow.delete
End Sub
What the macro basically does:
copy the Range("A3:E3")
past it to the end of the table
deletes empty rows
You can take this macro and adjust it to your Problem. Loop over the Rows and if the date (from column1) minus the date (from column5) is bigger than 10, then copy, else do nothing.
I hope this helps
I have to load a bunch of numbers from excel to access. I used Import Excel data in Access to load the data earlier.
Earlier :-
Field1 Field2 Field3 QTY
A1 B1 C1 1
A1 B2 C2 2
A1 B3 C3 3
A1 B4 C4 4
A1 B5 C5 5
My data is in a crosstab format now.
For example :-
A1 B1 B2 B3 B4 B5
C1 1 0 0 0 0
C2 0 2 0 0 0
C3 0 0 3 0 0
C4 0 0 0 4 0
C5 0 0 0 0 5
Is there a direct way in which I can import this to access? Or do I have to use a macro to convert into the linear form used earlier.
Thanks,
Arka.