combining two dataframes just based on the rows, without considering respective index

combining two dataframes just based on the rows, without considering respective index - python-3.x

I have two dataframes, df1, and df2 respectively
A1 A2
0 0.001 0.002
1 100 200
2 0.3 0.4
B1 B2
86 0.0002 0.003
12 0.2 0.3
123 -0.001 0.000
Due to their generation mechanism them, their indexes are not matched, and one of the dataframes was not even indexed orderly. I need to just combine these two dataframes regardless of index, the expected result is as follows.
A1 A2 B1 B2
0 0.001 0.002 0.002 0.003
1 100 200 0.2 0.3
2 0.3 0.4 -.0.001 0.000
In other words, I just need to combine them based on the row orders. I tried
Pd.concat([df1,df2],index=1,ignore_index=True) or pd.DataFrame([df1,df2]). Neither works.

Use pd.concat:
>>> pd.concat([df1.reset_index(drop=True),
df2.reset_index(drop=True)], axis=1)
A1 A2 B1 B2
0 0.001 0.002 0.0002 0.003
1 100.000 200.000 0.2000 0.300
2 0.300 0.400 -0.0010 0.000

Related

Fill a pandas dataframe by counting strings in a array and adding their values from a array

1) numpy array r which consists of strings.
import numpy as np
r = np.array([['S', 'S'],['S', 'V1'],['S', 'V2'],['V1', 'S'],['V1', 'V1']])
2) numpy array acc conatin values. The first value refers to the first element of two dimensional array r and second value refers to the second element of two dimensional array r
acc = np.array([0.613,0.387])
3) Question: I want to fill dataframe df1. For example: Row1) Array r[0]=['S', 'S'] contains 'S' in both then fill S=0.613+0.387=1.0 in df1 and V1 and V2 in df1 will zero as they do not exist in the array. Row2) Array r[1]=['S', 'V1'] contains one 'S' then fill S=0.613 and V1=0.387 in df1,and V2=0 (does not exist).......and so on.
Desired output:
import pandas as pd
df1 = pd.DataFrame({'S':[1,0.613,0.613,0.387,0], 'V1': [0,0.387,0,0.613,1],'V2': [0,0,0.387,0,0]})
print(df1)
S V1 V2
0 1.000 0.000 0.000
1 0.613 0.387 0.000
2 0.613 0.000 0.387
3 0.387 0.613 0.000
4 0.000 1.000 0.000

You can stack the dataframe, map the values and pivot back:
s = pd.DataFrame(r).stack().reset_index(name='val')
s['level_1'] = acc[s['level_1']]
s.pivot_table(index='level_0',
columns='val',
values='level_1',
aggfunc='sum',
fill_value=0)
Output:
val S V1 V2
level_0
0 1.000 0.000 0.000
1 0.613 0.387 0.000
2 0.613 0.000 0.387
3 0.387 0.613 0.000
4 0.000 1.000 0.000

Another way using pd.get_dummies(),np.vectorize and df.groupby() on axis=1:
df=pd.get_dummies(pd.DataFrame(r),prefix='',prefix_sep='')
s=pd.Series(acc,index=range(1,len(acc)+1))
final=(pd.DataFrame(np.vectorize(s.get)(np.where(df.eq(1),df.cumsum(axis=1),df)),
columns=df.columns).groupby(df.columns,axis=1).sum())
S V1 V2
0 1.000 0.000 0.000
1 0.613 0.387 0.000
2 0.613 0.000 0.387
3 0.387 0.613 0.000
4 0.000 1.000 0.000

Make a pandas data-frame with percentage between two referential values in a set

I have pandas data-frame (df) and a list (df_values).
I want to make another data-frame which contains the distribution/percentage a data-point in df belongs to values in the list df_values.
data-frame df is:
A
0 100
1 300
2 150
List df_values (set of referential values) is:
df_values = [[0,200,400,600]]
Desired data-frame:
Here number 100 in df is 0.50 towards 0 and 0.50 towards 200 in df_values. Similarly, 300 in df is 0.50 towards 200 and 0.50 towards 400 in df_values and so on.
0 200 400 600
0 0.50 0.50 0.0 0
1 0.00 0.50 0.5 0
2 0.25 0.75 0.0 0

Pandas: How to sum (dynamic) columns that are between two specific columns?

I'm working with dynamic .csvs. So I never know what will be the column names. Example of those:
1)
ETC META A B C D E %
0 2.0 A 0.0 24.564 0.000 0.0 0.0 -0.00%
1 4.2 B 0.0 2.150 0.000 0.0 0.0 3.55%
2 5.0 C 0.0 0.000 15.226 0.0 0.0 6.14%
2)
META A C D E %
0 A 0.00 0.00 2.90 0.0 -0.00%
1 B 3.00 0.00 0.00 0.0 3.55%
2 C 0.00 21.56 0.00 0.0 6.14%
3)
FILL ETC META G F %
0 T 2.0 A 0.00 6.70 -0.00%
1 F 4.2 B 2.90 0.00 3.55%
2 T 5.0 C 0.00 34.53 6.14%
As I would like to create a new column with the SUM of all columns between META and %, I need to get all the names of each column, so I can create something like that:
a = df['Total'] = df['A'] + df['B'] + df['C'] + df['D'] + df['E']
As the columns name changes, the code below will work just for the example 1). So I need: 1) identify all the columns; 2) and then, sum them.
The solution has to work for the 3 examples above (1, 2 and 3).
Note that the only certainty is the columns are between META and %, but even they are not fixed.

Select all columns without first and last by DataFrame.iloc and then sum:
df['Total'] = df.iloc[:, 1:-1].sum(axis=1)
Or remove META and % columns by DataFrame.drop before sum:
df['Total'] = df.drop(['META','%'], axis=1).sum(axis=1)
print (df)
META A B C D E % Total
0 A 0.0 24.564 0.000 0.0 0.0 -0.00% 24.564
1 B 0.0 2.150 0.000 0.0 0.0 3.55% 2.150
2 C 0.0 0.000 15.226 0.0 0.0 6.14% 15.226
EDIT: You can select columns between META and %:
#META, % are not numeric
df['Total'] = df.loc[:, 'META':'%'].sum(axis=1)
#META is not numeric
df['Total'] = df.iloc[:, df.columns.get_loc('META'):df.columns.get_loc('%')].sum(axis=1)
#more general, META is before % column
df['Total'] = df.iloc[:, df.columns.get_loc('META')+1:df.columns.get_loc('%')].sum(axis=1)

Extract and copy rows and columns from a large text file

In addition to my previous question - how canI extract values in the Columns format from a large text file. Also If I want to extract out the value of the specific column. The text file looks like the attached image.
I want to extract out the RESOL and FSC values as Given in the text to plot it in Numbers or EXCEL.
To be more general - what should be done if I do want to extract the values to Part_FSC and CC later.
Thanks
Text File:
Opening MRC/CCP4 file for WRITE...
File : map2.mrc
NX, NY, NZ: 600 600 600
MODE : real
SYMMETRY REDUNDANCY: 1
Fraction_mask, Fraction_particle = 0.1193 0.0111
C sqrt sqrt
C NO. RESOL RING RAD FSPR FSC Part_FSC Part_SSNR Rec_SSNR CC
C 2 952.50 0.0017 0.09 1.000 0.000 400.0145 479.40 0.1222
C 3 476.25 0.0033 0.19 1.000 0.000 159.3959 159.98 0.1586
C 4 317.50 0.0050 0.92 0.999 0.000 48.2248 43.27 0.0155
C 5 238.12 0.0067 0.42 1.000 0.000 88.3074 76.69 0.2637
C 6 190.50 0.0083 0.48 0.999 0.000 64.0162 56.25 0.4148
C 7 158.75 0.0100 1.41 0.992 0.000 17.1695 15.64 0.1282
C 8 136.07 0.0117 5.56 0.954 0.000 6.8244 6.47 0.0171
C 9 119.06 0.0133 1.49 0.993 0.000 16.1918 16.42 0.2729
C 10 105.83 0.0150 1.68 0.990 0.000 12.8313 13.83 0.3729
C 11 95.25 0.0167 3.55 0.969 0.000 6.8012 7.95 0.2624
C 12 86.59 0.0183 16.00 0.830 0.000 2.5273 3.13 0.0826

Untested, but should work
awk '$1 == "C"{printf "%s\t%s\n", $3,$6}' <filename>

RODBC read error where excel column contains leading NAs

I have been reading Excel sheets into R using the RODBC package and have hit an issue with the Excel ODBC driver. Columns that contain (sufficient) leading NAs are coerced to logical.
In Excel the data appears as follows:
period n n.ft n.pt
1/02/1985 0.008 NA 0.025
1/03/1985 -0.003 NA -0.024
1/04/1985 0.002 NA 0.015
1/05/1985 0.006 NA 0.012
1/06/1985 0.001 NA 0.003
1/07/1985 0.005 NA 0.010
1/08/1985 0.006 NA 0.001
1/09/1985 0.007 NA 0.013
1/10/1985 -0.002 NA 0.009
1/11/1985 0.013 NA 0.019
1/12/1985 -0.004 NA -0.021
1/01/1986 0.008 NA 0.009
1/02/1986 0.002 NA 0.009
1/03/1986 0.002 -0.003 1.000
1/04/1986 0.010 -0.003 0.041
1/05/1986 0.000 -0.001 -0.004
1/06/1986 0.005 0.003 0.005
1/07/1986 -0.003 0.005 0.012
1/08/1986 -0.001 -0.003 -0.021
1/09/1986 0.003 -0.001 0.012
1/10/1986 0.003 0.003 0.010
1/11/1986 -0.003 0.003 -0.003
1/12/1986 0.003 -0.003 0.022
1/01/1987 0.001 0.013 -0.004
1/02/1987 0.004 -0.004 0.011
1/03/1987 0.004 0.008 0.005
1/04/1987 0.000 0.002 -0.002
1/05/1987 0.001 0.002 0.006
1/06/1987 0.004 0.010 0.00
I read in the data with:
require(RODBC)
conexcel <- odbcConnectExcel(xls.file="C:/data/example.xls")
s1 <- 'SOx'
dd <- sqlFetch(conexcel, s1)
odbcClose(conexcel)
This reads in the entire second column as NA. I think this is due to the fact it's guessed to be logical, and therefore the subsequent numbers are assessed as invalid and hence NA.
> str(dd)
'data.frame': 29 obs. of 4 variables:
$ period: POSIXct, format: "1985-02-01" "1985-03-01" ...
$ n : num 0.00833 -0.00338 0.00157 0.00562 0.00117 ...
$ n#ft : logi NA NA NA NA NA NA ...
$ n#pt : num 0.02515 -0.02394 0.0154 0.01224 0.00301 ...
I am trying to find a way to prevent this coercion to logical, which I think is causing the subsequent error.
I found this Q+A by searching SO, however I am at work and have no hope of being permitted to edit the registry to change the default for DWORD, as suggested (I understand that the value set here determines how many NAs are required before Microsoft guesses the data type and bombs my read).
Right now, I'm thinking that the best solution is to invert the data in Excel, and read it into R up-side-down.
I love a good hack but surely there's a better solution?

This is not a bug, but a feature of ODBC (note the lack of R) as documented here
http://support.microsoft.com/kb/257819/en-us
(long page, check for "mixed data type").
Since reading Excel files with ODBC is rather limited, I prefer one of the alternatives mentioned by Gabor, with preference for XLConnnect.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

combining two dataframes just based on the rows, without considering respective index - python-3.x

Use pd.concat: >>> pd.concat([df1.reset_index(drop=True), df2.reset_index(drop=True)], axis=1) A1 A2 B1 B2 0 0.001 0.002 0.0002 0.003 1 100.000 200.000 0.2000 0.300 2 0.300 0.400 -0.0010 0.000

Related

Fill a pandas dataframe by counting strings in a array and adding their values from a array

Make a pandas data-frame with percentage between two referential values in a set

Pandas: How to sum (dynamic) columns that are between two specific columns?

Extract and copy rows and columns from a large text file

RODBC read error where excel column contains leading NAs

Categories

Resources