How to add number from file to variable? - python-3.x

I have file:
0 3 0.071 0.082 0.002
0 4 144 145.5 0.2
0 6 0.36 0.46 0.02
and I would like to add some number to variable. How to do that? I know how to add a column to variable. Is it possible by using table?
This is code to add column ti variable, but I don't know how to add one number to variable.
x = np.loadtxt('file', unpack=True, usecols=[1])

Related

How to get the column name of a dataframe from values in a numpy array

I have a df with 15 columns:
df.columns:
0 class
1 name
2 location
3 income
4 edu_level
--
14 marital_status
after some transformations I got an numpy.ndarray with shape (15,3) named loads:
0.52 0.33 0.09
0.20 0.53 0.23
0.60 0.28 0.23
0.13 0.45 0.41
0.49 0.9
so on so on so on
So, 3 columns with 15 values.
What I need to do:
I want to get the df column name of the values from the first column of loads that are greater then .50
For this example, the columns of df related to the first column of loadswith values higher than 0.5 should return:
0 Class
2 Location
Same for the second column of loads, should return:
1 name
3 income
4 edu_level
and the same logic to the 3rd column of loads.
I managed to get the numparray loads they way I need it but I am having a bad time with this last part. I know I can simple manually pick the columns but this will be a hard task when df has more than 15 features.
Can anyone help me, please?
given your threshold you can create a boolean array in order to filter df.columns:
threshold = .5
for j in range(loads.shape[1]):
print(df.columms[loads[:,j]>threshold])

Creating a new column into a dataframe based on conditions

For the dataframe df :
dummy_data1 = {'category': ['White', 'Black', 'Hispanic','White'],
'Pop':['75','85','90','100'],'White_ratio':[0.6,0.4,0.7,0.35],'Black_ratio':[0.3,0.2,0.1,0.45], 'Hispanic_ratio':[0.1,0.4,0.2,0.20] }
df = pd.DataFrame(dummy_data1, columns = ['category', 'Pop','White_ratio', 'Black_ratio', 'Hispanic_ratio'])
I want to add a new column to this data frame,'pop_n', by first checking the category, and then multiplying the value in 'Pop' by the corresponding ratio value in the columns. For the first row,
the category is 'White' so it should multiply 75 with 0.60 and put 45 in pop_n column.
I thought about writing something like :
df['pop_n']= (df['Pop']*df['White_ratio']).where(df['category']=='W')
this works but just for one category.
I will appreciate any helps with this.
Thanks.
Using DataFrame.filter and DataFrame.lookup:
First we use filter to get the columns with ratio in the name. Then split and keep the first word before the underscore only.
Finally we use lookup to match the category values to these columns.
# df['Pop'] = df['Pop'].astype(int)
df2 = df.filter(like='ratio').rename(columns=lambda x: x.split('_')[0])
df['pop_n'] = df2.lookup(df.index, df['category']) * df['Pop']
category Pop White_ratio Black_ratio Hispanic_ratio pop_n
0 White 75 0.60 0.30 0.1 45.0
1 Black 85 0.40 0.20 0.4 17.0
2 Hispanic 90 0.70 0.10 0.2 18.0
3 White 100 0.35 0.45 0.2 35.0
Locate the columns that have underscores in their names:
to_rename = {x: x.split("_")[0] for x in df if "_" in x}
Find the matching factors:
stack = df.rename(columns=to_rename)\
.set_index('category').stack()
factors = stack[map(lambda x: x[0]==x[1], stack.index)]\
.reset_index(drop=True)
Multiply the original data by the factors:
df['pop_n'] = df['Pop'].astype(int) * factors
# category Pop White_ratio Black_ratio Hispanic_ratio pop_n
#0 White 75 0.60 0.30 0.1 45
#1 Black 85 0.40 0.20 0.4 17
#2 Hispanic 90 0.70 0.10 0.2 18
#3 White 100 0.35 0.45 0.2 35

reading data with varying length header

I want to read in python a file which contains a varying length header and then extract in a dataframe/series the variables which are coming after the header.
The data looks like :
....................................................................
Data coverage and measurement duty cycle:
When the instrument duty cycle is not in measure mode (i.e. in-flight
calibrations) the data is not given here (error flag = 2).
The measurements have been found to exhibit a strong sensitivity to cabin
pressure.
Consequently the instrument requires calibrated at each new cabin
pressure/altitude.
Data taken at cabin pressures for which no calibration was performed is
not given here (error flag = 2).
Measurement sensivity to large roll angles was also observed.
Data corresponding to roll angles greater than 10 degrees is not given
here (error flag = 2)
......................................................................
High Std: TBD ppb
Target Std: TBD ppb
Zero Std: 0 ppb
Mole fraction error flag description :
0 : Valid data
2 : Missing data
31636 0.69 0
31637 0.66 0
31638 0.62 0
31639 0.64 0
31640 0.71 0
.....
.....
So what I want is to extract the data as :
Time C2H6 Flag
0 31636 0.69 0 NaN
1 31637 0.66 0 NaN
2 31638 0.62 0 NaN
3 31639 0.64 0 NaN
4 31640 0.71 0 NaN
5 31641 0.79 0 NaN
6 31642 0.85 0 NaN
7 31643 0.81 0 NaN
8 31644 0.79 0 NaN
9 31645 0.85 0 NaN
I can do that with
infile="/nfs/potts.jasmin-north/scratch/earic/AEOG/data/mantildas_faam_20180911_r1_c118.na"
flightdata = pd.read_fwf(infile, skiprows=53, header=None, names=['Time', 'C2H6', 'Flag'],)
but I m skipping approximately 53 rows because I counted how much I should skip. I have a bunch of these files and some don't have exactly 53 rows in the header so I was wondering what would be the best way to deal with this and a criteria to have Python always only read the three columns of data when finds them? I thought if I'd want let's say Python to actually read the data from where encounters
Mole fraction error flag description :
0 : Valid data
2 : Missing data
what should I do ? What about another criteria to use which would work better ?
You can split on the header delimiter, like so:
with open(filename, 'r') as f:
myfile = f.read()
infile = myfile.split('Mole fraction error flag description :')[-1]
# skip lines with missing data
infile = infile.split('\n')
# likely a better indicator of a line with incorrect format, you know the data better
infile = '\n'.join([line for line in infile if ' : ' not in line])
# create dataframe
flightdata = pd.read_fwf(infile, header=None, names=['Time', 'C2H6', 'Flag'],)

Pandas is converting month 10 into month 1. Is there a format issue here?

I have the following DataFrame
data inflation
0 2000.01 0.62
1 2000.02 0.13
2 2000.03 0.22
3 2000.04 0.42
4 2000.05 0.01
5 2000.06 0.23
6 2000.07 1.61
7 2000.08 1.31
8 2000.09 0.23
9 2000.10 0.14
Note that the format of the Year Month is with a dot
When I try to convert to DateTime as in:
inflation.data = pd.to_datetime(inflation.data, format='%Y.%m')
I get both line 0 and line 9 as 2000-01-01
That means pandas is automatically changing .10 into .01
Is that a bug? or just a format issue?
You're actually using the formatting codes in pandas slightly incorrectly.
Look at the Pandas helpfile
pandas.to_datetime(*args, **kwargs)[source]
Convert argument to datetime.
Parameters:
arg : string, datetime, list, tuple, 1-d array, Series
you appear to be feeding it float64s when it probably expects strings
Try the following code.
Or convert your inflation.data to string (use inflation.data.apply(str))
f0=['2000.01',
'2000.02',
'2000.03',
'2000.04',
'2000.05',
'2000.06',
'2000.07',
'2000.08',
'2000.09',
'2000.10']
inflation=pd.DataFrame(f0,columns={'data'})
inflation.data=pd.to_datetime(inflation.data,format='%Y.%m')
output
Out[3]:
0 2000-01-01
1 2000-02-01
2 2000-03-01
3 2000-04-01
4 2000-05-01
5 2000-06-01
6 2000-07-01
7 2000-08-01
8 2000-09-01
9 2000-10-01
Name: data, dtype: datetime64[ns]
This is an interesting problem. The astype() construct is converting .10 to .01 and you can't use any split methods on the current float type.
Here is my take on this:
Use python math module modf function which returns the fractional and integer parts of x.
Now round the year and month data and convert to string for to_datetime to interpret.
import math
df['Year']= df.data.apply(lambda x: round(math.modf(x)[1])).astype(str)
df['Month']= df.data.apply(lambda x: round((math.modf(x)[0])*100)).astype(str)
df = df.drop('data', axis = 1)
df['Date'] = pd.to_datetime(df.Year+':'+df.Month, format = '%Y:%m')
df = df.drop(['Year', 'Month'], axis = 1)
You get
inflation Date
0 0.62 2000-01-01
1 0.13 2000-02-01
2 0.22 2000-03-01
3 0.42 2000-04-01
4 0.01 2000-05-01
5 0.23 2000-06-01
6 1.61 2000-07-01
7 1.31 2000-08-01
8 0.23 2000-09-01
9 0.14 2000-10-01

Effective reasonable indexing for numeric vector search?

I have a long numeric table where 7 columns are a key and 4 columns is a value to find.
Actually I have rendered an object with different distances and perspective angles and have calculated Hu moments for it's contour. But this is not important to the question, just a sample to imagine.
So, when I have 7 values, I need to scan a table, find closest values in that 7 columns and extract corresponding 4 values.
So, the task aspects to consider is follows:
1) numbers have errors
2) the scale in function domain is not the same as the scale in function value; i.e. the "distance" from point in 7-dimensional space should depend on that 4 values, how it affect
3) search should be fast
So the question is follows: isn't some algorithm out there to solve this task efficiently, i.e. perform some indexing on that 7 columns, but do this no like conventional databases do, but taking into account point above.
If I understand the problem correctly, you might consider using scipy.cluster.vq (vector quantization):
Suppose your 7 numeric columns look like this (let's call the array code_book):
import scipy.cluster.vq as vq
import scipy.spatial as spatial
import numpy as np
np.random.seed(2013)
np.set_printoptions(precision=2)
code_book = np.random.random((3,7))
print(code_book)
# [[ 0.68 0.96 0.27 0.6 0.63 0.24 0.7 ]
# [ 0.84 0.6 0.59 0.87 0.7 0.08 0.33]
# [ 0.08 0.17 0.67 0.43 0.52 0.79 0.11]]
Suppose the associated 4 columns of values looks like this:
values = np.arange(12).reshape(3,4)
print(values)
# [[ 0 1 2 3]
# [ 4 5 6 7]
# [ 8 9 10 11]]
And finally, suppose we have some "observations" of 7-column values like this:
observations = np.random.random((5,7))
print(observations)
# [[ 0.49 0.39 0.41 0.49 0.9 0.89 0.1 ]
# [ 0.27 0.96 0.16 0.17 0.72 0.43 0.64]
# [ 0.93 0.54 0.99 0.62 0.63 0.81 0.36]
# [ 0.17 0.45 0.84 0.02 0.95 0.51 0.26]
# [ 0.51 0.8 0.2 0.9 0.41 0.34 0.36]]
To find the 7-valued row in code_book which is closest to each observation, you could use vq.vq:
index, dist = vq.vq(observations, code_book)
print(index)
# [2 0 1 2 0]
The index values refer to rows in code_book. However, if the rows in values are ordered the same way as code_book, we can "lookup" the associated value with values[index]:
print(values[index])
# [[ 8 9 10 11]
# [ 0 1 2 3]
# [ 4 5 6 7]
# [ 8 9 10 11]
# [ 0 1 2 3]]
The above assumes you have all your observations arranged in an array. Thus, to find all the indices you need only one call to vq.vq.
However, if you obtain the observations one at a time and need to find the closest row in code_book before going on to the next observation, then it would be inefficient to call vq.vq each time. Instead, generate a KDTree once, and then find the nearest neighbor(s) in the tree:
tree = spatial.KDTree(code_book)
for observation in observations:
distances, indices = tree.query(observation)
print(indices)
# 2
# 0
# 1
# 2
# 0
Note that the number of points in your code_book (N) must be large compared to the dimension of the data (e.g. N >> 2**7) for the KDTree to be fast compared to simple exhaustive search.
Using vq.vq or KDTree.query may or may not be faster than exhaustive search, depending on the size of your data (code_book and observations). To find out which is faster, be sure to benchmark these versus an exhaustive search using timeit.
i don't know if i understood well your question,but i will try giving an answer.
for each row K in the table compute the distance of your key from the key in that row:
( (X1-K1)^2 + (X2-K2)^2 + (X3-K3)^2 + (X4-K4)^2 + (X5-K5)^2 + (X6-K6)^2 + (X7-K7)^2 )^0.5
where {X1,X2,X3,X4,X5,X6,X7} is the key and {K1,K2,K3,K4,K5,K6,K7}is the key at row K
you could make one factor of the key more or less relevant of the others multiplying it while computing distance,for example you could replace (X1-K1)^2 in the formula above with 5*(X1-K1)^2 to make that more influent.
and store in a variable the distance ,in a second variable the row number
do the same with the following rows and if the new distance is lower then the one you stored then replace the distance and the row number.
when you have checked all the rows in your table the second variable you have used will show you the nearest row to the key
here is some pseudo-code:
int Row= 0
float Key[7] #suppose it is already filled with some values
float ClosestDistance= +infinity
int ClosestRow= 0
while Row<NumberOfRows{
NewDistance= Distance(Key,Table[Row][0:7])#suppose Distance is a function that outputs the distance and Table is the table you want to control Table[Row= NumberOfRows][Column= 7+4]
if NewDistance<ClosestDistance{
ClosestDistance= NewDistance
ClosestRow= Row}
increase row by 1}
ValueFound= Table[ClosestRow][7:11]#this should be the value you were looking for
i know it isn't fast but it is the best i could do,hope it helped.
P.S. i haven't considered measurement errors,i know.

Resources