I have two arguments that I want to print
print('{0:25}${2:>5.2f}'.format('object', 20))
But they give the following response:
Traceback (most recent call last):
IndexError: tuple index out of range
But I get the desired output when I changed the code to the following:
print('{0:25}${2:>5.2f}'.format('object', 20, 20))
I don't understand why as I only have two sets of {}. Thanks
your problem is the 2 index after the $ sign:
print('{0:25}${2:>5.2f}'.format('object', 20, 20))
when you use .format in on string in python the number at {number:} is the index for the argument you want there.
for example the following:
"hello there {1:} i want you to give me {0:} dollars".format(2,"Tom")
will resualt in the following output:
'hello there Tom i want you to give me 2 dollars'
there is a simple example here:
https://www.programiz.com/python-programming/methods/string/format
so to sum up, in order for your code to work just use:
print('{0:25}${1:>5.2f}'.format('object', 20))
It should be
>>> print('{0:25}${1:>5.2f}'.format('object', 20))
object $20.00
Note the change of the placeholder from 2 to 1
print('{0:25}${1:>5.2f}'.format('object', 20))
### ^
When you add a third parameter (a second 20), the placeholder 2 finds a value
>>> print('{0:25}${2:>5.2f}'.format('object', 20, 20))
object $20.00
But without the third parameter, an index out of range exception is thrown.
Related
Im getting a lengthy error traceback with last line as stated in title.
Im trying to use nearest method to fill the missing values during reindexing.
Heres my code:
import pandas as pd
s1=pd.Series([1,2,3,4],index=list('aceg'))
print(s1.reindex(pd.Index(list('abdg')),method='nearest'))
I was trying to see if filling missing info is done after reindexing or during reindexing which might affect the result in this case of method = 'nearest'.
Changing the method to ffill or bfill works fine.
It's not possible to do that with strings because the distance between two strings doesn't mean much. For this use case, you can convert your one-character index as a number with the ord function:
s1 = pd.Series([1,2,3,4], index=list('aceg'))
idx = pd.Index(list('gdba'))
s1.index = idx[s1.index.map(ord).reindex(idx.map(ord), method='nearest')[1]]
print(s1)
# Output:
a 1
b 2
d 3
g 4
dtype: int64
Details:
>>> s1.index.map(ord)
Int64Index([97, 98, 100, 103], dtype='int64')
>>> idx.map(ord)
Int64Index([103, 100, 98, 97], dtype='int64')
If you have strings index instead of one-character index, you can handle it with fuzzy logic and Levenshtein distance
I am attempting to speed up dozens of calls I make to pandas groupby using cython optimised functions. These incldue straight groupby, groupby with ranking and others. I have one that does a groupby that runs in my notebook, but not when called I get a NameError.
Here is the test code from my notebook (in 3 cells there)
%%cython
def _set_indices(keys_as_int, n_keys):
import numpy
cdef int i, j, k
cdef object[:, :] indices = [[i for i in range(0)] for _ in range(n_keys)]
for j, k in enumerate(keys_as_int):
indices[k].append(j)
return [([numpy.array(elt) for elt in indices])]
def group_by(keys):
_, first_occurrences, keys_as_int = np.unique(keys, return_index=True, return_inverse=True)
n_keys = max(keys_as_int) + 1
indices = [[] for _ in range(max(keys_as_int) + 1)]
print(str(keys_as_int) + str(n_keys) + str(indices))
indices = _set_indices(keys_as_int, n_keys)
return indices
%%timeit
result = group_by(['1', '2', '3', '1', '3'])
print(str(result))
The error I get is:
<ipython-input-20-3f8635aec47f> in group_by(keys)
4 indices = [[] for _ in range(max(keys_as_int) + 1)]
5 print(str(keys_as_int) + str(n_keys) + str(indices))
----> 6 indices = _set_indices(keys_as_int, n_keys)
7 return indices
NameError: name '_set_indices' is not defined
Can someone explain if this is due to notebook or if I have done something wrong with the way cython is used, I am new to it.
Also any hints to get a strongly type, with minimum cache hits solution are most welcome.
You need to put your _set_indices function in the same cell, or you need to explicitly import it. From the Compiling with a Jupyter Notebook documentation:
Note that each cell will be compiled into a separate extension module.
After compilation, you do have a global name _set_indices, but that doesn't make it available as a global in the separate extension module for the group_by() function.
You'll need to put the two function definitions into the same cell, or create a separate module for the utility functions.
Note that there is also another issue with the code; you can't just create a typed memory view from a list of integers:
Traceback (most recent call last):
File "so58378716.pyx", line 22, in init so58378716
result = group_by(['1', '2', '3', '1', '3'])
File "so58378716.pyx", line 19, in so58378716.group_by
indices = _set_indices(keys_as_int, n_keys)
File "so58378716.pyx", line 6, in so58378716._set_indices
cdef object[:, :] indices = [[i for i in range(0)] for _ in range(n_keys)]
File "stringsource", line 654, in View.MemoryView.memoryview_cwrapper
File "stringsource", line 349, in View.MemoryView.memoryview.__cinit__
TypeError: a bytes-like object is required, not 'list'
You'd have to create an actual numpy array, or use a cython.view.array object, or an array.array.
Input code:
best = sorted(word_scores.items(), key=lambda w, s: s, reverse=True)[:10000]
Result:
Traceback (most recent call last):
File "C:\Users\Sarah\Desktop\python\test.py", line 78, in <module>
best = sorted(word_scores.items(), key=lambda w, s: s, reverse=True)[:10000]
TypeError: <lambda>() missing 1 required positional argument: 's'
How do I solve it?
If I've understood the format of your word_scores dictionary correctly (that the keys are words and the values are integers representing scores), and you're simply looking to get an ordered list of words with the highest scores, it's as simple as this:
best = sorted(word_scores, key=word_scores.get, reverse=True)[:10000]
If you want to use a lambda to get an ordered list of tuples, where each tuple is a word and a score, and they are ordered by score, you can do the following:
best = sorted(word_scores.items(), key=lambda x: x[1], reverse=True)[:10000]
The difference between this and your original attempt is that I have passed one argument (x) to the lambda, and x is a tuple of length 2 - x[0] is the word and x[1] is the score. Since we want to sort by score, we use x[1].
In tensor flow, I have got a tensor with 512 rows and 2 columns. What I want to do is that: filter column 2 of the tensor on the basis of unique values of column 1 and then for each unique value (of column 1) process corresponding values of column 2 in the inner loop.
So, as an example, I have got a 2-dimensional tensor, value (after evaluating in a session) of which looks like following:
[[ 509, 270],
[ 533, 568],
[ 472, 232],
...,
[ 6, 276],
[ 331, 165],
[ 401, 1144]]
509, 533, 472 ... are elements of column1 and 270, 568, 232,... are elements of column 2.
Is there a way that I can define following 2 steps within a graph (not while executing the session):
get unique values of column1
for each `unique_value` in column1:
values_in_column2 = values in column2 corresponding to `unique_value` (filter column2 according to unique_value`)
some_function(values_in_column2)
I can do above steps while running the session but I would like to define the above 2 steps in a graph - which I can run in session after defining many subsequent steps.
Is there any way to do this? Appreciate any kind of help in this regard.
Here is pseudo code for what I want to do.
tensor1 = tf.stack([column1, column2], axis = 1)
column1 = tensor1[0, :]
unique_column1, unique_column1_indexes = tf.unique(column1)
for unique_column1_value in unique_column1:
column1_2_indexes = tf.where(column1 == unique_column1_value)
corresponding_column2_values = tensor1[column1_2_indexes][:, 1]
But as of now it gives an error:
TypeError: 'Tensor' object is not iterable.
at the following line:
for unique_column1_value in unique_column1.
I have followed this question: "TypeError: 'Tensor' object is not iterable" error with tensorflow Estimator which does not apply to me.
I understand that I need to use while_loop but I don't know how.
Regards,
Sumit
Updated: There is a solution for when column1 is sorted here. Note that this is also a feature request for the more general version, but is closed for inactivity. The sorted version solution is like:
column1 = tf.constant([1,2,2,2,3,3,4])
column2 = tf.constant([5,6,7,8,9,10,11])
tensor1 = tf.stack([column1, column2], axis = 1)
unique_column1, unique_column1_indices, counts = tf.unique_with_counts(column1)
unique_ix = tf.cumsum(tf.pad(counts,[[1,0]]))[:-1]
output = tf.gather(tensor1, unique_ix)
which outputs: [[ 1, 5][ 2, 6][ 3, 9][ 4, 11]]
First time posting on stackoverflow, so bear with me if I'm making some faux pas please :)
I'm trying to calculate the distance between two points, using geopy, but I can't quite get the actual application of the calculation to work.
Here's the head of the dataframe I'm working with (there are some missing values later in the dataframe, not sure if this is the issue or how to handle it in general):
start lat start long end_lat end_long
0 38.902760 -77.038630 38.880300 -76.986200
2 38.895914 -77.026064 38.915400 -77.044600
3 38.888251 -77.049426 38.895914 -77.026064
4 38.892300 -77.043600 38.888251 -77.049426
I've set up a function:
def dist_calc(st_lat, st_long, fin_lat, fin_long):
from geopy.distance import vincenty
start = (st_lat, st_long)
end = (fin_lat, fin_long)
return vincenty(start, end).miles
This one works fine when given manual input.
However, when I try to apply() the function, I run into trouble with the below code:
distances = df.apply(lambda row: dist_calc(row[-4], row[-3], row[-2], row[-1]), axis=1)
I'm fairly new to python, any help will be much appreciated!
Edit: error message:
distances = df.apply(lambda row: dist_calc2(row[-4], row[-3], row[-2], row[-1]), axis=1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 4262, in apply
ignore_failures=ignore_failures)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 4358, in _apply_standard
results[i] = func(v)
File "<stdin>", line 1, in <lambda>
File "<stdin>", line 5, in dist_calc2
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/geopy/distance.py", line 322, in __init__
super(vincenty, self).__init__(*args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/geopy/distance.py", line 115, in __init__
kilometers += self.measure(a, b)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/geopy/distance.py", line 414, in measure
u_sq = cos_sq_alpha * (major ** 2 - minor ** 2) / minor ** 2
UnboundLocalError: ("local variable 'cos_sq_alpha' referenced before assignment", 'occurred at index 10')
The default settings for pandas functions typically used to import text data like this (pd.read_table() etc) will interpret the spaces in the first 2 column names as separators, so you'll end up with 6 columns instead of 4, and your data will be misaligned:
In [23]: df = pd.read_clipboard()
In [24]: df
Out[24]:
start lat start.1 long end_lat end_long
0 0 38.902760 -77.038630 38.880300 -76.986200 NaN
1 2 38.895914 -77.026064 38.915400 -77.044600 NaN
2 3 38.888251 -77.049426 38.895914 -77.026064 NaN
3 4 38.892300 -77.043600 38.888251 -77.049426 NaN
In [25]: df.columns
Out[25]: Index(['start', 'lat', 'start.1', 'long', 'end_lat', 'end_long'], dtype='object')
Notice column names are wrong, the last column is full of NaNs, etc. If I apply your function to the dataframe in this form, I get the same error as you did.
Its usually better to try to fix this before it gets imported as a dataframe. I can think of 2 methods:
clean the data before importing, for example copy it into an editor and replace the offending spaces with underscores. This is the easiest.
use a regex to fix it during import. This may be necessary if the dataset is very large, or its is pulled from a website and has to be refreshed regularly.
Here's an example of case (2):
In [35]: df = pd.read_clipboard(sep=r'\s{2,}|\s(?=-)', engine='python')
In [36]: df = df.rename_axis({'start lat': 'start_lat', 'start long': 'start_long'}, axis=1)
In [37]: df
Out[37]:
start_lat start_long end_lat end_long
0 38.902760 -77.038630 38.880300 -76.986200
2 38.895914 -77.026064 38.915400 -77.044600
3 38.888251 -77.049426 38.895914 -77.026064
4 38.892300 -77.043600 38.888251 -77.049426
The specified that separators must contain either 2+ whitespaces characters, or 1 whitespace followed by a hyphen (minus sign). Then I rename the columns to what i assume are the expected values.
From this point your function / apply works fine, but i've changed it a little:
PEP8 recommends putting imports at the top of each file, rather than in a function
Extracting the columns by name is more robust, and would have given a much more understandable error than the weird error thrown by geopy.
For example:
In [51]: def dist_calc(row):
...: start = row[['start_lat','start_long']]
...: end = row[['end_lat', 'end_long']]
...: return vincenty(start, end).miles
...:
In [52]: df.apply(lambda row: dist_calc(row), axis=1)
Out[52]:
0 3.223232
2 1.674780
3 1.365851
4 0.420305
dtype: float64