PySpark error - Cannot resolve column name "age" among (_c0, _c1) [closed] - apache-spark

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed yesterday.
Improve this question
When I execute the below script:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("Filterexample"). getOrCreate()
data01 = [(1, "John", 25), (2, "James", 30), (3, "Thomas", 35), (4, "mohi", 44)]
df01 = spark.createDataFrame(data01, ["id", "name", "age"])
filterDf01 = df01.filter( df["age"] > 30)
filterDf01.show()
I get the following error:
`AnalysisException Traceback (most recent call last)
<ipython-input-49-f8a2b46cec55> in <module>
3 data01 = [(1, "John", 25), (2, "James", 30), (3, "Thomas", 35), (4, "mohi", 44)]
4 df01 = spark.createDataFrame(data01, ["id", "name", "age"])
----> 5 filterDf01 = df01.filter( df["age"] > 30)
6 filterDf01.show()
2 frames
/content/spark-3.1.1-bin-hadoop3.2/python/pyspark/sql/utils.py in deco(*a, **kw)
115 # Hide where the exception came from that shows a non-Pythonic
116 # JVM exception message.
--> 117 raise converted from None
118 else:
119 raise
AnalysisException: Cannot resolve column name "age" among (_c0, _c1)`
Please advise, how to resolve this error. Since I applied filter to display results with age more than 30, I should get the following output, as seen below.
(3, "Thomas", 35), (4, "mohi", 44)

Related

pandas fuzzy match on the same column but prevent matching against itself

This is a common question but I have an extra condition: how do I remove matches based on a unique ID? Or, how to prevent matching against itself?
Given a dataframe:
df = pd.DataFrame({'id':[1, 2, 3],
'name':['pizza','pizza toast', 'ramen']})
I used solutions like this one to create a multi-index dataframe:
Fuzzy match strings in one column and create new dataframe using fuzzywuzzy
df_copy = df.copy()
compare = pd.MultiIndex.from_product([df['name'], df_copy['name']]).to_series()
def metrics(tup):
return pd.Series([fuzz.ratio(*tup),
fuzz.token_sort_ratio(*tup)],
['ratio', 'token'])
compare.apply(metrics)
So that's great but how can I use the unique ID to prevent matching against itself?
If there's a case of ID/name = 1/pizza and 10/pizza, obviously I want to keep those. But I need to remove the same ID in both indexes.
I suggest a slightly different approach for the same result using Python standard library difflib module, which provides helpers for computing deltas.
So, with the following dataframe in which pizza has two different ids (and thus should be checked against one another later on):
import pandas as pd
df = pd.DataFrame(
{"id": [1, 2, 3, 4], "name": ["pizza", "pizza toast", "ramen", "pizza"]}
)
Here is how you can find similarities between different id/name combinations, but avoid checking an id/name combination against itself:
from difflib import SequenceMatcher
# Define a simple helper function
def ratio(a, b):
return SequenceMatcher(None, a, b).ratio()
And then, with the following steps:
# Create a column of unique identifiers: (id, name)
df["id_and_name"] = list(zip(df["id"], df["name"]))
# Calculate ratio only for different id_and_names
df = df.assign(
match=df["id_and_name"].map(
lambda x: {
value: ratio(x[1], value[1])
for value in df["id_and_name"]
if x[0] != value[0] or ratio(x[1], value[1]) != 1
}
)
)
# Format results in a readable fashion
df = (
pd.DataFrame(df["match"].to_list(), index=df["id_and_name"])
.reset_index(drop=False)
.melt("id_and_name", var_name="other_id_and_name", value_name="ratio")
.dropna()
.sort_values(by=["id_and_name", "ratio"], ascending=[True, False])
.reset_index(drop=True)
.pipe(lambda df_: df_.assign(ratio=df_["ratio"] * 100))
.pipe(lambda df_: df_.assign(ratio=df_["ratio"].astype(int)))
)
You get the expected result:
print(df)
# Output
id_and_name other_id_and_name ratio
0 (1, pizza) (4, pizza) 100
1 (1, pizza) (2, pizza toast) 62
2 (1, pizza) (3, ramen) 20
3 (2, pizza toast) (4, pizza) 62
4 (2, pizza toast) (1, pizza) 62
5 (2, pizza toast) (3, ramen) 12
6 (3, ramen) (4, pizza) 20
7 (3, ramen) (1, pizza) 20
8 (3, ramen) (2, pizza toast) 12
9 (4, pizza) (1, pizza) 100
10 (4, pizza) (2, pizza toast) 62
11 (4, pizza) (3, ramen) 20

Unable to calculate silhouette_score using a sparse matrix in sklearn [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I am trying to calculate silhouette_score or silhouette_samples using a sparse matrix but getting the following error:
ValueError: diag requires an array of at least two dimensions
The sample code is as follows:
edges = [
(1, 2, 0.9),
(1, 3, 0.7),
(1, 4, 0.1),
(1, 5, 0),
(1, 6, 0),
(2, 3, 0.8),
(2, 4, 0.2),
(2, 5, 0),
(2, 6, 0.3),
(3, 4, 0.3),
(3, 5, 0.2),
(3, 6, 0.25),
(4, 5, 0.8),
(4, 6, 0.6),
(5, 6, 0.9),
(7, 8, 1.0)]
gg = nx.Graph()
for u,v, w in edges:
gg.add_edge(u, v, weight=w)
adj = nx.adjacency_matrix(gg)
adj.setdiag(0)
from sklearn.metrics import silhouette_score, silhouette_samples
print(silhouette_score(adj, metric='precomputed', labels=labels))
silhouette_samples(adj, metric='precomputed', labels=labels)
This is a bug. You should report it. Relevant code.
X, labels = check_X_y(X, labels, accept_sparse=['csc', 'csr'])
# Check for non-zero diagonal entries in precomputed distance matrix
if metric == 'precomputed':
atol = np.finfo(X.dtype).eps * 100
if np.any(np.abs(np.diagonal(X)) > atol):
raise ValueError(
'The precomputed distance matrix contains non-zero '
'elements on the diagonal. Use np.fill_diagonal(X, 0).'
)
Although the input checking explicitly accepts CSC/CSR matrices, if metric is 'precomputed' it drops X into numpy functions that don't work on sparse matrices.

Hackerrank problem not working for large data within time constraints(stack problem to print maximum element)

The hackerrank problem statement is:
You have an empty sequence, and you will be given queries. Each query is one of these three types:
1 -Push the element x into the stack.
2 -Delete the element present at the top of the stack.
3 -Print the maximum element in the stack.
Input Format
The first line of input contains an integer.
The next lines each contain an above mentioned query. (It is guaranteed that each query is valid.)
Constraints:
Output Format
For each type query, print the maximum element in the stack on a new line.
Sample Input
10
1 97
2
1 20
2
1 26
1 20
2
3
1 91
3
Sample Output
26
91
My code:
n=int(input())
class Stack:
def __init__(self):
self.stack1=[]
def push(self,x):
return self.stack1.append(x)
def pop(self):
self.stack1.pop()
return
def maximum(self):
return max(self.stack1)
stack_object=Stack()
for _ in range(n):
a=list(map(int,input().split()))
if a[0]==1:
stack_object.push(a[1])
elif a[0]==2:
stack_object.pop()
else:
print(stack_object.maximum())
with this algorithm of time complexity O(n^2) I am able to pass 16 out of 27 test cases.
Can someone share a more optimized solution to the problem with time complexity O(n).
Thanks in advance.
There is a simple O(n) algorithms.
Instead of pushing x at the top the stack, simply push max(x, current_top).
Then, the top of the stack will contain the maximum value of all values pushed so far.
A stack is like a tower of elements. Imagine what each of the actions you listed would look like if it needed to work on tuples rather than numbers, of the form:
(number, h)
where h is the highest element at this level or lower in the tower. For example:
input 1 8 1 6 1 9 1 5 1 10 3 2 3 2 2 3
query out stack
1 8 [(8, 8)]
1 6 [(8, 8), (6, 8)]
1 9 [(8, 8), (6, 8), (9, 9)]
1 5 [(8, 8), (6, 8), (9, 9), (5, 9)]
1 10 [(8, 8), (6, 8), (9, 9), (5, 9), (10, 10)]
3 10
2 [(8, 8), (6, 8), (9, 9), (5, 9)]
3 9
2 [(8, 8), (6, 8), (9, 9)]
2 [(8, 8), (6, 8)]
3 8
Working code:
n=int(input())
class Stack:
def __init__(self):
self.stack1=[(None, -float('inf'))]
def push(self,x):
return self.stack1.append((x, max(self.maximum(), x)))
def pop(self):
self.stack1.pop()
return
def maximum(self):
return self.stack1[-1][1]
stack_object=Stack()
for _ in range(n):
a=list(map(int,input().split()))
if a[0]==1:
stack_object.push(a[1])
elif a[0]==2:
stack_object.pop()
else:
print(stack_object.maximum())

Creating set from nested list in python [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
marksheet = [['harry',87], ['bob', 76], ['bucky', 98]]
print(set([marks for name, marks in marksheet]))
output: {98, 76, 87}
Can someone please explain how this works?
You're iterating name, marks over marksheet. So you're extracting two values and storing them as name, which you ignore, and marks, which you create a list from. That list you just create is passed to set, which creates a set. You can break the code down step by step:
marksheet = [['harry',87], ['bob', 76], ['bucky', 98]]
In [40]: marksheet
Out[40]: [['harry', 87], ['bob', 76], ['bucky', 98]]
In [41]: l = [marks for name, marks in marksheet]
In [42]: l
Out[42]: [87, 76, 98]
You can also surround the values you're extracting in parentheses to help make it more clear:
In [43]: l = [marks for (name, marks) in marksheet]
In [44]: l
Out[44]: [87, 76, 98]
Some people use _ to denote the returned value is ignored:
In [45]: l = [marks for (_, marks) in marksheet]
In [46]: l
Out[46]: [87, 76, 98]
The above is an example of list comprehension. This is equivalent to:
In [47]: l=[]
In [48]: for (name, marks) in marksheet:
...: l.append(marks)
...:
In [49]: l
Out[49]: [87, 76, 98]
From there you are simply passing the list to set, which can take an iterable. In this case, the list you just created is the iterable:
In [50]: set(l)
Out[50]: {76, 87, 98}

Write a Python function histogram(l) that takes as input a list of integers with repetitions and returns a list of pairs [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
Write a Python function histogram(l) that takes as input a list of integers with repetitions and returns a list of pairs as follows:
for each number n that appears in l, there should be exactly one pair (n,r) in the list returned by the function, where r is is the number of repetitions of n in l.
the final list should be sorted in ascending order by r, the number of repetitions. For numbers that occur with the same number of repetitions, arrange the pairs in ascending order of the value of the number.
For instance:
histogram([13,12,11,13,14,13,7,7,13,14,12])
[(11, 1), (7, 2), (12, 2), (14, 2), (13, 4)]
histogram([7,12,11,13,7,11,13,14,12])
[(14, 1), (7, 2), (11, 2), (12, 2), (13, 2)]
>>> def histogram(L):
... from collections import Counter
... return Counter(L).items()
...
>>> histogram([13,12,11,13,14,13,7,7,13,14,12])
[(11, 1), (12, 2), (13, 4), (14, 2), (7, 2)]
>>> histogram([7,12,11,13,7,11,13,14,12])
[(11, 2), (12, 2), (13, 2), (14, 1), (7, 2)]
def histogram(l):
b=[]
x=[]
f=[]
for i in range(len(l)):
a=()
if l[i] not in x:
c=l.count(l[i])
f.append(c)
f.sort()
a=a+(l[i],c)
b.append(a)
x.append(l[i])
del a
else:
continue
b=sorted(b,key=lambda f: (f[1],f[0]))
return b
If you have any questions about my code, feel free to ask.
And if my code needs any improvements, please tell me.
def histogram(l):
count = 0
x=[]
k=[]
for i in range(len(l)):
index=i
count=0
for j in range(index,len(l)):
if l[index] == l[j] and l[index] not in k :
count =count + 1
k = k + [l[index]]
if (count != 0):
x = x + [(l[index], count)]
x.sort()
x=sorted(x,key=lambda x:x[1])
return x
#print(histogram([13,12,11,13,14,13,7,7,13,14,12]))
def histogram(l):
a=[]
ans=[]
for n in l:
if n not in a:
r=l.count(n)
a.append(n)
y=(n,r)
ans.append(y)
ans.sort()
ans.sort(key=lambda r :r[1])
return ans

Resources