Replace double loops python with apply - python-3.x

Does anyone know if it is possible to replace a double loop in python with something faster like the apply function ?
For instance, I have this dataframe :
df = pd.DataFrame()
df["col_1"] = ["hello", "salut","hello", "bye", "bye","hi","hello", "hello"]
df["col_2"] = ["dog", "dog", "dog", "cat", "cat", "mouse","dog","cat"]
df["col_3"] = [100,45,100,51,51,32,100,85]
and this function :
def f (l1, l2):
if list(l1) == list(l2) :
return 1
else:
return 0
Which returns 1 if 2 lists are identical and 0 otherwise. I would like to apply this function to create a column "similar" like this :
Which I can easily do with a double loop but I would like to do this faster with less complexity.
Thank you for your help ! :)

Basically you want to find col combinations that have duplicates, and mark them as 1 in column "similar". pandas.DataFrame.duplicated does exactly that, you just have to do:
df.duplicated(keep=False)
See https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.duplicated.html#pandas.DataFrame.duplicated , keep=False will mark all duplicates as True.
Then you just need to convert boolean to int:
df['similar'] = list(map(int, df.duplicated(keep=False)))

Related

How to code in such a way that you are taking an element from a list and comparing it with all other elements of another list in python

If it matches the string then increment else continue
meaning if x = ["you","me","us","then"] and y = ["hi","king","you","you","thai","you"] the code should take string you compare with all the elements in y and then increment a variable if it matches and return 3
Note: the code should not stop if it matches once with you it should search till end of the elements?
x = ["you","me","us","then"]
y = ["hi","king","you","you","thai","you"]
word_count = {}
for each_word in x :
word_count[each_word] = y.count(each_word)
print(dict)
#result : {'you': 3, 'me': 0, 'us': 0, 'then': 0}
if this is not the answer you are looking for, Please explain the question by providing sample result you are expecting.

Is this the valid "if" expression for not printing the names of less than four characters [duplicate]

I like to filter out data whose string length is not equal to 10.
If I try to filter out any row whose column A's or B's string length is not equal to 10, I tried this.
df=pd.read_csv('filex.csv')
df.A=df.A.apply(lambda x: x if len(x)== 10 else np.nan)
df.B=df.B.apply(lambda x: x if len(x)== 10 else np.nan)
df=df.dropna(subset=['A','B'], how='any')
This works slow, but is working.
However, it sometimes produce error when the data in A is not a string but a number (interpreted as a number when read_csv read the input file):
File "<stdin>", line 1, in <lambda>
TypeError: object of type 'float' has no len()
I believe there should be more efficient and elegant code instead of this.
Based on the answers and comments below, the simplest solution I found are:
df=df[df.A.apply(lambda x: len(str(x))==10]
df=df[df.B.apply(lambda x: len(str(x))==10]
or
df=df[(df.A.apply(lambda x: len(str(x))==10) & (df.B.apply(lambda x: len(str(x))==10)]
or
df=df[(df.A.astype(str).str.len()==10) & (df.B.astype(str).str.len()==10)]
import pandas as pd
df = pd.read_csv('filex.csv')
df['A'] = df['A'].astype('str')
df['B'] = df['B'].astype('str')
mask = (df['A'].str.len() == 10) & (df['B'].str.len() == 10)
df = df.loc[mask]
print(df)
Applied to filex.csv:
A,B
123,abc
1234,abcd
1234567890,abcdefghij
the code above prints
A B
2 1234567890 abcdefghij
A more Pythonic way of filtering out rows based on given conditions of other columns and their values:
Assuming a df of:
data = {
"names": ["Alice", "Zac", "Anna", "O"],
"cars": ["Civic", "BMW", "Mitsubishi", "Benz"],
"age": ["1", "4", "2", "0"],
}
df=pd.DataFrame(data)
df:
age cars names
0 1 Civic Alice
1 4 BMW Zac
2 2 Mitsubishi Anna
3 0 Benz O
Then:
df[
df["names"].apply(lambda x: len(x) > 1)
& df["cars"].apply(lambda x: "i" in x)
& df["age"].apply(lambda x: int(x) < 2)
]
We will have :
age cars names
0 1 Civic Alice
In the conditions above we are looking first at the length of strings, then we check whether a letter "i" exists in the strings or not, finally, we check for the value of integers in the first column.
I personally found this way to be the easiest:
df['column_name'] = df[df['column_name'].str.len()!=10]
You can also use query:
df.query('A.str.len() == 10 & B.str.len() == 10')
If You have numbers in rows, then they will convert as floats.
Convert all the rows to strings after importing from cvs. For better performance split that lambdas into multiple threads.
you can use df.apply(len) . it will give you the result
For string operations such as this, vanilla Python using built-in methods (without lambda) is much faster than apply() or str.len().
Building a boolean mask by mapping len to each string inside a list comprehension is approx. 40-70% faster than apply() and str.len() respectively.
For multiple columns, zip() allows to evaluate values from different columns concurrently.
col_A_len = map(len, df['A'].astype(str))
col_B_len = map(len, df['B'].astype(str))
m = [a==3 and b==3 for a,b in zip(col_A_len, col_B_len)]
df1 = df[m]
For a single column, drop zip() and loop over the column and check if the length is equal to 3:
df2 = df[[a==3 for a in map(len, df['A'].astype(str))]]
This code can be written a little concisely using the Series.map() method (but a little slower than list comprehension due to pandas overhead):
df2 = df[df['A'].astype(str).map(len)==3]
Filter out values other than length of 10 from column A and B, here i pass lambda expression to map() function. map() function always applies in Series Object.
df = df[df['A'].map(lambda x: len(str(x)) == 10)]
df = df[df['B'].map(lambda x: len(str(x)) == 10)]
You could use applymap to filter all columns you want at once, followed by the .all() method to filter only the rows where both columns are True.
#The *mask* variable is a dataframe of booleans, giving you True or False for the selected condition
mask = df[['A','B']].applymap(lambda x: len(str(x)) == 10)
#Here you can just use the mask to filter your rows, using the method *.all()* to filter only rows that are all True, but you could also use the *.any()* method for other needs
df = df[mask.all(axis=1)]

Python3: comparing dynamic list to create a regexp

I am currently writing a class to create a regexp.
As an input, we got 3 sentences in a list ("textContent") and the output regexp should match the 3 sentences.
For this, I use ZIP. The code below is 100% working.
from array import *
textContent = []
textContent.append("The sun is.shining")
textContent.append("the Sun is ShininG")
textContent.append("the_sun_is_shining")
s = ""
for x, y, z in zip(textContent[0], textContent[1], textContent[2]):
if x == y == z:
s+=str(x)
else:
s+="."
#answer is ".he..un.is..hinin."
print(s)
It's working but ONLY with 3 sentences in a List.
Now, I want the same comparison but with a dynamic list that could contain 2 or 256 sentences for example. And I'm stuck. I don't know how to adjust the code for that.
I noticed that the following throws no error:
zip(*textContent)
So, I'm stuck with the variables that I compare before: x, y, z
for x, y, z in zip(*textContent):
It could work only if textContent contains 3 values...
Any idea? May be another class than ZIP could make the job.
Thanks
This will solve your problem with zipping and comparing:
l = ['asd', 'agd', 'add', 'abd']
for letters in list(zip(*l)):
if all([letters[0] == letter for letter in letters]):
print('Yey')
else:
print('Ugh')
>>> Yey
>>> Ugh
>>> Yey
And for l = ['asd', 'agg', 'add', 'cbb'] it will print 3 'Ugh'.
Also you should check if l is longer than 0

How can i optimise my code and make it readable?

The task is:
User enters a number, you take 1 number from the left, one from the right and sum it. Then you take the rest of this number and sum every digit in it. then you get two answers. You have to sort them from biggest to lowest and make them into a one solid number. I solved it, but i don't like how it looks like. i mean the task is pretty simple but my code looks like trash. Maybe i should use some more built-in functions and libraries. If so, could you please advise me some? Thank you
a = int(input())
b = [int(i) for i in str(a)]
closesum = 0
d = []
e = ""
farsum = b[0] + b[-1]
print(farsum)
b.pop(0)
b.pop(-1)
print(b)
for i in b:
closesum += i
print(closesum)
d.append(int(closesum))
d.append(int(farsum))
print(d)
for i in sorted(d, reverse = True):
e += str(i)
print(int(e))
input()
You can use reduce
from functools import reduce
a = [0,1,2,3,4,5,6,7,8,9]
print(reduce(lambda x, y: x + y, a))
# 45
and you can just pass in a shortened list instead of poping elements: b[1:-1]
The first two lines:
str_input = input() # input will always read strings
num_list = [int(i) for i in str_input]
the for loop at the end is useless and there is no need to sort only 2 elements. You can just use a simple if..else condition to print what you want.
You don't need a loop to sum a slice of a list. You can also use join to concatenate a list of strings without looping. This implementation converts to string before sorting (the result would be the same). You could convert to string after sorting using map(str,...)
farsum = b[0] + b[-1]
closesum = sum(b[1:-2])
"".join(sorted((str(farsum),str(closesum)),reverse=True))

Common values in a Python dictionary

I'm trying to write a code that will return common values from a dictionary based on a list of words.
Example:
inp = ['here','now']
dict = {'here':{1,2,3}, 'now':{2,3}, 'stop':{1, 3}}
for val in inp.intersection(D):
lst = D[val]
print(sorted(lst))
output: [2, 3]
The input inp may contain any one or all of the above words, and I want to know what values they have in common. I just cannot seem to figure out how to do that. Please, any help would be appreciated.
The easiest way to do this is to just count them all, and then make a dict of the values that are equal to the number of sets you intersected.
To accomplish the first part, we do something like this:
answer = {}
for word in inp:
for itm in word:
if itm in answer:
answer[itm] += 1
else:
answer[itm] = 1
To accomplish the second part, we just have to iterate over answer and build an array like so:
answerArr = []
for i in answer:
if (answer[i] == len(inp)):
answerArr.append(i)
i'm not certain that i understood your question perfectly but i think this is what you meant albeit in a very simple way:
inp = ['here','now']
dict = {'here':{1,2,3}, 'now':{2,3}, 'stop':{1, 3}}
output = []
for item in inp:
output.append(dict[item])
for item in output:
occurances = output.count(item)
if occurances <= 1:
output.remove(item)
print(output)
This should output the items from the dict which occurs in more than one input. If you want it to be common for all of the inputs just change the <= 1 to be the number of inputs given.

Resources