Elegant way to find indexes for lists within lists? - python-3.x

Pretty new to python. I'm trying to index items in CSV files by row/column. The only method I've found is implementing a for loop to search each row in the list.
readCSV = [['', 'A', 'B', 'C', 'D'],
[1.0, 3.1, 5.0, 1.7, 8.2],
[2.0, 6.2, 7.0, 2.2, 9.3],
[3.0, 8.8, 5.5, 4.4, 6.0]]
row_column = []
for row in readCSV:
if my_item in row:
row_column.append(row[0])
row_column.append(readCSV[0][row.index(my_item)])
So for my_item = 6.2, I get row_column = [2.0, 'A'].
This works fine, but I can't help thinking there's a more elegant solution.

Try this one:
result = [(i, j) for i, k in enumerate(readCSV) for j, n in enumerate(k) if my_item == n]

import pandas as pd
import numpy as np
df = pd.DataFrame(readCSV[1:],columns=readCSV[0])
#### Output ####
No A B C D
0 1.0 3.1 5.0 1.7 8.2
1 2.0 6.2 7.0 2.2 9.3
2 3.0 8.8 5.5 4.4 6.0
##This provides the row in which there is a hit.
df1 = df[(df.A == my_item) | (df.B == my_item) |(df.C == my_item) | (df.D == my_item)]
print(df1)
#### Output ####
No A B C D
1 2.0 6.2 7.0 2.2 9.3
##If you want only those column values which is a hit for your my_item.
z1 = pd.concat([df[df['A'] == my_item][['No','A']],df[df['B'] == my_item][['No','B']],df[df['C'] == my_item][['No','C']],df[df['D'] == my_item][['No','D']]])
print(z1)
#### Output ####
A B C D No
1 6.2 NaN NaN NaN 2.0
## Incase if you want drop the nan , you can use np.isnan
z1 = np.array(z1)
print(z1[:,~np.any(np.isnan(z1), axis=0)])
#### Output ####
[[6.2 2. ]]

Related

How to remove header from pandas Styler?

I want to remove the header from the pandas Styler so that I can render it.
What I have tried:
def highlight(x):
c1 = 'background-color: #f5f5dc'
df1 = pd.DataFrame('', index=x.index, columns=x.columns)
df1.loc[['A'], :] = c1
return df1
temp = {'col1': ['abc', 'def'], 'col2': [1.0, 2.0]}
df = pd.DataFrame(temp)
df.index = ['A', 'B']
print(df)
df.style.apply(highlight, axis=None).hide_index()
Output
col1 col2
-------------
abc 1.000000
def 2.000000
But I want to remove col1 and col2 as it comes in my post rendering page which I don't need.
Is there any way I can do it?
In pandas 1.4.0 both hide_index and hide_columns were deprecated (GH43758) in favour of hide with axis=...
With labels:
df.style.apply(highlight, axis=None).hide(axis='index').hide(axis='columns')
With axis numbers:
df.style.apply(highlight, axis=None).hide(axis=0).hide(axis=1)
In versions prior to 1.4.0 the hide_columns function can be used to hide the columns. Similar to how the hide_index function does for the index:
df.style.apply(highlight, axis=None).hide_index().hide_columns()

Is there a module that can count occurrences of a list of strings in Python?

I have defined a list that reads the contents of a number of files and stores all of them.
How do I create a dataframe, with each filename in a row, and the corresponding columns count the occurrence of each word and output it.
For the sake of example, assume this is all well-defined (but I can provide original code if needed):
#define list
words = [ file1_contents, file2_contents ]
file1_contents = "string with dogs, cats and my pet sea turtle that lives in my box with my other turtles."
file2_contents = "another string about my squirrel, box turtle (who lives in the sea), but not my cat or dog".
filter_words = ["cat", "dog", "box turtle", "sea horse"]
Output would be something like this:
output = {'file1'{'cat': 1, 'dog':1, 'box turtle': 1, 'sea horse': 0}, 'file2'{ ...}}
I have attached an image of my end goal. I am just beginning to use python, so I'm not too sure about what package/module I would use here? I know pandas lets you work with dataframes.
I had the idea of using Counter from collections
from collections import Counter
z = ['blue', 'red', 'blue', 'yellow', 'blue', 'red']
Counter(z)
Counter({'blue': 3, 'red': 2, 'yellow': 1})
BUT, here is where I am stuck. How do I organise a table in python which would look like the attached image?
Example output:
Idea is loop be each file content, filter values from list filter_words by re.findall, count by Counter and create dictionary for DataFrame:
file1_contents = "string with dogs, cats and my pet sea turtle that lives in my box with my other turtles."
file2_contents = "another string about my squirrel, box turtle (who lives in the sea), but not my cat or dog."
import re
from collections import Counter
words = {'file1': file1_contents, 'file2':file2_contents}
filter_words = ["cat", "dog", "box turtle", "sea horse"]
out = {}
for k, w in words.items():
new = []
for fw in filter_words:
new.extend(re.findall(r"{}".format(fw),w) )
out[k] = dict(Counter(new))
print (out)
{'file1': {'cat': 1, 'dog': 1}, 'file2': {'cat': 1, 'dog': 1, 'box turtle': 1}}
df = pd.DataFrame.from_dict(out, orient='index').fillna(0).astype(int)
print (df)
cat dog box turtle
file1 1 1 0
file2 1 1 1
There are quiet a few things to consider to get this right, such as handling punctuation, plurals, 1 term 2 term words etc.
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import WordNetLemmatizer
# nltk.download('wordnet')
import string
import pandas as pd
def preproc(x):
#make translator object
trans=str.maketrans('','',string.punctuation)
wnl = WordNetLemmatizer()
x = ' '.join([wnl.lemmatize(e) for e in x.translate(trans).split()])
return x
vectorizer = CountVectorizer(vocabulary=filter_words,
ngram_range=(1,2),
preprocessor=preproc)
X = vectorizer.fit_transform(words)
pd.DataFrame(columns=filter_words,
data=X.todense())
Output:
cat dog box turtle sea horse
0 1 1 0 0
1 1 1 1 0
from collections import Counter
df_st = pd.DataFrame()
for i in range(1,3):
filename = 'file'+str(i)+'.txt'
with open(filename,'r') as f:
list_words = []
word_count = 0
for line in f:
for word in line.split():
word_count = word_count + 1
list_words.append(word)
df2 = pd.DataFrame(index = (0,),data=Counter(list_words))
df2['0_word_count'] = word_count
df2['0_file_name'] = filename
df_st = df_st.append(df2, ignore_index=True)
df_st
Out[2]:
(who 0_file_name 0_word_count about and another box but cat cats ... pet sea sea), squirrel, string that the turtle turtles. with
0 NaN file1.txt 18 NaN 1.0 NaN 1 NaN NaN 1.0 ... 1.0 1.0 NaN NaN 1 1.0 NaN 1 1.0 2.0
1 1.0 file2.txt 18 1.0 NaN 1.0 1 1.0 1.0 NaN ... NaN NaN 1.0 1.0 1 NaN 1.0 1 NaN NaN

How do I get values of same indexes from multiple arrays using numpy?

a = np.array([ 0.1, 0.2, 0.4, 0.5, 0.9])
b = np.array([ 1.2, 1.5, 1.7, 2.0, 2.4])
c = np.array([ 4.1, 5.3, 5.1, 5.0, 3.2])
First, I have to find the elements where c >= 5 so I typed:
threshold = np.where( c >= 5 )
I then have to find the elements of a and b in the same index where c is the lowest value within that threshold. As we can see the lowest is c = 5.0 in this example so it should show me:
a = 0.5
b = 2.0
c = 5.0
I have no idea how to do so, please help me! Thanks.
Here's one way -
# Get thresholded indices
In [97]: threshold = np.where( c >= 5 )[0]
# Or np.flatnonzero( c>=5 )
# Index into c, get argmin indices among them, index back to original indices
In [98]: idx = threshold[c[threshold].argmin()]
# Finally extract values off the input arrays
In [99]: a[idx],b[idx],c[idx]
Out[99]: (0.5, 2.0, 5.0)

How to join each 2 rows with 2 column in one row in python?

I am trying to join each 2 rows with 2 columns into one column.
I have a data like that, And it is stored in a text file
7.0 1042.3784354104064 1041.8736266399212 0.0
7.0 567.603384919274 566.8152346188947 0.0
8.0 709.5076838990026 709.3588638367074 0.0
8.0 386.811514883702 386.6412338380912 0.0
The expected output will be like that
1042.3784354104064 1041.8736266399212 567.603384919274 566.8152346188947
709.5076838990026 709.3588638367074 386.811514883702 386.6412338380912
You can create a dictionary mapping your first column values to lists, and then populate those lists as you iterate through your matrix:
from collections import defaultdict
matrix = [[7.0, 1042.3784354104064, 1041.8736266399212, 0.0],
[7.0, 567.603384919274, 566.8152346188947, 0.0],
[8.0, 709.5076838990026, 709.3588638367074, 0.0],
[8.0, 386.811514883702, 386.6412338380912, 0.0]]
dd = defaultdict(list)
for key, *values, discard in matrix:
dd[key].extend(values)
result = list(dd.values())
print(result)
# [[1042.3784354104064, 1041.8736266399212, 567.603384919274, 566.8152346188947],
# [709.5076838990026, 709.3588638367074, 386.811514883702, 386.6412338380912]]
Here's a pure numpy solution based on this answer
import numpy as np
mat = np.loadtxt('file.txt')
indices = np.cumsum(np.unique(mat[:, 0], return_counts=True)[1])[:-1]
result = np.array(np.split(mat[:, 1:-1], indices)).reshape((len(indices)+1, -1))
print(result)
# [[1042.37843541 1041.87362664 567.60338492 566.81523462]
# [ 709.5076839 709.35886384 386.81151488 386.64123384]]
The following code will transpose a list of lists, which I believe is what you're asking for. You can trim this new_data if you are looking to remove some rows.
raw_data = [[7.0, 1042.3784354104064, 1041.8736266399212, 0.0],
[7.0, 567.603384919274, 566.8152346188947, 0.0],
[8.0, 709.5076838990026, 709.3588638367074, 0.0],
[8.0, 386.811514883702, 386.6412338380912, 0.0]]
new_data = []
for i, data in enumerate(raw_data):
for j, d in enumerate(data):
if(i==0):
new_data.append([])
new_data[j].append(d)
print(new_data)

Pandas - Fastest way indexing with 2 dataframes

I am developing a software in Python 3 with Pandas library.
Time is very important but memory not so much.
For better visualization I am using the names a and b with few values, although there are many more:
a -> 50000 rows
b -> 5000 rows
I need to select from dataframe a and b (using multiples conditions)
a = pd.DataFrame({
'a1': ['x', 'y', 'z'] ,
'a2': [1, 2, 3],
'a3': [3.14, 2.73, -23.00],
'a4': [pd.np.nan, pd.np.nan, pd.np.nan]
})
a
a1 a2 a3 a4
0 x 1 3.14 NaN
1 y 2 2.73 NaN
2 z 3 -23.00 NaN
b = pd.DataFrame({
'b1': ['x', 'y', 'z', 'k', 'l'],
'b2': [2018, 2019, 2020, 2015, 2012]
})
b
b1 b2
0 x 2018
1 y 2019
2 z 2020
3 k 2015
4 l 2012
So far my code is like this:
for index, row in a.iterrows():
try:
# create a key
a1 = row["a1"]
mask = b.loc[(b['b1'] == a1) & (b['b2'] != 2019)]
# check if exists
if (len(mask.index) != 0): #not empty
a.loc[[index], ['a4']] = mask.iloc[0]['b2']
except KeyError: #not found
pass
But as you can see, I'm using for iterrows that is very slow compared to other methods and I'm changing the value of the DataFrame I'm iterating, that is not recommended.
Could you help me find a better way? The results should be like this:
a
a1 a2 a3 a4
0 x 1 3.14 2018
1 y 2 2.73 NaN
2 z 3 -23.00 2020
I tried things like this below, but I didnt made it work.
a.loc[ (a['a1'] == b['b1']) , 'a4'] = b.loc[b['b2'] != 2019]
*the real code has more conditions
Thanks!
EDIT
I benchmark using: iterrows, merge, set_index/loc. Here is the code:
import timeit
import pandas as pd
def f_iterrows():
for index, row in a.iterrows():
try:
# create a key
a1 = row["a1"]
a3 = row["a3"]
mask = b.loc[(b['b1'] == a1) & (b['b2'] != 2019)]
# check if exists
if len(mask.index) != 0: # not empty
a.loc[[index], ['a4']] = mask.iloc[0]['b2']
except: # not found
pass
def f_merge():
a.merge(b[b.b2 != 2019], left_on='a1', right_on='b1', how='left').drop(['a4', 'b1'], 1).rename(columns={'b2': 'a4'})
def f_lock():
df1 = a.set_index('a1')
df2 = b.set_index('b1')
df1.loc[:, 'a4'] = df2.b2[df2.b2 != 2019]
#variables for testing
number_rows = 100
number_iter = 100
a = pd.DataFrame({
'a1': ['x', 'y', 'z'] * number_rows,
'a2': [1, 2, 3] * number_rows,
'a3': [3.14, 2.73, -23.00] * number_rows,
'a4': [pd.np.nan, pd.np.nan, pd.np.nan] * number_rows
})
b = pd.DataFrame({
'b1': ['x', 'y', 'z', 'k', 'l'] * number_rows,
'b2': [2018, 2019, 2020, 2015, 2012] * number_rows
})
print('For: %s s' % str(timeit.timeit(f_iterrows, number=number_iter)))
print('Merge: %s s' % str(timeit.timeit(f_merge, number=number_iter)))
print('Loc: %s s' % str(timeit.timeit(f_iterrows, number=number_iter)))
They all worked :) and the time to run is:
For: 277.9994369489998 s
Loc: 274.04929955067564 s
Merge: 2.195712725706926 s
So far Merge is the fastest.
If another option appears I will update here, thanks again.
IIUC
a.merge(b[b.b2!=2019],left_on='a1',right_on='b1',how='left').drop(['a4','b1'],1).rename(columns={'b2':'a4'})
Out[263]:
a1 a2 a3 a4
0 x 1 3.14 2018.0
1 y 2 2.73 NaN
2 z 3 -23.00 2020.0

Resources