Pandas apply, map and iterrows behaving strangely

Pandas apply, map and iterrows behaving strangely - python-3.x

I am trying to prune texts in a list based on texts in another list. The following function works fine when called directly on two lists
def remove_texts(texts, texts2):
to_remove = []
for i in texts2:
if i in texts:
to_remove.append(i)
texts = [j for j in texts if j not in to_remove]
return texts
However, the following does nothing and I get no errors
df_other.texts = df_other.texts.map(lambda x: remove_texts(x, df_other.to_remove_split))
Nor does the following. Again no error is returned
for i, row in df_other.iterrows():
row['texts'] = remove_texts(row['texts'], row['to_remove_split'])
Any thoughts appreciated.

You actually want to find the set difference between texts
and texts2. Assume that they contain:
texts = [ 'AAA', 'BBB', 'DDD', 'EEE', 'FFF', 'GGG', 'HHH' ]
texts2 = [ 'CCC', 'EEE' ]
Then, the shortes solution is to compute just the set difference,
without using Pandas:
set(texts).difference(texts2)
gives:
{'AAA', 'BBB', 'DDD', 'FFF', 'GGG', 'HHH'}
Or if you want just a list (not set), write:
sorted(set(texts).difference(texts2))
And if for some reason you want to use Pandas, then start from
creting of both DataFrames:
df = pd.DataFrame(texts, columns=['texts'])
df2 = pd.DataFrame(texts2, columns=['texts'])
Then you can compute the set difference as:
df.query('texts not in #df2.texts')
or
df.texts[~df.texts.isin(df2.texts)]

Related

Set decimal values to 2 points in list under list pandas

I am trying to set max decimal values upto 2 digit for result of a nested list. I have already tried to set precision and tried other things but can not find a way.
r_ij_matrix = variables[1]
print(type(r_ij_matrix))
print(type(r_ij_matrix[0]))
pd.set_option('display.expand_frame_repr', False)
pd.set_option("display.precision", 2)
data = pd.DataFrame(r_ij_matrix, columns= Attributes, index= Names)
df = data.style.set_table_styles([dict(selector='th', props=[('text-align', 'center')])])
df.set_properties(**{'text-align': 'center'})
df.set_caption('Table: Combined Decision Matrix')

You can solve your problem with the apply() method of the dataframe. You can do something like that :
df.apply(lambda x: [[round(elt, 2) for elt in list_] for list_ in x])

Solved it by copying the list to another with the desired decimal points. Thanks everyone.
rij_matrix = variables[1]
rij_nparray = np.empty([8, 6, 3])
for i in range(8):
for j in range(6):
for k in range(3):
rij_nparray[i][j][k] = round(rij_matrix[i][j][k], 2)
rij_list = rij_nparray.tolist()
pd.set_option('display.expand_frame_repr', False)
data = pd.DataFrame(rij_list, columns= Attributes, index= Names)
df = data.style.set_table_styles([dict(selector='th', props=[('text-align', 'center')])])
df.set_properties(**{'text-align': 'center'})
df.set_caption('Table: Normalized Fuzzy Decision Matrix (r_ij)')

applymap seems to be good here:
but there is a BUT: be aware that it is propably not the best idea to store lists as values of a df, you just give up the functionality of pandas. and also after formatting them like this, there are stored as strings. This (if really wanted) should only be for presentation.
df.applymap(lambda lst: list(map("{:.2f}".format, lst)))
Output:
A B
0 [2.05, 2.28, 2.49] [3.11, 3.27, 3.42]
1 [2.05, 2.28, 2.49] [3.11, 3.27, 3.42]
2 [2.05, 2.28, 2.49] [3.11, 3.27, 3.42]
Used Input:
df = pd.DataFrame({
'A': [[2.04939015319192, 2.280350850198276, 2.4899799195977463],
[2.04939015319192, 2.280350850198276, 2.4899799195977463],
[2.04939015319192, 2.280350850198276, 2.4899799195977463]],
'B': [[3.1144823004794873, 3.271085446759225, 3.420526275297414],
[3.1144823004794873, 3.271085446759225, 3.420526275297414],
[3.1144823004794873, 3.271085446759225, 3.420526275297414]]})

extending list after for loop append fails

I am extracting data from pdfs into lists
list1=[]
for page in pages:
for lobj in element:
if isinstance(lobj, LTTextBox):
x, y, text = lobj.bbox[0], lobj.bbox[3], lobj.get_text()
if isinstance(lobj, LTTextContainer):
for text_line in lobj:
for character in text_line:
if isinstance(character, LTChar):
Font_size = character.size
list1.append([Font_size,(lobj.get_text())])
if isinstance(lobj, LTTextContainer):
for text_line in lobj:
for character in text_line:
if isinstance(character, LTChar):
font_name = character.fontname
list1.append(font_name)
print(list1)
gives me a list of lists that has the font_name not within each of the list with size and text.
list = [[12.0, 'aaa'], 'IJEAMP+Times-Bold', [12.0, 'bbb'], 'IJEAOO+Times-Roman', [12.0, 'ccc'], 'IJEAMP+Times-Bold', [10.0, 'ddd'], 'IJEAOO+Times-Roman', [10.0, 'eee'], 'IJEAOO+Times-Roman', [8.0, '2\n'], 'IJEAOO+Times-Roman', 'IJEAOO+Times-Roman']
How the list of lists should look like
list = [[12.0, 'aaa', 'IJEAMP+Times-Bold'], [12.0, 'bbb', 'IJEAOO+Times-Roman'], [12.0, 'ccc', 'IJEAMP+Times-Bold'], [10.0, 'ddd', 'IJEAOO+Times-Roman'], [10.0, 'eee', 'IJEAOO+Times-Roman'], [8.0, '2\n', 'IJEAOO+Times-Roman'], 'IJEAOO+Times-Roman']
If possible, i would like to ask for an answer to my problem that fixes my error in the code. I believe it is possible so that i dont need to create two lists and zip them afterwards.
I tried list2.extend([list1, font_name]) but that doesent do it as the font_name keeps getting split into individual letters

You are appending to the outer list, not the list you just added into it.
This adds your inner list:
list1.append([Font_size,(lobj.get_text())])
if you want to extend that added list, you can do so by using
list1[-1].append(font_name)
instead of
list1.append(font_name)

Effective ways to group things into list

I am doing a K-means project and I have to do it by hand, which is why I am trying to figure out what is the best ways to group things according to their last values into a list or a dictionary. Here is what I am talking about
list_of_tuples = [(honey,1),(bee,2),(tree,5),(flower,2),(computer,5),(key,1)]
Now my ultimate goal is to be able to sort out the list and have 3 different lists each with its respected element
"""This is the goal"""
list_1 = [honey,key]
list_2 = [bee,flower]
list_3 = [tree, computer]
I can use a lot of if statements and a for loop, but is there a more efficient way to do it?

If you're not opposed to using something like pandas, you could do something along these lines:
import pandas as pd
list_1, list_2, list_3 = pd.DataFrame(list_of_tuples).groupby(1)[0].apply(list).values
Result:
In [19]: list_1
Out[19]: ['honey', 'key']
In [20]: list_2
Out[20]: ['bee', 'flower']
In [21]: list_3
Out[21]: ['tree', 'computer']
Explanation:
pd.DataFrame(list_of_tuples).groupby(1) groups your list of tuples by the value at index 1, then you extract the values as lists of index 0 with [0].apply(list).values. This gives you an array of lists as below:
array([list(['honey', 'key']), list(['bee', 'flower']),
list(['tree', 'computer'])], dtype=object)

Something to the effect can be achieved with a dictionary and a for loop, using the second element of the tuple as a key value.
list_of_tuples = [("honey",1),("bee",2),("tree",5),("flower",2),("computer",5),("key",1)]
dict_list = {}
for t in list_of_tuples:
# create key and a single element list if key doesn't exist yet
# append to existing list otherwise
if t[1] not in dict_list.keys():
dict_list[t[1]] = [t[0]]
else:
dict_list[t[1]].append( t[0] )
list_1, list_2, list_3 = dict_list.values()

Common column names among data sets in Python

I have 6 data sets. Their names are: e10_all, e11_all, e12_all, e13_all, e14_all, and e19_all.
All have different numbers of columns and rows, but with some common columns. I need to append the rows of these columns together. First, I want to determine the columns that are common to all of the data sets, so I know which columns to select in my SQL query.
In R, I am able to do this using:
# Create list of dts
list_df = list(e10_all, e11_all, e12_all, e13_all, e14_all, e19_all)
col_common = colnames(list_df[[1]])
# Write for loop
for (i in 2:length(list_df)){
col_common = intersect(col_common, colnames(list_df[[i]]))
}
# View the common columns
col_common
# Get as a comma-separated list
cat(noquote(paste(col_common, collapse = ',')))
I want to do the same thing, but in Python. Does anyone happen to know a way?
Thank you

It's not that different in pandas. Making some dummy dataframes:
>>> import pandas as pd
>>> e10_all = pd.DataFrame({"A": [1,2], "B": [2,3], "C": [2,3]})
>>> e11_all = pd.DataFrame({"B": [4,5], "C": [5,6]})
>>> e12_all = pd.DataFrame({"B": [1,2], "C": [3,4], "M": [8,9]})
Then your code would translate to something like
>>> list_df = [e10_all, e11_all, e12_all]
>>> col_common = set.intersection(*(set(df.columns) for df in list_df))
>>> col_common
{'C', 'B'}
>>> ','.join(sorted(col_common))
'B,C'
That second line turns each of the frames' columns into a set and then takes the intersection of all of them. A more literal translation of your code would work too, although we tend to avoid writing loops where we can avoid it, and we tend to loop over elements directly (for df in list_df[1:]:) rather than going via index. Still,
col_common = set(list_df[0].columns)
for i in range(1, len(list_df)):
col_common = col_common.intersection(list_df[i].columns)
would get the job done.

Assign column contents to categories

I have a data frame with one column of sub-instances of a larger group, and want to categorize this into a smaller number of groups. How do I do this?
Consider the following sample data:
df = pd.DataFrame({
'a':np.random.randn(60),
'b':np.random.choice( [5,7,np.nan], 60),
'c':np.random.choice( ['panda', 'elephant', 'python', 'anaconda', 'shark', 'clown fish'], 60),
# some ways to create systematic groups for indexing or groupby
'e':np.tile( range(20), 3 ),
# a date range and set of random dates
})
I now would want, in a new row, e.g. panda and elephant categorized as mammals, etc.

The most intuitive would be to create a new series, create a dict and then remap according to it:
mapping_dict = {'panda': 'mammal', 'elephant': 'mammal', 'python': 'snake', 'anaconda': 'snake', 'shark': 'fish', 'clown fish': 'fish'}
c_Series = pd.Series(df['c']) # create new series
classified_c = c_Series.map(mapping_dict) # remap new series
if 'c_classified' not in df.columns: df.insert(3, 'c_classified', classified_c) # insert if not in df already (if you want to run the code multiple times

I think need map with fillna for replace NaNs if non match values:
#borrowed dict from Ivo's answer
mapping_dict = {'panda': 'mammal', 'elephant': 'mammal',
'python': 'snake', 'anaconda': 'snake',
'shark': 'fish', 'clown fish': 'fish'}
df['d'] = df['c'].map(mapping_dict).fillna('not_matched')
Also if change format of dictionary is possible generate final dictioanry with swap keys with values:
d = {'mammal':['panda','elephant'],
'snake':['python','anaconda'],
'fish':['shark','clown fish']}
mapping_dict = {k: oldk for oldk, oldv in d.items() for k in oldv}
df['d'] = df['c'].map(mapping_dict).fillna('not_matched')

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Pandas apply, map and iterrows behaving strangely - python-3.x

Related

Set decimal values to 2 points in list under list pandas

extending list after for loop append fails

Effective ways to group things into list

Common column names among data sets in Python

Assign column contents to categories

Categories

Resources