How can count different duplicates - python-3.x

I shared code below;
I want delete duplicates and count them.Also want a column for count times.
So clearly that code will count A column and count,delete duplicates.Finally it will add as a new column. Is it possible somehow?
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"]})
df = pd.DataFrame({"A":["foo","bar"], "B":[3,1]})

While completely not using pandas, you could achieve this using Counter from standard collections:
>>> from collections import Counter
>>> Counter(["foo", "foo", "foo", "bar"])
>>> counter = Counter(["foo", "foo", "foo", "bar"])
>>> counter.keys()
dict_keys(['foo', 'bar'])
>>> counter.values()
dict_values([3, 1])
So, for your case:
counter = Counter(["foo", "foo", "foo", "bar"])
df = pd.DataFrame({"A": list(counter.keys()), "B": list(counter.values())})

Related

column comprehension robust to missing values

I have only been able to create a two column data frame from a defaultdict (termed output):
df_mydata = pd.DataFrame([(k, v) for k, v in output.items()],
columns=['id', 'value'])
What I would like to be able to do is using this basic format also initiate the dataframe with three columns: 'id', 'id2' and 'value'. I have a separate defined dict that contains the necessary look up info, called id_lookup.
So I tried:
df_mydata = pd.DataFrame([(k, id_lookup[k], v) for k, v in output.items()],
columns=['id', 'id2','value'])
I think I'm doing it right, but I get key errors. I will only know if id_lookup is exhaustive for all possible encounters in hindsight. For my purposes, simply putting it all together and placing 'N/A` or something for those types of errors will be acceptable.
Would the above be appropriate for calculating a new column of data using a defaultdict and a simple lookup dict, and how might I make it robust to key errors?
Here is an example of how you could do this:
import pandas as pd
from collections import defaultdict
df = pd.DataFrame({'id': [1, 2, 3, 4],
'value': [10, 20, 30, 40]})
id_lookup = {1: 'A', 2: 'B', 3: 'C'}
new_column = defaultdict(str)
# Loop through the df and populate the defaultdict
for index, row in df.iterrows():
try:
new_column[index] = id_lookup[row['id']]
except KeyError:
new_column[index] = 'N/A'
# Convert the defaultdict to a Series and add it as a new column in the df
df['id2'] = pd.Series(new_column)
# Print the updated DataFrame
print(df)
which gives:
id value id2
0 1 10 A
1 2 20 B
2 3 30 C
3 4 40 N/A
​

sorting a list of string based on a dictionary

I have a dictionary containing the high-level job titles and their order. for example
{'ceo':0,'founder':1,'chairman':2}
I also have a list of job titles:
['ceo', 'manager','founder','partner', 'chairman']
what I want is this
['ceo','founder', 'chairman', 'manager','partner']
Try:
order = {"ceo": 0, "founder": 1, "chairman": 2}
lst = ["ceo", "manager", "founder", "partner", "chairman"]
out = sorted(lst, key=lambda v: order.get(v, float("inf")))
print(out)
Prints:
["ceo", "founder", "chairman", "manager", "partner"]

Convert pandas MultiIndex columns to uppercase

I would like to replace pandas multi index columns with uppercase names. With a normal (1D/level) index, I would do something like
df.coulumns = [c.upper() for c in df.columns]
When this is done on a DataFrame with a pd.MultiIndex, I get the following error:
AttributeError: 'tuple' object has no attribute 'upper'
How would I apply the same logic to a pandas multi index? Example code is below.
import pandas as pd
import numpy as np
arrays = [
["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
["one", "two", "one", "two", "one", "two", "one", "two"],
]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])
df = pd.DataFrame(np.random.randn(3, 8), index=["A", "B", "C"], columns=index)
arrays_upper = [
["BAR", "BAR", "BAZ", "BAZ", "FOO", "FOO", "QUX", "QUX"],
["ONE", "TWO", "ONE", "TWO", "ONE", "TWO", "ONE", "TWO"],
]
tuples_upper = list(zip(*arrays_upper))
index_upper = pd.MultiIndex.from_tuples(tuples_upper, names=['first', 'second'])
df_upper = pd.DataFrame(np.random.randn(3, 8), index=["A", "B", "C"], columns=index_upper)
print(f'Have: {df.columns}')
print(f'Want: {df_upper.columns}')
You can convert the multiindex to dataframe and uppercase the value in dataframe then convert it back to multiindex
df.columns = pd.MultiIndex.from_frame(df.columns.to_frame().applymap(str.upper))
print(df)
first BAR BAZ FOO QUX
second ONE TWO ONE TWO ONE TWO ONE TWO
A -0.374874 0.049597 -1.930723 -0.279234 0.235430 0.351351 -0.263074 -0.068096
B 0.040872 0.969948 -0.048848 -0.610735 -0.949685 0.336952 -0.012458 -0.258237
C 0.932494 -1.655863 0.900461 0.403524 -0.123720 0.207627 -0.372031 -0.049706
Or follow your loop idea
df.columns = pd.MultiIndex.from_tuples([tuple(map(str.upper, c)) for c in df.columns])
Use set_levels:
df.columns = df.columns.set_levels([level.str.upper() for level in df.columns.levels])

Python: Convert 2d list to dictionary with indexes as values

I have a 2d list with arbitrary strings like this:
lst = [['a', 'xyz' , 'tps'], ['rtr' , 'xyz']]
I want to create a dictionary out of this:
{'a': 0, 'xyz': 1, 'tps': 2, 'rtr': 3}
How do I do this? This answer answers for 1D list for non-repeated values, but, I have a 2d list and values can repeat. Is there a generic way of doing this?
Maybe you could use two for-loops:
lst = [['a', 'xyz' , 'tps'], ['rtr' , 'xyz']]
d = {}
overall_idx = 0
for sub_lst in lst:
for word in sub_lst:
if word not in d:
d[word] = overall_idx
# Increment overall_idx below if you want to only increment if word is not previously seen
# overall_idx += 1
overall_idx += 1
print(d)
Output:
{'a': 0, 'xyz': 1, 'tps': 2, 'rtr': 3}
You could first convert the list of lists to a list using a 'double' list comprehension.
Next, get rid of all the duplicates using a dictionary comprehension, we could use set for that but would lose the order.
Finally use another dictionary comprehension to get the desired result.
lst = [['a', 'xyz' , 'tps'], ['rtr' , 'xyz']]
# flatten list of lists to a list
flat_list = [item for sublist in lst for item in sublist]
# remove duplicates
ordered_set = {x:0 for x in flat_list}.keys()
# create required output
the_dictionary = {v:i for i, v in enumerate(ordered_set)}
print(the_dictionary)
""" OUTPUT
{'a': 0, 'xyz': 1, 'tps': 2, 'rtr': 3}
"""
also, with collections and itertools:
import itertools
from collections import OrderedDict
lstdict={}
lst = [['a', 'xyz' , 'tps'], ['rtr' , 'xyz']]
lstkeys = list(OrderedDict(zip(itertools.chain(*lst), itertools.repeat(None))))
lstdict = {lstkeys[i]: i for i in range(0, len(lstkeys))}
lstdict
output:
{'a': 0, 'xyz': 1, 'tps': 2, 'rtr': 3}

Common column names among data sets in Python

I have 6 data sets. Their names are: e10_all, e11_all, e12_all, e13_all, e14_all, and e19_all.
All have different numbers of columns and rows, but with some common columns. I need to append the rows of these columns together. First, I want to determine the columns that are common to all of the data sets, so I know which columns to select in my SQL query.
In R, I am able to do this using:
# Create list of dts
list_df = list(e10_all, e11_all, e12_all, e13_all, e14_all, e19_all)
col_common = colnames(list_df[[1]])
# Write for loop
for (i in 2:length(list_df)){
col_common = intersect(col_common, colnames(list_df[[i]]))
}
# View the common columns
col_common
# Get as a comma-separated list
cat(noquote(paste(col_common, collapse = ',')))
I want to do the same thing, but in Python. Does anyone happen to know a way?
Thank you
It's not that different in pandas. Making some dummy dataframes:
>>> import pandas as pd
>>> e10_all = pd.DataFrame({"A": [1,2], "B": [2,3], "C": [2,3]})
>>> e11_all = pd.DataFrame({"B": [4,5], "C": [5,6]})
>>> e12_all = pd.DataFrame({"B": [1,2], "C": [3,4], "M": [8,9]})
Then your code would translate to something like
>>> list_df = [e10_all, e11_all, e12_all]
>>> col_common = set.intersection(*(set(df.columns) for df in list_df))
>>> col_common
{'C', 'B'}
>>> ','.join(sorted(col_common))
'B,C'
That second line turns each of the frames' columns into a set and then takes the intersection of all of them. A more literal translation of your code would work too, although we tend to avoid writing loops where we can avoid it, and we tend to loop over elements directly (for df in list_df[1:]:) rather than going via index. Still,
col_common = set(list_df[0].columns)
for i in range(1, len(list_df)):
col_common = col_common.intersection(list_df[i].columns)
would get the job done.

Resources