I have a simple dataframe, where I want to add a new column(col3) with values determined by the values from 'col1'. If the value from 'col1' starts with A, I want to put 'A' to col3. And a similar thing to the value that starts with B.
import pandas as pd
d = {"col1" : ["A1", "A2", "B1", "B2"], "col2" : [1, 2, 3, 4]}
df = pd.DataFrame(data = d)
df
import numpy as np
import pandas as pd
d = {"col1" : ["A1", "A2", "B1", "B2"], "col2" : [1, 2, 3, 4]}
df = pd.DataFrame(data = d)
df['col3']=np.where((df.col1.str.startswith('A')),'A',df.col1)
df['col3']=np.where((df.col1.str.startswith('B')),'B',df.col3)
df
Output
col1 col2 col3
0 A1 1 A
1 A2 2 A
2 B1 3 B
3 B2 4 B
Try this
import numpy as np
import pandas as pd
d = {"col1" : ["A1", "A2", "B1", "B2", "C1", "C2"], "col2" : [1, 2, 3, 4, 5, 6]}
df = pd.DataFrame(data = d)
df['col3']=df["col1"].str[0]
print(df)
This results you in
col1 col2 col3
0 A1 1 A
1 A2 2 A
2 B1 3 B
3 B2 4 B
4 C1 5 C
5 C2 6 C
Related
I've got a csv that I'm reading into a pandas dataframe. However one of the columns is in the form of a dictionary. Here is an example:
ColA, ColB, ColC, ColdD
20, 30, {"ab":"1", "we":"2", "as":"3"},"String"
How can I turn this into a dataframe that looks like this:
ColA, ColB, AB, WE, AS, ColdD
20, 30, "1", "2", "3", "String"
edit
I fixed up the question, it looks like this but is a string that needs to be parsed, not dict object.
As per https://stackoverflow.com/a/38231651/454773, you can use .apply(pd.Series) to map the dict containing column onto new columns and then concatenate these new columns back into the original dataframe minus the original dict containing column:
dw=pd.DataFrame( [[20, 30, {"ab":"1", "we":"2", "as":"3"},"String"]],
columns=['ColA', 'ColB', 'ColC', 'ColdD'])
pd.concat([dw.drop(['ColC'], axis=1), dw['ColC'].apply(pd.Series)], axis=1)
Returns:
ColA ColB ColdD ab as we
20 30 String 1 3 2
So starting with your one row df
Col A Col B Col C Col D
0 20 30 {u'we': 2, u'ab': 1, u'as': 3} String1
EDIT: based on the comment by the OP, I'm assuming we need to convert the string first
import ast
df["ColC"] = df["ColC"].map(lambda d : ast.literal_eval(d))
then we convert Col C to a dict, transpose it and then join it to the original df
dfNew = df.join(pd.DataFrame(df["Col C"].to_dict()).T)
dfNew
which gives you this
Col A Col B Col C Col D ab as we
0 20 30 {u'we': 2, u'ab': 1, u'as': 3} String1 1 3 2
Then we just select the columns we want in dfNew
dfNew[["Col A", "Col B", "ab", "we", "as", "Col D"]]
Col A Col B ab we as Col D
0 20 30 1 2 3 String1
What about something like:
import pandas as pd
# Create mock dataframe
df = pd.DataFrame([
[20, 30, {'ab':1, 'we':2, 'as':3}, 'String1'],
[21, 31, {'ab':4, 'we':5, 'as':6}, 'String2'],
[22, 32, {'ab':7, 'we':8, 'as':9}, 'String2'],
], columns=['Col A', 'Col B', 'Col C', 'Col D'])
# Create dataframe where you'll store the dictionary values
ddf = pd.DataFrame(columns=['AB','WE','AS'])
# Populate ddf dataframe
for (i,r) in df.iterrows():
e = r['Col C']
ddf.loc[i] = [e['ab'], e['we'], e['as']]
# Replace df with the output of concat(df, ddf)
df = pd.concat([df, ddf], axis=1)
# New column order, also drops old Col C column
df = df[['Col A', 'Col B', 'AB', 'WE', 'AS', 'Col D']]
print(df)
Output:
Col A Col B AB WE AS Col D
0 20 30 1 2 3 String1
1 21 31 4 5 6 String2
2 22 32 7 8 9 String2
I have few lists and a dictionary and would like to create a pd dataframe.
Could someone help me out, I seem to be missing something:
one simple example bellow:
dict={"a": 1, "b": 3, "c": "text1"}
l1 = [1, 2, 3, 4]
l3 = ["x", "y"]
Using series I would do like this:
df = pd.DataFrame({'col1': pd.Series(l1), 'col2': pd.Series(l3)})
and would have the lists within the df as expected
for dict would do
df = pd.DataFrame(list(dic.items()), columns=['col3', 'col4'])
And would expect this result:
col1 col2 col3 col4
1 x a 1
2 y b 3
3 c text1
4
The problem is like this the first df would be overwritten by the second call of pd.Dataframe
How would I do this to have only one df with 4 columns?
I know one way would be to split the dict in 2 separate lists and just use Series over 4 lists, but I would think there is a better way to do this, out of 2 lists and 1 dict as above to have directly one df with 4 columns.
thanks for the help
you can also use pd.concat to concat two dataframe.
df1 = pd.DataFrame({'col1': pd.Series(l1), 'col2': pd.Series(l3)})
df2 = pd.DataFrame(list(dic.items()), columns=['col3', 'col4'])
df = pd.concat([df1, df2], axis=1)
Why not build each column seperately via dict.keys() and dict.values() instead of using dict.items()
df = pd.DataFrame({
'col1': pd.Series(l1),
'col2': pd.Series(l3),
'col3': pd.Series(dict.keys()),
'col4': pd.Series(dict.values())
})
print(df)
col1 col2 col3 col4
0 1 x a 1
1 2 y b 3
2 3 NaN c text1
3 4 NaN NaN NaN
Alternatively:
column_values = [l1, l3, dict.keys(), dict.values()]
data = {f"col{i}": pd.Series(values) for i, values in enumerate(column_values)}
df = pd.DataFrame(data)
print(df)
col0 col1 col2 col3
0 1 x a 1
1 2 y b 3
2 3 NaN c text1
3 4 NaN NaN NaN
You can unpack zipped values of list generated from d.items() and pass to itertools.zip_longest for add missing values for match by maximum length of list:
#dict is python code word, so used d for variable
d={"a": 1, "b": 3, "c": "text1"}
l1 = [1, 2, 3, 4]
l3 = ["x", "y"]
df = pd.DataFrame(zip_longest(l1, l3, *zip(*d.items()),
fillvalue=np.nan),
columns=['col1','col2','col3','col4'])
print (df)
col1 col2 col3 col4
0 1 x a 1
1 2 y b 3
2 3 NaN c text1
3 4 NaN NaN NaN
I have a dataframe like
import pandas as pd
import numpy as np
df = pd.DataFrame({"Col1": ['A', np.nan, 'B', 'B', 'C'],
"Col2": ['A', 'B', 'B', 'A', 'C'],
"Col3": ['A', 'B', 'C', 'A', 'C']})
I want to get the unique combinations across columns for each row and create a new column with those values, excluding the missing values.
The code I have right now to do this is
def handle_missing(s):
return np.unique(s[s.notnull()])
def unique_across_rows(data):
unique_vals = data.apply(handle_missing, axis = 1)
# numpy unique sorts the values automatically
merged_vals = unique_vals.apply(lambda x: x[0] if len(x) == 1 else '_'.join(x))
return merged_vals
df['Combos'] = unique_across_rows(df)
This returns the expected output:
Col1 Col2 Col3 Combos
0 A A A A
1 NaN B B B
2 B B C B_C
3 B A A A_B
4 C C C C
It seems to me that there should be a more vectorized approach that exists within Pandas to do this: how could I do that?
You can try a simple list comprehension which might be more efficient for larger dataframes:
df['combos'] = ['_'.join(sorted(k for k in set(v) if pd.notnull(k))) for v in df.values]
Or you can wrap the above list comprehension in a more readable function:
def combos():
for v in df.values:
unique = set(filter(pd.notnull, v))
yield '_'.join(sorted(unique))
df['combos'] = list(combos())
Col1 Col2 Col3 combos
0 A A A A
1 NaN B B B
2 B B C B_C
3 B A A A_B
4 C C C C
You can also use agg/apply on axis=1 like below:
df['Combos'] = df.agg(lambda x: '_'.join(sorted(x.dropna().unique())),axis=1)
print(df)
Col1 Col2 Col3 Combos
0 A A A A
1 NaN B B B
2 B B C B_C
3 B A A A_B
4 C C C C
Try (explanation to follow)
df['Combos'] = (df.stack() # this removes NaN values
.sort_values() # so we have A_B instead of B_A in 3rd row
.groupby(level=0) # group by original index
.agg(lambda x: '_'.join(x.unique())) # join the unique values
)
Output:
Col1 Col2 Col3 Combos
0 A A A A
1 NaN B B B
2 B B C B_C
3 B A A A_B
4 C C C C
fill the nan with a string place-holder '-'. Create a unique array from the col1,col2,col3 list and remove the placeholder. join the unique array values with a '-'
import pandas as pd
import numpy as np
def unique(list1):
if '-' in list1:
list1.remove('-')
x = np.array(list1)
return (np.unique(x))
df = pd.DataFrame({"Col1": ['A', np.nan, 'B', 'B', 'C'],
"Col2": ['A', 'B', 'B', 'A', 'C'],
"Col3": ['A', 'B', 'C', 'A', 'C']}).fillna('-')
s="-"
for key,row in df.iterrows():
df.loc[key,'combos']=s.join(unique([row.Col1, row.Col2, row.Col3]))
print(df.head())
I have a dataframe like this:
d = {'col1': ['a', 'b'], 'col2': [2, 4]}
df = pd.DataFrame(data=d)
df
>> col1 col2
0 a 2
1 b 4
and i want to duplicate the rows by col2 and get a table like this:
>> col1 col2
0 a 2
1 a 2
2 b 4
3 b 4
4 b 4
5 b 4
Thanks to everyone for the help!
Here's my solution using some numpy:
numRows = np.sum(df.col2)
blankSpace = np.zeros(numRows,).astype(str)
d2 = {'col1': blankSpace, 'col2': blankSpace}
df2 = pd.DataFrame(data=d2)
counter = 0
for i in range(df.shape[0]):
letter = df.col1[i]
numRowsForLetter = df.col2[i]
for j in range(numRowsForLetter):
df2.at[counter, 'col1'] = letter
df2.at[counter, 'col2'] = numRowsForLetter
counter += 1
df2 is your output dataframe!
I'm wondering about existing pandas functionalities, that I might not been able to find so far.
Bascially, I have a data frame with various columns. I'd like to select specific rows depending on the values of certain colums (FYI: i was interested in the value of column D, that had several parameters described in A-C).
E.g. I want to know which row(s) have A==1 & B==2 & C==5?
df
A B C D
0 1 2 4 a
1 1 2 5 b
2 1 3 4 c
df_result
1 1 2 5 b
So far I have been able to basically reduce this:
import pandas as pd
df = pd.DataFrame({'A': [1,1,1],
'B': [2,2,3],
'C': [4,5,4],
'D': ['a', 'b', 'c']})
df_A = df[df['A'] == 1]
df_B = df_A[df_A['B'] == 2]
df_C = df_B[df_B['C'] == 5]
To this:
parameter = [['A', 1],
['B', 2],
['C', 5]]
df_filtered = df
for x, y in parameter:
df_filtered = df_filtered[df_filtered[x] == y]
which yielded the same results. But I wonder if there's another way? Maybe without loop in one line?
You could use query() method to filter data, and construct filter expression from parameters like
In [288]: df.query(' and '.join(['{0}=={1}'.format(x[0], x[1]) for x in parameter]))
Out[288]:
A B C D
1 1 2 5 b
Details
In [296]: df
Out[296]:
A B C D
0 1 2 4 a
1 1 2 5 b
2 1 3 4 c
In [297]: query = ' and '.join(['{0}=={1}'.format(x[0], x[1]) for x in parameter])
In [298]: query
Out[298]: 'A==1 and B==2 and C==5'
In [299]: df.query(query)
Out[299]:
A B C D
1 1 2 5 b
Just for the information if others are interested, I would have done it this way:
import numpy as np
matched = np.all([df[vn] == vv for vn, vv in parameters], axis=0)
df_filtered = df[matched]
But I like the query function better, now that I have seen it #John Galt.