Pandas to replace value using map function - python-3.x

I have the pandas Dataframe like that( data.png)
enter image description here
2I want to replace single value '218.188.2.4' in the part [0, '218.188.2.4'] using a dict(dat1)enter image description here.
I am thinking whether you have more efficient way or function to do that.
Hope to get your response for that.
I got the Nan finally after I used the map function. I know it can only replace value with single value. I know trying to write a function to replace that is possible. While it is a little bit complex.
with map function:
fd_id['parameter value vector'].map(token_encode_dict)
return NaN at the corresponding position

I don't know if i correctly understand your question, but you can map the dict values across the columns as follows..
Example DataFrame:
>>> df
A B C D
0 no no no yes
1 yes yes yes no
2 yes no yes no
3 no yes no yes
Result:
>>> for col in 'ABCD':
df[col] = df[col].map({'yes':True, 'no':False})
>>> df
A B C D
0 False False False True
1 True True True False
2 True False True False
3 False True False True

Map function can only be used to process single value(not a list).
You can split the list into different columns and use map to replace values. Then you can integrate all the columns together to achieve replacement.

Related

Appending a value to a specific DataFrame cell

I have tested many different options to append a value to a certain cell in a dataframe, but couldn't figure out yet how to do it, nor have found any relevant on my researches.
I have a series/column of my dataframe that starts with 'False' in all positions. Then it starts receiving value with time, one per time. The problem then starts when I have to add more than one value to the same cell. E.g.
df = pd.DataFrame(data=[[1, 2, False], [4, 5, False], [7, 8, False]],columns=["A","B","C"])
which gives me:
- A B C
0 1 2 False
1 4 5 False
2 7 8 False
I've tried to transform the cell into a list in different ways, e.g (just a few as examples):
df.iloc[0,0] = df.iloc[0,0].tolist().append("A")
OR -
df.iloc[0,0] = df.iloc[0,0].tolist()
df.iloc[0,0] = df.iloc[0,0].append("A")
But nothing worked so far.
Any way I can append a value (a string) to a specific cell, a cell that might start as a Boolean or as a String?
If it's needed to concat value of a cell with a string value, you can use:
df.iloc[1,0] = str(df.iloc[1,0]) + "A"
df.iloc[0,2] = str(df.iloc[0,2]) + "A"
Or f-string can be used:
df.iloc[1,0] = f'{df.iloc[1,0]}' + "A"
df.iloc[0,2] = f'{df.iloc[0,2]}' + "A"
It is generally not advisable (check this article for example) to have Pandas dataframes with mixed dtypes since you cannot guarantee the behaviour of each "cell".
Therefore, one solution would be to first ensure that the whole column that you might change in the future is of type list. For example, if you know that the column "C" will or might be updated in the future to append values to it as if it's a list, then it's preferable that the False values you mentioned as a "starting point" are already encoded as part of a list. For example, with the dataframe you provided:
df.loc[:,"C"] = df.loc[:,"C"].apply(lambda x: [x])
df.iloc[0, 2].append("A")
df
This outputs:
A B C
0 1 2 [False, A]
1 4 5 [False]
2 7 8 [False]
And now, if you want to go through the C and check if the first value is False or True, you could, for example, iterate over:
df["C"].apply(lambda x: x[0])
This ensures that you can still access this value without resorting to tricks like checking the type, etc.

Is there a way to filter out values in a pandas dataframe column that has the same format?

I am a beginner in python and I am trying to find out if there is a method to find of if the values of a cell in a column of a pandas dataframe follows a certain format?
For example,
1234_ABC_12 passes
4567_ABC_12 passes
but,
123A_ABC_12 fails
I have tried something like this, but it does not work.
for item in df[col].item():
if item != ('\d\d\d\d_ABD_\d\d')
print('fail')
else:
print('success')
Please help and suggest a better way to do this. Thanks in advance.
Use str.match
df
a
0 1234_ABC_12
1 4567_ABC_12
2 123A_ABC_12
df.a.str.match('\d\d\d\d_ABC_\d\d')
0 True
1 True
2 False

Data Cleaning with Pandas in Python

I am trying to clean a csv file for data analysis. How do I convert TRUE FALSE into 1 and 0?
When I search Google, they suggested df.somecolumn=df.somecolumn.astype(int). However this csv file has 100 columns and not every column is true false(some are categorical, some are numerical). How do I do a sweeping code that allows us to convert any column with TRUE FALSE to 1 and 0 without typing 50 lines of df.somecolumn=df.somecolumn.astype(int)
you can use:
df.select_dtypes(include='bool')=df.select_dtypes(include='bool').astype(int)
A slightly different approach.
First, dtypes of a dataframe can be returned using df.dtypes, which gives a pandas series that looks like this,
a int64
b bool
c object
dtype: object
Second, we could replace bool with int type using replace,
df.dtypes.replace('bool', 'int8'), this gives
a int64
b int8
c object
dtype: object
Finally, pandas seires is essentially a dictionary which can be passed to pd.DataFrame.astype.
We could write it as a oneliner,
df.astype(df.dtypes.replace('bool', 'int8'))
I would do it like this:
df.somecolumn = df.somecolumn.apply(lambda x: 1 if x=="TRUE" else 0)
If you want to iterate through all your columns and check wether they have TRUE/FALSE values, you can do this:
for c in df:
if 'TRUE' in df[c] or 'FALSE' in df[c]:
df[c] = df[c].apply(lambda x: 1 if x=='TRUE' else 0)
Note that this approach is case-sensitive and won't work well if in the column the TRUE/FALSE values are mixed with others.

Filtering all rows with a specific value, without specifiying column names

I would like to filter a dataframe for rows that contain only False in every column. Because the number of columns and their names may vary I would like to do so without explicitly naming the columns. Column names from a list, or any function akin to pandas.DataFrame.all is fine. The index needs to be preserved.
bool_dict = {'one':[True, False, False,False], 'two':[False, True, True, False], 'three':[False,False,True,False]}
bool_df = pd.DataFrame(bool_dict)
Expected output is a dataframe comprising row index 3. i.e the result of this command
df_false = bool_df[(bool_df['one']==False) & (bool_df['two']==False) & (bool_df['three']==False)]
I'm sure there must be a simple solution, though I seem to be having trouble finding it. Thanks.
Use .any() or .all() along axis=1 to make it indifferent about the columns. Check if all are False either by:
bool_df[(~bool_df).all(1)]
# or
bool_df[~bool_df.any(1)]
# one two three
#3 False False False
You may need sum
bool_df[bool_df.sum(1)==0]
one three two
3 False False False
Or max
bool_df[~bool_df.max(1)]
one three two
3 False False False

Checking if string in column contain word [duplicate]

I have a pandas DataFrame with a column of string values. I need to select rows based on partial string matches.
Something like this idiom:
re.search(pattern, cell_in_question)
returning a boolean. I am familiar with the syntax of df[df['A'] == "hello world"] but can't seem to find a way to do the same with a partial string match, say 'hello'.
Vectorized string methods (i.e. Series.str) let you do the following:
df[df['A'].str.contains("hello")]
This is available in pandas 0.8.1 and up.
I am using pandas 0.14.1 on macos in ipython notebook. I tried the proposed line above:
df[df["A"].str.contains("Hello|Britain")]
and got an error:
cannot index with vector containing NA / NaN values
but it worked perfectly when an "==True" condition was added, like this:
df[df['A'].str.contains("Hello|Britain")==True]
How do I select by partial string from a pandas DataFrame?
This post is meant for readers who want to
search for a substring in a string column (the simplest case) as in df1[df1['col'].str.contains(r'foo(?!$)')]
search for multiple substrings (similar to isin), e.g., with df4[df4['col'].str.contains(r'foo|baz')]
match a whole word from text (e.g., "blue" should match "the sky is blue" but not "bluejay"), e.g., with df3[df3['col'].str.contains(r'\bblue\b')]
match multiple whole words
Understand the reason behind "ValueError: cannot index with vector containing NA / NaN values" and correct it with str.contains('pattern',na=False)
...and would like to know more about what methods should be preferred over others.
(P.S.: I've seen a lot of questions on similar topics, I thought it would be good to leave this here.)
Friendly disclaimer, this is post is long.
Basic Substring Search
# setup
df1 = pd.DataFrame({'col': ['foo', 'foobar', 'bar', 'baz']})
df1
col
0 foo
1 foobar
2 bar
3 baz
str.contains can be used to perform either substring searches or regex based search. The search defaults to regex-based unless you explicitly disable it.
Here is an example of regex-based search,
# find rows in `df1` which contain "foo" followed by something
df1[df1['col'].str.contains(r'foo(?!$)')]
col
1 foobar
Sometimes regex search is not required, so specify regex=False to disable it.
#select all rows containing "foo"
df1[df1['col'].str.contains('foo', regex=False)]
# same as df1[df1['col'].str.contains('foo')] but faster.
col
0 foo
1 foobar
Performance wise, regex search is slower than substring search:
df2 = pd.concat([df1] * 1000, ignore_index=True)
%timeit df2[df2['col'].str.contains('foo')]
%timeit df2[df2['col'].str.contains('foo', regex=False)]
6.31 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.8 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Avoid using regex-based search if you don't need it.
Addressing ValueErrors
Sometimes, performing a substring search and filtering on the result will result in
ValueError: cannot index with vector containing NA / NaN values
This is usually because of mixed data or NaNs in your object column,
s = pd.Series(['foo', 'foobar', np.nan, 'bar', 'baz', 123])
s.str.contains('foo|bar')
0 True
1 True
2 NaN
3 True
4 False
5 NaN
dtype: object
s[s.str.contains('foo|bar')]
# ---------------------------------------------------------------------------
# ValueError Traceback (most recent call last)
Anything that is not a string cannot have string methods applied on it, so the result is NaN (naturally). In this case, specify na=False to ignore non-string data,
s.str.contains('foo|bar', na=False)
0 True
1 True
2 False
3 True
4 False
5 False
dtype: bool
How do I apply this to multiple columns at once?
The answer is in the question. Use DataFrame.apply:
# `axis=1` tells `apply` to apply the lambda function column-wise.
df.apply(lambda col: col.str.contains('foo|bar', na=False), axis=1)
A B
0 True True
1 True False
2 False True
3 True False
4 False False
5 False False
All of the solutions below can be "applied" to multiple columns using the column-wise apply method (which is OK in my book, as long as you don't have too many columns).
If you have a DataFrame with mixed columns and want to select only the object/string columns, take a look at select_dtypes.
Multiple Substring Search
This is most easily achieved through a regex search using the regex OR pipe.
# Slightly modified example.
df4 = pd.DataFrame({'col': ['foo abc', 'foobar xyz', 'bar32', 'baz 45']})
df4
col
0 foo abc
1 foobar xyz
2 bar32
3 baz 45
df4[df4['col'].str.contains(r'foo|baz')]
col
0 foo abc
1 foobar xyz
3 baz 45
You can also create a list of terms, then join them:
terms = ['foo', 'baz']
df4[df4['col'].str.contains('|'.join(terms))]
col
0 foo abc
1 foobar xyz
3 baz 45
Sometimes, it is wise to escape your terms in case they have characters that can be interpreted as regex metacharacters. If your terms contain any of the following characters...
. ^ $ * + ? { } [ ] \ | ( )
Then, you'll need to use re.escape to escape them:
import re
df4[df4['col'].str.contains('|'.join(map(re.escape, terms)))]
col
0 foo abc
1 foobar xyz
3 baz 45
re.escape has the effect of escaping the special characters so they're treated literally.
re.escape(r'.foo^')
# '\\.foo\\^'
Matching Entire Word(s)
By default, the substring search searches for the specified substring/pattern regardless of whether it is full word or not. To only match full words, we will need to make use of regular expressions here—in particular, our pattern will need to specify word boundaries (\b).
For example,
df3 = pd.DataFrame({'col': ['the sky is blue', 'bluejay by the window']})
df3
col
0 the sky is blue
1 bluejay by the window
Now consider,
df3[df3['col'].str.contains('blue')]
col
0 the sky is blue
1 bluejay by the window
v/s
df3[df3['col'].str.contains(r'\bblue\b')]
col
0 the sky is blue
Multiple Whole Word Search
Similar to the above, except we add a word boundary (\b) to the joined pattern.
p = r'\b(?:{})\b'.format('|'.join(map(re.escape, terms)))
df4[df4['col'].str.contains(p)]
col
0 foo abc
3 baz 45
Where p looks like this,
p
# '\\b(?:foo|baz)\\b'
A Great Alternative: Use List Comprehensions!
Because you can! And you should! They are usually a little bit faster than string methods, because string methods are hard to vectorise and usually have loopy implementations.
Instead of,
df1[df1['col'].str.contains('foo', regex=False)]
Use the in operator inside a list comp,
df1[['foo' in x for x in df1['col']]]
col
0 foo abc
1 foobar
Instead of,
regex_pattern = r'foo(?!$)'
df1[df1['col'].str.contains(regex_pattern)]
Use re.compile (to cache your regex) + Pattern.search inside a list comp,
p = re.compile(regex_pattern, flags=re.IGNORECASE)
df1[[bool(p.search(x)) for x in df1['col']]]
col
1 foobar
If "col" has NaNs, then instead of
df1[df1['col'].str.contains(regex_pattern, na=False)]
Use,
def try_search(p, x):
try:
return bool(p.search(x))
except TypeError:
return False
p = re.compile(regex_pattern)
df1[[try_search(p, x) for x in df1['col']]]
col
1 foobar
More Options for Partial String Matching: np.char.find, np.vectorize, DataFrame.query.
In addition to str.contains and list comprehensions, you can also use the following alternatives.
np.char.find
Supports substring searches (read: no regex) only.
df4[np.char.find(df4['col'].values.astype(str), 'foo') > -1]
col
0 foo abc
1 foobar xyz
np.vectorize
This is a wrapper around a loop, but with lesser overhead than most pandas str methods.
f = np.vectorize(lambda haystack, needle: needle in haystack)
f(df1['col'], 'foo')
# array([ True, True, False, False])
df1[f(df1['col'], 'foo')]
col
0 foo abc
1 foobar
Regex solutions possible:
regex_pattern = r'foo(?!$)'
p = re.compile(regex_pattern)
f = np.vectorize(lambda x: pd.notna(x) and bool(p.search(x)))
df1[f(df1['col'])]
col
1 foobar
DataFrame.query
Supports string methods through the python engine. This offers no visible performance benefits, but is nonetheless useful to know if you need to dynamically generate your queries.
df1.query('col.str.contains("foo")', engine='python')
col
0 foo
1 foobar
More information on query and eval family of methods can be found at Dynamically evaluate an expression from a formula in Pandas.
Recommended Usage Precedence
(First) str.contains, for its simplicity and ease handling NaNs and mixed data
List comprehensions, for its performance (especially if your data is purely strings)
np.vectorize
(Last) df.query
If anyone wonders how to perform a related problem: "Select column by partial string"
Use:
df.filter(like='hello') # select columns which contain the word hello
And to select rows by partial string matching, pass axis=0 to filter:
# selects rows which contain the word hello in their index label
df.filter(like='hello', axis=0)
Quick note: if you want to do selection based on a partial string contained in the index, try the following:
df['stridx']=df.index
df[df['stridx'].str.contains("Hello|Britain")]
Should you need to do a case insensitive search for a string in a pandas dataframe column:
df[df['A'].str.contains("hello", case=False)]
Say you have the following DataFrame:
>>> df = pd.DataFrame([['hello', 'hello world'], ['abcd', 'defg']], columns=['a','b'])
>>> df
a b
0 hello hello world
1 abcd defg
You can always use the in operator in a lambda expression to create your filter.
>>> df.apply(lambda x: x['a'] in x['b'], axis=1)
0 True
1 False
dtype: bool
The trick here is to use the axis=1 option in the apply to pass elements to the lambda function row by row, as opposed to column by column.
You can try considering them as string as :
df[df['A'].astype(str).str.contains("Hello|Britain")]
Suppose we have a column named "ENTITY" in the dataframe df. We can filter our df,to have the entire dataframe df, wherein rows of "entity" column doesn't contain "DM" by using a mask as follows:
mask = df['ENTITY'].str.contains('DM')
df = df.loc[~(mask)].copy(deep=True)
Here's what I ended up doing for partial string matches. If anyone has a more efficient way of doing this please let me know.
def stringSearchColumn_DataFrame(df, colName, regex):
newdf = DataFrame()
for idx, record in df[colName].iteritems():
if re.search(regex, record):
newdf = concat([df[df[colName] == record], newdf], ignore_index=True)
return newdf
Using contains didn't work well for my string with special characters. Find worked though.
df[df['A'].str.find("hello") != -1]
A more generalised example - if looking for parts of a word OR specific words in a string:
df = pd.DataFrame([('cat andhat', 1000.0), ('hat', 2000000.0), ('the small dog', 1000.0), ('fog', 330000.0),('pet', 330000.0)], columns=['col1', 'col2'])
Specific parts of sentence or word:
searchfor = '.*cat.*hat.*|.*the.*dog.*'
Creat column showing the affected rows (can always filter out as necessary)
df["TrueFalse"]=df['col1'].str.contains(searchfor, regex=True)
col1 col2 TrueFalse
0 cat andhat 1000.0 True
1 hat 2000000.0 False
2 the small dog 1000.0 True
3 fog 330000.0 False
4 pet 3 30000.0 False
Maybe you want to search for some text in all columns of the Pandas dataframe, and not just in the subset of them. In this case, the following code will help.
df[df.apply(lambda row: row.astype(str).str.contains('String To Find').any(), axis=1)]
Warning. This method is relatively slow, albeit convenient.
Somewhat similar to #cs95's answer, but here you don't need to specify an engine:
df.query('A.str.contains("hello").values')
There are answers before this which accomplish the asked feature, anyway I would like to show the most generally way:
df.filter(regex=".*STRING_YOU_LOOK_FOR.*")
This way let's you get the column you look for whatever the way is wrote.
( Obviusly, you have to write the proper regex expression for each case )
My 2c worth:
I did the following:
sale_method = pd.DataFrame(model_data['Sale Method'].str.upper())
sale_method['sale_classification'] = \
np.where(sale_method['Sale Method'].isin(['PRIVATE']),
'private',
np.where(sale_method['Sale Method']
.str.contains('AUCTION'),
'auction',
'other'
)
)
df[df['A'].str.contains("hello", case=False)]

Resources