Pandas print unique values as string - python-3.x

I've got a list of unique value from selected column in pandas dataframe. What I want to achieve is to print the result as string.
import pandas as pd
df = pd.DataFrame({'A':['A','C','C','B','A','C','B']})
a = df['A'].unique()
print(a)
Output: ['A' 'C' 'B']
Desired output: A, C, B
So far I've tried below,
print(a.to_string())
Got this error: AttributeError: 'numpy.ndarray' object has no attribute 'to_string'
print(a.tostring())
Got this: b'\xf0\x04\xa6P\x9e\x01\x00\x000\xaf\x92P\x9e\x01\x00\x00\xb0\xaf\x92P\x9e\x01\x00\x00'
Can anyone give a hint.

import pandas as pd
df = pd.DataFrame({'A':['A','C','C','B','A','C','B']})
a = df['A'].unique()
print(', '.join(a)) # or print(*a, sep=', ')
Prints:
A, C, B
EDIT: To store as variable:
text = ', '.join(a)
print(text)

This should work:
print(', '.join(a))

py3 solution
df = pd.DataFrame({'A':['A','C','C','B','A','C','B']})
a = df['A'].unique()
print(*a, sep=", ")

Related

Remove consecutive duplicate entries from pandas in each cell

I have a data frame that looks like
d = {'col1': ['a,a,b', 'a,c,c,b'], 'col2': ['a,a,b', 'a,b,b,a']}
pd.DataFrame(data=d)
expected output
d={'col1':['a,b','a,c,b'],'col2':['a,b','a,b,a']}
I have tried like this :
arr = ['a', 'a', 'b', 'a', 'a', 'c','c']
print([x[0] for x in groupby(arr)])
How do I remove the duplicate entries in each row and column of dataframe?
a,a,b,c should be a,b,c
From what I understand, you don't want to include values which repeat in a sequence, you can try with this custom function:
def myfunc(x):
s=pd.Series(x.split(','))
res=s[s.ne(s.shift())]
return ','.join(res.values)
print(df.applymap(myfunc))
col1 col2
0 a,b a,b
1 a,c,b a,b,a
Another function can be created with itertools.groupby such as :
from itertools import groupby
def myfunc(x):
l=[x[0] for x in groupby(x.split(','))]
return ','.join(l)
You could define a function to help with this, then use .applymap to apply it to all columns (or .apply one column at a time):
d = {'col1': ['a,a,b', 'a,c,c,b'], 'col2': ['a,a,b', 'a,b,b,a']}
df = pd.DataFrame(data=d)
def remove_dups(string):
split = string.split(',') # split string into a list
uniques = set(split) # remove duplicate list elements
return ','.join(uniques) # rejoin the list elements into a string
result = df.applymap(remove_dups)
This returns:
col1 col2
0 a,b a,b
1 a,c,b a,b
Edit: This looks slightly different to your expected output, why do you expect a,b,a for the second row in col2?
Edit2: to preserve the original order, you can replace the set() function with unique_everseen()
from more_itertools import unique_everseen
.
.
.
uniques = unique_everseen(split)

Pandas checks with prefix and more checksum if searched prefix exists or no data

I have below code snippet which works fine.
import pandas as pd
import numpy as np
prefixes = ['sj00', 'sj12', 'cr00', 'cr08', 'eu00', 'eu50']
df = pd.read_csv('new_hosts', index_col=False, header=None)
df['prefix'] = df[0].str[:4]
df['grp'] = df.groupby('prefix').cumcount()
df = df.pivot(index='grp', columns='prefix', values=0)
df['sj12'] = df['sj12'].str.extract('(\w{2}\d{2}\w\*)', expand=True)
df = df[ prefixes ].dropna(axis=0, how='all').replace(np.nan, '', regex=True)
df = df.rename_axis(None)
Example File new_hosts
sj000001
sj000002
sj000003
sj000004
sj124000
sj125000
sj126000
sj127000
sj128000
sj129000
sj130000
sj131000
sj132000
cr000011
cr000012
cr000013
cr000014
crn00001
crn00002
crn00003
crn00004
euk000011
eu0000012
eu0000013
eu0000014
eu5000011
eu5000013
eu5000014
eu5000015
Current output:
sj00 sj12 cr00 cr08 eu00 eu50
sj000001 cr000011 crn00001 euk000011 eu5000011
sj000002 cr000012 crn00002 eu0000012 eu5000013
sj000003 cr000013 crn00003 eu0000013 eu5000014
sj000004 cr000014 crn00004 eu0000014 eu5000015
What's expected:
1) As code works fine but as you see the current output the second column don't have any values but still appearing So, how could i have a checksum if a particular column don't have any values then remove that from display.
2) Can we place a check for the prefixes if they exists in the dataframe before processing to avoid the error.
Appreciate any help.
IIUC, before
df = df[ prefixes ].dropna(axis=0, how='all').replace(np.nan, '', regex=True)
you can do:
# remove all empty columns
df = df.dropna(axis=1, how='all')
That would solve your first part. Second part can be reindex?
# select prefixes:
prefixes = ['sj00', 'sj12', 'cr00', 'cr08', 'eu00', 'eu50', 'sh00', 'dt00', 'sh00', 'dt00']
df = df.reindex(prefixes, axis=1).dropna(axis=1, how='all').replace(np.nan, '', regex=True)
Note the axis=1, not axis=0 is identical to what I propose for question 1.
Much thanks to Quang Hoang for the hints on the post, Just for the workaround, i got it working as follows until i get a better answer:
# Select prefixes
prefixes = ['sj00', 'sj12', 'cr00', 'cr08', 'eu00', 'eu50']
df = pd.read_csv('new_hosts', index_col=False, header=None)
df['prefix'] = df[0].str[:4]
df['grp'] = df.groupby('prefix').cumcount()
df = df.pivot(index='grp', columns='prefix', values=0)
df = df[prefixes]
# For column `sj12` only extract the values having `sj12` and a should be a word immediately after that like `sj12[a-z]`
df['sj12'] = df['sj12'].str.extract('(\w{2}\d{2}\w\*)', expand=True)
df.replace('', np.nan, inplace=True)
# Remove the empty columns
df = df.dropna(axis=1, how='all')
# again drop if all values in the row are nan and replace nan to empty for live columns
df = df.dropna(axis=0, how='all').replace(np.nan, '', regex=True)
# drop the index field
df = df.rename_axis(None)
print(df)

How to skip over np.nan while iterating through a dataframe for sentiment analysis

I have a data frame with 201279 entries, the last column is labeled "text" with customer reviews. The problem is that most of them are missing values, and come up as NaN.
I read some interesting information from this question:
Python numpy.nan and logical functions: wrong results
and I tried applying it to my problem:
df1.columns
Index(['id', 'sku', 'title', 'reviewCount', 'commentCount', 'averageRating',
'date', 'time', 'ProductName', 'CountOfBigTransactions', 'ClassID',
'Weight', 'Width', 'Depth', 'Height', 'LifeCycleName', 'FinishName',
'Color', 'Season', 'SizeOrUtility', 'Material', 'CountryOfOrigin',
'Quartile', 'display-name', 'online-flag', 'long-description', 'text'],
dtype='object')
I tried experimentingby doing this:
df['firstName'][202360]== np.nan
which returns False but indeed that index contains an np.nan.
So I looked for an answer, read through the question I linked, and saw that
np.bool(df1['text'][201279])==True
is a true statement. I thought, okay, I can run with this.
So, here's my code so far:
from textblob import TextBlob
import string
def remove_num_punct(aText):
p = string.punctuation
d = string.digits
j = p + d
table = str.maketrans(j, len(j)* ' ')
return aText.translate(table)
#Process text
aList = []
for text in df1['text']:
if np.bool(df1['text'])==True:
aList.append(np.nan)
else:
b = remove_num_punct(text)
pol = TextBlob(b).sentiment.polarity
aList.append(pol)
Then I would just convert aList with the sentiment to pd.DataFrame and join it to df1, then impute the missing values with K-nearest neighbors.
My problem is that the little routine I made throws a value error
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
So I'm not really sure what else to try. Thanks in advance!
EDIT: I have tried this:
i = 0
aList = []
for txt in df1['text'].isnull():
i += 1
if txt == True:
aList.append(np.nan)
which correctly populates the list with NaN.
But this gives me a different error:
i = 0
aList = []
for txt in df1['text'].isnull():
if txt == True:
aList.append(np.nan)
else:
b = remove_num_punct(df1['text'][i])
pol = TextBlob(b).sentiment.polarity
aList.append(pol)
i+=1
AttributeError: 'float' object has no attribute 'translate'
Which doesn't make sense, since if it is not NaN, then it contains text, right?
import pandas as pd
import numpy as np
df = pd.DataFrame({'age': [5, 6, np.NaN],
'born': [pd.NaT, pd.Timestamp('1939-05-27'), pd.Timestamp('1940-04-25')],
'name': ['Alfred', 'Batman', ''],
'toy': [None, 'Batmobile', 'Joker']})
df1 = df['toy']
for i in range(len(df1)):
if not df1[i]:
df2 = df1.drop(i)
df2
you can try in this way to deal the text which is null
I fixed it, I had to move the i += 1 back from the else indentation to the for indentation:
i = 0
aList = []
for txt in df1['text'].isnull():
if txt == True:
aList.append(np.nan)
else:
b = remove_num_punct(df1['text'][i])
pol = TextBlob(b).sentiment.polarity
aList.append(pol)
i+=1

PySpark: Search For substrings in text and subset dataframe

I am brand new to pyspark and want to translate my existing pandas / python code to PySpark.
I want to subset my dataframe so that only rows that contain specific key words I'm looking for in 'original_problem' field is returned.
Below is the Python code I tried in PySpark:
def pilot_discrep(input_file):
df = input_file
searchfor = ['cat', 'dog', 'frog', 'fleece']
df = df[df['original_problem'].str.contains('|'.join(searchfor))]
return df
When I try to run the above, I get the following error:
AnalysisException: u"Can't extract value from original_problem#207:
need struct type but got string;"
In pyspark, try this:
df = df[df['original_problem'].rlike('|'.join(searchfor))]
Or equivalently:
import pyspark.sql.functions as F
df.where(F.col('original_problem').rlike('|'.join(searchfor)))
Alternatively, you could go for udf:
import pyspark.sql.functions as F
searchfor = ['cat', 'dog', 'frog', 'fleece']
check_udf = F.udf(lambda x: x if x in searchfor else 'Not_present')
df = df.withColumn('check_presence', check_udf(F.col('original_problem')))
df = df.filter(df.check_presence != 'Not_present').drop('check_presence')
But the DataFrame methods are preferred because they will be faster.

How to write a Pandas Dataframe into a HDF5 dataset

I'm trying to write data from a Pandas dataframe into a nested hdf5 file, with multiple groups and datasets within each group. I'd like to keep it as a single file which will grow in the future on a daily basis. I've had a go with the following code, which shows the structure of what I'd like to achieve
import h5py
import numpy as np
import pandas as pd
file = h5py.File('database.h5','w')
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
groups = ['A','B','C']
for m in groups:
group = file.create_group(m)
dataset = ['1','2','3']
for n in dataset:
data = df
ds = group.create_dataset(m + n, data.shape)
print ("Dataset dataspace is", ds.shape)
print ("Dataset Numpy datatype is", ds.dtype)
print ("Dataset name is", ds.name)
print ("Dataset is a member of the group", ds.parent)
print ("Dataset was created in the file", ds.file)
print ("Writing data...")
ds[...] = data
print ("Reading data back...")
data_read = ds[...]
print ("Printing data...")
print (data_read)
file.close()
This way the nested structure is created but it loses the index and columns. I've tried the
df.to_hdf('database.h5', ds, table=True, mode='a')
but didn't work, I get this error
AttributeError: 'Dataset' object has no attribute 'split'
Can anyone shed some light please. Many thanks
df.to_hdf() expects a string as a key parameter (second parameter):
key : string
identifier for the group in the store
so try this:
df.to_hdf('database.h5', ds.name, table=True, mode='a')
where ds.name should return you a string (key name):
In [26]: ds.name
Out[26]: '/A1'
I thought to have a go with pandas\pytables and the HDFStore class instead of h5py. So I tried the following
import numpy as np
import pandas as pd
db = pd.HDFStore('Database.h5')
index = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=['Col1', 'Col2', 'Col3'])
groups = ['A','B','C']
i = 1
for m in groups:
subgroups = ['d','e','f']
for n in subgroups:
db.put(m + '/' + n, df, format = 'table', data_columns = True)
It works, 9 groups (groups instead of datasets in pyatbles instead fo h5py?) created from A/d to C/f. Columns and indexes preserved and can do the dataframe operations I need. Still wondering though whether this is an efficient way to retrieve data from a specific group which will become huge in the the future i.e. operations like
db['A/d'].Col1[4:]

Resources