Pandas print unique values as string - python-3.x

I've got a list of unique value from selected column in pandas dataframe. What I want to achieve is to print the result as string.
import pandas as pd
df = pd.DataFrame({'A':['A','C','C','B','A','C','B']})
a = df['A'].unique()
Output: ['A' 'C' 'B']
Desired output: A, C, B
So far I've tried below,
Got this error: AttributeError: 'numpy.ndarray' object has no attribute 'to_string'
Got this: b'\xf0\x04\xa6P\x9e\x01\x00\x000\xaf\x92P\x9e\x01\x00\x00\xb0\xaf\x92P\x9e\x01\x00\x00'
Can anyone give a hint.

import pandas as pd
df = pd.DataFrame({'A':['A','C','C','B','A','C','B']})
a = df['A'].unique()
print(', '.join(a)) # or print(*a, sep=', ')
A, C, B
EDIT: To store as variable:
text = ', '.join(a)

This should work:
print(', '.join(a))

py3 solution
df = pd.DataFrame({'A':['A','C','C','B','A','C','B']})
a = df['A'].unique()
print(*a, sep=", ")


Remove consecutive duplicate entries from pandas in each cell

I have a data frame that looks like
d = {'col1': ['a,a,b', 'a,c,c,b'], 'col2': ['a,a,b', 'a,b,b,a']}
expected output
I have tried like this :
arr = ['a', 'a', 'b', 'a', 'a', 'c','c']
print([x[0] for x in groupby(arr)])
How do I remove the duplicate entries in each row and column of dataframe?
a,a,b,c should be a,b,c
From what I understand, you don't want to include values which repeat in a sequence, you can try with this custom function:
def myfunc(x):
return ','.join(res.values)
col1 col2
0 a,b a,b
1 a,c,b a,b,a
Another function can be created with itertools.groupby such as :
from itertools import groupby
def myfunc(x):
l=[x[0] for x in groupby(x.split(','))]
return ','.join(l)
You could define a function to help with this, then use .applymap to apply it to all columns (or .apply one column at a time):
d = {'col1': ['a,a,b', 'a,c,c,b'], 'col2': ['a,a,b', 'a,b,b,a']}
df = pd.DataFrame(data=d)
def remove_dups(string):
split = string.split(',') # split string into a list
uniques = set(split) # remove duplicate list elements
return ','.join(uniques) # rejoin the list elements into a string
result = df.applymap(remove_dups)
This returns:
col1 col2
0 a,b a,b
1 a,c,b a,b
Edit: This looks slightly different to your expected output, why do you expect a,b,a for the second row in col2?
Edit2: to preserve the original order, you can replace the set() function with unique_everseen()
from more_itertools import unique_everseen
uniques = unique_everseen(split)

Pandas checks with prefix and more checksum if searched prefix exists or no data

I have below code snippet which works fine.
import pandas as pd
import numpy as np
prefixes = ['sj00', 'sj12', 'cr00', 'cr08', 'eu00', 'eu50']
df = pd.read_csv('new_hosts', index_col=False, header=None)
df['prefix'] = df[0].str[:4]
df['grp'] = df.groupby('prefix').cumcount()
df = df.pivot(index='grp', columns='prefix', values=0)
df['sj12'] = df['sj12'].str.extract('(\w{2}\d{2}\w\*)', expand=True)
df = df[ prefixes ].dropna(axis=0, how='all').replace(np.nan, '', regex=True)
df = df.rename_axis(None)
Example File new_hosts
Current output:
sj00 sj12 cr00 cr08 eu00 eu50
sj000001 cr000011 crn00001 euk000011 eu5000011
sj000002 cr000012 crn00002 eu0000012 eu5000013
sj000003 cr000013 crn00003 eu0000013 eu5000014
sj000004 cr000014 crn00004 eu0000014 eu5000015
What's expected:
1) As code works fine but as you see the current output the second column don't have any values but still appearing So, how could i have a checksum if a particular column don't have any values then remove that from display.
2) Can we place a check for the prefixes if they exists in the dataframe before processing to avoid the error.
Appreciate any help.
IIUC, before
df = df[ prefixes ].dropna(axis=0, how='all').replace(np.nan, '', regex=True)
you can do:
# remove all empty columns
df = df.dropna(axis=1, how='all')
That would solve your first part. Second part can be reindex?
# select prefixes:
prefixes = ['sj00', 'sj12', 'cr00', 'cr08', 'eu00', 'eu50', 'sh00', 'dt00', 'sh00', 'dt00']
df = df.reindex(prefixes, axis=1).dropna(axis=1, how='all').replace(np.nan, '', regex=True)
Note the axis=1, not axis=0 is identical to what I propose for question 1.
Much thanks to Quang Hoang for the hints on the post, Just for the workaround, i got it working as follows until i get a better answer:
# Select prefixes
prefixes = ['sj00', 'sj12', 'cr00', 'cr08', 'eu00', 'eu50']
df = pd.read_csv('new_hosts', index_col=False, header=None)
df['prefix'] = df[0].str[:4]
df['grp'] = df.groupby('prefix').cumcount()
df = df.pivot(index='grp', columns='prefix', values=0)
df = df[prefixes]
# For column `sj12` only extract the values having `sj12` and a should be a word immediately after that like `sj12[a-z]`
df['sj12'] = df['sj12'].str.extract('(\w{2}\d{2}\w\*)', expand=True)
df.replace('', np.nan, inplace=True)
# Remove the empty columns
df = df.dropna(axis=1, how='all')
# again drop if all values in the row are nan and replace nan to empty for live columns
df = df.dropna(axis=0, how='all').replace(np.nan, '', regex=True)
# drop the index field
df = df.rename_axis(None)

How to skip over np.nan while iterating through a dataframe for sentiment analysis

I have a data frame with 201279 entries, the last column is labeled "text" with customer reviews. The problem is that most of them are missing values, and come up as NaN.
I read some interesting information from this question:
Python numpy.nan and logical functions: wrong results
and I tried applying it to my problem:
Index(['id', 'sku', 'title', 'reviewCount', 'commentCount', 'averageRating',
'date', 'time', 'ProductName', 'CountOfBigTransactions', 'ClassID',
'Weight', 'Width', 'Depth', 'Height', 'LifeCycleName', 'FinishName',
'Color', 'Season', 'SizeOrUtility', 'Material', 'CountryOfOrigin',
'Quartile', 'display-name', 'online-flag', 'long-description', 'text'],
I tried experimentingby doing this:
df['firstName'][202360]== np.nan
which returns False but indeed that index contains an np.nan.
So I looked for an answer, read through the question I linked, and saw that
is a true statement. I thought, okay, I can run with this.
So, here's my code so far:
from textblob import TextBlob
import string
def remove_num_punct(aText):
p = string.punctuation
d = string.digits
j = p + d
table = str.maketrans(j, len(j)* ' ')
return aText.translate(table)
#Process text
aList = []
for text in df1['text']:
if np.bool(df1['text'])==True:
b = remove_num_punct(text)
pol = TextBlob(b).sentiment.polarity
Then I would just convert aList with the sentiment to pd.DataFrame and join it to df1, then impute the missing values with K-nearest neighbors.
My problem is that the little routine I made throws a value error
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
So I'm not really sure what else to try. Thanks in advance!
EDIT: I have tried this:
i = 0
aList = []
for txt in df1['text'].isnull():
i += 1
if txt == True:
which correctly populates the list with NaN.
But this gives me a different error:
i = 0
aList = []
for txt in df1['text'].isnull():
if txt == True:
b = remove_num_punct(df1['text'][i])
pol = TextBlob(b).sentiment.polarity
AttributeError: 'float' object has no attribute 'translate'
Which doesn't make sense, since if it is not NaN, then it contains text, right?
import pandas as pd
import numpy as np
df = pd.DataFrame({'age': [5, 6, np.NaN],
'born': [pd.NaT, pd.Timestamp('1939-05-27'), pd.Timestamp('1940-04-25')],
'name': ['Alfred', 'Batman', ''],
'toy': [None, 'Batmobile', 'Joker']})
df1 = df['toy']
for i in range(len(df1)):
if not df1[i]:
df2 = df1.drop(i)
you can try in this way to deal the text which is null
I fixed it, I had to move the i += 1 back from the else indentation to the for indentation:
i = 0
aList = []
for txt in df1['text'].isnull():
if txt == True:
b = remove_num_punct(df1['text'][i])
pol = TextBlob(b).sentiment.polarity

PySpark: Search For substrings in text and subset dataframe

I am brand new to pyspark and want to translate my existing pandas / python code to PySpark.
I want to subset my dataframe so that only rows that contain specific key words I'm looking for in 'original_problem' field is returned.
Below is the Python code I tried in PySpark:
def pilot_discrep(input_file):
df = input_file
searchfor = ['cat', 'dog', 'frog', 'fleece']
df = df[df['original_problem'].str.contains('|'.join(searchfor))]
return df
When I try to run the above, I get the following error:
AnalysisException: u"Can't extract value from original_problem#207:
need struct type but got string;"
In pyspark, try this:
df = df[df['original_problem'].rlike('|'.join(searchfor))]
Or equivalently:
import pyspark.sql.functions as F
Alternatively, you could go for udf:
import pyspark.sql.functions as F
searchfor = ['cat', 'dog', 'frog', 'fleece']
check_udf = F.udf(lambda x: x if x in searchfor else 'Not_present')
df = df.withColumn('check_presence', check_udf(F.col('original_problem')))
df = df.filter(df.check_presence != 'Not_present').drop('check_presence')
But the DataFrame methods are preferred because they will be faster.

How to write a Pandas Dataframe into a HDF5 dataset

I'm trying to write data from a Pandas dataframe into a nested hdf5 file, with multiple groups and datasets within each group. I'd like to keep it as a single file which will grow in the future on a daily basis. I've had a go with the following code, which shows the structure of what I'd like to achieve
import h5py
import numpy as np
import pandas as pd
file = h5py.File('database.h5','w')
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
groups = ['A','B','C']
for m in groups:
group = file.create_group(m)
dataset = ['1','2','3']
for n in dataset:
data = df
ds = group.create_dataset(m + n, data.shape)
print ("Dataset dataspace is", ds.shape)
print ("Dataset Numpy datatype is", ds.dtype)
print ("Dataset name is",
print ("Dataset is a member of the group", ds.parent)
print ("Dataset was created in the file", ds.file)
print ("Writing data...")
ds[...] = data
print ("Reading data back...")
data_read = ds[...]
print ("Printing data...")
print (data_read)
This way the nested structure is created but it loses the index and columns. I've tried the
df.to_hdf('database.h5', ds, table=True, mode='a')
but didn't work, I get this error
AttributeError: 'Dataset' object has no attribute 'split'
Can anyone shed some light please. Many thanks
df.to_hdf() expects a string as a key parameter (second parameter):
key : string
identifier for the group in the store
so try this:
df.to_hdf('database.h5',, table=True, mode='a')
where should return you a string (key name):
In [26]:
Out[26]: '/A1'
I thought to have a go with pandas\pytables and the HDFStore class instead of h5py. So I tried the following
import numpy as np
import pandas as pd
db = pd.HDFStore('Database.h5')
index = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=['Col1', 'Col2', 'Col3'])
groups = ['A','B','C']
i = 1
for m in groups:
subgroups = ['d','e','f']
for n in subgroups:
db.put(m + '/' + n, df, format = 'table', data_columns = True)
It works, 9 groups (groups instead of datasets in pyatbles instead fo h5py?) created from A/d to C/f. Columns and indexes preserved and can do the dataframe operations I need. Still wondering though whether this is an efficient way to retrieve data from a specific group which will become huge in the the future i.e. operations like
