This question already has answers here:
Pandas Passing Variable Names into Column Name
(3 answers)
Closed 4 years ago.
I am trying to read unique values for columns in list but unable to put variable correctly in a way that it becomes a command. If i run c_data.ABC.unique() directly then i get list of unique values in ABC column. Please suggest what is going wrong.
import pandas as pd
c_data=pd.read_csv("/home/fileName.csv")
list=['ABC','DEF']
for f in list:
cl="c_data.{}.unique()".format(f)
print(cl)
Output:
c_data.ABC.unique()
c_data.DEF.unique()
You should definetly check on these indexing basic in pandas. So, about your answer you can use the most basic indexing by brackets [] and string column name, for example c_data['ABC'], so you can iterate like this:
c_data = pd.read_csv("/home/fileName.csv")
list = ['ABC', 'DEF']
for f in list:
print(c_data[f].unique())
If you want/need to use format method, you can just replace column name with formatted string:
c_data = pd.read_csv("/home/fileName.csv")
list = ['ABC', 'DEF']
for f in list:
print(c_data['{0}'.format(f)].unique()])
Also, you can use bracket indexing with a list of string, which will give you another DataFrame. Then you can iterate over DataFrame itself which will give you column names:
c_data = pd.read_csv("/home/fileName.csv")
f_data = c_data[['ABC', 'DEF']]
for f in f_data:
print(f_data[f].unique())
Related
I have a data frame (df).
The Data frame contains a string column called: supported_cpu.
The (supported_cpu) data is a string type separated by a comma.
I want to use this data for the ML model.
enter image description here
I had to get unique values for the column (supported_cpu). The output is a (list) of unique values.
def pars_string(df,col):
#Separate the column from the string using split
data=df[col].value_counts().reset_index()
data['index']=data['index'].str.split(",")
# Create a list including all of the items, which is separated by column
df_01=[]
for i in range(data.shape[0]):
for j in data['index'][i]:
df_01.append(j)
# get unique value from sub_df
list_01=list(set(df_01))
# there are some leading or trailing spaces in the list_01 which need to be deleted to get unique value
list_02=[x.strip(' ') for x in list_01]
# get unique value from list_02
list_03=list(set(list_02))
return(list_03)
supported_cpu_list = pars_string(df=df,col='supported_cpu')
The output:
enter image description here
I want to map this output to the data frame to encode it for the ML model.
How could I store the output in the data frame? Note : Some row have a multi-value(more than one CPU)
Input: string type separated by a column
output: I did not know what it should be.
Input: string type separated by a column
output: I did not know what it should be.
I really recommend to anyone who's starting using pandas to read about vectorization and thinking in terms of columns (aka Series). This is the way it was build and it is the way in which its supposed to be used.
And from what I understand (I may be wrong) is that you want to get unique values from supported_cpu column. So you could use the Series methods on string to split that particular column, then flatten the resulting array using internal `chain
from itertools import chain
df['supported_cpu'] = df['supported_cpu'].str.split(pat=',')
unique_vals = set(chain(*df['supported_cpus'].tolist()))
unique_vals = (item for item in unique_vals if item)
Multi-values in some rows should be parsed to single values for later ML model training. The list can be converted to dataframe simply by pd.DataFrame(supported_cpu_list).
Using a csv extract from a registration system I am attempting to format the data to use as contact/distribution list import into a virtual meeting application. Using the following function I am able to pull the needed data into a nested list ([name1, email1] [name2, email2],...).
def createDistributionList():
with open(fileOpen) as readFile, open('test2.txt', 'w') as writeFile:
data = pd.read_csv(readFile)
df = pd.DataFrame(data, columns= ['Attendee Name', 'Attendee Email'])
distList = df.values.tolist()
print(' '.join(map(str, distList)))
The format I need the data in is one long string - name1(email1);name2(email2);...
I have been unable to get the output that I am looking for. Any assistance or a pointer to a relevant reference would be greatly appreciated.
You can use list comprehension for that:
tup = (["name1", "email1"], ["name2", "email2"], ["name3", "email3"])
print(";".join(["{}({})".format(l[0], l[1]) for l in tup]))
I have this following dataframe:
And i have this following list:
and i want to replace the series value of team_stat['First Half']['W'] to the list value of first_half_win_result
Try the following code, it would convert list to series for pandas
team_stat['First Half']['W'] = pd.Series(first_half_win_result)
well i find the solution:
team_stat = team_stat.transpose()
team_stat.loc['First Half', 'W'] = first_half_win_result
team_stat = team_stat.transpose()
This question already has answers here:
Pandas: get second character of the string, from every row
(2 answers)
Closed 4 years ago.
I have a data frame and want to parse the 9th character into a second column. I'm missing the syntax somewhere though.
#develop the data
df = pd.DataFrame(columns = ["vin"], data = ['LHJLC79U58B001633','SZC84294845693987','LFGTCKPA665700387','L8YTCKPV49Y010001',
'LJ4TCBPV27Y010217','LFGTCKPM481006270','LFGTCKPM581004253','LTBPN8J00DC003107',
'1A9LPEER3FC596536','1A9LREAR5FC596814','1A9LKEER2GC596611','1A9L0EAH9C596099',
'22A000018'])
df['manufacturer'] = ['A','A','A','A','B','B','B','B','B','C','C','D','D']
def check_digit(df):
df['check_digit'] = df['vin'][8]
print(df['checkdigit'])]
For some reason, this puts the 8th row VIN in every line.
In your code doing this:
df['check_digit'] = df['vin'][8]
Is only selecting the 8th element in the column 'vin'. Try this instead:
for i in range(len(df['vin'])):
df['check_digit'] = df['vin'][i][8]
As a rule of thumb, whenever you are stuck, simply check the type of the variable returned. It solves a lot of small problems.
EDIT: As pointed out by #Georgy in the comment, using a loop wouldn't be pythonic and a more efficient way of solving this would be :
df['check_digit'] = df['vin'].str[8]
The .str does the trick. For future reference on that, I think you would find this helpful.
The correct way is:
def check_digit(df):
df['check_digit'] = df['vin'].str[8]
print(df)
This question already has answers here:
How to convert index of a pandas dataframe into a column
(9 answers)
Closed 1 year ago.
I'm not sure where I am astray but I cannot seem to reset the index on a dataframe.
When I run test.head(), I get the output below:
As you can see, the dataframe is a slice, so the index is out of bounds.
What I'd like to do is to reset the index for this dataframe. So I run test.reset_index(drop=True). This outputs the following:
That looks like a new index, but it's not. Running test.head again, the index is still the same. Attempting to use lambda.apply or iterrows() creates problems with the dataframe.
How can I really reset the index?
reset_index by default does not modify the DataFrame; it returns a new DataFrame with the reset index. If you want to modify the original, use the inplace argument: df.reset_index(drop=True, inplace=True). Alternatively, assign the result of reset_index by doing df = df.reset_index(drop=True).
BrenBarn's answer works.
The following also worked via this thread, which isn't a troubleshooting so much as an articulation of how to reset the index:
test = test.reset_index(drop=True)
As an extension of in code veritas's answer... instead of doing del at the end:
test = test.reset_index()
del test['index']
You can set drop to True.
test = test.reset_index(drop=True)
I would add to in code veritas's answer:
If you already have an index column specified, then you can save the del, of course. In my hypothetical example:
df_total_sales_customers = pd.DataFrame({'Sales': total_sales_customers['Sales'],
'Customers': total_sales_customers['Customers']}, index = total_sales_customers.index)
df_total_sales_customers = df_total_sales_customers.reset_index()