Pandas object - save to .csv - python-3.x

I have a pandas object df and I would like to save that to .csv:
df.to_csv('output.csv', index = False)
Even if the data frame is displayed right in the terminal after printing, in the *.csv some lines are shifted several columns forward. I do not know how to demonstrate that in the minimal working code. I tried that with the one problematic column, but the result of one column was correct in the *.csv. What should I check, please? The whole column contains strings.
After advice:
selected['SpType'] = selected['SpType'].str.replace('\t', '')
I obtained an error:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
selected['SpType'] = selected['SpType'].str.replace('\t', '')

If the tabs are the problem, you could just replace all tabs.
If the tabs occur in column column_name you could do something like:
df['column_name'] = df['column_name'].str.replace('\t', '')
If the problem is in several columns, you could loop over all columns. eg.:
for col in df.columns:
df[col] = df[col].str.replace('\t', '')
df.to_csv('output.csv', index = False)

Related

split time series dataframe when value change

I'have a Dataframe, that correspond to lat/long of an object in movement.
This object go from one place to another, and I created a column that reference what place he is at every second.
I want to split that dataframe, so when the object go in one place, the leave to another, I'll have two separate dataframe.
'None' mean he is between places
My actual code :
def cut_df2(df):
df_copy = df.copy()
#check if change of place
df_copy['changed'] = df_copy['place'].ne(df_copy['place'].shift().bfill()).astype(int)
last = 0
dfs= []
for num, line in df_copy.iterrows():
if line.changed:
dfs.append(df.iloc[last:num,:])
last = num
# Check if last line was in a place
if line.place != 'None':
dfs.append(df.iloc[last:,:])
df_outs= []
# Delete empty dataframes
for num, dataframe in enumerate(dfs):
if not dataframe.empty :
if dataframe.reset_index().place.iloc[0] != 'None':
df_outs.append(dataframe)
return df_outs
It won't work on big dataset, but work on simple examples and I've no idea why, anyone can help me?
Try using this instead:
https://www.geeksforgeeks.org/split-pandas-dataframe-by-rows/
iloc can be a good way to split a dataframe
df1 = datasX.iloc[:, :72]
df2 = datasX.iloc[:, 72:]

Look up a number inside a list within a pandas cell, and return corresponding string value from a second DF

(I've edited the first column name in the labels_df for clarity)
I have two DataFrames, train_df and labels_df. train_df has integers that map to attribute names in the labels_df. I would like to look up each number within a given train_df cell and return in the adjacent cell, the corresponding attribute name from the labels_df.
So fore example, the first observation in train_df has attribute_ids of 147, 616 and 813 which map to (in the labels_df) culture::french, tag::dogs, tag::men. And I would like to place those strings inside one cell on the same row as the corresponding integers.
I've tried variations of the function below but fear I am wayyy off:
def my_mapping(df1, df2):
tags = df1['attribute_ids']
for i in tags.iteritems():
df1['new_col'] = df2.iloc[i]
return df1
The data are originally from two csv files:
train.csv
labels.csv
I tried this from #Danny :
sample_train_df['attribute_ids'].apply(lambda x: [sample_labels_df[sample_labels_df['attribute_name'] == i]
['attribute_id_num'] for i in x])
*please note - I am running the above code on samples of each DF due to run times on the original DFs.
which returned:
I hope this is what you are looking for. i am sure there's a much more efficient way using look up.
df['new_col'] = df['attribute_ids'].apply(lambda x: [labels_df[labels_df['attribute_id'] == i]['attribute_name'] for i in x])
This is super ugly and one day, hopefully sooner than later, i'll be able to accomplish this task in an elegant fashion though, until then, this is what got me the result I need.
split train_df['attribute_ids'] into their own cell/column
helper_df = train_df['attribute_ids'].str.split(expand=True)
combine train_df with the helper_df so I have the id column (they are photo id's)
train_df2 = pd.concat([train_df, helper_df], axis=1)
drop the original attribute_ids column
train_df2.drop(columns = 'attribute_ids', inplace=True)
rename the new columns
train_df2.rename(columns = {0:'attr1', 1:'attr2', 2:'attr3', 3:'attr4', 4:'attr5', 5:'attr6',
6:'attr7', 7:'attr8', 8:'attr9', 9:'attr10', 10:'attr11'})
convert the labels_df into a dictionary
def create_file_mapping(df):
mapping = dict()
for i in range(len(df)):
name, tags = df['attribute_id_num'][i], df['attribute_name'][i]
mapping[str(name)] = tags
return mapping
map and replace the tag numbers with their corresponding tag names
train_df3 = train_df2.applymap(lambda s: my_map.get(s) if s in my_map else s)
create a new column of the observations tags in a list of concatenated values
helper1['new_col'] = helper1[helper1.columns[0:10]].apply(lambda x: ','.join(x.astype(str)), axis = 1)

change the column value based on the multiple columns

I am developing code for searching a keyword in the given data.
for example, I have a data in column A & I want to find if
the substring is present in the row if yes give me that keyword against
the data, if that keyword is not present then give me 'blank'.
import pandas as pd
data = pd.read_excel("C:/Users/606736.CTS/Desktop/Keyword.xlsx")
# dropping null value columns to avoid errors
data.dropna(inplace = True)
# Converting the column to uppercase
data["Uppercase"]= data["Skill"].str.upper()
# Below is the keywords I want to search in the data
sub =['MEMORY','PASSWORD','DISK','LOGIN','RESET']
# I have used the below code, which is creating multiple columns &
giving me the boolean output
for keyword in sub:
data[keyword] = data.astype(str).sum(axis=1).str.contains(keyword)
# what I want is, search the keyword if it exits give me the keyword
name else blank
Try this:
data['Keyword'] = np.nan
for i in sub:
data.loc[(data['Uppercase'].apply(lambda x: i in x.split(' ')) & (data['Keyword'].isna()), 'Keyword'] = i

How to save tuples output form for loop to DataFrame Python

I have some data 33k rows x 57 columns.
In some columns there is a data which I want to translate with dictionary.
I have done translation, but now I want to write back translated data to my data set.
I have problem with saving tuples output from for loop.
I am using tuples for creating good translation. .join and .append is not working in my case. I was trying in many case but without any success.
Looking for any advice.
data = pd.read_csv(filepath, engine="python", sep=";", keep_default_na=False)
for index, row in data.iterrows():
row["translated"] = (tuple(slownik.get(znak) for znak in row["1st_service"]))
I just want to see in print(data["1st_service"] a translated data not the previous one before for loop.
First of all, if your csv doesn't already have a 'translated' column, you'll have to add it:
import numpy as np
data['translated'] = np.nan
The problem is the row object you're trying to write to is only a view of the dataframe, it's not the dataframe itself. Plus you're missing square brackets for your list comprehension, if I'm understanding what you're doing. So change your last line to:
data.loc[index, "translated"] = tuple([slownik.get(znak) for znak in row["1st_service"]])
and you'll get a tuple written into that one cell.
In future, posting the exact error message you're getting is very helpful!
I have manage it, below working code:
data = pd.read_csv(filepath, engine="python", sep=";", keep_default_na=False)
data.columns = []
slownik = dict([ ])
trans = ' '
for index, row in data.iterrows():
trans += str(tuple([slownik.get(znak) for znak in row["1st_service"]]))
data['1st_service'] = trans.split(')(')
data.to_csv("out.csv", index=False)
Can you tell me if it is well done?
Maybe there is an faster way to do it?
I am doing it for 12 columns in one for loop, as shown up.

Write values from Pandas DataFrame columns into tkinter TreeView/Table Columns

I want to write values from a dataframe into a tkinter treeview/Table, I am not able to do this.
my code:
#Setting up tkinter window.
root = Tk()
tree = ttk.Treeview(root)
#taking file input through a dialog box from the user.
file = filedialog.askopenfile(parent=root,mode='rb',title='Choose a xlsx file')
#readinf the excel file selected by the user and then creating a dataframe of that file.
xls = pd.read_excel(file)
df = pd.DataFrame(xls)
#taking all the columns heading in a variable"df_col".
df_col = df.columns.values
#all the column name are generated dynamically.
tree["columns"]=(df_col)
counter = len(df)
#generating for loop to create columns and give heading to them through df_col var.
for x in range(len(df_col)):
tree.column(x, width=100 )
tree.heading(x, text=df_col[x])
#generating for loop to print values of dataframe in treeview column.
for i in range(counter):
tree.insert('', 0, values=(df[df_col[x]]][i]))
It is not printing the columns and showing the KeyError:0.
Output Required:
The first argument of tree.column() should be the column name, which you assigned with:
tree["columns"]=(df_col)
The problem is that you have named the columns using a string, but you are attempting to access them using integers in:
for x in range(len(df_col)):
tree.column(x, width=100 )
tree.heading(x, text=df_col[x])
Above, you are attempting to access tree.columns(0), instead of tree.columns('Company'), hence the key error.
Try instead:
for x in range(len(df_col)):
tree.column(df_col[x], width=100)
tree.heading(df_col[x], text=df_col[x])
Note that df_col is an ndarray, not a dataframe, which is why df_col[x] works correctly (df[x] would give a key error). This is because df.columns.values returns an ndarray. As a side note, it may be a bit confusing to name an ndarray df_col.
There are also a few issues with your insert. The second argument should correspond to the index of the entry you wish to address. One solution is then to use a row index as the second argument, followed by a row label as text="rowLabel", followed by a list of values for the row:
tree.insert('', i, text=rowLabels[i], values=df.iloc[i,:].tolist())
Where rowLabels should be defined as whatever you want to use in the first column of the table. I would suggest using an index column from the spreadsheet here, if possible. It could be defined by:
rowLabels = df.iloc[:,indexColumn].tolist()
or:
rowLabels = df.index.tolist()
The latter is viable if df has named indices defined by a column during the spreadsheet import. In the former, indexColumn is an int referring to a column number in df that contains unique identifiers.
The option values=df.iloc[i,:].tolist() converts all columns of the ith row into a list, and, since we have passed an index value (the second argument) that gets larger, the call will insert a new row every loop (from the python tkinter docs entry on Treeview --> insert: "if index is greater than or equal to the current number of children, it is inserted at the end").
Finally, I am not sure if you did not post the end of your code, but, in order for the tree to show up, you will also need to use pack, grid, etc.
tree.pack()
or
tree.grid(row=0, column=0)
References:
https://docs.python.org/3/library/tkinter.ttk.html#tkinter.ttk.Treeview
This helpful example makes a few of the steps clear:
https://knowpapa.com/ttk-treeview/
As I was reading over your code. I noticed at the end line you have an extra bracket #:
df[df_col[x]]]
for i in range(counter):
tree.insert('', 0, values=(df[df_col[x]]][i]))
I would assume that would explain the KeyError.

Resources