Pandas add column if exists in index - python-3.x

I have several csv files which they differ between them in term of columns.
I know exactly which column I want to see, but I am not sure if all the files they have those column.
in my previous code, I had in place this sort of filtering
keep_col = ['code', '#timestamp', 'message', 'name','ID', 'deviceAction']
know I implemented a for loop to go through all the files (csv) in a folder and run some scripts against those file.
Here is where I am facing the issue.
as the csv structure may vary, I can't keep the keep_col that static and I needed to add some extra column,
like this:
keep_col = ['code', '#timestamp', 'message', 'name','ad.loginName','sourceServiceName','ad.destinationHosts','ID', 'deviceAction']
but unfortunately my script fails because the the new column I added, are not in the csv index. fair enough, I decided to put in place a ìf statment
as follow:
if 'ad.loginName' and 'sourceServiceName' and 'ad.destinationHosts' in f.index.values:
keep_col = ['Code', '#timestamp', 'message', 'name','ad.loginName','sourceServiceName','ad.destinationHosts','ID', 'deviceAction']
else:
keep_col = ['Code', '#timestamp', 'message', 'name','ID', 'deviceAction']
I tried with both AND and ORand the output was wrong in both cases, and here why:
OR: fails to run because is needs to validate at least one condition, which in my first file does not have any of those column.
AND: works, but does not report back the column because not all 3 conditions are true so it doesn't report back any of those 3 fields.
Please, can any of you help me to solve this.
I would like the script to check IF any of those columns exists, to write them, and if they do not exists in the index, to just ignore them and move on
thank you very much guys, and please if you need more infos just let me know.

I think you mean to check if all indexes you mention are in f.index.value.
if all(col in f.columns
for col in ['ad.loginName', 'sourceServiceName',
'ad.destinationHosts']
):
or in other words if that set is a subset of f.index.values
if set(['ad.loginName', 'sourceServiceName',
'ad.destinationHosts']) <= set(f.columns):
Or back to your original problem you want
keep_col = ['Code', '#timestamp', 'message', 'name','ad.loginName','sourceServiceName','ad.destinationHosts','ID', 'deviceAction']
keep_col = [col for col in keep_col if col in f.columns]
If I tell you 'some string' and 'some other string' == True, can you spot what you did wrong in your code?

Related

Replace a column of numbers with the associated label from another text file using the index on the text file

I'm trying to loop through column of numbers and replace a with the associated label from another text file using the index on the text file. But it keeps returning an error
#conditions.txt has the labels while 'Reason for absence' has the numbers
with open('conditions.txt') as f:
lines = f.readlines()
ds = pd.DataFrame(lines)
index=ds.index
print(index)
newname=df['Reason for absence'].copy()
for value in newname:
for i in index:
if value in newname == index :
newname=ds
This issue seems to be related to another posted question. While the approaches are different, possibly you are looking for the same end result. Take a look at the provided solution here. If the solution is not helpful, request you provide the error message.

Python Warning Panda Dataframe "Simple Issue!" - "A value is trying to be set on a copy of a slice from a DataFrame"

first post / total Python novice so be patient with my slow understanding!
I have a dataframe containing a list of transactions by order of transaction date.
I've appended an additional new field/column called ["DB/CR"], that dependant on the presence of "-" in the ["Amount"] field populates 'Debit', else 'Credit' in the absence of "-".
Noting the transactions are in date order, I've included another new field/column called [Top x]. The output of which is I want to populate and incremental independent number (starting at 1) for both debits and credits on a segregated basis.
As such, I have created a simple loop with a associated 'if' / 'elif' (prob could use else as it's binary) statement that loops through the data sent row 0 to the last row in the df and using an if statement 1) "Debit" or 2) "Credit" increments the number for each independently by "Debit" 'i' integer, and "Credit" 'ii' integer.
The code works as expected in terms of output of the 'Top x'; however, I always receive a warning "A value is trying to be set on a copy of a slice from a DataFrame".
Trying to perfect my script, without any warnings I've been trying to understand what I'm doing incorrect but not getting it in terms of my use case scenario.
Appreciate if someone can kindly shed light on / propose how the code needs to be refactored to avoid receiving this error.
Code (the df source data is an imported csv):
#top x debits/credits
i = 0
ii = 0
for ind in df.index:
if df["DB/CR"][ind] == "Debit":
i = i+1
df["Top x"][ind] = i
elif df["DB/CR"][ind] == "Credit":
ii = ii+1
df["Top x"][ind] = ii
Interpreter
df["Top x"][ind] = i
G:\Finances Backup\venv\Statementsv.03.py:173: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df["Top x"][ind] = ii
Many thanks :)
You should use df.loc["DB/CR", ind] = "Debit"
Use iterrows() to iterate over the DF. However, updating DF while iterating is not preferable
see documentation here
Refer to the documentation here Iterrows()
You should never modify something you are iterating over. This is not
guaranteed to work in all cases. Depending on the data types, the
iterator returns a copy and not a view, and writing to it will have no
effect.

In Ruby, how would one create new CSV's conditionally from an original CSV?

I'm going to use this as sample data to simplify the problem:
data_set_1
I want to split the contents of this csv according to Column A - DEPARTMENT and place them on new csv's named after the department.
If it were done in the same workbook (so it can fit in one image) it would look like:
data_set_2
My initial thought was something pretty simple like:
CSV.foreach('test_book.csv', headers: true) do |asset|
CSV.open("/import_csv/#{asset[1]}", "a") do |row|
row << asset
end
end
Since that should take care of the logic for me. However, from looking into it, CSV#foreach does not accept file access rights as second parameter, and it gets an error when I run it. Any help would be appreciated, thanks!
I don't see why you would need to pass file access rights to CSV#foreach. This method just reads the CSV. How I would do this is like so:
# Parse the entire CSV into an array.
orig_rows = CSV.parse(File.read('test_book.csv'), headers: true)
# Group the rows by department.
# This becomes { 'deptA' => [<rows>], 'deptB' => [<rows>], etc }
groups = orig_rows.group_by { |row| row[1] }
# Write each group of rows to its own file
groups.each do |dept, rows|
CSV.open("/import_csv/#{dept}.csv", "w") do |csv|
rows.each do |row|
csv << row.values
end
end
end
A caveat, though. This approach does load the entire CSV into memory, so if your file is very large, it wouldn't work. In that case, the "streaming" approach (line-by-line) that you show in your question would be preferrable.

Reordering data by manipulating column wise in Python

I have data in a csv file as follows:
60,27702,1938470,13935,18513,8
60,32424,1933740,16103,15082,11
60,20080,1946092,9335,14970,2
60,28236,1937936,13799,16871,6
60,22717,1943455,10809,16726,4
120,37702,2938470,23935,28513,8
120,42424,2933740,26103,25082,11
120,30080,2946092,2335,24970,2
120,38236,2937936,23799,26871,6
120,32717,2943455,20809,26726,4
180,47702,3938470,33935,8513,8
180,52424,3933740,36103,5082,11
180,40080,3946092,3335,4970,2
180,48236,3937936,33799,6871,6
180,42717,3943455,30809,6726,4
I then used the following code to insert column heading:
df = pd.read_csv("contikiMAC_new_out.csv", names=['Energest','CPU','LPM','Transmit','Listen','ID'])
I used df.groupby(['ID']) to see the data in group according to column 'ID'.
The problem is the data in column 'LPM' gets reset after some time so I would like to add the previous value with the new value whenever the new value in LPM column is smaller for specific 'ID' .
I tried doing :
for x in df.groupby(['ID']):
for i in df.ID:
if (df.loc[i, 'LPM'] < df.loc[i - 1, 'LPM']):
df.loc[i, 'LPM'] = df.loc[i, 'LPM'] + df.loc[i - 1, 'LPM']
But actually not getting the fruitful result I desire because it mixes with the 'LPM' value of different 'ID' and the process takes a long time. Can anyone please help me in suggesting a way to write the data group wise in a csv file based on 'ID' after performing the sum operation ?
The data structure I like to see is as follows:
60,27702,1938470,13935,18513,8
120,37702,2938470,23935,28513,8
180,47702,3938470,33935,37026,8
60,32424,1933740,16103,15082,11
120,42424,2933740,26103,25082,11
180,52424,3933740,36103,30164,11
60,20080,1946092,9335,14970,2
120,30080,2946092,2335,24970,2
180,40080,3946092,3335,29940,2
60,28236,1937936,13799,16871,6
120,38236,2937936,23799,26871,6
180,48236,3937936,33799,33742,6
60,22717,1943455,10809,16726,4
120,32717,2943455,20809,26726,4
180,42717,3943455,30809,33452,4
If I understood your problem correctly, DataFrame.shift is what you're looking for.
Something like:
df['LPM_prev'] = df.groupby(['ID'])['LPM'].shift(1)
And then you can work with that column

Replace all error values of all columns after importing datas (while keeping the rows)

An Excel table as data source may contain error values (#NA, #DIV/0), which could disturbe later some steps during the transformation process in Power Query.
Depending of the following steps, we may get no output but an error. So how to handle this cases?
I found two standard steps in Power Query to catch them:
Remove errors (UI: Home/Remove Rows/Remove Errors) -> all rows with an error will be removed
Replace error values (UI: Transform/Replace Errors) -> the columns have first to be selected for performing this operations.
The first possibility is not a solution for me, since I want to keep the rows and just replace the error values.
In my case, my data table will change over the time, means the column name may change (e.g. years), or new columns appear. So the second possibility is too static, since I do not want to change the script each time.
So I've tried to get a dynamic way to clean all columns, indepent from the column names (and number of columns). It replaces the errors by a null value.
let
Source = Excel.CurrentWorkbook(){[Name="Tabelle1"]}[Content],
//Remove errors of all columns of the data source. ColumnName doesn't play any role
Cols = Table.ColumnNames(Source),
ColumnListWithParameter = Table.FromColumns({Cols, List.Repeat({""}, List.Count(Cols))}, {"ColName" as text, "ErrorHandling" as text}),
ParameterList = Table.ToRows(ColumnListWithParameter ),
ReplaceErrorSource = Table.ReplaceErrorValues(Source, ParameterList)
in
ReplaceErrorSource
Here the different three queries messages, after I've added two new column (with errors) to the source:
If anybody has another solution to make this kind of data cleaning, please write your post here.
let
src = Excel.CurrentWorkbook(){[Name="Tabelle1"]}[Content],
cols = Table.ColumnNames(src),
replace = Table.ReplaceErrorValues(src, List.Transform(cols, each {_, "!"}))
in
replace
Just for novices like me in Power Query
"!" could be any string as substitute for error values. I initially thought it was a wild card.
List.Transform(cols, each {_, "!"}) generates the list of error handling by column for the main funcion:
Table.ReplaceErrorValues(table_with errors, {{col1,error_str1},{col2,error_str2},{},{}, ...,{coln,error_strn}})
Nice elegant solution, Sergei

Resources