Pandas : how to consider content of certain columns as list - python-3.x

Let's say I have a simple pandas dataframe named df :
0 1
0 a [b, c, d]
I save this dataframe into a CSV file as follow :
df.to_csv("test.csv", index=False, sep="\t", encoding="utf-8")
Then later in my script I read this csv :
df = pd.read_csv("test.csv", index_col=False, sep="\t", encoding="utf-8")
Now what I want to do is to use explode() on column '1' but it does not work because the content of column '1' is not a list since I saved df into a CSV file.
What I tried so far is to change column '1' type into a list with astype() without any success.
Thank you by advance.

Try this, Since you are reading from csv file,your dataframe value in column A (1 in your case) is essentially a string for which you need to infer the values as list.
import pandas as pd
import ast
df=pd.DataFrame({"A":["['a','b']","['c']"],"B":[1,2]})
df["A"]=df["A"].apply(lambda x: ast.literal_eval(x))
Now, the following works !
df.explode("A")

Related

Remove columns starting with same special character in a csv file using Python

My csv file has below columns:
AFM_reversal_indicator,Alert_Message,axiom_key,_timediff,player,__mv_Splunk_Alert_Id,__mv_nbr_plastic,__mv_code_acct_stat_demo.
I want to remove columns starting with "__mv".
I saw some posts where pandas are used to filter out columns.
Is it possible to do it using csv module in python, If yes how ?
Also, with Pandas what regex should I give:
df.filter(regex='')
df.to_csv(output_file_path)
P.S I am using python3.8
You mean with standard python? You can use a list comprehension, e.g.
import csv
with open( 'data.csv', 'r' ) as f:
DataGenerator = csv.reader( f )
Header = next( DataGenerator )
Header = [ Col.strip() for Col in Header ]
Data = list( DataGenerator )
if Data[-1] == []: del( Data[-1] )
Data = [ [Row[i] for i in range( len( Header ) ) if not Header[i].startswith( "__mv" ) ] for Row in Data ]
Header = [ Col for Col in Header if not Col.startswith( "__mv" ) ]
However, this is just a simple example. You'll probably have further things to consider, e.g. what type your csv columns have, whether you want to read all the data at once like I do here, or one-by-one from the generator to save on memory, etc.
You could also use the builtin filter command instead of the inner list comprehension.
Also, if you have numpy installed and you wanted something more 'numerical', you can always use "structured numpy arrays" (https://numpy.org/doc/stable/user/basics.rec.html). They're quite nice. (personally I prefer them to pandas anyway). Numpy also has its own csv-reading functions (see: https://www.geeksforgeeks.org/how-to-read-csv-files-with-numpy/)
You don't need to use .filter for that. You can just find out which are those columns and then drop them from the DataFrame
import pandas as pd
# Load the dataframe (In our case create a dummy one with your columns)
df = pd.DataFrame(columns = ["AFM_reversal_indicator", "Alert_Message,axiom_key", "_timediff,player", "__mv_Splunk_Alert_Id", "__mv_nbr_plastic", "__mv_code_acct_stat_demo"])
# Get a list of all column names starting with "__mv"
mv_columns = [col for col in df.columns if col.startswith("__mv")]
# Drop the columns
df = df.drop(columns=mv_columns)
# Save the updated dataframe to a CSV file
df.to_csv("cleaned_data.csv", index=False)
The mv_columns will iterate through the columns in your DataFrame and pick those that starts with "__mv". Then the .drop will just remove those from it.
If for some reason you want to use csv package only, then the solution might not be as elegant as with pandas. But here is a suggestion:
import csv
with open("original_data.csv", "r") as input_file, open("cleaned_data.csv", "w", newline="") as output_file:
reader = csv.reader(input_file)
writer = csv.writer(output_file)
header_row = next(reader)
mv_columns = [col for col in header_row if col.startswith("__mv")]
mv_column_indices = [header_row.index(col) for col in mv_columns]
new_header_row = [col for col in header_row if col not in mv_columns]
writer.writerow(new_header_row)
for row in reader:
new_row = [row[i] for i in range(len(row)) if i not in mv_column_indices]
writer.writerow(new_row)
So, first, you read the first row that supposed to be your headers. With a similar logic as before, you find those columns that starts with "__mv" and then you get their indices. You write the new columns to your output file with those columns that don't exist to the "__mv" columns. Then you need to iterate through the rest of the CSV and remove those columns as you go.

Converting Pandas DataFrame OrderedSet column into list

I have a Pandas DataFrame, one column, is an OrderedSet like this:
df
OrderedSetCol
0 OrderedSet([1721754, 3622558, 2550234, 2344034, 8550040])
This is:
from ordered_set import OrderedSet
I am just trying to convert this column into list:
df['OrderedSetCol_list'] = df['OrderedSetCol'].apply(lambda x: ast.literal_eval(str("\'" + x.replace('OrderedSet(','').replace(')','') + "\'")))
The code executes succesfully, but, my column type is still str and not list
type(df.loc[0]['OrderedSetCol_list'])
str
What am I doing wrong?
EDIT: My OrderedSetCol is also a string column as I am reading a file from a disk, which was originally saved from OrderedSet column.
Expected Output:
[1721754, 3622558, 2550234, 2344034, 8550040]
You can apply a list calling just like you would do with the OrderedSet itself:
df = pd.DataFrame({'OrderedSetCol':[OrderedSet([1721754, 3622558, 2550234, 2344034, 8550040])]})
df.OrderedSetCol.apply(list)
Output:
[1721754, 3622558, 2550234, 2344034, 8550040]
If your data type string column:
df.OrderedSetCol.str.findall('\d+')

Merge duplicate rows in a text file using python based on a key column

I have a csv file and I need to merge records of those rows based on a key column name
a.csv
Name|Acc#|ID|Age
Suresh|2345|a-b2|24
Mahesh|234|a-vf|34
Mahesh|4554|a-bg|45
Keren|344|s-bg|45
yankie|999|z-bg|34
yankie|3453|g-bgbbg|45
Expected output: Merging records based on name like values from both the rows for name Mahesh and yankie are merged
Name|Acc#|ID|Age
Suresh|2345|a-b2|24
Mahesh|[234,4555]|[a-vf,a-bg]|[34,45]
Keren|344|s-bg|45
yankie|[999,3453]|[z-bg,g-bgbbg]|[34,45]
can someone help me with this in python?
import pandas as pd
df = pd.read_csv("a.csv", sep="|", dtype=str)
new_df = df.groupby('Name',as_index=False).aggregate(lambda tdf: tdf.unique().tolist() if tdf.shape[0] > 1 else tdf)
new_df.to_csv("data.csv", index=False, sep="|")
Output:
Name|Acc#|ID|Age
Keren|344|s-bg|45
Mahesh|['234', '4554']|['a-vf', 'a-bg']|['34', '45']
Suresh|2345|a-b2|24
yankie|['999', '3453']|['z-bg', 'g-bgbbg']|['34', '45']

How to read comma separated data into data frame without column header?

I am getting a comma separated data set as bytes which I need to:
Convert in to string from byte
Create csv (can skip this if there is any way to jump to 3rd output)
format and read as data frame without converting first row as column name.
(Later I will be using this df to compare with oracle db output.)
Input data:
val = '-8335,Q1,2017,2002-07-10 00:00:00.0,-,Mr. A,4342000,AnalystA,0,F\n-8336,Q1,2017,2002-07-11 00:00:00.0,-,Mr. B,4342001,Analyst A,0,F\n-8337,Q1,2017,2002-07-10 00:00:00.0,-,Mr. C,4342002,Analyst A,0,F\n'
type(val)
i managed to do till step 3 but my first row is becoming header. I am fine if we can give any value as column header e.g. a, b, c, ...
#1 Code I tried to convert byte to str
strval = val.decode('ascii').strip()
#2 code to craete csv. Frist i created blank csv and later appended the data
import csv
import pandas as pd
abc = ""
with open('csvfile.csv', 'w') as csvOutput:
testData = csv.writer(csvOutput)
testData.writerow(abc)
with open('csvfile.csv', 'a') as csvAppend:
csvAppend.write(val)
#3 now converting it into dataframe
df = pd.read_csv('csvfile.csv')
# hdf = pd.read_csv('csvfile.csv', column=none) -- this give NameError: name 'none' is not defined
output:
df
according to the read_csv documentation it should be enough to add header=None as a parameter:
df = pd.read_csv('csvfile.csv', header=None)
In this way the header will be interpreted as a row of data. If you want to exclude this line then you need to add the skiprows=1 parameter:
df = pd.read_csv('csvfile.csv', header=None, skiprows=1)
You can do it without saving to csv file like this, you don't need to convert the bytes to string or save that to file
Here val is of type bytes if it is of type string as in your example you can use io.StringIO instead of io.BytesIO
import pandas as pd
import io
val = b'-8335,Q1,2017,2002-07-10 00:00:00.0,-,Mr. A,4342000,AnalystA,0,F\n-8336,Q1,2017,2002-07-11 00:00:00.0,-,Mr. B,4342001,Analyst A,0,F\n-8337,Q1,2017,2002-07-10 00:00:00.0,-,Mr. C,4342002,Analyst A,0,F\n'
buf_bytes = io.BytesIO(val)
pd.read_csv(buf_bytes, header=None)

pandas read_csv function fails

I am trying to convert the following csv into a dataframe, using simply :
import pandas as pd
ticket = pd.read_csv("file.csv")
However, due to the missing quotation marks on the first column of the csv :
A,"B","C","D"
0,"1","2","3"
It fails to assign properly each rows to its rightful header.
import pandas as pd
df = pd.read_csv('your_csv.csv')
IN [1]: df
Out[1]:
A "B" "C" "D"
0 0 "1" "2" "3"
Seems to work for me with the given data placed into a .csv, can you be more specific about the error?

Resources