Write dataframe and strings in one single csv file - python-3.x

I want to export a dataframe (500 rows,2 columns) from python to a CSV file.
However, I need to ensure that 1st 20 rows have some text/strings written and then the dataframe(500 rows,2 columns) should start from the 21st row onwards.
I referred to the following link: Skip first rows when writing csv (pandas.DataFrame.to_csv) . However, it does not satisfy my requirements.
Can somebody please let me know how do we do this?

Get first 20 rows and save it to another dataframe
Check if there are any null values
If not any null values, remove first 20 rows
Save df as a csv file
df2 = df.head(20)
df2 = df2.isnull().values.any()
if not df2:
df = df[10:]
df.to_csv('updated.csv')

Related

Pandas read file incomplete column number comma separated file

Hello I have CSV file which has no header while reading I am getting error the data CSV look as below
School1,std1,std2
Schoo2,std3,std4,std5,std6,std7
School4,std1,std6
School6,std9,std10
Because of incomplete column not able read
df=of.read_csv("test.txt",sep=",", header=None)
Can any one suggest me how can I read this file
If you know the number of columns of your largest row, you can just create a list to be the header, as for your example, index 1 has 6 columns, so:
col_names = [1,2,3,4,5,6]
df = pd.read_csv("test.txt",sep=",", header=None, names=col_names)
If you're handling with a huge amount of rows and don't know the max number of columns, you can check this topic for a better solution:
import csv with different number of columns per row using Pandas

Python3 - Delete Top Rows from CSV before headers

I have a csv file that is a vulnerability report and at the top of the csv is 4 rows of with one to two columns of text about the report before the headers. I am trying to write a script that will delete these rows so that I can combine multiple files for calculations, reporting etc.
I have tried using pandas to convert the csv into a dataframe and then delete the rows, but because these top rows are not headers the dataframe conversion fails. Any advice on how I can delete these top four rows from the csv? Thanks!
Use the skiprows parameter of read_csv. For example:
# Skip 2 rows from top in csv and initialize a dataframe
usersDf = pd.read_csv('users.csv', skiprows=2)
# Skip rows at specific index
usersDf = pd.read_csv('users.csv', skiprows=[0,2,5])

How to append dataframe into csv file with columns order changed without iterating over each column?

I am trying to append new dataframes iteratively to a single CSV file. The problem is dataframes always have same columns names but their order is random.Currently, I am using following code to append new data into csv :
with open(name+'.csv', 'a') as f:
df.to_csv(f, header=f.tell()==0)
But this does not work when column order is changed. It keeps appending values in order they come without taking in account the headers. E.g . If column order in the first dataframe id [A,B,C,D] and in the second dataframe, order is [D,C,A,B] the CSV becomes:
A,B,C,D
a,b,c,d
a,b,c,d
a,b,c,d
...
d,c,a,b
d,c,a,b
d,c,a,b
...
Any suggestions?
Use reindex function
with open(name+'.csv', 'a') as f:
df.reindex(columns=list('ABCD')).to_csv(f, header=f.tell()==0)

How can I split the headers of data and the data itself to their own respective columns?

I have a file that has 312759 rows but only one column with different header names in the one row, so I need to separate that rows with their own values and columns. So the data frame has 312759 rows × 1 columns but I need 312759 X approx. 40 headers/cols. I am new python and to stackoverflow community so any help would be appreciated.
read the data using pandas
import pandas as pd
read = pd.read_csv('output.csv')
read.drop(read.head(5).index, inplace=True)
then save it back as a .csv file
read.to_csv("output2.csv")

Appending blank rows to dataframe if column does not exist

This question is kind of odd and complex, so bear with me, please.
I have several massive CSV files (GB size) that I am importing with pandas. These CSV files are dumps of data collected by a data acquisition system, and I don't need most of it, so I'm using the usecols parameter to filter out the relevant data. The issue is that not all of the CSV files have all of the columns I need (a property of the data system being used).
The problem is that, if the column doesn't exist in the file but is specified in usecols, read_csv throws an error.
Is there a straightforward way to force a specified column set in a dataframe and have pandas just return blank rows if the column doesn't exist? I thought about iterating over each column for each file and working the resulting series into the dataframe, but that seems inefficient and unwieldy.
I thought about iterating over each column for each file and working the resulting series into the dataframe, but that seems inefficient and unwieldy.
Assuming some kind of master list all_cols_to_use, can you do something like:
def parse_big_csv(csvpath):
with open(csvpath, 'r') as infile:
header = infile.readline().strip().split(',')
cols_to_use = sorted(set(header) & set(all_cols_to_use))
missing_cols = sorted(set(all_cols_to_use) - set(header))
df = pd.read_csv(csvpath, usecols=cols_to_use)
df.loc[:, missing_cols] = np.nan
return df
This assumes that you're okay with filling the missing columns with np.nan, but should work. (Also, if you’re concatenating the data frames, the missing columns will be in the final df and filled with np.nan as appropriate.)

Resources