pandas: split dataframe into multiple csvs - python-3.x

I have a large file, imported into a single dataframe in Pandas.
I'm using pandas to split up a file into many segments, by the number of rows in the dataframe.
eg: 10 rows:
file 1 gets [0:4]
file 2 gets [5:9]
Is there a way to do this without having to create more dataframes?

assign a new column g here, you just need to specific how many item you want in each groupby, here I am using 3 .
df.assign(g=df.index//3)
Out[324]:
0 g
0 1 0
1 2 0
2 3 0
3 4 1
4 5 1
5 6 1
6 7 2
7 8 2
8 9 2
9 10 3
and you can call the df[df.g==1] to get what you need

There are two ways of doing this. I believe you are looking for the former. Basically, we open a series of csv writers, then we write to the correct csv writer by using some basic math with the index, then we close all files.
A single DataFrame evenly divided into N number of CSV files
import pandas as pd
import csv, math
df = pd.DataFrame([1,2,3,4,5,6,7,8,9,10]) # uncreative input values for 10 columns
NUMBER_OF_SPLITS = 2
fileOpens = [open(f"out{i}.csv","w") for i in range(NUMBER_OF_SPLITS)]
fileWriters = [csv.writer(v, lineterminator='\n') for v in fileOpens]
for i,row in df.iterrows():
fileWriters[math.floor((i/df.shape[0])*NUMBER_OF_SPLITS)].writerow(row.tolist())
for file in fileOpens:
file.close()
More than one DataFrame evenly divided into N number of CSV files
import pandas as pd
import numpy as np
df = pd.DataFrame([1,2,3,4,5,6,7,8,9,10]) # uncreative input values for 10 columns
NUMBER_OF_SPLITS = 2
for i, new_df in enumerate(np.array_split(df,NUMBER_OF_SPLITS)):
with open(f"out{i}.csv","w") as fo:
fo.write(new_df.to_csv())

use numpy.array_split to split your dataframe dfX and save it in N csv files of equal size: dfX_1.csv to dfX_N.csv
N = 10
for i, df in enumerate(np.array_split(dfX, N)):
df.to_csv(f"dfX_{i + 1}.csv", index=False)

iterating over iloc's arguments will do the trick.

Related

How to organise different datasets on Excel into the same layout/order (using pandas)

I have multiple Excel spreadsheets containing the same types of data but they are not in the same order. For example, if file 1 has the results of measurements A, B, C and D from River X printed in columns 1, 2, 3 and 4, respectively but file 2 has the same measurements taken for a different river, River Y, printed in columns 6, 7, 8, and 9 respectively, is there a way to use pandas to reorganise one dataframe to match the layout of another dataframe (i.e. make it so that Sheet2 has the measurements for River Y printed in columns 1, 2, 3 and 4)? Sometimes the data is presented horizontally, not vertically as described above, too. If I have the same measurements for, say, 400 different rivers on 400 separate sheets, but the presentation/layout of data is erratic with regards to each individual file, it would be useful to be able to put a single order on every spreadsheet without having to manually shift columns on Excel.
Is there a way to use pandas to reorganise one dataframe to match the layout of another dataframe?
You can get a list of columns from one of your dataframes and then sort that. Next you can use the sorted order to reorder your remaining dataframes. I've created an example below:
import pandas as pd
import numpy as np
# Create an example of your problem
root = 'River'
suffix = list('123')
cols_1 = [root + '_' + each_suffix for each_suffix in suffix]
cols_2 = [root + '_' + each_suffix for each_suffix in suffix[::]]
data = np.arange(9).reshape(3,3)
df_1 = pd.DataFrame(columns=cols_1, data=data)
df_2 = pd.DataFrame(columns=cols_2, data=data)
df_1
[out] River_1 River_2 River_3
0 0 1 2
1 3 4 5
2 6 7 8
df_2
[out] River_3 River_2 River_1
0 0 1 2
1 3 4 5
2 6 7 8
col_list = df_1.columns.to_list() # Get a list of column names use .sort() to sort in place or
sorted_col_list = sorted(col_list, reverse=False) # Use reverse True to invert the order
def rearrange_df_cols(df, target_order):
df = df[target_order]
print(df)
return df
rearrange_df_cols(df_1, sorted_col_list)
[out] River_1 River_2 River_3
0 0 1 2
1 3 4 5
2 6 7 8
rearrange_df_cols(df_2, sorted_col_list)
[out] River_1 River_2 River_3
0 2 1 0
1 5 4 3
2 8 7 6
You can write a function based on what's above and apply it to all of your file/sheets provided that all columns names exist (NB the must be written identically).
Sometimes the data is presented horizontally, not vertically as described above, too.
This would be better as a separate question. In principle you should check the dimension of your data e.g. df.shape and based of the shape you can either use df.transpose() and then your function to reorder the columns names or directly use your function to reorder the column names.

Iterate every 5 rows and create a file

I have an excel file which I need to convert using python pandas.
I want to create a file for each 5 rows i.e. if I have 29 rows in an excel. I want to create total 6 files. First 5 files consisting of 5 rows each and last file containing of 4 rows. Can anyone help please?
You can read the whole excel file like this:
df = pd.read_excel(filename)
Then, you can split this df in batches of 5 rows like this:
n = 5 #chunk row size
list_df = [df[i:i+n] for i in range(0,df.shape[0],n)]
list_df will have 6 chunks for your case. 5 of them having 5 rows each and the 6th one having 4 rows.
You can use the code below. c is just a counter, x is the number of files you will need, and the output files will be named file_1.xlsx and so on:
import pandas as pd
import numpy as np
import math
df = pd.read_excel('path_to_your_file.xlsx') # create original df
c = 1
x = math.ceil(df.shape[0]/5)
for i in np.array_split(df, x):
filename = 'file_'+str(c)
pd.DataFrame(i).to_excel(filename+'.xlsx', index=False)
c += 1

Modifying multiple columns of data using iteration, but changing increment value for each column

I'm trying to modify multiple column values in pandas.Dataframes with different increments in each column so that the values in each column do not overlap with each other when graphed on a line graph.
Here's the end goal of what I want to do: link
Let's say I have this kind of Dataframe:
Col1 Col2 Col3
0 0.3 0.2
1 1.1 1.2
2 2.2 2.4
3 3 3.1
but with hundreds of columns and thousands of values.
When graphing this on a line-graph on excel or matplotlib, the values overlap with each other, so I would like to separate each column by adding the same values for each column like so:
Col1(+0) Col2(+10) Col3(+20)
0 10.3 20.2
1 11.1 21.2
2 12.2 22.4
3 13 23.1
By adding the same value to one column and increasing by an increment of 10 over each column, I am able to see each line without it overlapping in one graph.
I thought of using loops and iterations to automate this value-adding process, but I couldn't find any previous solutions on Stackoverflow that addresses how I could change the increment value (e.g. from adding 0 in Col1 in one loop, then adding 10 to Col2 in the next loop) between different columns, but not within the values in a column. To make things worse, I'm a beginner with no clue about programming or data manipulation.
Since the data is in a CSV format, I first used Pandas to read it and store in a Dataframe, and selected the columns that I wanted to edit:
import pandas as pd
#import CSV file
df = pd.read_csv ('data.csv')
#store csv data into dataframe
df1 = pd.DataFrame (data = df)
# Locate columns that I want to edit with df.loc
columns = df1.loc[:, ' C000':]
here is where I'm stuck:
# use iteration with increments to add numbers
n = 0
for values in columns:
values = n + 0
print (values)
But this for-loop only adds one increment value (in this case 0), and adds it to all columns, not just the first column. Not only that, but I don't know how to add the next increment value for the next column.
Any possible solutions would be greatly appreciated.
IIUC ,just use df.add() over axis=1 with a list made from the length of df.columns:
df1 = df.add(list(range(0,len(df.columns)*10))[::10],axis=1)
Or as #jezrael suggested, better:
df1=df.add(range(0,len(df.columns)*10, 10),axis=1)
print(df1)
Col1 Col2 Col3
0 0 10.3 20.2
1 1 11.1 21.2
2 2 12.2 22.4
3 3 13.0 23.1
Details :
list(range(0,len(df.columns)*10))[::10]
#[0, 10, 20]
I would recommend you to avoid looping over the data frame as it is inefficient but rather think of adding to matrixes.
e.g.
import numpy as np
import pandas as pd
# Create your example df
df = pd.DataFrame(data=np.random.randn(10,3))
# Create a Matrix of ones
x = np.ones(df.shape)
# Multiply each column with an incremented value * 10
x = x * 10*np.arange(1,df.shape[1]+1)
# Add the matrix to the data
df + x
Edit: In case you do not want to increment with 10, 20 ,30 but 0,10,20 use this instead
import numpy as np
import pandas as pd
# Create your example df
df = pd.DataFrame(data=np.random.randn(10,3))
# Create a Matrix of ones
x = np.ones(df.shape)
# THIS LINE CHANGED
# Obmit the 1 so there is only an end value -> default start is 0
# Adjust the length of the vector
x = x * 10*np.arange(df.shape[1])
# Add the matrix to the data
df + x

Contructing new dataframe and keeping old one?

I am sorry for asking a naive question but it's driving me crazy at the moment. I have a dataframe df1, and creating new dataframe df2 by using it, as following:
import pandas as pd
def NewDF(df):
df['sum']=df['a']+df['b']
return df
df1 =pd.DataFrame({'a':[1,2,3],'b':[4,5,6]})
print(df1)
df2 =NewDF(df1)
print(df1)
which gives
a b
0 1 4
1 2 5
2 3 6
a b sum
0 1 4 5
1 2 5 7
2 3 6 9
Why I am loosing df1 shape and getting third column? How can I avoid this?
DataFrames are mutable so you should either explicitly pass a copy to your function, or have the first step in your function copy the input. Otherwise, just like with lists, any modifications your functions make also apply to the original.
Your options are:
def NewDF(df):
df = df.copy()
df['sum']=df['a']+df['b']
return df
df2 = NewDF(df1)
or
df2 = NewDF(df1.copy())
Here we can see that everything in your original implementation refers to the same object
import pandas as pd
def NewDF(df):
print(id(df))
df['sum']=df['a']+df['b']
print(id(df))
return df
df1 =pd.DataFrame({'a':[1,2,3],'b':[4,5,6]})
print(id(df1))
#2242099787480
df2 = NewDF(df1)
#2242099787480
#2242099787480
print(id(df2))
#2242099787480
The third column that you are getting is the Index column, each pandas DataFrame will always maintain an Index, however you can choose if you don't want it in your output.
import pandas as pd
def NewDF(df):
df['sum']=df['a']+df['b']
return df
df1 =pd.DataFrame({'a':[1,2,3],'b':[4,5,6]})
print(df1.to_string(index=False))
df2 =NewDF(df1)
print(df1.to_string(index = False))
Gives the output as
a b
1 4
2 5
3 6
a b sum
1 4 5
2 5 7
3 6 9
Now you might have the question why does index exist, Index is actually a backed hash table which increases speed and is a highly desirable feature in multiple contexts, If this was just a one of question, this should be enough, but if you are looking to learn more about pandas and I would advice you to look into indexing, you can begin by looking here https://stackoverflow.com/a/27238758/10953776

Appending values to a column in a loop

I have various files containing data. I want to extract one specific column from each file and create a new dataframe with one column containing all the extracted data.
So for example I have 3 files:
A B C
1 2 3
4 5 6
A B C
7 8 9
8 7 6
A B C
5 4 3
2 1 0
The new dataframe should only contain the values from column C:
C
3
6
9
6
3
0
So the column of the first file should be copied to the new dataframe, the column from the second file should be appendend to the new dataframe.
My code looks like this so far:
import pandas as pd
import glob
for filename in glob.glob('*.dat'):
df= pd.read_csv(filename, delimiter="\t", header=6)
df1= df["Bias"]
print(df)
Now df1 is overwritten in each loop step. Would it be a good idea to create a temporary dataframe in each loop step and then copy the data to the new dataframe?
Any input is appreciated!
Use list comprehension or for loop with append for list of DataFrames and if need only some columns add parameter usecols, last concat all together for big DataFrame:
dfs = [pd.read_csv(f, delimiter="\t", header=6, usecols=['C']) for f in glob.glob('*.dat')]
Or:
dfs = []
for filename in glob.glob('*.dat'):
df = pd.read_csv(filename, delimiter="\t", header=6, usecols=['C'])
#if need all columns
#df = pd.read_csv(filename, delimiter="\t", header=6)
dfs.append(df)
df = pd.concat(dfs, ignore_index=True)

Resources