increment odd values and duplicate in pandas or excel - excel

Want an excel sheet column with odd values but has to be duplicated.
The value of the column has to be like 1,1,3,3,5,5,7,7,9,9,11,11.... and so on until 20000/20001.
Could you let me know how to do it either in excel sheet or in pandas dataframe.
Tried duplicating in excel but it does not work.
Tried in pandas but even even numbers come up.
df['ABC'] = 2 + df.index//1

You can use numpy.arange with numpy.repeat :
import pandas as pd
import numpy as np
df = pd.Series(np.repeat(np.arange(1,20002,2),2)).to_frame("ABC")
Then (if needed) use pandas.DataFrame.to_excel to make a spreadsheet :
df.to_excel("out.xlsx", index=False)
# Output :
print(df)
ABC
0 1
1 1
2 3
3 3
4 5
... ...
19997 19997
19998 19999
19999 19999
20000 20001
20001 20001
[20002 rows x 1 columns]

Using ms365, try something like:
=TOCOL(LET(x,SEQUENCE(10001,,,2),IFERROR(EXPAND(x,,2),x)))

Related

Read excel column and re-write into excel as data format using Pandas

My excel spreadsheet is for the form as below.
A
B
Part 1- 20210910
55
Part 2- 20210829
45
Part 3- 20210912
2
I would like to take the strings from Column A "Part A- 20210910" but read it using Pandas as "2021/09/10", a date format. How could I implement this?
IIUC:
df['A'] = df['A'].str.extract(r'(\d{8})').astype('datetime64')
print(df)
# Output:
A B
0 2021-09-10 55
1 2021-08-29 45
2 2021-09-12 2
My beginner way of doing it:
import pandas as pd
df = pd.read_excel('file_name.xlsx')
df['A'] = df['A'].apply(lambda x: x.split('-')).apply(lambda x: x[1]).apply(lambda x: pd.to_datetime(str(x), format='%Y%m%d'))
Output

Iterate every 5 rows and create a file

I have an excel file which I need to convert using python pandas.
I want to create a file for each 5 rows i.e. if I have 29 rows in an excel. I want to create total 6 files. First 5 files consisting of 5 rows each and last file containing of 4 rows. Can anyone help please?
You can read the whole excel file like this:
df = pd.read_excel(filename)
Then, you can split this df in batches of 5 rows like this:
n = 5 #chunk row size
list_df = [df[i:i+n] for i in range(0,df.shape[0],n)]
list_df will have 6 chunks for your case. 5 of them having 5 rows each and the 6th one having 4 rows.
You can use the code below. c is just a counter, x is the number of files you will need, and the output files will be named file_1.xlsx and so on:
import pandas as pd
import numpy as np
import math
df = pd.read_excel('path_to_your_file.xlsx') # create original df
c = 1
x = math.ceil(df.shape[0]/5)
for i in np.array_split(df, x):
filename = 'file_'+str(c)
pd.DataFrame(i).to_excel(filename+'.xlsx', index=False)
c += 1

Merging sheets of excel using python

I am trying to take data of two sheets and comparing with each other if it matches i want to append column. Let me explain this by showing what i am doing and what i am trying to get in output using python.
This is my sheet1 from excel.xlsx:
it contains four column name,class,age and group.
This is my sheet2 from excel.xlsx:
it contains default, and name column with extra names in it.
So, Now i am trying to match name of sheet2 with sheet1, if the name containing in sheet1 matches with sheet2 then i want to add default value corresponding to that name from sheet2.
This i need in output:
As you can see only Ravi and Neha having default in sheet2 and that name matches with sheet1 name. Suhash and Aish dont have any default value so not anything coming there.
This code i tried:
import pandas as pd
import xlrd
df1 = pd.read_excel('stack.xlsx', sheet_name='Sheet1')
df2 = pd.read_excel('stack.xlsx', sheet_name='Sheet2')
df1['DEFAULT'] = df1.NAME.map(df2.set_index('NAME')['DEFAULT'].to_dict())
df1.to_excel('play.xlsx',index=False)
and getting output excel like this:
Not getting default against Ravi.
Please help me with this to get this expected output using python.
Assuming you read each sheet into a dataframe (df = sheet1, df2 = sheet2)
it's quite easy and there are a few options (ranked in order of speed, from fastest to slowest):
# .merge
df = df.merge(df2, how='left', on='Name')
# pd.conact
df = pd.concat([df.set_index('Name'), df2.set_index('Name').Default], axis=1, sort='Name', join='inner')
# .join
df = df.set_index('Name').join(df2.set_index('Name'))
# .map
df.Default = df.Name.map(df2.set_index('Name')['Default'].to_dict())
All of them will have the following output:
Name Default Class Age Group
0 NaN NaN 4 2 tig
1 Ravi 2.0 5 5 rose
2 NaN NaN 3 3 lily
3 Suhas NaN 5 5 rose
4 NaN NaN 2 2 sun
5 Neha 3.0 5 5 rose
6 NaN NaN 5 2 sun
7 Aish NaN 5 5 rose
Then you overwrite the original sheet by using df.to_excel
EDIT
So the code you shared has 3 problems. One of which seems to be a language barrier... You only need 1 of the options I gave you. Secondly there's a missing ' when reading the first sheet into df. And lastly you're inconsistent when using the df names. you defined df1 and df2 but used just df in the code which doesn't work
So the correct code would be as follows:
import pandas as pd
import xlrd
df1 = pd.read_excel('stack.xlsx', sheet_name='Sheet1') #Here the ' was missing
df2 = pd.read_excel('stack.xlsx', sheet_name='Sheet2')
## Now you chose one of the options, I used map here, but you can pick any one of them
df1.DEFAULT = df1.NAME.map(df2.set_index('NAME')['DEFAULT'].to_dict())
df1.to_excel('play.xlsx',index=False)

Modifying multiple columns of data using iteration, but changing increment value for each column

I'm trying to modify multiple column values in pandas.Dataframes with different increments in each column so that the values in each column do not overlap with each other when graphed on a line graph.
Here's the end goal of what I want to do: link
Let's say I have this kind of Dataframe:
Col1 Col2 Col3
0 0.3 0.2
1 1.1 1.2
2 2.2 2.4
3 3 3.1
but with hundreds of columns and thousands of values.
When graphing this on a line-graph on excel or matplotlib, the values overlap with each other, so I would like to separate each column by adding the same values for each column like so:
Col1(+0) Col2(+10) Col3(+20)
0 10.3 20.2
1 11.1 21.2
2 12.2 22.4
3 13 23.1
By adding the same value to one column and increasing by an increment of 10 over each column, I am able to see each line without it overlapping in one graph.
I thought of using loops and iterations to automate this value-adding process, but I couldn't find any previous solutions on Stackoverflow that addresses how I could change the increment value (e.g. from adding 0 in Col1 in one loop, then adding 10 to Col2 in the next loop) between different columns, but not within the values in a column. To make things worse, I'm a beginner with no clue about programming or data manipulation.
Since the data is in a CSV format, I first used Pandas to read it and store in a Dataframe, and selected the columns that I wanted to edit:
import pandas as pd
#import CSV file
df = pd.read_csv ('data.csv')
#store csv data into dataframe
df1 = pd.DataFrame (data = df)
# Locate columns that I want to edit with df.loc
columns = df1.loc[:, ' C000':]
here is where I'm stuck:
# use iteration with increments to add numbers
n = 0
for values in columns:
values = n + 0
print (values)
But this for-loop only adds one increment value (in this case 0), and adds it to all columns, not just the first column. Not only that, but I don't know how to add the next increment value for the next column.
Any possible solutions would be greatly appreciated.
IIUC ,just use df.add() over axis=1 with a list made from the length of df.columns:
df1 = df.add(list(range(0,len(df.columns)*10))[::10],axis=1)
Or as #jezrael suggested, better:
df1=df.add(range(0,len(df.columns)*10, 10),axis=1)
print(df1)
Col1 Col2 Col3
0 0 10.3 20.2
1 1 11.1 21.2
2 2 12.2 22.4
3 3 13.0 23.1
Details :
list(range(0,len(df.columns)*10))[::10]
#[0, 10, 20]
I would recommend you to avoid looping over the data frame as it is inefficient but rather think of adding to matrixes.
e.g.
import numpy as np
import pandas as pd
# Create your example df
df = pd.DataFrame(data=np.random.randn(10,3))
# Create a Matrix of ones
x = np.ones(df.shape)
# Multiply each column with an incremented value * 10
x = x * 10*np.arange(1,df.shape[1]+1)
# Add the matrix to the data
df + x
Edit: In case you do not want to increment with 10, 20 ,30 but 0,10,20 use this instead
import numpy as np
import pandas as pd
# Create your example df
df = pd.DataFrame(data=np.random.randn(10,3))
# Create a Matrix of ones
x = np.ones(df.shape)
# THIS LINE CHANGED
# Obmit the 1 so there is only an end value -> default start is 0
# Adjust the length of the vector
x = x * 10*np.arange(df.shape[1])
# Add the matrix to the data
df + x

pandas: split dataframe into multiple csvs

I have a large file, imported into a single dataframe in Pandas.
I'm using pandas to split up a file into many segments, by the number of rows in the dataframe.
eg: 10 rows:
file 1 gets [0:4]
file 2 gets [5:9]
Is there a way to do this without having to create more dataframes?
assign a new column g here, you just need to specific how many item you want in each groupby, here I am using 3 .
df.assign(g=df.index//3)
Out[324]:
0 g
0 1 0
1 2 0
2 3 0
3 4 1
4 5 1
5 6 1
6 7 2
7 8 2
8 9 2
9 10 3
and you can call the df[df.g==1] to get what you need
There are two ways of doing this. I believe you are looking for the former. Basically, we open a series of csv writers, then we write to the correct csv writer by using some basic math with the index, then we close all files.
A single DataFrame evenly divided into N number of CSV files
import pandas as pd
import csv, math
df = pd.DataFrame([1,2,3,4,5,6,7,8,9,10]) # uncreative input values for 10 columns
NUMBER_OF_SPLITS = 2
fileOpens = [open(f"out{i}.csv","w") for i in range(NUMBER_OF_SPLITS)]
fileWriters = [csv.writer(v, lineterminator='\n') for v in fileOpens]
for i,row in df.iterrows():
fileWriters[math.floor((i/df.shape[0])*NUMBER_OF_SPLITS)].writerow(row.tolist())
for file in fileOpens:
file.close()
More than one DataFrame evenly divided into N number of CSV files
import pandas as pd
import numpy as np
df = pd.DataFrame([1,2,3,4,5,6,7,8,9,10]) # uncreative input values for 10 columns
NUMBER_OF_SPLITS = 2
for i, new_df in enumerate(np.array_split(df,NUMBER_OF_SPLITS)):
with open(f"out{i}.csv","w") as fo:
fo.write(new_df.to_csv())
use numpy.array_split to split your dataframe dfX and save it in N csv files of equal size: dfX_1.csv to dfX_N.csv
N = 10
for i, df in enumerate(np.array_split(dfX, N)):
df.to_csv(f"dfX_{i + 1}.csv", index=False)
iterating over iloc's arguments will do the trick.

Resources