I have a script producing multiple sheets for processing into a database but have strict number formats for certain columns in my dataframes.
I have created a sample dict for based on column headers and number format required and a sample df.
import pandas as pd
df_int_headers=['GrossRevenue', 'Realisation', 'NetRevenue']
df={'ID': [654398,456789],'GrossRevenue': [3.6069109,7.584326], 'Realisation': [1.5129510,3.2659478], 'NetRevenue': [2.0939599,4.3183782]}
df_formats = {'GrossRevenue': 3, 'Realisation': 6, 'NetRevenue': 4}
df=pd.DataFrame.from_dict(df)
def formatter(header):
for key, value in df_formats.items():
for head in header:
return header.round(value).astype(str).astype(float)
df[df_int_headers] = df[df_int_headers].apply(formatter)
df.to_excel('test.xlsx',index=False)
When using current code, all column number formats are returned as 3 .d.p. in my Excel sheet whereas I require different formats for each column.
Look forward to your replies.
For me working pass dictioanry to DataFrame.round, for your original key-value 'NetRevenue': 4 are returned only 3 values, in my opinion there should be 0 in end which is removed, because number:
df={'ID': 654398,'GrossRevenue': 3.6069109,
'Realisation': 1.5129510, 'NetRevenue': 2.0939599}
df = pd.DataFrame(df, index=[0])
df_formats = {'GrossRevenue': 3, 'Realisation': 6, 'NetRevenue': 5}
df_int_headers = list(df_formats.keys())
df[df_int_headers] = df[df_int_headers].round(df_formats)
Related
I would like to generate an XLSX file with keys and values from a Dictionary. Example bellow
statistics = {
"a:": f"textt",
"b": " ",
"c": f"{len(list_1)}",
}
df = pd.DataFrame(
{'Statistics': pd.Series(statistics.keys()),
'Statistics Values': pd.Series(statistics.values()))
writer = pd.ExcelWriter(f"{output_xlsx_file}", engine='xlsxwriter')
df['Statistics'].to_excel(writer, sheet_name='Statistics', index=False)
df['Statistics Values'].to_excel(writer, sheet_name='Statistics', startcol=1, index=False)
The expected result is to have an xlsx file with 2 columns in the col 1 dicts keys in the second column dicts values
This does happen, with one exception for the dicts values if they are a number like 3rd one in my example within the XLSX there is quote infront of the number
Any idea how can make that being a number and get rid of that quote, if I want to add in xlsx the numbers it will fail as it's not seen as a number.
I'd like to take an existing DataFrame with a single level of columns and modify it to use a MultiIndex based on a reference list of tuples and have the proper ordering/alignment. To illustrate by example:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(10,5), columns = ['nyc','london','canada','chile','earth'])
coltuples = [('cities','nyc'),('countries','canada'),('countries','usa'),('countries','chile'),('planets','earth'),('planets','mars'),('cities','sf'),('cities','london')]
I'd like to create a new DataFrame which has a top level consisting of 'cities', 'countries', and 'planets' with the corresponding original columns underneath. I am not concerned about order but definitely proper alignment.
It can be assumed that 'coltuples' will not be missing any of the columns from 'df', but may have extraneous pairs, and the ordering of the pairs can be random.
I am trying something along the lines of:
coltuplesuse = [x for x in coltuples if x[1] in df.columns]
cols = pd.MultiIndex.from_tuples(coltuplesuse, names=['level1','level2'])
df.reindex(columns=cols)
which seems to be on the right track but the underlying data in the DataFrame is 'nan'
thanks in advance!
Two things to notice: you want the command set_axis rather than reindex, and sorting by the original column order will ensure the correct label is assigned to the correct column (this is done in the sorted... key= bit).
use_cols = [tup for tup in coltuples if tup[1] in df.columns]
use_cols = sorted(use_cols, key=lambda x: list(df.columns).index(x[1]))
multi_index = pd.MultiIndex.from_tuples(use_cols, names=['level1', 'level2'])
df.set_axis(multi_index, axis=1)
output:
level1 cities countries planets
level2 nyc london canada chile earth
0 0.028033 0.540977 -0.056096 1.675698 -0.328630
1 1.170465 -1.003825 0.882126 0.453294 -1.127752
2 -0.187466 -0.192546 0.269802 -1.225172 -0.548491
3 2.272900 -0.085427 0.029242 -2.258696 1.034485
4 -1.243871 -1.660432 -0.051674 2.098602 -2.098941
5 -0.820820 -0.289754 0.019348 0.176778 0.395959
6 1.346459 -0.260583 0.212008 -1.071501 0.945545
7 0.673351 1.133616 1.117379 -0.531403 1.467604
8 0.332187 -3.541103 -0.222365 1.035739 -0.485742
9 -0.605965 -1.442371 -1.628210 -0.711887 -2.104755
I need to add a column to 40 excel files. The new column in each file will be filled with a name.
This is what I have:
files=[16686_Survey.xlsx, 16687_Survey.xlsx, 16772_Survey.xlsx, ...] (40 files with more than 200 rows each)
filenames=['name1', 'name2', 'name3', ...] (40 names)
I need to add a column to each excel file and write its corresponding name along the new column.
With the following code I got what I need for one file.
import pandas as pd
df = pd.read_excel('16686_Survey.xlsx')
df.insert(0, "WellName", "Name1")
writer = pd.ExcelWriter('16686_Survey.xlsx')
df.to_excel(writer, index = False)
writer.save()
But it would be inefficient if I do it 40 times, and I would like to learn how to use a loop to address this type of problem because I have been in the same situation many times.
The image is what I got with the code above. The first table in the image is what I have. The second table is what I want
Thank you for your help!
I'm not a 100% sure I understand your question but I think you're looking for this:
import pandas as pd
files=['16686_Survey.xlsx', '16687_Survey.xlsx', '16772_Survey.xlsx', ...]
filenames=['name1', 'name2', 'name3', ...]
for excel_file, other_name in zip(files, filenames):
df = pd.read_excel(excel_file)
df.insert(0, "WellName", other_name)
writer = pd.ExcelWriter(excel_file)
df.to_excel(writer, index = False)
writer.save()
I combined both the lists (I assumed they were the same length) using the zip function. The zip function takes items from the lists one by one and combines them so that all the first items are together and all the second and so forth.
I imported my csv file into a python using numpy.txt and the results look like this:
>>> print(FH)
array([['Probe_Name', '', 'A2M', ..., 'POS_D', 'POS_E', 'POS_F'],
['Accession', '', 'NM_000014.4', ..., 'ERCC_00092.1',
'ERCC_00035.1', 'ERCC_00034.1'],
['Class_Name', '', 'Endogenous', ..., 'Positive', 'Positive',
'Positive'],
...,
['CF33294_10', '', '6351', ..., '1187', '226', '84'],
['CF33299_11', '', '5239', ..., '932', '138', '64'],
['CF33300_12', '', '37372', ..., '981', '202', '58']], dtype=object)
every single list is a column and the first item of every column is the header. I want to plot the data in different ways. to do so, I want to make variable for every single column. for example the first column I want to print(Probe_Name) as the header and the results will be shown like this:
A2M
.
.
.
POS_D
POS_E
POS_F
and this is the case for the rest of columns. and then I will plot the variables.
I tried to do that in python3 like this:
def items(N_array:)
for item in N_array:
name = item[0]
content = item[1:]
return name, content
print(items(FH))it does not return what I expect. do you know how to fix it?
One simple way to do this is with pandas dataframes. When you read the csv file using a pandas dataframe, you essentially get a collection of 'columns' (called series in pandas).
import pandas as pd
df = pd.read_csv("your filename.csv")
df
Probe_Name Accession
0 A2m MD_9999
1 POS_D NM_0014.4
2 POS_E 99999
Now we can deal with each column, which is named automatically by the header column.
print(df['Probe_Name'])
0 A2m
1 POS_D
2 POS_E
Furthermore, you can you do plotting (assuming you have numeric data in here somewhere).
http://pandas.pydata.org/pandas-docs/stable/index.html
I'm retrieving real-time financial data.
Every 1 second, I pull the following list:
[{'symbol': 'ETHBTC', 'price': '0.03381600'}, {'symbol': 'LTCBTC', 'price': >'0.01848300'}...]
The goal is to put this list into an already-existing pandas DataFrame.
What I've done so far is converting this list of a dictionary to a pandas DataFrame. My problem is that symbols and prices are in two columns. I would like to have symbols as the DataFrame header and add a new row every 1 second containing price's values.
marketInformation = [{'symbol': 'ETHBTC', 'price': '0.03381600'}, {'symbol': 'LTCBTC', 'price': >'0.01848300'}...]
data = pd.DataFrame(marketInformation)
header = data['symbol'].values
newData = pd.DataFrame(columns=header)
while True:
realTimeData = ... // get a new marketInformation list of dict
newData.append(pd.DataFrame(realTimeData)['price'])
print(newData)
Unfortunately, the printed DataFrame is always empty. I would like to have a new row added every second with new prices for each symbol with the current time.
I printed the below part:
pd.DataFrame(realTimeData)['price']
and it gives me a pandas.core.series.Series object with a length equals to the number of symbol.
What's wrong?
After you create newData, just do:
newData.loc[len(newData), :] = [item['price'] for item in realTimeData]
You just need to set_index() and then transpose the df:
newData = pd.DataFrame(marketInformation).set_index('symbol').T
#In [245]: newData
#Out[245]:
#symbol ETHBTC LTC
#price 0.03381600 0.01848300
# then change the index to whatever makes sense to your data
newdata_time = pd.Timestamp.now()
newData.rename(index={'price':newdata_time})
#Out[246]:
#symbol ETHBTC LTC
#2019-04-03 17:08:51.389359 0.03381600 0.01848300