I imported my csv file into a python using numpy.txt and the results look like this:
>>> print(FH)
array([['Probe_Name', '', 'A2M', ..., 'POS_D', 'POS_E', 'POS_F'],
['Accession', '', 'NM_000014.4', ..., 'ERCC_00092.1',
'ERCC_00035.1', 'ERCC_00034.1'],
['Class_Name', '', 'Endogenous', ..., 'Positive', 'Positive',
'Positive'],
...,
['CF33294_10', '', '6351', ..., '1187', '226', '84'],
['CF33299_11', '', '5239', ..., '932', '138', '64'],
['CF33300_12', '', '37372', ..., '981', '202', '58']], dtype=object)
every single list is a column and the first item of every column is the header. I want to plot the data in different ways. to do so, I want to make variable for every single column. for example the first column I want to print(Probe_Name) as the header and the results will be shown like this:
A2M
.
.
.
POS_D
POS_E
POS_F
and this is the case for the rest of columns. and then I will plot the variables.
I tried to do that in python3 like this:
def items(N_array:)
for item in N_array:
name = item[0]
content = item[1:]
return name, content
print(items(FH))it does not return what I expect. do you know how to fix it?
One simple way to do this is with pandas dataframes. When you read the csv file using a pandas dataframe, you essentially get a collection of 'columns' (called series in pandas).
import pandas as pd
df = pd.read_csv("your filename.csv")
df
Probe_Name Accession
0 A2m MD_9999
1 POS_D NM_0014.4
2 POS_E 99999
Now we can deal with each column, which is named automatically by the header column.
print(df['Probe_Name'])
0 A2m
1 POS_D
2 POS_E
Furthermore, you can you do plotting (assuming you have numeric data in here somewhere).
http://pandas.pydata.org/pandas-docs/stable/index.html
Related
I have a script producing multiple sheets for processing into a database but have strict number formats for certain columns in my dataframes.
I have created a sample dict for based on column headers and number format required and a sample df.
import pandas as pd
df_int_headers=['GrossRevenue', 'Realisation', 'NetRevenue']
df={'ID': [654398,456789],'GrossRevenue': [3.6069109,7.584326], 'Realisation': [1.5129510,3.2659478], 'NetRevenue': [2.0939599,4.3183782]}
df_formats = {'GrossRevenue': 3, 'Realisation': 6, 'NetRevenue': 4}
df=pd.DataFrame.from_dict(df)
def formatter(header):
for key, value in df_formats.items():
for head in header:
return header.round(value).astype(str).astype(float)
df[df_int_headers] = df[df_int_headers].apply(formatter)
df.to_excel('test.xlsx',index=False)
When using current code, all column number formats are returned as 3 .d.p. in my Excel sheet whereas I require different formats for each column.
Look forward to your replies.
For me working pass dictioanry to DataFrame.round, for your original key-value 'NetRevenue': 4 are returned only 3 values, in my opinion there should be 0 in end which is removed, because number:
df={'ID': 654398,'GrossRevenue': 3.6069109,
'Realisation': 1.5129510, 'NetRevenue': 2.0939599}
df = pd.DataFrame(df, index=[0])
df_formats = {'GrossRevenue': 3, 'Realisation': 6, 'NetRevenue': 5}
df_int_headers = list(df_formats.keys())
df[df_int_headers] = df[df_int_headers].round(df_formats)
I'm trying to make a table from a list of data using pandas.
Originally I wanted to make a function where I can pass dynamic variables so I could continuously add new rows from data list.
It works up until a point where adding rows part begun. Column headers are adding, but the data - no. It either keeps value at only last col or adds nothing.
My scrath was:
for title in titles:
for x in data:
table = {
title: data[x]
}
df.DataFrame(table, columns=titles, index[0]
columns list:
titles = ['timestamp', 'source', 'tracepoint']
data list:
data = ['first', 'second', 'third',
'first', 'second', 'third',
'first', 'second', 'third']
How can I make something like this?
timestamp, source, tracepoint
first, second, third
first, second, third
first, second, third
If you just want to initialize pandas dataframe, you can use dataframe’s constructor.
And you can also append row by using a dict.
Pandas provides other useful functions,
such as concatenation between data frames, insert/delete columns. If you need, please check pandas’s doc.
import pandas as pd
# initialization by dataframe’s constructor
titles = ['timestamp', 'source', 'tracepoint']
data = [['first', 'second', 'third'],
['first', 'second', 'third'],
['first', 'second', 'third']]
df = pd.DataFrame(data, columns=titles)
print('---initialization---')
print(df)
# append row
new_row = {
'timestamp': '2020/11/01',
'source': 'xxx',
'tracepoint': 'yyy'
}
df = df.append(new_row, ignore_index=True)
print('---append result---')
print(df)
output
---initialization---
timestamp source tracepoint
0 first second third
1 first second third
2 first second third
---append result---
timestamp source tracepoint
0 first second third
1 first second third
2 first second third
3 2020/11/01 xxx yyy
I need to add a column to 40 excel files. The new column in each file will be filled with a name.
This is what I have:
files=[16686_Survey.xlsx, 16687_Survey.xlsx, 16772_Survey.xlsx, ...] (40 files with more than 200 rows each)
filenames=['name1', 'name2', 'name3', ...] (40 names)
I need to add a column to each excel file and write its corresponding name along the new column.
With the following code I got what I need for one file.
import pandas as pd
df = pd.read_excel('16686_Survey.xlsx')
df.insert(0, "WellName", "Name1")
writer = pd.ExcelWriter('16686_Survey.xlsx')
df.to_excel(writer, index = False)
writer.save()
But it would be inefficient if I do it 40 times, and I would like to learn how to use a loop to address this type of problem because I have been in the same situation many times.
The image is what I got with the code above. The first table in the image is what I have. The second table is what I want
Thank you for your help!
I'm not a 100% sure I understand your question but I think you're looking for this:
import pandas as pd
files=['16686_Survey.xlsx', '16687_Survey.xlsx', '16772_Survey.xlsx', ...]
filenames=['name1', 'name2', 'name3', ...]
for excel_file, other_name in zip(files, filenames):
df = pd.read_excel(excel_file)
df.insert(0, "WellName", other_name)
writer = pd.ExcelWriter(excel_file)
df.to_excel(writer, index = False)
writer.save()
I combined both the lists (I assumed they were the same length) using the zip function. The zip function takes items from the lists one by one and combines them so that all the first items are together and all the second and so forth.
I am downloading data in json format in Python 3.7 and trying to display it as an Excel spreadsheet. I tried using pandas, but I think that there is a problem in getting the data into a dataframe. I also tried a csv approach but that was not successful.
I am new to learning python, so maybe there is something obvious I am missing.
import json, requests,urllib,csv
import pandas as pd
url = urllib.request.urlopen('https://library.brown.edu/search/solr_pub/iip/?start=0&rows=100&indent=on&wt=json&q=*')
str_response = url.read().decode('utf-8')
jsondata=json.loads(str_response)
df = pd.DataFrame(jsondata)
I was hoping to get a number of rows for each item, e.g., zoor0353, with the columns for each of the keys associated with it (e.g., region, date_desc, etc. -there are quite a few). Instead, it seemed only to take the first section, returning:
responseHeader \
QTime 1
docs NaN
numFound NaN
params {'q': '*', 'indent': 'on', 'start': '0', 'rows...
start NaN
status 0
response
QTime NaN
docs [{'inscription_id': 'zoor0353', 'metadata': ['...
numFound 4356
params NaN
start 0
status NaN
I tried this with normalization method, but did no better. Ultimately, I would like to use a dataset made of an appended file of many calls to this api, and am also wondering if I will need to manipulate data, and how, to get it to work with pandas.
Since I'm not sure what exactly you need to get, I'll show you
how to use pandas.io.json.json_normalize
how to pick extract key/values
json_normalize
import requests
import pandas as pd
from pandas.io.json import json_normalize
url = 'https://library.brown.edu/search/solr_pub/iip/?start=0&rows=100&indent=on&wt=json&q=*'
r = requests.get(url).json()
df = json_normalize(r['response']['docs']).T
This is a sample output; each column is a separate item of r['response']['docs'] containing values, and rows are the keys for said values.
Data are all messed up, and also some of the keys you probably don't want to use. That's why I think it's better to extract specific keys and values.
df = pd.DataFrame()
query = r['response']['docs']
for i in range(len(query)):
records = [
query[i]['inscription_id'], query[i]['city'], query[i]['_version_'],
query[i]['type'][0], query[i]['language_display'][0], query[i]['physical_type'][0]
]
# recs.append(records)
data = pd.DataFrame.from_records([records], columns=['inscription_id', 'city', '_version_', 'type', 'language_display', 'physical_type'])
df = df.append(data)
# sample data
inscription_id city _version_ type language_display physical_type
0 unkn0103 Unknown 1613822149755142144 prayer Hebrew other_object
0 zoor0391 Zoora 1613822173978296320 funerary.epitaph Greek tombstone
0 zoor0369 Zoora 1613822168079007744 funerary.epitaph Greek tomb
0 zoor0378 Zoora 1613822170509606912 funerary.epitaph Greek tombstone
0 zoor0393 Zoora 1613822174648336384 funerary.epitaph Greek tombstone
edit
query=r[][] - what is it and how can I learn more about it?
r = requests.get(url).json()
by using .json() at the end of requests.get method, I converted the response object (data from the API) to a dict type. Hence, r is now a dictionary containing keys:
r.keys()
# dict_keys(['responseHeader', 'response'])
to access values of a specific key, I use a format: dictionary[key][key_inside_of_another_key]:
r['response'].keys()
# dict_keys(['numFound', 'start', 'docs'])
r['response']['docs'][0]
# sample output
{'inscription_id': 'zoor0353',
'metadata': ['zoor0353',
'Negev',
'Zoora',
'Zoora, Negev. cemetery in An Naq. \nNegev.\n Zoora. Found by local inhabitants in the northwest corner of the Bronze\n Age, Byzantine and Islamic cemetery in the An Naq neighborhood south of\n the Wadi al-Hasa, probably in secondary use in later graves. \n',
'funerary.epitaph',
...
...
...
# [0] here is an item inside of a key - in the case of your data,
# there were 100 such items: [0, 1, .. 99] inside of r['response']['docs']
# to access item #4, you'd use ['response']['docs'][4]
That's how you navigate through a dictionary. Now, to access specific key, say inscription_id or _version_:
r['response']['docs'][0]['inscription_id']
# 'zoor0353'
r['response']['docs'][99]['inscription_id']
# 'mger0001'
r['response']['docs'][33]['_version_']
# 1613822151126679552
Lastly, to iterate through all rows, or items of data, I used a for loop: i here is a substitude for a range of numbers representing each item of your data -- from r['response']['docs'][1] to r['response']['docs'][99].
I'm retrieving real-time financial data.
Every 1 second, I pull the following list:
[{'symbol': 'ETHBTC', 'price': '0.03381600'}, {'symbol': 'LTCBTC', 'price': >'0.01848300'}...]
The goal is to put this list into an already-existing pandas DataFrame.
What I've done so far is converting this list of a dictionary to a pandas DataFrame. My problem is that symbols and prices are in two columns. I would like to have symbols as the DataFrame header and add a new row every 1 second containing price's values.
marketInformation = [{'symbol': 'ETHBTC', 'price': '0.03381600'}, {'symbol': 'LTCBTC', 'price': >'0.01848300'}...]
data = pd.DataFrame(marketInformation)
header = data['symbol'].values
newData = pd.DataFrame(columns=header)
while True:
realTimeData = ... // get a new marketInformation list of dict
newData.append(pd.DataFrame(realTimeData)['price'])
print(newData)
Unfortunately, the printed DataFrame is always empty. I would like to have a new row added every second with new prices for each symbol with the current time.
I printed the below part:
pd.DataFrame(realTimeData)['price']
and it gives me a pandas.core.series.Series object with a length equals to the number of symbol.
What's wrong?
After you create newData, just do:
newData.loc[len(newData), :] = [item['price'] for item in realTimeData]
You just need to set_index() and then transpose the df:
newData = pd.DataFrame(marketInformation).set_index('symbol').T
#In [245]: newData
#Out[245]:
#symbol ETHBTC LTC
#price 0.03381600 0.01848300
# then change the index to whatever makes sense to your data
newdata_time = pd.Timestamp.now()
newData.rename(index={'price':newdata_time})
#Out[246]:
#symbol ETHBTC LTC
#2019-04-03 17:08:51.389359 0.03381600 0.01848300