I am trying to write ASCII data to a text file in the following format:
Time Heat Flux ...
0.023 1.793 ...
.
.
.
The text header comes from a list of tags with dimension 1 x n and the numeric data with dimension m x n. I usually print this information when I know the number of rows and columns a priori in this manner:
# ... open file object, etc.
# Print header
print('%16s \t %16s' % ('Time', 'Heat Flux'), file=fileObject)
for ii in range(0, len(heatFlux):
print('%16.3f \t %16.3f' % (heatFlux[ii][0], heatFlux[ii][1]), file=fileObject)
I want to have generic code that allows me to write these files with a dynamically-sized array (in terms of number of columns). I've tried to generate a string and insert the tags and spaces, which I then write to file, but I am not sure how to "format-print" the string itself.
For example, I was trying
tagHeader = []
for tag in keyTags:
tagHeader = tagHeader + tag + '\t'
# ...
print(tagHeader, file=fileObject)
Can someone help me with this? Thanks!
If you have a list with the header and a 2D data structure you can do something like this:
def table(header, data):
print('\t'.join(['%16s'] * len(header)) % tuple(header))
for row in data:
print('\t'.join(['%16.3f'] * len(header)) % tuple(row))
Here are some tests:
header1 = ['Time', 'Heat Flux']
header2 = ['Time', 'Heat Flux', 'New col']
header3 = ['This', 'is', 'a', 'test']
data1 = [[uniform(-10, 10) for _ in range(len(header1))] for _ in range(randint(2, 10))]
data2 = [[uniform(-10, 10) for _ in range(len(header2))] for _ in range(randint(2, 10))]
data3 = [[uniform(-10, 10) for _ in range(len(header3))] for _ in range(randint(2, 10))]
table(header1, data1)
table(header2, data2)
table(header3, data3)
Output:
Time Heat Flux
7.037 -1.528
8.058 5.649
Time Heat Flux New col
-9.590 4.846 -4.024
-8.597 9.718 -8.174
9.260 -0.947 -6.675
3.401 -5.101 8.323
0.099 -6.582 3.951
This is a test
-2.126 -0.678 4.782 -7.849
-9.007 -0.019 -4.402 8.017
-7.399 -7.617 6.235 9.320
-0.486 -5.304 -4.723 1.946
2.743 -2.150 -6.779 -2.099
-7.499 -2.618 -9.918 0.674
8.912 -6.648 -7.865 -0.101
0.682 -0.414 7.677 7.167
-3.105 -6.562 6.970 -2.147
Related
I have a pandas dataframe of the form:
benchmark_x benchmark_y ref_point_x ref_point_y
0 525039.140 175445.518 525039.145 175445.539
1 525039.022 175445.542 525039.032 175445.568
2 525038.944 175445.558 525038.954 175445.588
3 525038.855 175445.576 525038.859 175445.576
4 525038.797 175445.587 525038.794 175445.559
5 525038.689 175445.609 525038.679 175445.551
6 525038.551 175445.637 525038.544 175445.577
7 525038.473 175445.653 525038.459 175445.594
8 525038.385 175445.670 525038.374 175445.610
9 525038.306 175445.686 525038.289 175445.626
I am trying to find the shortest distance from the line to the benchmark such that if the line is above the benchmark the distance is positive and if it is below the benchmark the distance is negative. See image below:
I used the KDTree from scipy like so:
from scipy.spatial import KDTree
tree=KDTree(df[["benchmark_x", "benchmark_y"]])
test = df.apply(lambda row: tree.query(row[["ref_point_x", "ref_point_y"]]), axis=1)
test=test.apply(pd.Series, index=["distance", "index"])
This seems to work except that it fails to capture the negative values as a result that the line is below the benchmark.
# recreating your example
columns = "benchmark_x benchmark_y ref_point_x ref_point_y".split(" ")
data = """525039.140 175445.518 525039.145 175445.539
525039.022 175445.542 525039.032 175445.568
525038.944 175445.558 525038.954 175445.588
525038.855 175445.576 525038.859 175445.576
525038.797 175445.587 525038.794 175445.559
525038.689 175445.609 525038.679 175445.551
525038.551 175445.637 525038.544 175445.577
525038.473 175445.653 525038.459 175445.594
525038.385 175445.670 525038.374 175445.610
525038.306 175445.686 525038.289 175445.626"""
data = [float(x) for x in data.replace("\n"," ").split(" ") if len(x)>0]
arr = np.array(data).reshape(-1,4)
df = pd.DataFrame(arr, columns=columns)
# adding your two new columns to the df
from scipy.spatial import KDTree
tree=KDTree(df[["benchmark_x", "benchmark_y"]])
df["distance"], df["index"] = tree.query(df[["ref_point_x", "ref_point_y"]])
Now to compare if one line is above the other or not, we have to evaluate y at the same x position. Therefore we need to interpolate the y points for the x positions of the other line.
df = df.sort_values("ref_point_x") # sorting is required for interpolation
xy_refpoint = df[["ref_point_x", "ref_point_y"]].values
df["ref_point_y_at_benchmark_x"] = np.interp(df["benchmark_x"], xy_refpoint[:,0], xy_refpoint[:,1])
And finally your criterium can be evaluated and applied:
df["distance"] = np.where(df["ref_point_y_at_benchmark_x"] < df["benchmark_y"], -df["distance"], df["distance"])
# or change the < to <,>,<=,>= as you wish
so I want to do a fisher exact test (one sided) on every row of a 3000+ row table with a format matching the below example
gene
sample_alt
sample_ref
population_alt
population_ref
One
4
556
770
37000
Two
5
555
771
36999
Three
6
554
772
36998
I would ideally like to make another column of the table equivalent to
[(4+556)!(4+770)!(770+37000)!(556+37000)!]/[4!(556!)770!(37000!)(4+556+770+37000)!]
for the first row of data, and so on and so forth for each row of the table.
I know how to do a fisher test in R for simple 2x2 tables, but I wouldn't know how I would apply the fisher.test() function to each row of a large table. I also can't use an excel formula because the numbers get so big with the factorials that they reach excel's digit limit and result in a #NUM error. What's the best way to simply complete this? Thanks in advance!
Beginning with a tab-delimited text file on desktop (table.txt) with the same format as shown in the stem question
if(!require(psych)){install.packages("psych")}
multiFisher = function(file="Desktop/table.txt", saveit=TRUE,
outfile="Desktop/table.csv", progress=T,
verbose=FALSE, digits=3, ... )
{
require(psych)
Data = read.table(file, skip=1, header=F,
col.names=c("Gene", "MD", "WTD", "MC", "WTC"), ...)
if(verbose){print(str(Data))}
Data$Fisher.p = NA
Data$phi = NA
Data$OR1 = format(0.123, nsmall=3)
Data$OR2 = NA
if(progress){cat("\n")}
for(i in 1:length(Data$Gene)){
Matrix = matrix(c(Data$WTC[i],Data$MC[i],Data$WTD[i],Data$MD[i]), nrow=2)
Fisher = fisher.test(Matrix, alternative = 'greater')
Data$Fisher.p[i] = signif(Fisher$p.value, digits=digits)
Data$phi[i] = phi(Matrix, digits=digits)
OR1 = (Data$WTC[i]*Data$MD[i])/(Data$MC[i]*Data$WTD[i])
OR2 = 1 / OR1
Data$OR1[i] = format(signif(OR1, digits=digits), nsmall=3)
Data$OR2[i] = signif(OR2, digits=digits)
if(progress) {cat(".")}
}
if(progress){cat("\n"); cat("\n")}
if(saveit){write.csv(Data, outfile)}
return(Data)
}
multiFisher()
My data is like below - stored in a .OUT file:
{ID=ISIN Name=yes PROGRAM=abc START_of_FIELDS CODE END-OF-FIELDS TIMESTARTED=Mon Nov 30 20:45:56
START-OF-DATA
CODE|ERR CODE|NUM|EXCH_CODE|
912828U rp|0|1|BERLIN|
1392917 rp|0|1|IND|
3CB0248 rp|0|1|BRAZIL|
END-OF-DATA***}
I need to extract the lines between START-OF-DATA and END-OF-DATA from above .OUT file using Python and load it in CSV file.
CODE|ERR CODE|NUM|EXCH_CODE|
912828U rp|0|1|BERLIN|
1392917 rp|0|1|IND|
3CB0248 rp|0|1|FRANKFURT|
You can use non greedy quantifier regex to get the entries between two strings.
with open('file.txt', 'r') as file:
data = file.read()
pattern = pattern = re.compile(r'(?:START-OF-DATA(.*?)END-OF-DATA)', re.MULTILINE|re.IGNORECASE | re.DOTALL)
g = re.findall(pattern,data)
O/P
[' \nCODE|ERR CODE|NUM|EXCH_CODE|\n912828U rp|0|1|BERLIN|\n1392917 rp|0|1|IND| \n3CB0248 rp|0|1|BRAZIL| \n']
#remove whitespaces and split by new line and remove empty entries of list
t = g[0].replace(" ","").split("\n")
new = list(filter(None, t))
O/P
['CODE|ERRCODE|NUM|EXCH_CODE|', '912828Urp|0|1|BERLIN|', '1392917rp|0|1|IND|', '3CB0248rp|0|1|BRAZIL|']
#create dataframe with pipe demoted
df = pd.DataFrame([i.split('|') for i in new])
O/P
0 1 2 3
0 CODE ERRCODE NUM EXCH_CODE
1 912828Urp 0 1 BERLIN
2 1392917rp 0 1 IND
3 3CB0248rp 0 1 BRAZIL
#create csv from df
df.to_csv('file.csv')
The regex pattern defined here will capture everything whenever a match is found for a string that begins with "START-OF-DATA" and ends with "END-OF-DATA" and leave you its output
I would like to get the S&P 500 ['Adj Close'] column and replace the column with the corresponding stock symbol, however, I am not able to replace the dataframe columns because it gives me an error: KeyError: '5'
What I would like to achieve is to loop through all the available stocks from the list and replace the Adj Close with the stock symbol.
This is what I did:
First I have scraped the stock symbols from Wikipedia and added them to a list.
data = pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
symbols = data[0] # get first column
symbols.head()
stock = symbols['Symbol'].to_list()
print(stock[0:5])
this gives me a list of stock symbols as below:
['MMM', 'ABT', 'ABBV', 'ABMD', 'ACN']
then I scraped Yahoo finance to get the daily financial data as below
stock_url = 'https://query1.finance.yahoo.com/v7/finance/download/{}?'
params = {
'range' : '1y',
'interval' : '1d',
'events' : 'history'
}
response = requests.get(stock_url.format(stock[0]), params=params)
file = StringIO(response.text)
reader = csv.reader(file)
data = list(reader)
df = pd.DataFrame(data)
stock_data = df['5']
Fix for key error
You are calling the the url using the list 'stock' and it gives a 404 response when I tried.
Call the URL with individual stock like below,
requests.get(stock_url.format(stock[0]), params=params)
Also do below, The column 5 is stored as integer instead of character. That is the reason you got 'key error'
stock_data = df[5]
I tried for stock 'MMM' - stock[0] and it prints below:
0 1 2 3 4 5 \
0 Date Open High Low Close Adj Close
1 2019-12-11 168.380005 168.839996 167.330002 168.740005 162.682480
2 2019-12-12 166.729996 170.850006 166.330002 168.559998 162.508926
3 2019-12-13 169.619995 171.119995 168.080002 168.789993 162.730667
4 2019-12-16 168.940002 170.830002 168.190002 170.750000 164.620316
.. ... ... ... ... ... ...
249 2020-12-04 172.130005 173.160004 171.539993 172.460007 172.460007
250 2020-12-07 171.720001 172.500000 169.179993 170.149994 170.149994
251 2020-12-08 169.740005 172.830002 169.699997 172.460007 172.460007
252 2020-12-09 172.669998 175.639999 171.929993 175.289993 175.289993
253 2020-12-10 174.869995 175.399994 172.690002 173.490005 173.490005
[254 rows x 7 columns]
Loop through stocks and replace Adj Close (Edited as per requirements from comments)
Code for looping through stocks and replacing Adj close with Stock symbol.
stock_url = 'https://query1.finance.yahoo.com/v7/finance/download/{}?'
params = {
'range' : '1y',
'interval' : '1d',
'events' : 'history'
}
df = pd.DataFrame()
for i in stock:
response = requests.get(stock_url.format(i), params=params)
file = io.StringIO(response.text)
reader = csv.reader(file)
data = list(reader)
df1 = pd.DataFrame(data)
df1.loc[df1[5] == 'Adj Close',5] = i
df = df.append(df1)
Tried the code for first 3 stocks and here it is:
I am not getting the output as expected.
I am trying to convert CSV to dataframe, But it is not working:
sales=pd.read_csv('Downloads/item.csv',sep=',',delimeter='"',error_bad_lines=False,quotechar='"')
This is my CSV file sample:
"account_number,name,item_code,category,quantity,unit price,net_price,date "
"093356,Waters-Walker,AS-93055,Shirt,5,82.68,413.40,2013-11-17 20:41:11"
"659366,Waelchi-Fahey,AS-93055,Shirt,18,99.64,1793.52,2014-01-03 08:14:27"
"563905,""Kerluke, Reilly and Bechtelar"",AS-93055,Shirt,17,52.82,897.94,2013-12-04 02:07:05"
"995267,Cole-Eichmann,GS-86623,Shoes,18,15.28,275.04,2014-04-09 16:15:03"
"524021,Hegmann and Sons,LL-46261,Shoes,7,78.78,551.46,2014-06-18 19:25:10"
"929400,""Senger, Upton and Breitenberg"",LW-86841,Shoes,17,38.19,649.23,2014-02-10 05:55:56"
Please take a look at the bold characters in the CSV files they are enclosed with ""
Here is my proposal:
df = pd.read_csv('file.csv')
col_name = 'account_number,name,item_code,category,quantity,unit price,net_price,date'
z = pd.concat([df[col_name].str.split(r'(,(?=\S)|:)', expand=True)], axis=1)
z['date'] = z[14]+z[15]+z[16]+z[17]+z[18]
z = z.drop(columns=[1,3,5,7,9,11,13, 14,15,16,17,18])
z.columns = col_name.split(',')
Crucial is this regex r'(,(?=\S)|:)' - comma but not followed by space but I don't know why it also split on :. If you can fix it then you don't have manually concat date.
Output:
account_number ... date
0 093356 ... 2013-11-17 20:41:11
1 659366 ... 2014-01-03 08:14:27
2 563905 ... 2013-12-04 02:07:05
3 995267 ... 2014-04-09 16:15:03
4 524021 ... 2014-06-18 19:25:10
5 929400 ... 2014-02-10 05:55:56