I have a particular problem with Rs paste-function in combination with the row- and coloumn-selection of a data frame. It seems that paste always surrounds it's input-arguments with as.numeric() or something which does a similar job.
Here is a code snippet of what I am doing:
paste(df[1, c("entry1", "entry2")], collapse="; ")
This passes the first row of a data frame df with column entries for column "entry1" and "entry2". I assumed an output like this:
"Auffuellung; Holozaen"
Instead I am receiving the concatenated number equivalents (not indices) of the passed data frame entries:
"1; 5"
Calling str(df[1, c("entry1", "entry2")]) on my real data base results in the following output (German, do not wonder ;) ):
'data.frame': 1 obs. of 2 variables:
$ Hauptbestandteile: Factor w/ 38 levels "Auffuellung",..: 1
$ Chronografie : Factor w/ 18 levels "Devon","Famennium",..: 5
What am I doing wrong in this case? Until now, I never faced such a problem with the paste-function and I would have never expected something like this to happen. So, how do I solve the problem and get the correct output of concatenated strings instead of concatenated number equivalents?
Thank you in advance!
Your problem is related to the fact that your data are factor variables. paste is pasting the underlying "integer" code. This is confusing and not immediately obvious as to how to get around it. You need to turn it into a vector using unlist() and it will work as exepcted...
Example
df <- data.frame( Month = factor(month.name) , Short = factor(month.abb) )
df[ 1 , ]
# Month Short
#1 January Jan
paste( df[ 1 , ] , collapse = "; " )
#[1] "5; 5"
paste( unlist( df[ 1 , ] ) , collapse = "; " )
#[1] "January; Jan"
Of course when reading your data in you can avoid strings being automatically converted to factors using the stringsAsFactors = FALSE argument to read.*.
See the R room chat log here for a discussion on this.
Related
I recently needed to fill blank string values within a pandas dataframe with an adjacent column for the same row.
I attempted df.apply(lambda x: x['A'].replace(...) as well attempted np.where. Neither worked. There were anomalies with the assignment of "blank string values", I couldn't pick them up via '' or df['A'].replace(r'^\s$',df['B'],regex=True), or replacing df['B'] with e.g. -. The only two things that worked was .isnull() and iterrows where they appeared as nan.
So iterrows worked, but I'm not saving the changes.
How is pandas saving the changes?
mylist = {'A':['fe','fi', 'fo', ''], 'B':['fe1,','fi2', 'fi3', 'thum']}
coffee = pd.DataFrame(mylist)
print ("output1\n",coffee.head())
for index,row in coffee.iterrows():
if str(row['A']) == '':
row['A'] = row['B']
print ("output2\n", coffee.head())
output1
A B
0 fe fe1,
1 fi fi2
2 fo fi3
3 thum
output2
A B
0 fe fe1,
1 fi fi2
2 fo fi3
3 thum thum
Note The dataframe is an object BTW.
About pandas.DataFrame.iterrows, the documentation says :
You should never modify something you are iterating over. This is not
guaranteed to work in all cases. Depending on the data types, the
iterator returns a copy and not a view, and writing to it will have no
effect.
In your case, you can use one of these *solutions (that should work with your real dataset as well):
coffee.loc[coffee["A"].eq(""), "A"] = coffee["B"]
Or :
coffee["A"] = coffee["B"].where(coffee["A"].eq(""), coffee["A"])
Or :
coffee["A"] = coffee["A"].replace("", None).fillna(coffee["B"])
Still a strange behaviour though that your original dataframe got updated within the loop without any re-assign. Also, not to mention that the row/Series is supposed to return a copy and not a view..
I have a txt file that is an output from another modelling program where it is looking at parameters of a modeled node at a time. The data output is similar to the following below. My problem is the data is coming as a single column and is broken occasionally by a new header and then the first section of the column repeats (time), but the second portion is new. There are two things I would like to be able to do:
1) Break the data into the two columns time and data for the node. Then add the node label as the first column.
2) Later there is another parameter for the node, not immediately under where the information would be in the form Data 2 Node (XX,XX) that is the same as one previous.
This would give me 4 columns in the end with the first being the node id repeated, the second being the time, third being data parameter 1, and fourth being the matched data parameter 2.
I've included a small sample of the data below, but the output is nearly over 1,000,000 lines so it would be nice to use pandas or another python functionality to manipulate the data.
Thanks for the help!
Name 20 vs 2
----------------------------------
time Data 1 Node( 72, 23)
--------------------- ---------------------
4.1203924E-003 -3.6406431E-005
1.4085015E-002 -5.8257871E-004
2.4049638E-002 6.8743013E-004
3.4014260E-002 8.2296302E-005
4.3978883E-002 -1.2276627E-004
5.3943505E-002 1.9813024E-004
....
Name 20 vs 2
----------------------------------
time Data 1 Node( 72, 24)
--------------------- ---------------------
4.1203924E-003 -3.6406431E-005
1.4085015E-002 -5.8257871E-004
2.4049638E-002 6.8743013E-004
3.4014260E-002 8.2296302E-005
4.3978883E-002 -1.2276627E-004
5.3943505E-002 1.9813024E-004
So after a fair amount of googling I managed to piece this one together. The data I was looking at was space separated so I used the fixed file width read method in pandas, following that I looked at the indices of a few known elements in the list and used them to break up the data into two dataframes that I could merge and process after. I removed the lines and NaN values as they were not of interest to me. Following that the fun began on actually using the data.
import pandas
widths = [28, 27 ]
df = pd.read_fwf(filename, widths=widths, names=["Times", "Items"])
data = df["Items"].astype(str)
index1 = data.str.contains('Data 1').idxmax()
index2 = data.str.contains('Data 2').idxmax()
df2 = pd.read_fwf(filename, widths=widths, skiprows=index1, skipfooter = (len(df)-index2), header = None, names=["Times", "Prin. Stress 1"])
df2 = pd.read_fwf(filename, widths=widths, skiprows=index2, header = None, names=["Times", "Prin. Stress 2"])
df2["Prin. Stress 2"] = df3["Prin. Stress 2"]
df2 = df2[~df2["Times"].str.contains("---")] # removes lines ----
df2.dropna(inplace = True)
I am trying to do some calculation across rows and columns in python. It is taking painfully longer time to execute for large dataset.
I am trying to do some calculation as follows:
Df =pd.DataFrame({'A': [1,1,1,2,2,2,2],
'unit': [1,2,1,1,1,1,2],
'D1':[100,100,100,200,300,400,3509],
'D2':[200,200,200,300,300,400,2500],
'D3':[50,50,50,60,50,67,98],
'Level1':[1,4,0,4,4,4,5],
'Level2':[45,3,0,6,7,8,9],
'Level3':[0,0,34,8,7,0,5]
})
For each value of A (in above example A=1 and 2) I am running a function sequentially (i.e., I can not run the same function for A=1 and A=2 at the same time since outcome of A=1 changes some other values for A=2). I am calculating a Score as:
def score(data):
data['score_Level1']=np.where(data['Level1']>=data['unit'], data['unit'], 0)*(((np.where(data['Level1']>=data['unit'], data['unit'], 0)).sum()*100) +(10/data['D1']))
data['score_Level2']=np.where(data['Level2']>=data['unit'], data['unit'], 0)*(((np.where(data['Level2']>=data['unit'], data['unit'], 0)).sum()*100) +(10/data['D2']))
data['score_Level3']=np.where(data['Level3']>=data['unit'], data['unit'], 0)*(((np.where(data['Level3']>=data['unit'], data['unit'], 0)).sum()*100) +(10/data['D3']))
return(data)
What above code does is it goes row by row and gives score for Leveli (i=1,2,3) as follows:
Step1:
compare Value of "Leveli' with corresponding "unit" column, if Leveli >=unit then unit else 0.
Step2:
Then it (sums up result for above operation across all rows for Leveli)*100+ (1/Di) = Lets say "S"
Step3:
It goes row by row again and assign a score for Leveli as:
Step1*Step2 (for each row)
Above code should yield results for A=1 as:
score(Df[Df['A']==1])
I am listing only scoring for Level1, same thing happends for Level2 and Level3
Step1:
Compare 1>=1 = True Yields 1, 4>=2 = true Yields 2, 0>=1 =False Yields 0
Step2:
(1+2+0)*100+1/100=300.1
Step3:
Compare 1>=1 = True Yields 1 *300.1=300.1
Compare 4>=2 = True Yields 2 *300.1=600.2
Compare 0>=1 = False Yields 0 *300.1=0
I am doing this activity for 200 million values of A. Since it has to be done sequentially (A=n depends on outcome of A=n-1), it is taking a long time to compute.
Any suggestion of making it faster is much appreciated.
I think, you can avoid the where, which should run faster.
Can you please try this code:
def score2(data, score_field, level_field, d_field):
indexer= data[level_field]>=data['unit']
data[score_field]= 0.0
data.loc[indexer, score_field]= data['unit'] * data.loc[indexer, 'unit'].sum()*100 + 10/data[d_field]
return(data)
score2(Df, 'score_Level1', 'Level1', 'D1')
score2(Df, 'score_Level2', 'Level2', 'D2')
score2(Df, 'score_Level3', 'Level3', 'D3')
The .loc in combination with the indexer replaces the where. On the left side of the assignment it will only set the values for the rows in which the "level-field" is greather than unit. All others stay as they are. Without the line data[score_field]= 0.0 they would contain NaN.
Btw. pandas has it's own .where method, which works on series. It is slightly different from the numpy implementation.
I have a data frame which contains different columns ('features').
My goal is to calculate column X statistical measures:
Mean, Standart-Deviation, Variance
But, to calculate all of those, with dependency on column Y.
e.g. Get all rows which Y = 1, and for them calculate mean,stddev, var,
then do the same for all rows which Y = 2 for them.
My current implementation is:
print "For CONGESTION_FLAG = 0:"
log_df.filter(log_df[flag_col] == 0).select([mean(size_col), stddev(size_col),
pow(stddev(size_col), 2)]).show(20, False)
print "For CONGESTION_FLAG = 1:"
log_df.filter(log_df[flag_col] == 1).select([mean(size_col), stddev(size_col),
pow(stddev(size_col), 2)]).show(20, False)
print "For CONGESTION_FLAG = 2:"
log_df.filter(log_df[flag_col] == 2).select([mean(size_col), stddev(size_col),
pow(stddev(size_col), 2)]).show(20, False)
I was told the filter() way is wasteful in terms of computation times, and received an advice that for making those calculation run faster (i'm using this on 1GB data file), it would be better use groupBy() method.
Can someone please help me transform those lines to do the same calculations by using groupBy instead?
I got mixed up with the syntax and didn't manage to do so correctly.
Thanks.
Filter by itself is not wasteful. The problem is that you are calling it multiple times (once for each value) meaning you are scanning the data 3 times. The operation you are describing is best achieved by groupby which basically aggregates data per value of the grouped column.
You could do something like this:
agg_df = log_df.groupBy(flag_col).agg(mean(size_col).alias("mean"), stddev(size_col).alias("stddev"), pow(stddev(size_col),2).alias("pow"))
You might also get better performance by calculating stddev^2 after the aggregation (you should try it on your data):
agg_df = log_df.groupBy(flag_col).agg(mean(size_col).alias("mean"), stddev(size_col).alias("stddev"))
agg_df2 = agg_df.withColumn("pow", agg_df["stddev"] * agg_df["stddev"])
You can:
log_df.groupBy(log_df[flag_col]).agg(
mean(size_col), stddev(size_col), pow(stddev(size_col), 2)
)
Can any one tell me, how can I write my output of Fortran program in CSV format? So I can open the CSV file in Excel for plotting data.
A slightly simpler version of the write statement could be:
write (1, '(1x, F, 3(",", F))') a(1), a(2), a(3), a(4)
Of course, this only works if your data is numeric or easily repeatable. You can leave the formatting to your spreadsheet program or be more explicit here.
I'd also recommend the csv_file module from FLIBS. Fortran is well equipped to read csv files, but not so much to write them. With the csv_file module, you put
use csv_file
at the beginning of your function/subroutine and then call it with:
call csv_write(unit, value, advance)
where unit = the file unit number, value = the array or scalar value you want to write, and advance = .true. or .false. depending on whether you want to advance to the next line or not.
Sample program:
program write_csv
use csv_file
implicit none
integer :: a(3), b(2)
open(unit=1,file='test.txt',status='unknown')
a = (/1,2,3/)
b = (/4,5/)
call csv_write(1,a,.true.)
call csv_write(1,b,.true.)
end program
output:
1,2,3
4,5
if you instead just want to use the write command, I think you have to do it like this:
write(1,'(I1,A,I1,A,I1)') a(1),',',a(2),',',a(3)
write(1,'(I1,A,I1)') b(1),',',b(2)
which is very convoluted and requires you to know the maximum number of digits your values will have.
I'd strongly suggest using the csv_file module. It's certainly saved me many hours of frustration.
The Intel and gfortran (5.5) compilers recognize:
write(unit,'(*(G0.6,:,","))')array or data structure
which doesn't have excess blanks, and the line can have more than 999 columns.
To remove excess blanks with F95, first write into a character buffer and then use your own CSV_write program to take out the excess blanks, like this:
write(Buf,'(999(G21.6,:,","))')array or data structure
call CSV_write(unit,Buf)
You can also use
write(Buf,*)array or data structure
call CSV_write(unit,Buf)
where your CSV_write program replaces whitespace with "," in Buf. This is problematic in that it doesn't separate character variables unless there are extra blanks (i.e. 'a ','abc ' is OK).
I thought a full simple example without any other library might help. I assume you are working with matrices, since you want to plot from Excel (in any case it should be easy to extend the example).
tl;dr
Print one row at a time in a loop using the format format(1x, *(g0, ", "))
Full story
The purpose of the code below is to write in CSV format (that you can easily import in Excel) a (3x4) matrix.
The important line is the one labeled 101. It sets the format.
program testcsv
IMPLICIT NONE
INTEGER :: i, nrow
REAL, DIMENSION(3,4) :: matrix
! Create a sample matrix
matrix = RESHAPE(source = (/1,2,3,4,5,6,7,8,9,10,11,12/), &
shape = (/ 3, 4 /))
! Store the number of rows
nrow = SIZE(matrix, 1)
! Formatting for CSV
101 format(1x, *(g0, ", "))
! Open connection (i.e. create file where to write)
OPEN(unit = 10, access = "sequential", action = "write", &
status = "replace", file = "data.csv", form = "formatted")
! Loop across rows
do i=1,3
WRITE(10, 101) matrix(i,:)
end do
! Close connection
CLOSE(10)
end program testcsv
We first create the sample matrix. Then store the number of rows in the variable nrow (this is useful when you are not sure of the matrix's dimension beforehand). Skip a second the format statement. What we do next is to open (create or replace) the CSV file, names data.csv. Then we loop over the rows (do statement) of the matrix to write a row at a time (write statement) in the CSV file; rows will be appended one after another.
In more details how the write statement works is: WRITE(U,FMT) WHAT. We write "what" (the i-th row of the matrix: matrix(i,:)), to connection U (the one we created with the open statement), formatting the WHAT according to FMT.
Note that in the example FMT=101, and 101 is the label of our format statement:
format(1x, *(g0, ", "))
what this does is: "1x" insert a white space at the beginning of the row; the "*" is used for unlimited format repetition, which means that the format in the following parentheses is repeated for all the data left in the object we are printing (i.e. all elements in the matrix's row). Thus, each row number is formatted as: 'g0, ", "'.
g is a general format descriptor that handles floats as well as characters, logicals and integers; the trailing 0 basically means: "use the least amount of space needed to contain the object to be formatted" (avoids unnecessary spaces). Then, after the formatted number, we require the comma plus a space: **", ". This produces our comma-separated values for a row of the matrix (you can use other separators instead of "," if you need). We repeat for every row and that's it.
(The spaces in the format are not really needed, thus one could use format(*(g0,","))
Reference: Metcalf, M., Reid, J., & Cohen, M. (2018). Modern Fortran Explained: Incorporating Fortran 2018. Oxford University Press.
Tens seconds work with a search engine finds me the FLIBS library, which includes a module called csv_file which will write strings, scalars and arrays out is CSV format.