I have certain data that need to be converted to strings. Example:
[ABCGHDEF-12345, ABCDKJEF-123235,...]
The example above does not represent a constant or a string by itself but is taken from an Excel sheet (ranging upto 30+ items for each row). I want to convert these to strings. Since data is undefined, explicitly converting them doesn't work. Is there a way to do this iteratively without placing double/single quotes manually between each data element?
What I want finally:
["ABCGHDEF-12345", "ABCDKJEF-123235",...]
To convert the string to list of strings you can try:
s = "[ABCGHDEF-12345, ABCDKJEF-123235]"
s = s.strip("[]").split(", ")
print(s)
Prints:
['ABCGHDEF-12345', 'ABCDKJEF-123235']
i would like to read the from a text file shown below, loop through each individual digit and determine which digit occurs the maximum number of times. How can I go about it in pyspark.
Here is the txt file
1.4142135623 7309504880 1688724209 6980785696 7187537694 8073176679 7379907324 7846210703 8850387534 3276415727 3501384623 0912297024 9248360558 5073721264 4121497099 9358314132 2266592750 5592755799 9505011527 8206057147 0109559971 6059702745 3459686201 4728517418 6408891986 0955232923 0484308714 3214508397 6260362799 5251407989 6872533965 4633180882 9640620615 2583523950 5474575028 7759961729 8355752203 3753185701 1354374603 4084988471 6038689997 0699004815 0305440277 9031645424 7823068492 9369186215 8057846311 1596668713 0130156185 6898723723 5288509264 8612494977 1542183342 0428568606 0146824720 7714358548 7415565706 9677653720 2264854470 1585880162 0758474922 6572260020 8558446652 1458398893 9443709265 9180031138 8246468157 0826301005 9485870400 3186480342 1948972782 9064104507 2636881313 7398552561 1732204024 5091227700 2269411275 7362728049 5738108967 5040183698 6836845072 5799364729 0607629969 4138047565 4823728997 1803268024 7442062926 9124859052 1810044598 4215059112 0249441341 7285314781 0580360337 1077309182 8693147101 7111168391 6581726889 4197587165 8215212822 951848847
If you want to achieve for a single line or multiple lines in a text file, first eliminate all the special characters from the file, and then you need to get pair RDD (digits in text file as key) .
Bellow is the pseudo-code for it,
strRDD = df.flatMap(lambda line : list(str(line.value)))\
.filter(lambda digit : digit.isalnum())\
.map(lambda digit : (int(digit), 1))\
.reduceByKey(lambda a, b : a + b).max(lambda x : x[1])
print(strRDD)
Output for your text file is :(8, 113)
//Using RDD's - This will be better for your scenario.
load data using text file then use flatMap (since you said all numbers in single line) to split using space as delimiter. Then convert each number from string to float as some of your numbers are float. Now make RDD pair( number as key and value as one for each number). Using reduceBykey aggregate total count of each number and then sort by value using sortBy in descending order.
data = sc.textFile("your file path")
numcount = data.flatMap(lambda x: x.split( )).map(lambda x: (float(x),1)).reduceByKey(lambda a,b:a+b).sortBy(lambda x: x[1],ascending = False)
for i in numcount.collect(): print (i)
// Using DataFrames:
since you said all numbers in single line I am reading as text to get
all values into single column. Then creating new column and splitting the column to list with space as delimiter using split function. Then exploding the list to get each number into a row.
Now drop old column using drop function
Now using groupBy, count functions you can count each number how many times it show's up. Also using sort in descending order you can find order wise.
from pyspark.sql.functions import *
numcount = spark.read.text("your file path").withColumn('new', explode(split('value',' '))).drop('value')
numcount.groupBy('new').agg(count('new').alias('counts')).sort(desc('counts')).show()
Starting from a dataframe with the data you can do:
df = df.groupby("digit").count().sort("count")
This code returns the most repeated values. After that you can show this values and get the most repèated value with:
df.show()
df.limit(1).show()
I am trying to convert all items in my dataframe to a float. The types are varies at the moment. The following error persist -> ValueError: could not convert string to float: '116,584.54'
The file can be found at https://www.imf.org/external/pubs/ft/weo/2019/01/weodata/WEOApr2019all.xls
I checked the value in excel, it is a Number. I tried .replace, .astype, pd.to_numeric.
for i in weo['1980']:
if i == float:
print(i)
i.replace(",",'')
i.replace("--",np.nan)
else:
continue
Also, I have tried:
weo['1980'] = weo['1980'].apply(pd.to_numeric)
You can try using DataFrame.astype in order to conduct the conversion which is usually the recommended approach. As you already attempted in your question, you may have to remove all the comas form the string in column 1980 first as it may cause the same error as quoted in your question:
weo['1980'] = weo['1980'].replace(',', '')
weo['1980'] = weo['1980'].asytpe(float)
If you're reading your DataFrame from Excel using pandas.read_excel, you can also specify the thousands argument to do this conversion for you which will likely result in a higher performance:
pandas.read_excel(file, thousands=',')
I had types error all the time while playing with dataframes. I now always use this to convert all the values that can be converted into floats.
# Convert all columns that can be converted into float into float.
# Error were raised because their type was Object
df = df.apply(pd.to_numeric, errors='ignore')
I want to write a list of strings to a binary file. Suppose I have a list of strings mylist? Assume the items of the list has a '\t' at the end, except the last one has a '\n' at the end (to help me, recover the data back). Example: ['test\t', 'test1\t', 'test2\t', 'testl\n']
For a numpy ndarray, I found the following script that worked (got it from here numpy to r converter):
binfile = open('myfile.bin','wb')
for i in range(mynpdata.shape[1]):
binfile.write(struct.pack('%id' % mynpdata.shape[0], *mynpdata[:,i]))
binfile.close()
Does binfile.write automatically parses all the data if variable has * in front it (such in the *mynpdata[:,i] example above)? Would this work with a list of integers in the same way (e.g. *myIntList)?
How can I do the same with a list of string?
I tried it on a single string using (which I found somewhere on the net):
oneString = 'test'
oneStringByte = bytes(oneString,'utf-8')
struct.pack('I%ds' % (len(oneString),), len(oneString), oneString)
but I couldn't understand why is the % within 'I%ds' above replaced by (len(oneString),) instead of len(oneString) like the ndarray example AND also why is both len(oneString) and oneString passed?
Can someone help me with writing a list of string (if necessary, assuming it is written to the same binary file where I wrote out the ndarray) ?
There's no need for struct. Simply join the strings and encode them using either a specified or an assumed text encoding in order to turn them into bytes.
''.join(L).encode('utf-8')
Can any one tell me, how can I write my output of Fortran program in CSV format? So I can open the CSV file in Excel for plotting data.
A slightly simpler version of the write statement could be:
write (1, '(1x, F, 3(",", F))') a(1), a(2), a(3), a(4)
Of course, this only works if your data is numeric or easily repeatable. You can leave the formatting to your spreadsheet program or be more explicit here.
I'd also recommend the csv_file module from FLIBS. Fortran is well equipped to read csv files, but not so much to write them. With the csv_file module, you put
use csv_file
at the beginning of your function/subroutine and then call it with:
call csv_write(unit, value, advance)
where unit = the file unit number, value = the array or scalar value you want to write, and advance = .true. or .false. depending on whether you want to advance to the next line or not.
Sample program:
program write_csv
use csv_file
implicit none
integer :: a(3), b(2)
open(unit=1,file='test.txt',status='unknown')
a = (/1,2,3/)
b = (/4,5/)
call csv_write(1,a,.true.)
call csv_write(1,b,.true.)
end program
output:
1,2,3
4,5
if you instead just want to use the write command, I think you have to do it like this:
write(1,'(I1,A,I1,A,I1)') a(1),',',a(2),',',a(3)
write(1,'(I1,A,I1)') b(1),',',b(2)
which is very convoluted and requires you to know the maximum number of digits your values will have.
I'd strongly suggest using the csv_file module. It's certainly saved me many hours of frustration.
The Intel and gfortran (5.5) compilers recognize:
write(unit,'(*(G0.6,:,","))')array or data structure
which doesn't have excess blanks, and the line can have more than 999 columns.
To remove excess blanks with F95, first write into a character buffer and then use your own CSV_write program to take out the excess blanks, like this:
write(Buf,'(999(G21.6,:,","))')array or data structure
call CSV_write(unit,Buf)
You can also use
write(Buf,*)array or data structure
call CSV_write(unit,Buf)
where your CSV_write program replaces whitespace with "," in Buf. This is problematic in that it doesn't separate character variables unless there are extra blanks (i.e. 'a ','abc ' is OK).
I thought a full simple example without any other library might help. I assume you are working with matrices, since you want to plot from Excel (in any case it should be easy to extend the example).
tl;dr
Print one row at a time in a loop using the format format(1x, *(g0, ", "))
Full story
The purpose of the code below is to write in CSV format (that you can easily import in Excel) a (3x4) matrix.
The important line is the one labeled 101. It sets the format.
program testcsv
IMPLICIT NONE
INTEGER :: i, nrow
REAL, DIMENSION(3,4) :: matrix
! Create a sample matrix
matrix = RESHAPE(source = (/1,2,3,4,5,6,7,8,9,10,11,12/), &
shape = (/ 3, 4 /))
! Store the number of rows
nrow = SIZE(matrix, 1)
! Formatting for CSV
101 format(1x, *(g0, ", "))
! Open connection (i.e. create file where to write)
OPEN(unit = 10, access = "sequential", action = "write", &
status = "replace", file = "data.csv", form = "formatted")
! Loop across rows
do i=1,3
WRITE(10, 101) matrix(i,:)
end do
! Close connection
CLOSE(10)
end program testcsv
We first create the sample matrix. Then store the number of rows in the variable nrow (this is useful when you are not sure of the matrix's dimension beforehand). Skip a second the format statement. What we do next is to open (create or replace) the CSV file, names data.csv. Then we loop over the rows (do statement) of the matrix to write a row at a time (write statement) in the CSV file; rows will be appended one after another.
In more details how the write statement works is: WRITE(U,FMT) WHAT. We write "what" (the i-th row of the matrix: matrix(i,:)), to connection U (the one we created with the open statement), formatting the WHAT according to FMT.
Note that in the example FMT=101, and 101 is the label of our format statement:
format(1x, *(g0, ", "))
what this does is: "1x" insert a white space at the beginning of the row; the "*" is used for unlimited format repetition, which means that the format in the following parentheses is repeated for all the data left in the object we are printing (i.e. all elements in the matrix's row). Thus, each row number is formatted as: 'g0, ", "'.
g is a general format descriptor that handles floats as well as characters, logicals and integers; the trailing 0 basically means: "use the least amount of space needed to contain the object to be formatted" (avoids unnecessary spaces). Then, after the formatted number, we require the comma plus a space: **", ". This produces our comma-separated values for a row of the matrix (you can use other separators instead of "," if you need). We repeat for every row and that's it.
(The spaces in the format are not really needed, thus one could use format(*(g0,","))
Reference: Metcalf, M., Reid, J., & Cohen, M. (2018). Modern Fortran Explained: Incorporating Fortran 2018. Oxford University Press.
Tens seconds work with a search engine finds me the FLIBS library, which includes a module called csv_file which will write strings, scalars and arrays out is CSV format.