SPSS string extraction - string

I have a column with names of different persons separated by comma, for example, (all in 1 cell) Ben Lee, Paul Loy, Boy Lim. I want to separate each name into different columns. How will I do it? (in SPSS syntax).

See this thread with potential solution(s). Namely, credit David Marso and Jon Peck:
* General Parser *.
DATA LIST / X 1-80 (A).
BEGIN DATA 11-0101-423-7384
END DATA.
VECTOR NUMS(10).
COMPUTE #0=0.
LOOP.
COMPUTE #1=INDEX(X,'-').
COMPUTE #0=#0+1.
IF #1>0 NUMS(#0)=NUMBER(SUBSTR(X,1,#1-1),F8).
COMPUTE X=SUBSTR(X,#1+1).
END LOOP IF #1=0.
COMPUTE NUMS(#0)=NUMBER(X,F8).
MATCH FILES FILE * / DROP X.
LIST.
Or alternatively a python solution:
data list free /x(a13).
begin data.
1,13,5,6,99,8
end data.
dataset name data.
begin program.
def split(v):
return v.split(',')
end program.
spssinc trans result = v1 to v6
/formula "split(x)".

do repeat TXTname="Ben Lee" "Paul Loy" "Boy Lim"/VRname=BenLee PaulLoy BoyLim.
compute VRname=index(OriginalColumnName, TXTname)>0.
end repeat.
If there are many more names you might prefer to use standard variable names instead, and add the actual names in the labels instead:
do repeat TXTname="Ben Lee" "Paul Loy" "Boy Lim"/VRname=Name01 to Name03.
compute VRname=index(OriginalColumnName, TXTname)>0.
end repeat.
variable labels
Name01 "Ben Lee"
Name02 "Paul Loy"
Name03 "Boy Lim".

Related

What is the easiest way to split a string into a first name and last name?

The dataset has 14k rows and has many titles, etc.
I am a beginner in Pandas and Python and I'd like to know how to proceed with getting the output of first name and last name from this dataset.
Dataset:
0 Pr.Doz.Dr. Klaus Semmler Facharzt für Frauenhe...
1 Dr. univ. (Budapest) Dalia Lax
2 Dr. med. Jovan Stojilkovic
3 Dr. med. Dirk Schneider
4 Marc Scheuermann
14083 Bag Kinderarztpraxis
14084 Herr Ulrich Bromig
14085 Sohn Heinrich
14086 Herr Dr. sc. med. Amadeus Hartwig
14087 Jasmin Rieche
for name in dataset:
first = name.split()[-2]
last = name.split()[-1]
# save here
This will work for most names, not all. For repeatability you may need a list of titles such as (dr., md., univ.) to skip over
As it doesn't contain any structure, you're out of luck. An ad-hoc solution could be to just write down a list of all locations/titles/conjunctions and other noise you've identified and then strip those from the rows. Then, if you notice some other things you'd like to exclude, just add them to your list.
This will not solve the issue of certain rows having their name in reverse order. So it'll require you to manually go over everything and check if the row is valid, but it might be quicker than editing each row by hand.
A simple, brute-force example would be:
excludes = {'dr.', 'herr', 'budapest', 'med.', 'für', ... }
new_entries = []
for title in all_entries:
cleaned_result = []
parts = title.split(' ')
for part in parts:
if part.lowercase() not in excludes:
cleaned_result.append(part)
new_entries.append(' '.join(cleaned_result))

Defining a function to find the unique palindromes in a given string

I'm kinda new to python.I'm trying to define a function when asked would give an output of only unique words which are palindromes in a string.
I used casefold() to make it case-insensitive and set() to print only uniques.
Here's my code:
def uniquePalindromes(string):
x=string.split()
for i in x:
k=[]
rev= ''.join(reversed(i))
if i.casefold() == rev.casefold():
k.append(i.casefold())
print(set(k))
else:
return
I've tried to run this line
print( uniquePalindromes('Hanah asked Sarah but Sarah refused') )
The expected output should be ['hanah','sarah'] but its returning only {'hanah'} as the output. Please help.
Your logic is sound, and your function is mostly doing what you want it to. Part of the issue is how you're returning things - all you're doing is printing the set of each individual word. For example, when I take your existing code and do this:
>>> print(uniquePalindromes('Hannah Hannah Alomomola Girafarig Yes Nah, Chansey Goldeen Need log'))
{'hannah'}
{'alomomola'}
{'girafarig'}
None
hannah, alomomola, and girafarig are the palindromes I would expect to see, but they're not given in the format I expect. For one, they're being printed, instead of returned, and for two, that's happening one-by-one.
And the function is returning None, and you're trying to print that. This is not what we want.
Here's a fixed version of your function:
def uniquePalindromes(string):
x=string.split()
k = [] # note how we put it *outside* the loop, so it persists across each iteration without being reset
for i in x:
rev= ''.join(reversed(i))
if i.casefold() == rev.casefold():
k.append(i.casefold())
# the print statement isn't what we want
# no need for an else statement - the loop will continue anyway
# now, once all elements have been visited, return the set of unique elements from k
return set(k)
now it returns roughly what you'd expect - a single set with multiple words, instead of printing multiple sets with one word each. Then, we can print that set.
>>> print(uniquePalindromes("Hannah asked Sarah but Sarah refused"))
{'hannah'}
>>> print(uniquePalindromes("Hannah and her friend Anna caught a Girafarig and named it hannaH"))
{'anna', 'hannah', 'girafarig', 'a'}
they are not gonna like me on here if I give you some tips. But try to divide the amount of characters (that aren't whitespace) into 2. If the amount on each side is not equivalent then you must be dealing with an odd amount of letters. That means that you should be able to traverse the palindrome going downwards from the middle and upwards from the middle, comparing those letters together and using the middle point as a "jump off" point. Hope this helps

finding elements from list with different format of strings

I have a large list with elements as:
#1
#10
(on
)
0.0574
122-124
122A
Cat
Dog
elephant
elephant12
elephant-1
I want to search and be able to find only the following:
Cat
Dog
elephant
elephant12
elephant-1
i.e. elements which have an English alphabet at the beginning.
Use list comprehension:
import string
result = [item for item in my_list if item[0] in string.ascii_letters]
As #Jon commented, check if a character is a letter can simply be:
result = [item for item in my_list if item[0].isalpha()]
The above works when all items are string, and you expect items with leading English character. Change the if part as needed or even write a function if it is too complex.
If you are looking for memory-optimized version, consider generator.
You could work with a list of animals.
for example:
list=["cat","dog","elephant","fish","bird", "snake"]
then search for any of those strings in your input
I know there are better methods but I would do it with
def search(input):
for item in list:
if item in input:
result.append(item)
return result
Then you would add case ignorance precisions.
If you want the number associated with your animal name, you'll need to append the input. In this case, iterate your input.
Of course your list could take the dimension of a database if you need, for example to search for all possible existing word.

Need help working with lists within lists

I'm taking a programming class and have our first assignment. I understand how it's supposed to work, but apparently I haven't hit upon the correct terms to search to get help (and the book is less than useless).
The assignment is to take a provided data set (names and numbers) and perform some manipulation and computation with it.
I'm able to get the names into a list, and know the general format of what commands I'm giving, but the specifics are evading me. I know that you refer to the numbers as names[0][1], names[1][1], etc, but not how to refer to just that record that is being changed. For example, we have to have the program check if a name begins with a letter that is Q or later; if it does, we double the number associated with that name.
This is what I have so far, with ??? indicating where I know something goes, but not sure what it's called to search for it.
It's homework, so I'm not really looking for answers, but guidance to figure out the right terms to search for my answers. I already found some stuff on the site (like the statistics functions), but just can't find everything the book doesn't even mention.
names = [("Jack",456),("Kayden",355),("Randy",765),("Lisa",635),("Devin",358),("LaWanda",452),("William",308),("Patrcia",256)]
length = len(names)
count = 0
while True
count < length:
if ??? > "Q" # checks if first letter of name is greater than Q
??? # doubles number associated with name
count += 1
print(names) # self-check
numberNames = names # creates new list
import statistics
mean = statistics.mean(???)
median = statistics.median(???)
print("Mean value: {0:.2f}".format(mean))
alphaNames = sorted(numberNames) # sorts names list by name and creates new list
print(alphaNames)
first of all you need to iter over your names list. To do so use for loop:
for person in names:
print(person)
But names are a list of tuples so you will need to get the person name by accessing the first item of the tuple. You do this just like you do with lists
name = person[0]
score = person[1]
Finally to get the ASCII code of a character, you use ord() function. That is going to be helpful to know if name starts with a Q or above.
print(ord('A'))
print(ord('Q'))
print(ord('R'))
This should be enough informations to get you started with.
I see a few parts to your question, so I'll try to separate them out in my response.
check if first letter of name is greater than Q
Hopefully this will help you with the syntax here. Like list, str also supports element access by index with the [] syntax.
$ names = [("Jack",456),("Kayden",355)]
$ names[0]
('Jack', 456)
$ names[0][0]
'Jack'
$ names[0][0][0]
'J'
$ names[0][0][0] < 'Q'
True
$ names[0][0][0] > 'Q'
False
double number associated with name
$ names[0][1]
456
$ names[0][1] * 2
912
"how to refer to just that record that is being changed"
We are trying to update the value associated with the name.
In theme with my previous code examples - that is, we want to update the value at index 1 of the tuple stored at index 0 in the list called names
However, tuples are immutable so we have to be a little tricky if we want to use the data structure you're using.
$ names = [("Jack",456), ("Kayden", 355)]
$ names[0]
('Jack', 456)
$ tpl = names[0]
$ tpl = (tpl[0], tpl[1] * 2)
$ tpl
('Jack', 912)
$ names[0] = tpl
$ names
[('Jack', 912), ('Kayden', 355)]
Do this for all tuples in the list
We need to do this for the whole list, it looks like you were onto that with your while loop. Your counter variable for indexing the list is named count so just use that to index a specific tuple, like: names[count][0] for the countth name or names[count][1] for the countth number.
using statistics for calculating mean and median
I recommend looking at the documentation for a module when you want to know how to use it. Here is an example for mean:
mean(data)
Return the sample arithmetic mean of data.
$ mean([1, 2, 3, 4, 4])
2.8
Hopefully these examples help you with the syntax for continuing your assignment, although this could turn into a long discussion.
The title of your post is "Need help working with lists within lists" ... well, your code example uses a list of tuples
$ names = [("Jack",456),("Kayden",355)]
$ type(names)
<class 'list'>
$ type(names[0])
<class 'tuple'>
$ names = [["Jack",456], ["Kayden", 355]]
$ type(names)
<class 'list'>
$ type(names[0])
<class 'list'>
notice the difference in the [] and ()
If you are free to structure the data however you like, then I would recommend using a dict (read: dictionary).
I know that you refer to the numbers as names[0][1], names[1][1], etc, but
not how to refer to just that record that is being changed. For
example, we have to have the program check if a name begins with a
letter that is Q or later; if it does, we double the number associated
with that name.
It's not entirely clear what else you have to do in this assignment, but regarding your concerns above, to reference the ith"record that is being changed" in your names list, simply use names[i]. So, if you want to access the first record in names, simply use names[0], since indexing in Python begins at zero.
Since each element in your list is a tuple (which can also be indexed), using constructs like names[0][0] and names[0][1] are ways to index the values within the tuple, as you pointed out.
I'm unsure why you're using while True if you're trying to iterate through each name and check whether it begins with "Q". It seems like a for loop would be better, unless your class hasn't gotten there yet.
As for checking whether the first letter is 'Q', str (string) objects are indexed similarly to lists and tuples. To access the first letter in a string, for example, see the following:
>>> my_string = 'Hello'
>>> my_string[0]
'H'
If you give more information, we can help guide you with the statistics piece, as well. But I would first suggest you get some background around mean and median (if you're unfamiliar).

How to write Fortran Output as CSV file?

Can any one tell me, how can I write my output of Fortran program in CSV format? So I can open the CSV file in Excel for plotting data.
A slightly simpler version of the write statement could be:
write (1, '(1x, F, 3(",", F))') a(1), a(2), a(3), a(4)
Of course, this only works if your data is numeric or easily repeatable. You can leave the formatting to your spreadsheet program or be more explicit here.
I'd also recommend the csv_file module from FLIBS. Fortran is well equipped to read csv files, but not so much to write them. With the csv_file module, you put
use csv_file
at the beginning of your function/subroutine and then call it with:
call csv_write(unit, value, advance)
where unit = the file unit number, value = the array or scalar value you want to write, and advance = .true. or .false. depending on whether you want to advance to the next line or not.
Sample program:
program write_csv
use csv_file
implicit none
integer :: a(3), b(2)
open(unit=1,file='test.txt',status='unknown')
a = (/1,2,3/)
b = (/4,5/)
call csv_write(1,a,.true.)
call csv_write(1,b,.true.)
end program
output:
1,2,3
4,5
if you instead just want to use the write command, I think you have to do it like this:
write(1,'(I1,A,I1,A,I1)') a(1),',',a(2),',',a(3)
write(1,'(I1,A,I1)') b(1),',',b(2)
which is very convoluted and requires you to know the maximum number of digits your values will have.
I'd strongly suggest using the csv_file module. It's certainly saved me many hours of frustration.
The Intel and gfortran (5.5) compilers recognize:
write(unit,'(*(G0.6,:,","))')array or data structure
which doesn't have excess blanks, and the line can have more than 999 columns.
To remove excess blanks with F95, first write into a character buffer and then use your own CSV_write program to take out the excess blanks, like this:
write(Buf,'(999(G21.6,:,","))')array or data structure
call CSV_write(unit,Buf)
You can also use
write(Buf,*)array or data structure
call CSV_write(unit,Buf)
where your CSV_write program replaces whitespace with "," in Buf. This is problematic in that it doesn't separate character variables unless there are extra blanks (i.e. 'a ','abc ' is OK).
I thought a full simple example without any other library might help. I assume you are working with matrices, since you want to plot from Excel (in any case it should be easy to extend the example).
tl;dr
Print one row at a time in a loop using the format format(1x, *(g0, ", "))
Full story
The purpose of the code below is to write in CSV format (that you can easily import in Excel) a (3x4) matrix.
The important line is the one labeled 101. It sets the format.
program testcsv
IMPLICIT NONE
INTEGER :: i, nrow
REAL, DIMENSION(3,4) :: matrix
! Create a sample matrix
matrix = RESHAPE(source = (/1,2,3,4,5,6,7,8,9,10,11,12/), &
shape = (/ 3, 4 /))
! Store the number of rows
nrow = SIZE(matrix, 1)
! Formatting for CSV
101 format(1x, *(g0, ", "))
! Open connection (i.e. create file where to write)
OPEN(unit = 10, access = "sequential", action = "write", &
status = "replace", file = "data.csv", form = "formatted")
! Loop across rows
do i=1,3
WRITE(10, 101) matrix(i,:)
end do
! Close connection
CLOSE(10)
end program testcsv
We first create the sample matrix. Then store the number of rows in the variable nrow (this is useful when you are not sure of the matrix's dimension beforehand). Skip a second the format statement. What we do next is to open (create or replace) the CSV file, names data.csv. Then we loop over the rows (do statement) of the matrix to write a row at a time (write statement) in the CSV file; rows will be appended one after another.
In more details how the write statement works is: WRITE(U,FMT) WHAT. We write "what" (the i-th row of the matrix: matrix(i,:)), to connection U (the one we created with the open statement), formatting the WHAT according to FMT.
Note that in the example FMT=101, and 101 is the label of our format statement:
format(1x, *(g0, ", "))
what this does is: "1x" insert a white space at the beginning of the row; the "*" is used for unlimited format repetition, which means that the format in the following parentheses is repeated for all the data left in the object we are printing (i.e. all elements in the matrix's row). Thus, each row number is formatted as: 'g0, ", "'.
g is a general format descriptor that handles floats as well as characters, logicals and integers; the trailing 0 basically means: "use the least amount of space needed to contain the object to be formatted" (avoids unnecessary spaces). Then, after the formatted number, we require the comma plus a space: **", ". This produces our comma-separated values for a row of the matrix (you can use other separators instead of "," if you need). We repeat for every row and that's it.
(The spaces in the format are not really needed, thus one could use format(*(g0,","))
Reference: Metcalf, M., Reid, J., & Cohen, M. (2018). Modern Fortran Explained: Incorporating Fortran 2018. Oxford University Press.
Tens seconds work with a search engine finds me the FLIBS library, which includes a module called csv_file which will write strings, scalars and arrays out is CSV format.

Resources