How to use slicing in Python - python-3.x

I am finding slicing in Python a bit very difficult. Lets say if I want the first five and last five characters of a phrase to display how do i go about it. For example:
words = input("Enter a word ")
slice = words[:2]
print(slice)

You can use negative indexing for slice from end :
>>> s="teststring"
>>>
>>> s[-5:]
'tring'
>>> s[:5]
'tests'
Actually a slice notation observes the following law :
[start:end:step]
One way to remember how slices work is to think of the indices as pointing between characters, with the left edge of the first character numbered 0. Then the right edge of the last character of a string of n characters has index n, for example:
+---+---+---+---+---+---+
| P | y | t | h | o | n |
+---+---+---+---+---+---+
0 1 2 3 4 5 6
-6 -5 -4 -3 -2 -1
Read more about slicing https://docs.python.org/2/tutorial/introduction.html#strings
And https://docs.python.org/2.3/whatsnew/section-slices.html

Related

How to split data and assign it into designated variables?

I have data in Stata regarding the feeling of the current situation. There are seven types of feeling. The data is stored in the following format (note that the data type is a string, and one person can respond to more than 1 answer)
feeling
4,7
1,3,4
2,5,6,7
1,2,3,4,5,6,7
Since the data is a string, I tried to separate it by
split feeling, parse (,)
and I got the result
feeling1
feeling2
feeling3
feeling4
feeling5
feeling6
feeling7
4
7
1
3
4
2
5
6
7
1
2
3
4
5
6
7
However, this is not the result I want. which is that the representative number of feelings should go into the correct variable. For instance.
feeling1
feeling2
feeling3
feeling4
feeling5
feeling6
feeling7
4
7
1
3
4
2
5
6
7
1
2
3
4
5
6
7
I am not sure if there is any built-in command or function for this kind of problem. I am thinking about using forval in looping through every value in each variable and try to juggle it around into the correct variable.
A loop over the distinct values would be enough here. I give your example in a form explained in the Stata tag wiki as more helpful and then give code to get the variables you want as numeric variables.
* Example generated by -dataex-. For more info, type help dataex
clear
input str13 feeling
"4,7"
"1,3,4"
"2,5,6,7"
"1,2,3,4,5,6,7"
end
forval j = 1/7 {
gen wanted`j' = `j' if strpos(feeling, "`j'")
gen better`j' = strpos(feeling, "`j'") > 0
}
l feeling wanted1-better3
+---------------------------------------------------------------------------+
| feeling wanted1 better1 wanted2 better2 wanted3 better3 |
|---------------------------------------------------------------------------|
1. | 4,7 . 0 . 0 . 0 |
2. | 1,3,4 1 1 . 0 3 1 |
3. | 2,5,6,7 . 0 2 1 . 0 |
4. | 1,2,3,4,5,6,7 1 1 2 1 3 1 |
+---------------------------------------------------------------------------+
If you wanted a string result that would be yielded by
gen wanted`j' = "`j'" if strpos(feeling, "`j'")
Had the number of feelings been 10 or more you would have needed more careful code as for example a search for "1" would find it within "10".
Indicator (some say dummy) variables with distinct values 1 or 0 are immensely more valuable for most analysis of this kind of data.
Note Stata-related sources such as
this FAQ
this paper
and this paper.

output of "s = "0123456789" print(s[2:-1:-1])" slice operator

s = "0123456789"
print(s[2:-1:-1])
according to me, output of the above question should be "210" BUT it gives nothing
please explain to me how?
Syntax:
sequence [start:stop[:step]]
start:
Optional. Starting index of the slice. Defaults to 0.
stop:
Optional. The last index of the slice or the number of items to get. Defaults to len(sequence).
step:
Optional. Extended slice syntax. Step value of the slice. Defaults to 1.
+---+---+---+---+
|-4 |-3 |-2 |-1 | <= negative indexes
+---+---+---+---+
| A | B | C | D | <= sequence elements
+---+---+---+---+
| 0 | 1 | 2 | 3 | <= positive indexes
+---+---+---+---+
|<- 2:-1:-1 ->| <= extent of the slice: "ABCD"[2:-1:-1] (won't work)
Explanation:
Here in my example "ABCD"[2:-1:-1] If we interpret it, then it says:
start from index 2. (include that item)
Go till index -1 (exclude that item) which is the last item as you can see the table above.
With steps of -1 which basically means in reverse direction. Here you are contradicting your sequence. So it returns nothing.
So the solution would be "ABCD"[2::-1] as someone correctly answered in the comment. This says start from index 2 go till end either beginnig or end based on the steps which is -1 here so beginning.
So same answer to your question print(s[2::-1]) will print 210

Extract a substring new column based on a substring based on conditions ideally with Pandas

I got a data set (Excel) with hundreds of entries. In one string column there is most of the information. The information is divided by '_' and typed in by humans. Therefore, it is not possible to work with index positions.
To create a usable data basis it's mandatory to extract information from this column in another column.
The search pattern = '*v*' is alone not enough. But combined with the condition that the first item has to be a digit it works.
I tried to get it to work with iterrows, iteritems, str.strip, str.extract and many more. But the best solution I received with a for-loop.
pattern = '_*v*_'
test = []
for i in df['col']:
'#Split the string in substrings
i = i.split('_')
for c in i:
if c.find('x') == 1:
if c[0].isdigit():
# print(c)
test.append(c)
else:
'#To be able to fix a few rows manually
test.append(0)
[4]: test =[22v3, 33v55, 4v2]
#Input
+-----------+-----------+
| col | targetcol |
+-----------+-----------+
| as_22v3 | |
| 33v55_bdd | |
| Ave_4v2 | |
+-----------+-----------+
#Output
+-----------+-----------+--+
| col | targetcol | |
+-----------+-----------+--+
| as_22v3 | 22v3 | |
| 33v55_bdd | 33v55 | |
| Ave_4v2 | 4v2 | |
+-----------+-----------+--+
My code does work, but only for the first few rows. It stops after 36 values and I can't figure out why. There is no error message besides of course that it is not possible to assign the list to a DataFrame series since it has not the same size.
pandas.Series.str.extract should help:
>>> df['col'].str.extract(r'(\d+v+\d+)')
0
0 22v3
1 33v55
2 4v2
df = pd.DataFrame({
'col': ['as_22v3', '33v55_bdd', 'Ave_4v2']
})
df['targetcol'] = df['col'].str.extract(r'(\d+v+\d+)')
EDIT
df = pd.DataFrame({
'col': ['as_22v3', '33v55_bdd', 'Ave_4v2', '_22 v3', 'space 2,2v3', '2.v3',
'2.111v999', 'asd.123v77', '1 v7', '123 v 8135']
})
pattern = r'(\d+(\,[0-9]+)?(\s+)?v\d+)'
df['result'] = df['col'].str.extract(pattern)[0]
col result
0 as_22v3 22v3
1 33v55_bdd 33v55
2 Ave_4v2 4v2
3 _22 v3 22 v3
4 space 2,2v3 2,2v3
5 2.v3 NaN
6 2.111v999 111v999
7 asd.123v77 123v77
8 1 v7 1 v7
9 123 v 8135 NaN
You say it stops after 36 values? You say it is Excel file you are processing? One thing you could try is to save data set to .csv file and try to read this file in with pd.read_csv function. There are sometimes some extra characters in Excel file that are not easily visible.

spark dataframe sum of column based on condition

I want to calculate the portion of the value, with only two partitions( where type == red and where type != red)
ID | type | value
-----------------------------
1 | red | 10
2 | blue | 20
3 | yellow | 30
result should be :
ID | type | value | portion
-----------------------------
1 | red | 10 | 1
2 | blue | 20 |0.4
3 | yellow | 30 |0.6
The normal window function in spark only supports partitionby a whole column, but I need the "blue" and "yellow", together recognized as the "non-red" type.
Any idea?
First add a column is_red to easier differentiate between the two groups. Then you can groupBy this new column and get the sums for each of the two groups respectively.
To get the fraction (portion), simply divide each row's value by the correct sum, taking into account if the type is red or not. This part can be done using when and otherwise in Spark.
Below is the Scala code to do this. There is a sortBy since when using groupBy the order of results is not guaranteed. With the sort, sum1 below will contain the total sum for all non-red types while sum2 is the sum for red types.
val sum1 :: sum2 :: _ = df.withColumn("is_red", $"type" === lit("red"))
.groupBy($"is_red")
.agg(sum($"value"))
.collect()
.map(row => (row.getAs[Boolean](0), row.getAs[Long](1)))
.toList
.sortBy(_._1)
.map(_._2)
val df2 = df.withColumn("portion", when($"is_red", $"value"/lit(sum2)).otherwise($"value"/lit(sum1)))
The extra is_red column can be removed using drop.
Inspired by Shaido, I used an extra column is_red and the spark window function. But I'm not sure which one is better in performance.
df.withColumn("is_red", when(col("type").equalTo("Red"), "Red")
.otherwise("not Red")
.withColumn("portion", col("value")/sum("value)
.over(Window.partitionBy(col"is_Red")))
.drop(is_Red)

How to justify columns using format function?

I have a working function that takes a list made up of lists and outputs it as a table. I am just missing certain spacing and new lines. I'm pretty new to formatting strings (and python in general.) How do I use the format function to fix my output?
For examples:
>>> show_table([['A','BB'],['C','DD']])
'| A | BB |\n| C | DD |\n'
>>> print(show_table([['A','BB'],['C','DD']]))
| A | BB |
| C | DD |
>>> show_table([['A','BBB','C'],['1','22','3333']])
'| A | BBB | C |\n| 1 | 22 | 3333 |\n'
>>> print(show_table([['A','BBB','C'],['1','22','3333']]))
| A | BBB | C |
| 1 | 22 | 3333 |
What I am actually outputting though:
>>>show_table([['A','BB'],['C','DD']])
'| A | BB | C | DD |\n'
>>>show_table([['A','BBB','C'],['1','22','3333']])
'| A | BBB | C | 1 | 22 | 3333 |\n'
>>>show_table([['A','BBB','C'],['1','22','3333']])
| A | BBB | C | 1 | 22 | 3333 |
I will definitely need to use the format function but I'm not sure how?
This is my current code (my indenting is actually correct but I'm horrible with stackoverflow format):
def show_table(table):
if table is None:
table=[]
new_table = ""
for row in table:
for val in row:
new_table += ("| " + val + " ")
new_table += "|\n"
return new_table
You do actually have an indentation error in your function: the line
new_table += "|\n"
should be indented further so that it happens at the end of each row, not at the end of the table.
Side note: you'll catch this kind of thing more easily if you stick to 4 spaces per indent. This and other conventions are there to help you, and it's a very good idea to learn the discipline of keeping to them early in your progress with Python. PEP 8 is a great resource to familarise yourself with.
The spacing on your "what I need" examples is also rather messed up, which is unfortunate since spacing is the subject of your question, but I gather from this question that you want each column to be properly aligned, e.g.
>>> print(show_table([['10','2','300'],['4000','50','60'],['7','800','90000']]))
| 10 | 2 | 300 |
| 4000 | 50 | 60 |
| 7 | 800 | 90000 |
In order to do that, you'll need to know in advance what the maximum width of each item in a column is. That's actually a little tricky, because your table is organised into rows rather than columns, but the zip() function can help. Here's an example of what zip() does:
>>> table = [['10', '2', '300'], ['4000', '50', '60'], ['7', '800', '90000']]
>>> from pprint import pprint
>>> pprint(table, width=30)
[['10', '2', '300'],
['4000', '50', '60'],
['7', '800', '90000']]
>>> flipped = zip(*table)
>>> pprint(flipped, width=30)
[('10', '4000', '7'),
('2', '50', '800'),
('300', '60', '90000')]
As you can see, zip() turns rows into columns and vice versa. (don't worry too much about the * before table right now; it's a bit advanced to explain for the moment. Just remember that you need it).
You get the length of a string with len():
>>> len('800')
3
You get the maximum of the items in a list with max():
>>> max([2, 4, 1])
4
You can put all these together in a list comprehension, which is like a compact for loop that builds a list:
>>> widths = [max([len(x) for x in col]) for col in zip(*table)]
>>> widths
[4, 3, 5]
If you look carefully, you'll see that there are actually two list comprehensions in that line:
[len(x) for x in col]
makes a list with the lengths of each item x in a list col, and
[max(something) for col in zip(*table)]
makes a list with the maximum value of something for each column in the flipped (with zip) table … where something is the other list comprehension.
That's all kinda complicated the first time you see it, so spend a little while making sure you understand what's going on.
Now that you have your maximum widths for each column, you can use them to format your output. In order to do so, though, you need to keep track of which column you're in, and to do that, you need enumerate(). Here's an example of enumerate() in action:
>>> for i, x in enumerate(['a', 'b', 'c']):
... print("i is", i, "and x is", x)
...
i is 0 and x is a
i is 1 and x is b
i is 2 and x is c
As you can see, iterating over the result of enumerate() gives you two values: the position in the list, and the item itself.
Still with me? Fun, isn't it? Pressing on ...
The only thing left is the actual formatting. Python's str.format() method is pretty powerful, and too complex to explain thoroughly in this answer. One of the things you can use it for is to pad things out to a given width:
>>> "{val:5s}".format(val='x')
'x '
In the example above, {val:5s} says "insert the value of val here as a string, padding it out to 5 spaces". You can also specify the width as a variable, like this:
>>> "{val:{width}s}".format(val='x', width=3)
'x '
These are all the pieces you need … and here's a function that uses all those pieces:
def show_table(table):
if table is None:
table = []
new_table = ""
widths = [max([len(x) for x in c]) for c in zip(*table)]
for row in table:
for i, val in enumerate(row):
new_table += "| {val:{width}s} ".format(val=val, width=widths[i])
new_table += "|\n"
return new_table
… and here it is in action:
>>> table = [['10','2','300'],['4000','50','60'],['7','800','90000']]
>>> print(show_table(table))
| 10 | 2 | 300 |
| 4000 | 50 | 60 |
| 7 | 800 | 90000 |
I've covered a fair bit of ground in this answer. Hopefully if you study the final version of show_table() given here in detail (as well as the docs linked to throughout the answer), you'll be able to see how all the pieces described earlier on fit together.

Resources