I have a working function that takes a list made up of lists and outputs it as a table. I am just missing certain spacing and new lines. I'm pretty new to formatting strings (and python in general.) How do I use the format function to fix my output?
For examples:
>>> show_table([['A','BB'],['C','DD']])
'| A | BB |\n| C | DD |\n'
>>> print(show_table([['A','BB'],['C','DD']]))
| A | BB |
| C | DD |
>>> show_table([['A','BBB','C'],['1','22','3333']])
'| A | BBB | C |\n| 1 | 22 | 3333 |\n'
>>> print(show_table([['A','BBB','C'],['1','22','3333']]))
| A | BBB | C |
| 1 | 22 | 3333 |
What I am actually outputting though:
>>>show_table([['A','BB'],['C','DD']])
'| A | BB | C | DD |\n'
>>>show_table([['A','BBB','C'],['1','22','3333']])
'| A | BBB | C | 1 | 22 | 3333 |\n'
>>>show_table([['A','BBB','C'],['1','22','3333']])
| A | BBB | C | 1 | 22 | 3333 |
I will definitely need to use the format function but I'm not sure how?
This is my current code (my indenting is actually correct but I'm horrible with stackoverflow format):
def show_table(table):
if table is None:
table=[]
new_table = ""
for row in table:
for val in row:
new_table += ("| " + val + " ")
new_table += "|\n"
return new_table
You do actually have an indentation error in your function: the line
new_table += "|\n"
should be indented further so that it happens at the end of each row, not at the end of the table.
Side note: you'll catch this kind of thing more easily if you stick to 4 spaces per indent. This and other conventions are there to help you, and it's a very good idea to learn the discipline of keeping to them early in your progress with Python. PEP 8 is a great resource to familarise yourself with.
The spacing on your "what I need" examples is also rather messed up, which is unfortunate since spacing is the subject of your question, but I gather from this question that you want each column to be properly aligned, e.g.
>>> print(show_table([['10','2','300'],['4000','50','60'],['7','800','90000']]))
| 10 | 2 | 300 |
| 4000 | 50 | 60 |
| 7 | 800 | 90000 |
In order to do that, you'll need to know in advance what the maximum width of each item in a column is. That's actually a little tricky, because your table is organised into rows rather than columns, but the zip() function can help. Here's an example of what zip() does:
>>> table = [['10', '2', '300'], ['4000', '50', '60'], ['7', '800', '90000']]
>>> from pprint import pprint
>>> pprint(table, width=30)
[['10', '2', '300'],
['4000', '50', '60'],
['7', '800', '90000']]
>>> flipped = zip(*table)
>>> pprint(flipped, width=30)
[('10', '4000', '7'),
('2', '50', '800'),
('300', '60', '90000')]
As you can see, zip() turns rows into columns and vice versa. (don't worry too much about the * before table right now; it's a bit advanced to explain for the moment. Just remember that you need it).
You get the length of a string with len():
>>> len('800')
3
You get the maximum of the items in a list with max():
>>> max([2, 4, 1])
4
You can put all these together in a list comprehension, which is like a compact for loop that builds a list:
>>> widths = [max([len(x) for x in col]) for col in zip(*table)]
>>> widths
[4, 3, 5]
If you look carefully, you'll see that there are actually two list comprehensions in that line:
[len(x) for x in col]
makes a list with the lengths of each item x in a list col, and
[max(something) for col in zip(*table)]
makes a list with the maximum value of something for each column in the flipped (with zip) table … where something is the other list comprehension.
That's all kinda complicated the first time you see it, so spend a little while making sure you understand what's going on.
Now that you have your maximum widths for each column, you can use them to format your output. In order to do so, though, you need to keep track of which column you're in, and to do that, you need enumerate(). Here's an example of enumerate() in action:
>>> for i, x in enumerate(['a', 'b', 'c']):
... print("i is", i, "and x is", x)
...
i is 0 and x is a
i is 1 and x is b
i is 2 and x is c
As you can see, iterating over the result of enumerate() gives you two values: the position in the list, and the item itself.
Still with me? Fun, isn't it? Pressing on ...
The only thing left is the actual formatting. Python's str.format() method is pretty powerful, and too complex to explain thoroughly in this answer. One of the things you can use it for is to pad things out to a given width:
>>> "{val:5s}".format(val='x')
'x '
In the example above, {val:5s} says "insert the value of val here as a string, padding it out to 5 spaces". You can also specify the width as a variable, like this:
>>> "{val:{width}s}".format(val='x', width=3)
'x '
These are all the pieces you need … and here's a function that uses all those pieces:
def show_table(table):
if table is None:
table = []
new_table = ""
widths = [max([len(x) for x in c]) for c in zip(*table)]
for row in table:
for i, val in enumerate(row):
new_table += "| {val:{width}s} ".format(val=val, width=widths[i])
new_table += "|\n"
return new_table
… and here it is in action:
>>> table = [['10','2','300'],['4000','50','60'],['7','800','90000']]
>>> print(show_table(table))
| 10 | 2 | 300 |
| 4000 | 50 | 60 |
| 7 | 800 | 90000 |
I've covered a fair bit of ground in this answer. Hopefully if you study the final version of show_table() given here in detail (as well as the docs linked to throughout the answer), you'll be able to see how all the pieces described earlier on fit together.
Related
This question already has answers here:
How do I Pandas group-by to get sum?
(11 answers)
Closed 1 year ago.
The df is formatted in this manner:
Zip Code | State | Carrier | Price
__________________________________
xxxxx | XX | ABCD | 12.0
xxxxx | XX | TUSD | 15.0
xxxxx | XX | PPLD | 17.0
The Code:
carrier_sum = []
unique_carrier = a_df['Carrier'].unique()
for i in unique_carrier:
x=0
for y, row in a_df.iterrows():
x = a_df.loc[a_df['Carrier'] == i, 'Prices'].sum()
print(i, x)
carrier_sum.append([i,x])
This is my code, at first it makes a unique_carrier list. Then for each of the carriers it iterrows() through the df to get the 'Price' and sum it returning the carrier_sum to the empty df I created.
The problem is it seems to take forever, I mean I ran it once and it took over 15 minutes just to get the sum for the first one unique carrier sum and there are 8 of them.
What can I do to make it more efficient?
The dataset is over 300000 rows long.
One way that I thought of is to go ahead and set a list with the unique carriers identified beforehand since I don't really need to look for it in the df, another thing I thought of is to organize the main dataset by carrier name alphabetically, and make the unique carrier list line up with how it is in the dataset.
Thank you for reading.
This solution can work for you
df.groupby('Carrier')['Price'].sum()
I got a data set (Excel) with hundreds of entries. In one string column there is most of the information. The information is divided by '_' and typed in by humans. Therefore, it is not possible to work with index positions.
To create a usable data basis it's mandatory to extract information from this column in another column.
The search pattern = '*v*' is alone not enough. But combined with the condition that the first item has to be a digit it works.
I tried to get it to work with iterrows, iteritems, str.strip, str.extract and many more. But the best solution I received with a for-loop.
pattern = '_*v*_'
test = []
for i in df['col']:
'#Split the string in substrings
i = i.split('_')
for c in i:
if c.find('x') == 1:
if c[0].isdigit():
# print(c)
test.append(c)
else:
'#To be able to fix a few rows manually
test.append(0)
[4]: test =[22v3, 33v55, 4v2]
#Input
+-----------+-----------+
| col | targetcol |
+-----------+-----------+
| as_22v3 | |
| 33v55_bdd | |
| Ave_4v2 | |
+-----------+-----------+
#Output
+-----------+-----------+--+
| col | targetcol | |
+-----------+-----------+--+
| as_22v3 | 22v3 | |
| 33v55_bdd | 33v55 | |
| Ave_4v2 | 4v2 | |
+-----------+-----------+--+
My code does work, but only for the first few rows. It stops after 36 values and I can't figure out why. There is no error message besides of course that it is not possible to assign the list to a DataFrame series since it has not the same size.
pandas.Series.str.extract should help:
>>> df['col'].str.extract(r'(\d+v+\d+)')
0
0 22v3
1 33v55
2 4v2
df = pd.DataFrame({
'col': ['as_22v3', '33v55_bdd', 'Ave_4v2']
})
df['targetcol'] = df['col'].str.extract(r'(\d+v+\d+)')
EDIT
df = pd.DataFrame({
'col': ['as_22v3', '33v55_bdd', 'Ave_4v2', '_22 v3', 'space 2,2v3', '2.v3',
'2.111v999', 'asd.123v77', '1 v7', '123 v 8135']
})
pattern = r'(\d+(\,[0-9]+)?(\s+)?v\d+)'
df['result'] = df['col'].str.extract(pattern)[0]
col result
0 as_22v3 22v3
1 33v55_bdd 33v55
2 Ave_4v2 4v2
3 _22 v3 22 v3
4 space 2,2v3 2,2v3
5 2.v3 NaN
6 2.111v999 111v999
7 asd.123v77 123v77
8 1 v7 1 v7
9 123 v 8135 NaN
You say it stops after 36 values? You say it is Excel file you are processing? One thing you could try is to save data set to .csv file and try to read this file in with pd.read_csv function. There are sometimes some extra characters in Excel file that are not easily visible.
I am printing a "Table" to the console. I will be using this same table structure for several different variables. However as you can see from Output below, the lines don't all align.
One way to resolve it would be to increase the number of decimal places (e.g. 6.730000 for Standard Deviation) which would push the line into place.
However, I do not want this many decimal places.
Is it possible to add extra 0s to the end of a number, and make these invisible?
I am planning on using this table structure for several variables, and the length of Mean, Stddev, and Median will likely never be more than 6 characters.
EDIT - I would really like to ensure that each value which appears in the table will be 6 characters long, and if it is not 6 characters long, add additional "invisible" zeros.
Input
# Create and structure Table to store descriptive statistics for each variable.
subtitle = "| Mean | Stddev | Median |"
structure = '| {:0.2f} | {:0.2f} | {:0.2f} |'
lines = '=' * len(subtitle)
# Print table.
print(lines)
print(subtitle)
print(lines)
print(structure.format(mean, std, median))
print(lines)
Output:
======================================
| Mean | Stddev | Median |
======================================
| 181.26 | 6.73 | 180.34 |
======================================
Didn't really figure this out - but found a workaround.
I just did the following:
"| {:^6} | {:^6} | {:^6} | {:^6} | {:^6} |"
This keeps the width between | consistent.
I want to calculate the portion of the value, with only two partitions( where type == red and where type != red)
ID | type | value
-----------------------------
1 | red | 10
2 | blue | 20
3 | yellow | 30
result should be :
ID | type | value | portion
-----------------------------
1 | red | 10 | 1
2 | blue | 20 |0.4
3 | yellow | 30 |0.6
The normal window function in spark only supports partitionby a whole column, but I need the "blue" and "yellow", together recognized as the "non-red" type.
Any idea?
First add a column is_red to easier differentiate between the two groups. Then you can groupBy this new column and get the sums for each of the two groups respectively.
To get the fraction (portion), simply divide each row's value by the correct sum, taking into account if the type is red or not. This part can be done using when and otherwise in Spark.
Below is the Scala code to do this. There is a sortBy since when using groupBy the order of results is not guaranteed. With the sort, sum1 below will contain the total sum for all non-red types while sum2 is the sum for red types.
val sum1 :: sum2 :: _ = df.withColumn("is_red", $"type" === lit("red"))
.groupBy($"is_red")
.agg(sum($"value"))
.collect()
.map(row => (row.getAs[Boolean](0), row.getAs[Long](1)))
.toList
.sortBy(_._1)
.map(_._2)
val df2 = df.withColumn("portion", when($"is_red", $"value"/lit(sum2)).otherwise($"value"/lit(sum1)))
The extra is_red column can be removed using drop.
Inspired by Shaido, I used an extra column is_red and the spark window function. But I'm not sure which one is better in performance.
df.withColumn("is_red", when(col("type").equalTo("Red"), "Red")
.otherwise("not Red")
.withColumn("portion", col("value")/sum("value)
.over(Window.partitionBy(col"is_Red")))
.drop(is_Red)
I have a pyspark dataframe here like the picture below. I would like to group every 2 rows, but in a way that:
the first row would be that user from row 1 and 2 and
the second row would be from row 2 and 3 etc.
Something like this:
---CustomerID--previous_stockcodes----stock_codes-----
Prices and quantities are not used, previous basket and current basket are put into one. For example, the first row of CustomerID 12347 would be:
12347----[85116, 22375, 71...]-----[84625A, 84625C, ...]
I have written loops to do that but that's really inefficient and slow. I wonder if I can do something like that efficiently using pyspark but I am having trouble figuring that out. Thanks a lot in advance
You could get the next row by using lead function provided by spark-sql.
lead is a window function.
Syntax : lead(column_name,int_value,default_value) over (partition by column_name order by column_name)
int_value takes number of rows you want to lead from current row.
default_value takes input for case when leading rows are not found
>>> input_df.show()
+----------+---------+----------------+
|customerID|invoiceNo| stockCode_list|
+----------+---------+----------------+
| 12347| 537626| [85116, 22375]|
| 12347| 542237|[84625A, 84625C]|
| 12347| 549222| [22376, 22374]|
| 12347| 556201| [23084, 23162]|
| 12348| 539318| [84992, 22951]|
| 12348| 541998| [21980, 21985]|
| 12348| 548955| [23077, 23078]|
+----------+---------+----------------+
>>> from pyspark.sql.window import Window
>>> from pyspark.sql.functions import lead,col
>>> win_func = Window.partitionBy("customerID").orderBy("invoiceNo")
>>> new_col = lead("stockCode_list",1,None).over(win_func)
>>> req_df = input_df.select(col("customerID"),col("invoiceNo"),col("stockCode_list"),new_col.alias("req_col"))
>>> req_df.orderBy("customerID","invoiceNo").show()
+----------+---------+----------------+----------------+
|customerID|invoiceNo| stockCode_list| req_col|
+----------+---------+----------------+----------------+
| 12347| 537626| [85116, 22375]|[84625A, 84625C]|
| 12347| 542237|[84625A, 84625C]| [22376, 22374]|
| 12347| 549222| [22376, 22374]| [23084, 23162]|
| 12347| 556201| [23084, 23162]| null|
| 12348| 539318| [84992, 22951]| [21980, 21985]|
| 12348| 541998| [21980, 21985]| [23077, 23078]|
| 12348| 548955| [23077, 23078]| null|
+----------+---------+----------------+----------------+