I'm using pandas to manipulate a CSV file. My data looks like this:
Col1 | Col2 | Street
------|----------|----------------------------
abc | 11092019 | abc,def,ghi,jkl,mno,pqr
def | 11092019 | abc,def,ghi,jkl
ghi | 11092019 | abc,def,ghi,jkl,mno
jkl | 11092019 | abc,def,ghi
mno | 11092019 | abc,def,ghi,jkl
pqr | 11092019 | abc,def
I am splitting the Street column by comma and as in the example this can return different number of columns once split.
My searching has got me to this point where I have the following code
i = df.columns.get_loc('Street')
df2 = (df['Street'].str.split(',', expand=True).rename(columns=lambda x: f"Street{x+1}"))
pd.concat([df.iloc[:, :i], df2, df.iloc[:, i+1:]], axis=1)
This would yield the following result
Col1 | Col2 | Street1 | Street2 | Street3 | Street4 | Street5 | Street6
------|----------|---------|---------|---------|---------|---------|---------
abc | 11092019 | abc | def | ghi | jkl | mno | pqr
def | 11092019 | abc | def | ghi | jkl | |
ghi | 11092019 | abc | def | ghi | jkl | mno |
jkl | 11092019 | abc | def | ghi | | |
mno | 11092019 | abc | def | ghi | jkl | |
pqr | 11092019 | abc | def | | | |
This is so close to what I want but I want to retain the original split column so the column Street in this example. I just can't figure out out to keep that original column in the output. Can someone point me in the right direction?
Thanks!
Related
I have the following data and I want to replace the 3th occurrence of the | symbol with nothing.
ABC | DEF | GHI | XYZ | 123
ABC | DEF | GHI | XYZ | 123
ABC | DEF | GHI | XYZ | 123
Final output should be:
ABC | DEF | GHI XYZ | 123
ABC | DEF | GHI XYZ | 123
ABC | DEF | GHI XYZ | 123
You can run the following:
:%norm 3f|r
This means:
:%norm on every line, run the following normal commands
3f| move cursor to the 3rd occurrence of |
r replace it with a space
You could of course do:
:%norm 3f|x
To delete the | completely.
Another way would be to use visual block mode (see :help visual-block).
Although this will only work if all the | are lined up (i.e. in the same
column).
How to trim the additional spaces present between the names in PySpark dataframe?
Below is my dataframe
+----------------------+----------+
|name |account_id|
+----------------------+----------+
| abc xyz pqr | 1 |
| pqm rst | 2 |
+----------------------+----------+
Output I want
+-------------+----------+
|name |account_id|
+-------------+----------+
| abc xyz pqr | 1 |
| pqm rst | 2 |
+-------------+----------+
I tried using regex_replace, but it trims the space completely. Is there any other way to implement this ? Thanks a lot!
I tried using 'regexp_replace(,'\s+',' ')' and I got the output.
df=df.withColumn("name",regexp_replace(col("name"),'\s+',' '))
Output
+-----------+----------+
| name |account_id|
+-----------+----------+
|abc xyz pqr| 1 |
| pqm rst| 2 |
+-----------+----------+
I've got a list of customers where the customers are repeated across multiple rows. I'd like to merge cells that are similar in Column A, but not touch anything else. If I could even format bold borders between customers, that'd be great.
Basically,
1 | abc | abc
1 | abc | def
1 | def | xyz
2 | abc |
2 | abc | def
3 | | xyz
4 | abc | qrs
4 | abc | def
5 | mni | xyz
To
1 | abc | abc
| abc | def
| def | xyz
2 | abc |
| abc | def
3 | | xyz
4 | abc | qrs
| abc | def
5 | mni | xyz
You don't have to merge cells. In fact, I recommend against it, that causes more problems than it solves. What you could do is hide column "A", then insert an empty column "B" put this formula in "B2", then auto-fill:
=IF(A2=A1,"",A2)
Also, this solution avoids macros, which can be difficult and problematic if you are new to them.
For that "customer separator" formatting that you're looking for, use conditional formatting. Here's a picture:
I got that tip from this web site: Conditional Formatting
I have a data set in MS Excel 2013 which looks like this:
+------+------------+----------------+
| Name | DateWorked | HoursCompleted |
+------+------------+----------------+
| abc | 03/01/2016 | 4:53 |
| abc | 03/02/2016 | 5:22 |
| ghi | 03/03/2016 | 2:10 |
| jkl | 03/04/2016 | 5:30 |
| mno | 03/05/2016 | 4:20 |
| pqr | 03/06/2016 | 0:20 |
+------+------------+----------------+
where Name is string, DateWorked is mm/dd/yyyy, HoursCompleted is h:mm data type/format.
I am using ADODB to get some results out this data, storing it in a ADODB.Recordset and displaying in another sheet.
My code/query looks something like this:
countCommand.CommandText = "SELECT SUM([HoursCompleted]) FROM [Sheet1$] WHERE [Name] = 'abc' AND [DateWorked] BETWEEN #02/28/2016# AND #03/02/2016#"
Set countResults = countCommand.Execute
Now this works like a charm and gives me a result of 10:15.
However if the first value under DateWorked is NOT a Date datatype then I am not getting the result. For example, if my data looks like this:
+------+------------+----------------+
| Name | DateWorked | HoursCompleted |
+------+------------+----------------+
| abc | Not worked | 0:00 |
| abc | 03/02/2016 | 5:22 |
| ghi | 03/03/2016 | 2:10 |
| jkl | 03/04/2016 | 5:30 |
| mno | 03/05/2016 | 4:20 |
| pqr | 03/06/2016 | 0:20 |
+------+------------+----------------+
How do I set the datatype for my columns?
I have a datatable which displays records of country, state and district.
country | state | district
--------+-------+---------
ABC | A | Z
ABC | A | y
ABC | A | x
ABC | B | 1
ABC | B | 2
However, as you see in the example, the country is repeated over multiple rows. I would like to merge those into a single cell like so:
country | state | district
--------+-------+---------
ABC | A | Z
| A | y
| A | x
| B | 1
| B | 2
How can I achieve this?