Split a DataFrame column value on '|' and get all items except first - apache-spark

I need to split a column value on '|' , get all items except first item for a new column 'address'.
Whats it makes more complicates is that the number of items is not always the same!
df1 = spark.createDataFrame([
["Luc Krier|2363 Ryan Road"],
["Jeanny Thorn|2263 Patton Lane|Raleigh North Carolina"],
["Teddy E Beecher|2839 Hartland Avenue|Fond Du Lac Wisconsin|US"],
["Philippe Schauss|1 Im Oberdor|Allemagne"],
["Meindert I Tholen|Hagedoornweg 138|Amsterdam|NL"]
]).toDF("s")
I've tried this already:
split, size substring but can't get it done. Any help much appreciated !
Expected output:
addres
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
2363 Ryan Road"
2263 Patton Lane|Raleigh North Carolina"
2839 Hartland Avenue|Fond Du Lac Wisconsin|US"
1 Im Oberdor|Allemagne"
Hagedoornweg 138|Amsterdam|NL"

Try this
df1.select(concat_ws('|',slice(split('s','\|'),2,1000))).show()
+------------------------------------------+
|concat_ws(|, slice(split(s, \|), 2, 1000))|
+------------------------------------------+
|2363 Ryan Road|Long Lake South Dakota |
|2263 Patton Lane|Raleigh North Carolina |
|2839 Hartland Avenue|Fond Du Lac Wisconsin|
|1 Im Oberdor|Allemagne |
|Hagedoornweg 138|Amsterdam |
+------------------------------------------+
where 1000 is the max_length of your array, given an arbitrary large int for now.

Function 'instr' can be used for find first '|', and 'substring' for getting result:
df1.selectExpr(
"substring(s, instr(s,'|') + 1, length(s))"
)
Or regexpr from string start to first '|':
df1.select(
regexp_replace($"s", "^[^\\|]+\\|", "")
)

Related

How to get parents and grand parents tags given specific attribute in XML in python?

I have an xml with a structure like this one:
<cat>
<foo>
<fooID>1</fooID>
<fooName>One</fooName>
<bar>
<barID>a</barID>
<barName>small_a</barName>
<barClass>
<baz>
<qux>
<corge>
<corgeName>...</corgeName>
<corgeType>
<corgeReport>
<corgeReportRes Reference="x" Channel="High">
<Pos>1</Pos>
</corgeReportRes>
</corgeReport>
</corgeType>
</corge>
</qux>
</baz>
</barClass>
</bar>
<bar>
<barID>b</barID>
<barName>small_b</barName>
<barClass>
<baz>
<qux>
<corge>
<corgeName>...</corgeName>
<corgeType>
<corgeReport>
<corgeReportRes Reference="y" Channel="High">
<Pos>1</Pos>
</corgeReportRes>
</corgeReport>
</corgeType>
</corge>
</qux>
</baz>
</barClass>
</bar>
</foo>
<foo>
<fooID>2</fooID>
<fooName>Two</fooName>
<bar>
<barID>c</barID>
<barName>small_c</barName>
<barClass>
<baz>
<qux>
<corge>
<corgeName>...</corgeName>
<corgeType>
<corgeReport>
<corgeReportRes Reference="z" Channel="High">
<Pos>1</Pos>
</corgeReportRes>
</corgeReport>
</corgeType>
</corge>
</qux>
</baz>
</barClass>
</bar>
</foo>
</cat>
And, I would like to obtain the values of specific parent/grand parent/grand grand parent tags that have a node with attribute Channel="High". I would like to obtain only fooID value, fooName value, barID value, barName value.
I have the following code in Python 3:
import xml.etree.ElementTree as xmlET
root = xmlET.parse('file.xml').getroot()
test = root.findall(".//*[#Channel='High']")
Which is actually giving me a list of elements that match, however, I still need the information of the specific parents/grand parents/grand grand parents.
How could I do that?
fooID | fooName | barID | barName
- - - - - - - - - - - - - - - - -
1 | One | a | small_a <-- This is the information I'm interested
1 | One | b | small_b <-- Also this
2 | Two | c | small_c <-- And this
Edit: fooID and fooName nodes are siblings of the grand-grand-parent bar, the one that contains the Channel="High". It's almost the same case for barID and barName, they are siblings of the grand-parent barClass, the one that contains the Channel="High". Also, what I want to obtain is the values 1, One, a and small_a, not filtering by it, since there will be multiple foo blocks.
If I understand you correctly, you are probably looking for something like this (using python):
from lxml import etree
foos = """[your xml above]"""
items = []
for entry in doc.xpath('//foo[.//corgeReportRes[#Channel="High"]]'):
items.append(entry.xpath('./fooID/text()')[0])
items.append(entry.xpath('./fooName/text()')[0])
items.append(entry.xpath('./bar/barID/text()')[0])
items.append(entry.xpath('./bar/barName/text()')[0])
print('fooID | fooName | barID | barName')
print(' | '.join(items))
Output:
fooID | fooName | barID | barName
1 | One | a | small_a

Is there a Python function that removes characters (with a digit) from a string?

I am working on a project about gentrification. My teammates pulled data from the census and cleaned it to get the values we need. The issue is, the zip code values won't print 0's (i.e. "2322" when it should be "02322"). We managed to find the tact value that prints the full zip code with the tact codes("ZCTA5 02322"). I want to remove "ZCTA5" to get the zip code alone.
I've tried the below code but it only gets rid of the "ZCTA" instead of "ZCTA5" (i.e. "502322"). I'm also concerned that if I manage to remove the 5 with the characters, it will remove all 5's in the zip codes as well.
From there I will be pulling from pgeocode to access the respective lat & lng values to create the heatmap. Please help?
I've tried the .replace(), .translate(), functions. Replace still prints the zip codes with 5. Translate gets an attribute error.
Sample data
Zipcode | Name | Change_In_Value | Change_In_Income | Change_In_Degree | Change_In_Rent
2322 | ZCTA5 02322 | -0.050242 | -0.010953 | 0.528509 | -0.013263
2324 | ZCTA5 02324 | 0.012279 | -0.022949 | -0.040456 | 0.210664
2330 | ZCTA5 02330 | 0.020438 | 0.087415 | -0.095076 | -0.147382
2332 | ZCTA5 02332 | 0.035024 | 0.054745 | 0.044315 | 1.273772
2333 | ZCTA5 02333 | -0.012588 | 0.079819 | 0.182517 | 0.156093
Translate
zipcode = []
test2 = gent_df['Name'] = gent_df['Name'].astype(str).translate({ord('ZCTA5'): None}).astype(int)
zipcode.append(test2)
test2.head()
Replace
zipcode = []
test2 = gent_df['Name'] = gent_df['Name'].astype(str).replace(r'\D', '').astype(int)
zipcode.append(test2)
test2.head()
Replace
Expected:
24093
26039
34785
38944
29826
Actual:
524093
526039
534785
538944
529826
Translate
Expected:
24093
26039
34785
38944
29826
Actual:
AttributeError Traceback (most recent call last)
<ipython-input-71-0e5ff4660e45> in <module>
3 zipcode = []
4
----> 5 test2 = gent_df['Name'] = gent_df['Name'].astype(str).translate({ord('ZCTA5'): None}).astype(int)
6 # zipcode.append(test2)
7 test2.head()
~\Anaconda3\envs\MyPyEnv\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
5178 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5179 return self[name]
-> 5180 return object.__getattribute__(self, name)
5181
5182 def __setattr__(self, name, value):
AttributeError: 'Series' object has no attribute 'translate'
It looks like you are using pandas so you should be able to use the .lstrip() method. I tried this on a sample df and it worked for me:
gent_df.Name = gent_df.Name.str.lstrip(to_strip='ZCTA5')
Here is a link to the library page for .strip(), .lstrip(), and .rstrip()
I hope this helps!
There are many ways to do this. I can think of 2 off the top of my head.
If you want to keep the last 5 characters of the zipcode string, regardless of whether they are digits or not:
gent_df['Name'] = gent_df['Name'].str[-5:]
If want to get the last 5 digits of the zipcode string:
gent_df['Name'] = gent_df['Name'].str.extract(r'(\d{5})$')[0]
Include some sample data for more specific answer.

How to remove '/5' from CSV file

I am cleaning a restaurant data set using Pandas' read_csv.
I have columns like this:
name, online_order, book_table, rate, votes
xxxx, Yes, Yes, 4.5/5, 705
I expect them to be like this:
name, online_order, book_table, rate, votes
xxxx, Yes, Yes, 4.5, 705
You basically need to split the item(dataframe["rate"]) based on / and take out what you need. .apply this on your dataframe using lambda x: getRate(x)
def getRate(x):
return str(x).split("/")[0]
To use it with column name rate, we can use:
dataframe["rate"] = dataframe["rate"].apply(lambda x: getRate(x))
You can use the python .split() function to remove specific text, given that the text is consistently going to be "/5", and there are no instances of "/5" that you want to keep in that string. You can use it like this:
num = "4.5/5"
num.split("/5")[0]
output: '4.5'
If this isn't exactly what you need, there's more regex python functions here
You can use DataFrame.apply() to make your replacement operation on the ratecolumn:
def clean(x):
if "/" not in x :
return x
else:
return x[0:x.index('/')]
df.rate = df.rate.apply(lambda x : clean(x))
print(df)
Output
+----+-------+---------------+-------------+-------+-------+
| | name | online_order | book_table | rate | votes |
+----+-------+---------------+-------------+-------+-------+
| 0 | xxxx | Yes | Yes | 4.5 | 705 |
+----+-------+---------------+-------------+-------+-------+
EDIT
Edited to handle situations in which there could be multiple / or that it could be another number than /5 (ie : /4or /1/3 ...)

Python Print Table for Term and Definition with Handled Overflow

I'm trying to make a program that prints out a two column table (Term and Definition) something like this: (table width should be 80 characters)
+--------------------------------------------------------------------------+
| Term | Definition
|
+--------------------------------------+-----------------------------------+
| this is the first term. |This is the definition for thefirst|
| |term that wraps around because the |
| |definition is longer than the width|
| |of the column. |
+--------------------------------------+-----------------------------------+
|The term may also be longer than the |This is the definition for the |
|width of the column and should wrap |second term. |
|around as well. | |
+--------------------------------------+-----------------------------------+
I have existing code for this, but it prints out "this is the first term" on every line because I have used a nested for loop. (Also tried implementing the textwrap module) Here is the code that I have:
# read file
with open(setsList[selectedSet-1], "r", newline="") as setFile:
cardList = list(csv.reader(setFile))
setFile.close()
for i in range(len(cardList)):
wrapped_term = textwrap.wrap(cardList[i][0], 30)
wrapped_definition = textwrap.wrap(cardList[i][1], 30)
for line in wrapped_term:
for line2 in wrapped_definition:
print(line, " ",line2)
print("- - - - - - - - - - - - - - - - - - - - - - - - - - -")
Can anyone suggest a solution? Thank you.
After a lot of (trial) & error & random youtube videos, the solution: (if anyone has a similar problem)
with open("table.csv", "r", newline="") as setFile:
cardList = list(csv.reader(setFile))
setFile.close()
print("+------------------------------------------------------------------------------+")
print("| Term | Definition |")
print("+------------------------------------------------------------------------------+")
print()
for x in range(len(cardList)):
wrapped_term = textwrap.wrap(cardList[x][0], 30)
wrapped_definition = textwrap.wrap(cardList[x][1], 30)
wrapped_list = []
for i in range(len(wrapped_term)):
try:
wrapped_list.append([wrapped_term[i], wrapped_definition[i]])
except IndexError:
if len(wrapped_term) > len(wrapped_definition):
wrapped_list.append([wrapped_term[i], ""])
elif len(wrapped_term) < len(wrapped_definition):
wrapped_list.append(["", wrapped_definition[i]])
column1 = len(" Term ")
column2 = len(" Definition ")
print("+--------------------------------------+---------------------------------------+")
for item in wrapped_list:
print("|", item[0], " "*(column1 - len(item[0])),"|", item[1], " "*(column2-len(item[1])), "|")
print("+--------------------------------------+---------------------------------------+")
print("* *")
Basically, I created a wrapped version of each of my terms and definitions.
Then the try-catch stuff checks whether the term is longer than the definition (in terms of lines) and if so puts blank lines for the definition and vice versa.
I then created a wrapped_list (combined terms and definitions) to store this the above.
With help from this video: (https://www.youtube.com/watch?v=B9BRuhqEb2Q), I formatted the table.
Hope this helped anyone struggling with a similar problem - this can be applied to any number of columns in a table, and any length of csv file.

Visual CSV Reader what doesn't keep the file "open"

I'm looking for a way to view CSV files as you would in Excel (nice clear layout) the only issue with Excel is that it doesn't notify you of updates nor does it close the file once it's "open" unlike Notepad++ which allows you to see when the file is updated and it also allows you to have the file open then manipluate it in lets say python...
The only problem with Notepad++ is it's impossible to read a CSV easily..
Notepad++
Pros
"This file has been modified by another program, Would you like to reload it?"
Closes the file once the data is loaded visually.
Can compare against other files easily.
Cons
No clear visual viewer.
Excel
Pros
Clear visual viewer.
Cons
Doesn't close the file once the data is loaded visually.
Doesn't alert you when the file has been modified by another program (you can't modify it if it's open in Excel)
Can't compare to another file easily.
Anyone know of a program to help me?
Before
data,data,datalongdatadatalongdata,datalongdata,datalongdata,data,data,data,datalongdata,data,data,data,data,datalongdata,datalongdata,data
data,data,data,data,data,data,data,data,data,data,data,data,data,data,data,data
data,data,data,data,data,data,data,data,data,data,data,data,data,data,data,data
data,data,data,data,data,data,data,data,data,data,data,data,data,data,datalongdatadatalongdatadatalongdatadatalongdatadatalongdata,data
data,data,data,data,data,data,data,data,data'data,data,data,data,data,data,data,data
data,data,data,data,data,data,data,data,data,data,data,data,data,data,data,data
data,data,data,data,data,data,data,data,data,data,data,data,data,data,data,data
data,data,data,data,data,data,data,data,data,data,data,data,data,datalongdatadatalongdatadatalongdatadatalongdata,data,data
After (Look at line 5 for data'data)
data,data,datalongdatadatalongdata,datalongdata,datalongdata,data,data,data,datalongdata,data,data,data,data,datalongdata,datalongdata ,data
data,data,data ,data ,data ,data,data,data,data ,data,data,data,data,data ,data ,data
data,data,data ,data ,data ,data,data,data,data ,data,data,data,data,data ,data ,data
data,data,data ,data ,data ,data,data,data,data ,data,data,data,data,data ,datalongdatadatalongdatadatalongdatadatalongdatadatalongdata,data
data,data,data ,data ,data ,data,data,data,data'data,data,data,data,data,data,data,data
data,data,data ,data ,data ,data,data,data,data,data,data,data,data,data,data,data
data,data,data ,data ,data ,data,data,data,data,data,data,data,data,data,data,data
data,data,data ,data ,data ,data,data,data,data,data,data,data,data,datalongdatadatalongdatadatalongdatadatalongdata,data,data
You will never achieve the great visual experience of Excel in Notepad++!
The only "solution" I know of, lie inside the TextFX plugin.
Select all your text, and then go to TextFX > TextFX Edit > Line up multiple lines by (,). This will convert the following example:
heeey,this,is,a,testtttttttttt
34,3,2234,3,5
123,123,123,123,123
To:
heeey,this,is ,a ,testtttttttttt
34 ,3 ,2234,3 ,5
123 ,123 ,123 ,123,123
PS. You might want to check CSVed. Never had to use it so I don't know if it has all the features you need, but from the screenshot it looks good :)
I'm back with a new answer :) After you found that bug in TextFX, I decided to create something better using the Python Script plugin.
Examples
It will convert the following example:
heeey,this,is,a,testtttttttttt
34,3
123,123,123,123,123
To:
+ ----------------------------------------- +
| heeey | this | is | a | testtttttttttt |
| 34 | 3 | | | |
| 123 | 123 | 123 | 123 | 123 |
+ ----------------------------------------- +
And the following:
title1,title2,title3,title4,title5,title6,title7,title8,title9
datalongdata,datalongdata,data,data,data,datalongdata,data,data,data
data,data,data,data,data,data,data,datalongdatadatalongdatadatalongdatadatalongdatadatalongdata,data
data,data'data,data,data,data,data,data,data,data
To:
+ ------------------------------------------------------------------------------------------------------------------------------------------------------ +
| title1 | title2 | title3 | title4 | title5 | title6 | title7 | title8 | title9 |
+ ------------------------------------------------------------------------------------------------------------------------------------------------------ +
| datalongdata | datalongdata | data | data | data | datalongdata | data | data | data |
| data | data | data | data | data | data | data | datalongdatadatalongdatadatalongdatadatalongdatadatalongdata | data |
| data | data'data | data | data | data | data | data | data | data |
+ ------------------------------------------------------------------------------------------------------------------------------------------------------ +
Installation
Install Python Script plugin, from Plugin Manager or from the official website.
When installed, go to Plugins > Python Script > New Script. Choose a filename for your new file (eg pretty_csv.py) and copy the code that follows.
Open your csv file and then run Plugins > Python Script > Scripts > pretty_csv.py. this will open a new tab with your table.
Please note that in the first few lines of the script you can alter some parameters. I hope that the variables names are self-explanatory! I guess the most important ones are the boolean ones, border and header.
#define parameters
delimiter=","
new_delimiter=" | "
border=True
border_vertical_left="| "
border_vertical_right=" |"
border_horizontal="-"
border_corner_tl="+ "
border_corner_tr=" +"
border_corner_bl="+ "
border_corner_br=" +"
header=True
border_header_separator="-"
border_header_left="+ "
border_header_right=" +"
newline="\n"
#load csv
content=editor.getText()
content=content.rstrip(newline)
rows=content.split(newline)
#find the max number of columns (so having rows with different number of columns is no problem)
max_columns=max([row.count(delimiter) for row in rows])
if max_columns>0:
max_columns=max_columns+1
#find the max width of each column
column_max_width=[0]*max_columns
for row in rows:
for index,column in enumerate(row.split(delimiter)):
width=len(column)
if width>column_max_width[index]:
column_max_width[index]=width
total_length=sum(column_max_width)+len(new_delimiter)*(max_columns-1)
#create new document
notepad.new()
#apply the changes
left=border_vertical_left if border is True else ""
right=border_vertical_right if border is True else ""
left_header=border_header_left if border is True else ""
right_header=border_header_right if border is True else ""
for row_number,row in enumerate(rows):
columns=row.split(delimiter)
max_index=len(columns)-1
for index in range(max_columns):
if index>max_index:
columns.append(' ' * column_max_width[index])
else:
diff=column_max_width[index]-len(columns[index])
columns[index]=columns[index] + ' ' * diff
if row_number==0 and border is True: #draw top border
editor.addText(border_corner_tl + border_horizontal * total_length + border_corner_tr + newline)
editor.addText(left + new_delimiter.join(columns) + right + newline) #print the new row
if row_number==0 and header is True: #draw header's separator
editor.addText(left_header + border_header_separator * total_length + right_header + newline)
if row_number==len(rows)-1 and border is True: #draw bottom border
editor.addText(border_corner_bl + border_horizontal * total_length + border_corner_br)
else:
console.clear()
console.show()
console.writeError("No \"%s\" delimiter found!" % delimiter)
If you find any bugs or have any suggestions please let me know!

Resources