I have a dataframe which contains negative numbers, with accountancy notation i.e.:
df.select('sales').distinct().show()
+------------+
| sales |
+------------+
| 18 |
| 3 |
| 10 |
| (5)|
| 4 |
| 40 |
| 0 |
| 8 |
| 16 |
| (2)|
| 2 |
| (1)|
| 14 |
| (3)|
| 9 |
| 19 |
| (6)|
| 1 |
| (9)|
| (4)|
+------------+
only showing top 20 rows
The numbers wrapped in () are negative. How can I replace them to have minus values instead i.e. (5) becomes -5 and so on.
Here is what I have tried:
sales = (
df
.select('sales')
.withColumn('sales_new',
sf.when(sf.col('sales').substr(1,1) == '(',
sf.concat(sf.lit('-'), sf.col('sales').substr(2,3)))
.otherwise(sf.col('sales')))
)
sales.show(20,False)
+---------+---------+
|salees |sales_new|
+---------+---------+
| 151 | 151 |
| 134 | 134 |
| 151 | 151 |
|(151) |-151 |
|(134) |-134 |
|(151) |-151 |
| 151 | 151 |
| 50 | 50 |
| 101 | 101 |
| 134 | 134 |
|(134) |-134 |
| 46 | 46 |
| 151 | 151 |
| 134 | 134 |
| 185 | 185 |
| 84 | 84 |
| 188 | 188 |
|(94) |-94) |
| 38 | 38 |
| 21 | 21 |
+---------+---------+
The issue is that the length of sales can vary so hardcoding a value into the substring() won't work in some cases.
I have tried using regexp_replace but get an error that:
PatternSyntaxException: Unclosed group near index 1
sales = (
df
.select('sales')
.withColumn('sales_new', regexp_replace(sf.col('sales'), '(', ''))
)
This can be solved with a case statement and regular expression together:
from pyspark.sql.functions import regexp_replace, col
sales = (
df
.select('sales')
.withColumn('sales_new', sf.when(sf.col('sales').substr(1,1) == '(',
sf.concat(sf.lit('-'), regexp_replace(sf.col('sales'), '\(|\)', '')))
.otherwise(sf.col('sales')))
)
sales.show(20,False)
+---------+---------+
|sales |sales_new|
+---------+---------+
|151 |151 |
|134 |134 |
|151 |151 |
|(151) |-151 |
|(134) |-134 |
|(151) |-151 |
|151 |151 |
|50 |50 |
|101 |101 |
|134 |134 |
|(134) |-134 |
|46 |46 |
|151 |151 |
|134 |134 |
|185 |185 |
|84 |84 |
|188 |188 |
|(94) |-94 |
|38 |38 |
|21 |21 |
+---------+---------+
You can slice the string from the second character to the second last character, and then convert it to float, for example:
def convert(number):
try:
number = float(number)
except:
number = number[1:-1]
number = float(number)
return number
You can iterate through all the elements and apply this function.
I'm trying to scrape an IMDB list, but currently all I that prints out in the table is the first movie (Toy Story).
I've tried to initialize count = 0 and then I've tried to update first_movie = movie_containers[count+1] at the end of the for loop, but it doesn't work. Whatever I try, I get various errors such as 'Arrays Must Be the Same Length'. When it does work, like I said, only the first movie on the page is printed into the table 50 times.
from bs4 import BeautifulSoup
from requests import get
import pandas as pd
url = 'https://www.imdb.com/search/title/?genres=comedy&explore=title_type,genres&pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=3396781f-d87f-4fac-8694-c56ce6f490fe&pf_rd_r=3PWY0EZBAKM22YP2F114&pf_rd_s=center-1&pf_rd_t=15051&pf_rd_i=genre&ref_=ft_gnr_pr1_i_1'
response = get(url)
html = BeautifulSoup(response.text, 'lxml')
movie_containers = html.find_all('div', class_='lister-item mode-advanced')
first_movie = movie_containers[0]
name = first_movie.h3.a.text
year = first_movie.find('span', class_='lister-item-year text-muted unbold').text
rating = float(first_movie.find('div', class_='inline-block ratings-imdb-rating').text.strip())
metascore = int(first_movie.find('span', class_='metascore favorable').text)
vote = first_movie.find('span', attrs={'name':'nv'})
vote = vote['data-value']
gross = first_movie.find('span', attrs={'data-value':'272,257,544'})
gross = '$' + gross['data-value']
info_container = first_movie.findAll('p', class_='text-muted')[0]
certificate = info_container.find('span', class_='certificate').text
runtime = info_container.find('span', class_='runtime').text
genre = info_container.find('span', class_='genre').text.strip()
description = first_movie.findAll('p', class_='text-muted')[1].text.strip()
#second_movie_metascore = movie_containers[1].find('div', class_='ratings-metascore')
names = []
years = []
ratings = []
metascores = []
votes = []
grossing = []
certificates = []
runtimes = []
genres = []
descriptions = []
for container in movie_containers:
try:
name = first_movie.h3.a.text
names.append(name)
except:
continue
try:
year = first_movie.find('span', class_='lister-item-year text-muted unbold').text
years.append(year)
except:
continue
try:
rating = float(first_movie.find('div', class_='inline-block ratings-imdb-rating').text.strip())
ratings.append(rating)
except:
continue
try:
metascore = int(first_movie.find('span', class_='metascore favorable').text)
metascores.append(metascore)
except:
continue
try:
vote = first_movie.find('span', attrs={'name':'nv'})
vote = vote['data-value']
votes.append(vote)
except:
continue
try:
gross = first_movie.find('span', attrs={'data-value':'272,257,544'})
gross = '$' + gross['data-value']
grossing.append(gross)
except:
continue
try:
certificate = info_container.find('span', class_='certificate').text
certificates.append(certificate)
except:
continue
try:
runtime = info_container.find('span', class_='runtime').text
runtimes.append(runtime)
except:
continue
try:
genre = info_container.find('span', class_='genre').text.strip()
genres.append(genre)
except:
continue
try:
description = first_movie.findAll('p', class_='text-muted')[1].text.strip()
descriptions.append(description)
except:
continue
test_df = pd.DataFrame({'Movie': names,
'Year': years,
'IMDB': ratings,
'Metascore': metascores,
'Votes': votes,
'Gross': grossing,
'Certificate': certificates,
'Runtime': runtimes,
'Genres': genres,
'Descriptions': descriptions
})
#print(test_df.info())
print(test_df)
Also, how do I start the pd list at 1, not 0 when it prints out a table?
You can try this code to scrape the data. I'm now printing it on screen, but you will put the data into the Panda's dataframe:
from bs4 import BeautifulSoup
import requests
import textwrap
url = 'https://www.imdb.com/search/title/?genres=comedy&explore=title_type,genres&pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=3396781f-d87f-4fac-8694-c56ce6f490fe&pf_rd_r=3PWY0EZBAKM22YP2F114&pf_rd_s=center-1&pf_rd_t=15051&pf_rd_i=genre&ref_=ft_gnr_pr1_i_1'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
names = []
years = []
ratings = []
metascores = []
votes = []
grossing = []
certificates = []
runtimes = []
genres = []
descriptions = []
for i in soup.select('.lister-item-content'):
for t in i.select('h3 a'):
names.append(t.text)
break
else:
names.append('-')
for t in i.select('.lister-item-year'):
years.append(t.text)
break
else:
years.append('-')
for t in i.select('.ratings-imdb-rating'):
ratings.append(t.text.strip())
break
else:
ratings.append('-')
for t in i.select('.metascore'):
metascores.append(t.text.strip())
break
else:
metascores.append('-')
for t in i.select('.sort-num_votes-visible span:contains("Votes:") + span[data-value]'):
votes.append(t['data-value'])
break
else:
votes.append('-')
for t in i.select('.sort-num_votes-visible span:contains("Gross:") + span[data-value]'):
grossing.append(t['data-value'])
break
else:
grossing.append('-')
for t in i.select('.certificate'):
certificates.append(t.text.strip())
break
else:
certificates.append('-')
for t in i.select('.runtime'):
runtimes.append(t.text.strip())
break
else:
runtimes.append('-')
for t in i.select('.genre'):
genres.append(t.text.strip().split(','))
break
else:
genres.append('-')
for t in i.select('p.text-muted')[1:2]:
descriptions.append(t.text.strip())
break
else:
descriptions.append('-')
for row in zip(names, years, ratings, metascores, votes, grossing, certificates, runtimes, genres, descriptions):
for col_num, data in enumerate(row):
if col_num == 0:
t = textwrap.shorten(str(data), 35)
print('{: ^35}'.format(t), end='|')
elif col_num in (1, 2, 3, 4, 5, 6, 7):
t = textwrap.shorten(str(data), 12)
print('{: ^12}'.format(t), end='|')
else:
t = textwrap.shorten(str(data), 35)
print('{: ^35}'.format(t), end='|')
print()
Prints:
Toy Story 4 | (2019) | 8.3 | 84 | 50496 |272,257,544 | G | 100 min |['Animation', ' Adventure', ' [...]|When a new toy called "Forky" [...]|
Charlie's Angels | (2019) | - | - | - | - | - | - |['Action', ' Adventure', ' Comedy']| Reboot of the 2000 action [...] |
Murder Mystery | (2019) | 6.0 | 38 | 46255 | - | PG-13 | 97 min | ['Action', ' Comedy', ' Crime'] | A New York cop and his wife [...] |
Eile veel |(III) (2019)| 7.1 | 56 | 10539 | 26,132,740 | PG-13 | 116 min | ['Comedy', ' Fantasy', ' Music'] | A struggling musician [...] |
Mehed mustas: globaalne oht | (2019) | 5.7 | 38 | 24338 | 66,894,949 | PG-13 | 114 min |['Action', ' Adventure', ' Comedy']|The Men in Black have always [...] |
Good Omens | (2019) | 8.3 | - | 24804 | - | - | 60 min | ['Comedy', ' Fantasy'] | A tale of the bungling of [...] |
Ükskord Hollywoodis | (2019) | 9.6 | 88 | 6936 | - | - | 159 min | ['Comedy', ' Drama'] |A faded television actor and [...] |
Aladdin | (2019) | 7.4 | 53 | 77230 |313,189,616 | PG | 128 min |['Adventure', ' Comedy', ' Family']|A kind-hearted street urchin [...] |
Mr. Iglesias | (2019– ) | 7.2 | - | 2266 | - | - | 30 min | ['Comedy'] | A good-natured high school [...] |
Shazam! | (2019) | 7.3 | 70 | 129241 |140,105,000 | PG-13 | 132 min |['Action', ' Adventure', ' Comedy']| We all have a superhero [...] |
Shaft | (2019) | 6.4 | 40 | 12016 | 19,019,975 | R | 111 min | ['Action', ' Comedy', ' Crime'] | John Shaft Jr., a cyber [...] |
Kontor |(2005–2013) | 8.8 | - | 301620 | - | - | 22 min | ['Comedy'] |A mockumentary on a group of [...] |
Sõbrad |(1994–2004) | 8.9 | - | 683205 | - | - | 22 min | ['Comedy', ' Romance'] | Follows the personal and [...] |
Lelulugu | (1995) | 8.3 | 95 | 800957 |191,796,233 | - | 81 min |['Animation', ' Adventure', ' [...]| A cowboy doll is profoundly [...] |
Lelulugu 3 | (2010) | 8.3 | 92 | 689098 |415,004,880 | - | 103 min |['Animation', ' Adventure', ' [...]| The toys are mistakenly [...] |
Orange Is the New Black | (2013– ) | 8.1 | - | 256417 | - | - | 59 min | ['Comedy', ' Crime', ' Drama'] | Convicted of a decade old [...] |
Brooklyn Nine-Nine | (2013– ) | 8.4 | - | 154342 | - | - | 22 min | ['Comedy', ' Crime'] | Jake Peralta, an immature, [...] |
Always Be My Maybe | (2019) | 6.9 | 64 | 26210 | - | PG-13 | 101 min | ['Comedy', ' Romance'] | A pair of childhood friends [...] |
The Dead Don't Die | (2019) | 6.0 | 54 | 6841 | 6,116,830 | R | 104 min | ['Comedy', ' Fantasy', ' Horror'] | The peaceful town of [...] |
Suure Paugu teooria |(2007–2019) | 8.2 | - | 653122 | - | - | 22 min | ['Comedy', ' Romance'] | A woman who moves into an [...] |
Lelulugu 2 | (1999) | 7.9 | 88 | 476104 |245,852,179 | - | 92 min |['Animation', ' Adventure', ' [...]|When Woody is stolen by a toy [...]|
Fast & Furious Presents: [...] | (2019) | - | - | - | - | PG-13 | - |['Action', ' Adventure', ' Comedy']|Lawman Luke Hobbs and outcast [...]|
Dead to Me | (2019– ) | 8.2 | - | 23149 | - | - | 30 min | ['Comedy', ' Drama'] | A series about a powerful [...] |
Pintsaklipslased | (2011– ) | 8.5 | - | 328568 | - | - | 44 min | ['Comedy', ' Drama'] | On the run from a drug deal [...] |
The Secret Life of Pets 2 | (2019) | 6.6 | 55 | 8613 |135,983,335 | PG | 86 min |['Animation', ' Adventure', ' [...]| Continuing the story of Max [...] |
Good Girls | (2018– ) | 7.9 | - | 18518 | - | - | 43 min | ['Comedy', ' Crime', ' Drama'] | Three suburban mothers [...] |
Ralph Breaks the Internet | (2018) | 7.1 | 71 | 91165 |201,091,711 | PG | 112 min |['Animation', ' Adventure', ' [...]|Six years after the events of [...]|
Trolls 2 | (2020) | - | - | - | - | - | - |['Animation', ' Adventure', ' [...]| Sequel to the 2016 animated hit. |
Booksmart | (2019) | 7.4 | 84 | 24935 | 21,474,121 | R | 102 min | ['Comedy'] | On the eve of their high [...] |
The Old Man & the Gun | (2018) | 6.8 | 80 | 27337 | 11,277,120 | PG-13 | 93 min |['Biography', ' Comedy', ' Crime'] | Based on the true story of [...] |
Fleabag | (2016– ) | 8.6 | - | 25041 | - | - | 27 min | ['Comedy', ' Drama'] |A comedy series adapted from [...] |
Schitt's Creek | (2015– ) | 8.2 | - | 18112 | - | - | 22 min | ['Comedy'] |When rich video-store magnate [...]|
Catch-22 | (2019– ) | 7.9 | - | 6829 | - | - | 45 min | ['Comedy', ' Crime', ' Drama'] |Limited series adaptation of [...] |
Häbitu | (2011– ) | 8.7 | - | 171782 | - | - | 46 min | ['Comedy', ' Drama'] | A scrappy, fiercely loyal [...] |
Jane the Virgin | (2014– ) | 7.8 | - | 30106 | - | - | 60 min | ['Comedy'] | A young, devout Catholic [...] |
Parks and Recreation |(2009–2015) | 8.6 | - | 178220 | - | - | 22 min | ['Comedy'] | The absurd antics of an [...] |
One Punch Man: Wanpanman | (2015– ) | 8.9 | - | 87166 | - | - | 24 min | ['Animation', ' Action', ' [...] |The story of Saitama, a hero [...] |
The Boys | (2019– ) | - | - | - | - | - | 60 min | ['Action', ' Comedy', ' Crime'] |A group of vigilantes set out [...]|
Pokémon Detective Pikachu | (2019) | 6.8 | 53 | 65217 |142,692,000 | PG | 104 min |['Action', ' Adventure', ' Comedy']| In a world where people [...] |
Kuidas ma kohtasin teie ema |(2005–2014) | 8.3 | - | 544472 | - | - | 22 min | ['Comedy', ' Romance'] | A father recounts to his [...] |
It's Always Sunny in Philadelphia | (2005– ) | 8.7 | - | 171517 | - | - | 22 min | ['Comedy'] | Five friends with big egos [...] |
Stuber | (2019) | 5.9 | 53 | 794 | - | R | 93 min | ['Action', ' Comedy'] |A detective recruits his Uber [...]|
Moodne perekond | (2009– ) | 8.4 | - | 314178 | - | - | 22 min | ['Comedy', ' Romance'] | Three different but related [...] |
The Umbrella Academy | (2019– ) | 8.1 | - | 73654 | - | - | 60 min |['Action', ' Adventure', ' Comedy']| A disbanded group of [...] |
Happy! |(2017–2019) | 8.3 | - | 25284 | - | - | 60 min | ['Action', ' Comedy', ' Crime'] | An injured hitman befriends [...] |
Rick and Morty | (2013– ) | 9.3 | - | 279411 | - | - | 23 min |['Animation', ' Adventure', ' [...]| An animated series that [...] |
Cobra Kai | (2018– ) | 8.8 | - | 34069 | - | - | 30 min | ['Action', ' Comedy', ' Drama'] |Decades after their 1984 All [...] |
Roheline raamat | (2018) | 8.2 | 69 | 215069 | 85,080,171 | PG-13 | 130 min |['Biography', ' Comedy', ' Drama'] | A working-class Italian- [...] |
Kondid |(2005–2017) | 7.9 | - | 130232 | - | - | 40 min | ['Comedy', ' Crime', ' Drama'] | Forensic anthropologist Dr. [...] |
Sex Education | (2019– ) | 8.4 | - | 68509 | - | - | 45 min | ['Comedy', ' Drama'] | A teenage boy with a sex [...] |
I have a set of data as below.
SHEET 1
+------+-------+
| JANUARY |
+------+-------+
+----+----------+------+-------+
| ID | NAME |COUNT | PRICE |
+----+----------+------+-------+
| 1 | ALFRED | 11 | 150 |
| 2 | ARIS | 22 | 120 |
| 3 | JOHN | 33 | 170 |
| 4 | CHRIS | 22 | 190 |
| 5 | JOE | 55 | 120 |
| 6 | ACE | 11 | 200 |
+----+----------+------+-------+
SHEET2
+----+----------+------+-------+
| ID | NAME |COUNT | PRICE |
+----+----------+------+-------+
| 1 | CHRIS | 13 | 123 |
| 2 | ACE | 26 | 165 |
| 3 | JOE | 39 | 178 |
| 4 | ALFRED | 21 | 198 |
| 5 | JOHN | 58 | 112 |
| 6 | ARIS | 11 | 200 |
+----+----------+------+-------+
The RESULT should look like this in sheet1 :
+------+-------++------+-------+
| JANUARY | FEBRUARY |
+------+-------++------+-------+
+----+----------+------+-------++-------+-------+
| ID | NAME |COUNT | PRICE || COUNT | PRICE |
+----+----------+------+-------++-------+-------+
| 1 | ALFRED | 11 | 150 || 21 | 198 |
| 2 | ARIS | 22 | 120 || 11 | 200 |
| 3 | JOHN | 33 | 170 || 58 | 112 |
| 4 | CHRIS | 22 | 190 || 13 | 123 |
| 5 | JOE | 55 | 120 || 39 | 178 |
| 6 | ACE | 11 | 200 || 26 | 165 |
+----+----------+------+-------++-------+-------+
I need formula in column name "FEBRUARY". this formula will find its match in sheet 2
Assuming the first Count value should go in cell E3 of Sheet1, the following formula would be the usual way of doing it:-
=INDEX(Sheet2!C:C,MATCH($B3,Sheet2!$B:$B,0))
Then the Price (in F3) would be given by
=INDEX(Sheet2!D:D,MATCH($B3,Sheet2!$B:$B,0))
I think this query will work fine for your requirement
SELECT `Sheet1$`.ID,`Sheet1$`.NAME, `Sheet1$`.COUNT AS 'Jan-COUNT',`Sheet1$`.PRICE AS 'Jan-PRICE', `Sheet2$`.COUNT AS 'Feb-COUNT',`Sheet2$`.PRICE AS 'Feb-PRICE'
FROM `C:\Users\Nagendra\Desktop\aaaaa.xlsx`.`Sheet1$` `Sheet1$`, `C:\Users\Nagendra\Desktop\aaaaa.xlsx`.`Sheet2$` `Sheet2$`
WHERE (`Sheet1$`.NAME=`Sheet2$`.NAME)
Provide Actual path insted of
C:\Users\Nagendra\Desktop\aaaaa.xlsx
First you need to know about how to make connection. So refer http://smallbusiness.chron.com/use-sql-statements-ms-excel-41193.html