Problem HTTP error 403 in Python 3 Web Scraping
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------
Music by Eric Matyas
https://www.soundimage.org
Track title: Hypnotic Puzzle2
--
Chapters
00:00 Problem Http Error 403 In Python 3 Web Scraping
00:49 Accepted Answer Score 365
01:33 Answer 2 Score 57
01:55 Answer 3 Score 27
02:40 Answer 4 Score 10
02:53 Thank you
--
Full question
https://stackoverflow.com/questions/1662...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #http #webscraping #httpstatuscode403
#avk47
ACCEPTED ANSWER
Score 365
This is probably because of mod_security or some similar server security feature which blocks known spider/bot user agents (urllib uses something like python urllib/3.3.0, it's easily detected). Try setting a known browser user agent with:
from urllib.request import Request, urlopen
req = Request(
url='http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1',
headers={'User-Agent': 'Mozilla/5.0'}
)
webpage = urlopen(req).read()
This works for me.
By the way, in your code you are missing the () after .read in the urlopen line, but I think that it's a typo.
TIP: since this is exercise, choose a different, non restrictive site. Maybe they are blocking urllib for some reason...
ANSWER 2
Score 57
Definitely it's blocking because of your use of urllib based on the user agent. This same thing is happening to me with OfferUp. You can create a new class called AppURLopener which overrides the user-agent with Mozilla.
import urllib.request
class AppURLopener(urllib.request.FancyURLopener):
version = "Mozilla/5.0"
opener = AppURLopener()
response = opener.open('http://httpbin.org/user-agent')
ANSWER 3
Score 27
"This is probably because of mod_security or some similar server security feature which blocks known
spider/bot
user agents (urllib uses something like python urllib/3.3.0, it's easily detected)" - as already mentioned by Stefano Sanfilippo
from urllib.request import Request, urlopen
url="https://stackoverflow.com/search?q=html+error+403"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
web_byte = urlopen(req).read()
webpage = web_byte.decode('utf-8')
The web_byte is a byte object returned by the server and the content type present in webpage is mostly utf-8. Therefore you need to decode web_byte using decode method.
This solves complete problem while I was having trying to scrape from a website using PyCharm
P.S -> I use python 3.4
ANSWER 4
Score 10
Based on previous answers this has worked for me with Python 3.7 by increasing the timeout to 10.
from urllib.request import Request, urlopen
req = Request('Url_Link', headers={'User-Agent': 'XYZ/3.0'})
webpage = urlopen(req, timeout=10).read()
print(webpage)