Python Web Scraping Tutorial. Web scraping, also called web data mining or web harvesting, is the process of constructing an agent which can extract, parse, download and organize useful information from the web automatically. Scrape a Dynamic Website with Python; Web Scraping with Javascript (NodeJS) Turn Any Website Into An API with AutoScraper and FastAPI; 6 Puppeteer Tricks to Avoid Detection and Make Web Scraping Easier; How to use a proxy in Playwright.
- Web Scraping Python Example
- Web Scraping By Python Tutorial
- Web Scraping Python Selenium
- Web Scraping Python Projects
Reaching the most potential clients is very important for most startups. In this way, they can generate better leads. One of the easiest ways to have a good clientage is to have as many business email addresses as possible and send them your service details time and again.
They are many scraping tools present on the internet that provide these services for free, but they have withdrawal data limits. They also offer unlimited data extraction limits, but they are paid. Why pay them when you can build one with your own hands?
This article will demonstrate how easy it is to build a simple web crawler in Python. Although it will be a very simple example but for beginners, it will be a learning experience, especially for those who are new to web scraping. This will be a step-by-step tutorial that will help you get email addresses without any limits.
Let’s start with the building process of our intelligent web scraper. I will divide the whole code into different pieces by commenting on what’s going on so that you can get a deeper insight into how the whole process works. I will also share the entire code at the end of the post to fully analyze the whole process.
Step 1: Importing Modules
We will be using the following six modules for our project.
The details of the imported modules are given below:
- re is for regular expression matching.
- requests for sending HTTP requests.
- urlsplit for dividing the URLs into component parts.
- deque is a container that is in the form of a list used for appending and popping on either end.
- BeautifulSoup for pulling data from HTML files of different web pages.
- pandas for email formatting into DataFrame and for further operations.
Step 2: Initializing Variables
In this step, we will initialize a deque that will save scraped URLs, unscraped URLs, and a set of saving emails scraped successfully from the websites.
Web Scraping Python Example
Duplicate elements are not allowed in a set, so they are all unique.
Step 3: Starting the Scraping Process
- The first step is to distinguish between the scraped and unscraped URLs. The way to do this is to move a URL from unscraped to scraped.
Web Scraping By Python Tutorial
- The next step is to extract data from different parts of the URL. For this purpose, we will use urlsplit.
urlsplit() returns a 5-tuple: (addressing scheme, network location, path, query, fragment, identifier).
I can’t show sample inputs and outputs for urlsplit() due to confidential reasons, but once you try, the code will ask you to input some value (website address). The output will display the SplitResult(), and inside the SplitResult() there would be five attributes.
This will allow us to get the base and path part for the website URL.
- This is the time to send the HTTP GET request to the website.
- For extracting the email addresses we will use the regular experession and then add them to the email set.
Regular expressions are of massive help when you want to extract the information of your own choice. If you are not comfortable with them, you can have a look at Python RegEx for more details.
- The next step is to find all linked URLs to the website.
The <a href=””> tag indicates a hyperlink that can be used to find all the linked URLs in the document.
Then we will find the new URLs and add them in the unscraped queue if they are not in the scraped nor in the unscraped.
When you try the code on your own, you will notice that not all the links are able to be scraped, so we also need to exclude them,
Step 4: Exporting Emails to a CSV file
To analyze the results in a better way, we will export the emails to the CSV file.
If you are using Google Colab,you can download the file to your local machine by
As already explained, I can’t show the scrapped email addresses due to confidentiality issues.
[Disclaimer! Some websites don’t allow to do web scraping and they have very intelligent bots that can permanently block your IP, so scrape at your own risk.]
Complete Code
Web Scraping Python Selenium
Wrapping Up
In this article, we have explored one more wonder of web scraping by showing a practical example of scraping email addresses. We have tried the most intelligent approach by making our web crawler by using Python and its easiest and yet powerful library called BeautfulSoup. Web Scraping can be of massive help if done rightfully considering your requirements. Although we have written a very simple code for scraping email addresses, it is totally free of cost, and also, you don’t need to rely on other services for this. I tried my level best to simplify the code as much as possible and also added room for customization so you optimize it according to your own requirements.
If you are looking for proxy services to use during your scraping projects, don’t forget to look at ProxyScraperesidential and premium proxies.
Web Scraping Python Projects
That was all for this article. See you in the next ones!