Showing posts with label urllib. Show all posts
Showing posts with label urllib. Show all posts

Monday, July 3, 2017

Web Scraping with Python


The main modules which are needed for web related operations are urllib, urlopen, requests, selenium. Other than this four there are other main and supporting modules like beautifulsoup, webbrowser are there. so, let's start with urllib  -

Follow the below code closely -

import urllib
ts=urllib.urlopen("http://pythonbeginner.in")
for textscript in ts:
    print textscript

This code first fetches the whole content of a website and display them line wise. As simple like that. In web scraping, we have to find a given information in a web page. The projects I have got are mostly something like  -  go through the website on a regular basis and update the price of a product accordingly. Or, it can be useful in share market auto trading etc. 
Suppose, we have to find how many times the word 'python' is used in the code. Then what will we do? Web scrapers often face problems like this. Let's look at the second code - 

import urllib,re
ts=urllib.urlopen("http://pythonbeginner.in")
count=0
for line in ts:
    if re.search('python',line):
        print line
        count+=1
print count

Remember the codes from the entry Regular Expression. Here I used the same code to find out the number of times the word 'python' occurs. It found out that I have used it 38 times on the website main page😮.

You can open any page in your default browser with the module webbrowser. Look at the example -

import webbrowser
webbrowser.open("www.yahoo.com")

But for dynamic web pages where the pages are created by ajax, javascript or any other run-time programs then the web scraping becomes difficult. As in the above example, we are getting a static web page, not a dynamic one. To get dynamic values we need to use selenium. I will discuss it in the next blog entry. Wait for it.


Feautured Post

Python Trivia Post: enumerate()

Python has a very useful inbuilt function, which is called enumerate(). This is needed whenever you need the numbering of a certain element...