Search Engine With Intelligent Web Spider

Search Engine is a tool designed to search information on the World Wide Web. The search results are usually presented in a list and are commonly called hits.The information may consist of web pages, images, information and other types of files. Some search engines also mine data available in databases or open directories. Unlike Web directories, which are maintained by human editors, search engines operate algorithmically or a mixture of algorithmic and human input.
1  Functional Requirements:
    The Functional Requirements of a Search Engine consists of the following four blocks:
A Web Crawler or Spider that follows HTML links in Web pages to gather documents.
An Indexer that indexes the documents crawled using some indexing rules and saves the indexed results for searching.
An Query Engine that performs the actual search and returns ranked results.
An Interface that allows users to interact with the query engine.
2. Project Scope:
This software project will cover the two major areas:

1  Database population ( with Spider/Crawler)

2  Web to search from database.

The responsibility of first module will be to move like a spider from one site to other, and to find the keywords of each webpage. These keywords will be found by separating them from a list of words like a, an, the, that, is, are, etc. And then the keywords will be stemmed out, by using Stemming Algorithm. Means the Management will be converted into Manage, Meetings into Meet and Cats into Cat etc. And these stemmed words will be stored in the database. While searching from site to site, this software will consider the privileges set for users on current site, means it will also examine the robots.txt (i.e. A file usually placed on restricted domains, and this file describes the privileges to use the content of that site), this software will also do the cycle handling while moving from site to site and will also counts number of times a certain web site is cited, means it will also do the statistical analysis. And on the basis of these analyzed statistics, the system will manage found keywords into manageable clusters/categories.
Second module will be web based and the responsibility of this module will be to search from the database maintained by the first module. It will search by taking value/input from user. And before search from database, this input will be stemmed by using the same Stemming algorithm of first module.

3  Significance of the project:

This project will be able to be used for these tasks:

1  To test for Bad links of web-pages in a site,Means if some one uploads his/her site having large number of pages, then it’s a headache for him/her to check the correctness of hierarchy of site. And this software will make ease for him/her to test the correctness of hierarchy and links by providing visual structure/view.
It can also help to test good (i.e. valid and having some target-page) or bad (i.e. invalid or without target-page) links in a web page.

2  To search for copyright violations.

   By using this spider, we will quickly map out all of the pages/links contained on a web site. Means if a site is pointing/linking to the content of some other copyrighted information/files, then this violation will be caught by watching the structure of that site. For example, we can check any site if it tries to point/link to the *.mp3 files placed at http://www.apnaymp3.com/ illegally.

There can be many other tasks, where this type of projects can be used.

4. Proposed methodology

Basically we will study different methodologies for this project and select the best one which gives best results. But at starting phase we will study and apply the following techniques/algorithms:

1  Stemming Algorithm to extract keywords from web pages.

2  Object oriented technique for development.

3  Making good indexing of fields in database, for fast searching.

During the project if any other algorithm other than that satisfies project requirements and our faculty supervisor then we’ll use that also. But the consideration will be made on the basis of its affects.

 

Download
Share:

Popular Posts

Pick project