by:
  1. Webcrawler
    1. What is a Webcrawler?
    2. What can it do for me?
    3. And what not?
  2. How does an indexer work and why do i get so much spam?
  3. Where can i get it and how to install?
  4. TODO


Webcrawler:

1.1 What is a Webcrawler?

A webcrawler - also called (web)robot, spider... - is a small program that follows hyperlinks on the internet to save an amount of data for later use.

1.2. What can it do for me ?

Look at my real life example, i use Larbin as a webcrawler, combined with a
self made indexer to parse the websites into my database. Some example scripts, like extract_mails, show how simple it can be to fetch millions of mail adresses. With small modifications in your query it is possible to get music, movies, documents, adresses etc.

Larbin uses the asynchronous-capable DNS client library wich is very fast ...
Snip:

Larbin should be able to fetch more than 100 millions pages on a standard PC.
The current version of Larbin can fetch 5,000,000 pages a day on a standard PC, but this speed mainly depends on your network.


1.3. And what not ?

Larbin is just a webcrawler, it can fetch you any information from the web, but it does not index them into a database. If you like to have all these mail adresses, mp3s, divx, mpgs etc you must write some code....

2. How does an indexer work and why do i get so much spam?

Simply said it is the job of an indexer is to save all this contents into a database. But it can do a lot more, for example sort the data and extract relevant contents.

3. Where can i get it and how to install ?

The webcrawler Larbin is opensource, my sets of php and perl scripts to handle
the output of larbin called "webtools4larbin" are opensource also.
I use mysql and postgresql supported applications, but for comercial use i have
a db abstraction layer that is able to handle allmost any type of database.....

Larbin (Download here )

For the Database inexer please check my little project:

http://freshmeat.net/projects/webtools4larbin

FAQ:
Why do you descripe how to get millions of email adresses ?

I think its nessesery to understood how a webcrawler/indexer works bevor we can think about systems they preventing you from spam.

More Questions?

why dont leave a short comment Wink


Have a lot of fun

nfo