[squid-users] Squid URL list -- as search engine helper?

From: <[email protected]>
Date: 01 Nov 2002 20:26:27 -0500

I have a slight variation on the "what URLs does squid know about"
question... It would be useful to be able to use squid to reduce web
crawling overhead for a search engine, not merely by direct caching,
but by having secondary indexers get an explicit lists of what pages
are "free" to fetch.

I've come up with a few ways of doing this, all of which have flaws:

1) just export the squid logs.
   * not incremental
   * not accurate - they show what has been seen, but not what is
     still around
2) use a redirect_program
   * performance risk
   * *only* incremental
   * requires new programs on the cache box
3) squidclient cachemgr:objects
   * not incremental (but fast enough to make this less of a problem)
   * only has MD5 hashes for objects that aren't still in memory

The last option was the most interesting until I discovered the
in-memory distinction -- I've learned a bunch more about squid
internals in the process, though :-) Basically, vm_objects gives
everything for which URLs are still around; objects gives everything,
but only reports hash keys for the on-disk ones (since that's all it
knows.)

This approach might be salvagable, for example, if there were a way
[which I haven't found] to retrieve a document by cache-key instead of
filename -- the cache knows what URL the cache-key maps to once it
opens the file, after all.

Any thoughts? Suggestions for other approaches? Of course I have a
preference for using builtin features, since it is easier to get
administrative cooperation about config file changes than installing
new programs.

> You can't. Not even Squid knows this. Squid only knows MD5 hashes of
> all URL's.. MD5 is used to conserve memory and speed up lookups. (a
> MD5 is 16 bytes, while an URL is anything from 9 bytes to several KB)
Received on Fri Nov 01 2002 - 18:26:28 MST

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 17:11:07 MST