Hi,
I am just working on a tool to dump information from the squid cache.
It is fairly simple, but is intended to be the foundation for other tools
that work with squid. For example, I want to use it to generate an HTML
file with a list of all the URL's in the cache sorted by last modification
time on the original server.
The tool is called squid-cat (and is only one original source file, and uses
the store_metaswap.c file from squid itself). It has these major options:
squid-cat -s, prints the Squid "meta" information about the cache file.
The meta information includes the timestamp, lastmod times from the
original server and also the URL for the file.
squid-cat -h, prints the HTTP headers from a cache file
squid-cat -b, print the body (as in the HTML for the URL) of the cache file.
Be careful with this, if the URL is an image, you will get binary data
output.
It is meant to be used like this,
cd /var/spool/squid/cache # or whereever you keep them
find . -type f -print | xargs squid-cat -sh
to print all the squid and http headers for all files in the squid cache.
I am not happy with the output format yet, it prints "key: value" lines.
It would be more useful if it accepted a list of wanted information and
it printed them one value per column. Then shell/awk/perl scripts could
process the output easier.
In your case, you would use squid-cat -h to find the cached files that
contain HTML (by looking for Content-Type: text/html). Then use
squid-cat -s to get the URLs for those files, and squid-cat -b to get the
HTML bodies of the URLs and feed them into an indexing engine.
If you want to check it out, just ask.
Brian Beuning
bbeuning@mindspring.com
"Blair, Bill" wrote:
> I had a long look through the FAQ's, but could not find an answer to the
> following:
>
> > I sent the original message to squid@ircache.net, Alex Rousskov was
> > kind enough to respond and suggest sending to
> > squid-users. He thought that "Squid does not have such a
> > functionality. It is possible to write a stand-alone program that will
> > search through the cache for pages satisfying some criteria. However,
> > that is not trivial. You will save yourself a lot of
> > time if somebody else has already written such a tool."
> >
> >
> >
> original message:
>
> > hi
> >
> > sorry if this is the wrong approach to asking this question about
> > squid
> >
> > I wonder if your software, or a related product, offers the following
> > functionality. Does it have a search engine capable of searching
> > through the content of pages that are contained in the cache?
> >
> > My situation is that we have a number of clients that are proxying
> > through a http proxy which is then connected to the wider network.
> > I'd
> > like to offer a search engine on the proxy/cache that will search the
> > cache, rather than the wider network, for documents that meet the
> > search
> > criteria. Preferably, the client should be given the original URL of
> > these pages so that they can force a renewed load if the page has been
> > outdated.
> >
> > If no documents are found, then the client can perhpas connect to an
> > external search engine to search the wider network.
> >
> > Our situation is generated by the client being on the end of a
> > moderately fast satellite broadcast. Some clients will have a low
> > rate
> > phone connection to request items to be put onto the broadcast, but
> > not
> > everyone will have this facility. There is a proxy/cache monitoring
> > all
> > transmissions on the broadcast, so the receive only sites can get
> > access
> > to documents requested by other (bi-directional) sites, but they need
> > a
> > search engine into the proxy cache to know what might lurk there
> >
> > hope you can help
> >
> >
> > cheers
> >
> > Bill Blair
Received on Tue Feb 16 1999 - 21:03:05 MST
This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:44:35 MST