Re: Adult Sites

From: Brian Ristuccia <[email protected]>
Date: Tue, 23 Feb 1999 23:19:46 -0500

On Wed, Feb 24, 1999 at 11:41:13AM +0800, David Luyer wrote:
>
> Josh Kuperman wrote:
>
> > You could try to filter the regular expression "sex" which would stifle
> > about 10%.
>
> Is that 10% of sex sites, or 10% of the net's legitimate content?
>

Most likely 10% of the net's legitimate content, and less than 1% of all sex
sites...

> Placenames such as Essex, Sussex and Middlesex, programming references
> such as the header file 'bytesex.h', links to useful places like Sexual
> Abuse Recovery, Sexual Harassment sites, Sexuality Information Department,
> articles about sextuplets, the benefits of de-sexing of pets, ... and so
> on and so on.
>
> There are appropriate words that can be used to filter out sex sites.
> 'sex' just isn't one of them.
>

I urge you both to rethink your strategies. Site blocking by keywords, even
very carefully chosen ones, is random at best. Whether you opt for keyword
matching in the URL, filename, or document text, the risks are grave that
you will inadvertantly block a large percentage of the Internet while still
missing many of the adult sites you intended to block.

keyword filename document text
------- --------------- ----------------------------------------------------
ass assemble.html "..or the right of the people peacibly to
                         assemble.."

tit petition.html "..and to petitition the government.."

fuck fucking-ie.html "..After a week of debugging the proxy system, we
                         tracked the problem to yet another fucking bug in IE.
                         This patch will allow the proxy to work around the
                         problem."

breast breast.gif "Yesterday, we found the breast possible solution to
                         the problem that was causing documents to be
                         incorrectly cached."

In cases 1 and 2, the "adult" keyword gets matched as a substring in another
word. In case 3, it's used as an expletive by someone who's angry about
having to work around someone elses's broken software again. In case 4's
document text, we have an easy typing error, where someone inadvertantly
entered breast instead of best. In case 4's filename, we have a picture from
a chicken recipe or women's health site.

> (I keep seeing this suggestion. It really _isn't_ a good idea.)
>

Unfortunately, neither would the use of any other words. No matter how "porn
only" a word may sound, it's often just one or two characters away from a
commonly used word, a commonly found substring (like the essex, sussex,
middlesex examples you gave), or used an an expletive by casual programmers.

-- 
Brian Ristuccia
brianr@osiris.978.org
brianr@debian.org
bristucc@cs.uml.edu
Received on Tue Feb 23 1999 - 20:59:23 MST

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:44:40 MST