Identifying and blocking Pornography web-sites

The keyword filtering feature allows you to block pages which may contain inappropriate content using a scoring system.

Identifying and blocking Pornography web-sites

Postby Manish » Fri Jun 30, 2006 1:27 pm

Most web-sites that serve pornographic content, try very hard to promote themselves on the search engines.

Search engines list and classify web-sites by "looking into" the meta tags for keywords, description & title.

A typically well-crafted web-page would have the following tags in the head section:
Code: Select all
<meta name="keywords" content="relevant,keywords,for,web-page,promotion">
<meta name="description" content="A short description of the page content. It could be approximately between 200 to 250 words">
<Title>The title of the web-page, approximately 30 to 60 characters. </Title>


SafeSquid's Keyword filter can be used to decisively identify such web-sites.

In this PCRE:
Code: Select all
(<meta name=.*(keywords|description).*content=|<title>).*\b(all bad phrases|and|words|add as many|pipes as you wish)\b.*>


We can very precisely identify web-sites that use 'all bad phrases' 'and' 'words' 'add as many' 'pipes as you wish' as phrases in any of the above listed meta tags.

The trick is to create as many rules as required in the keyword-filter section, and give each rule a different score. One popular way of working is to set the oerall threshold to 100, and then give each rule a different weight between 0-100 depending on the probability of use of each of such words & phrases, in objectionable web-sites.

We found Google's Keword analysis tool, to be extremely useful to generate lists of phrases and words, for any subject - including pornograpy.
This is an on-line tool available at:
https://adwords.google.com/select/KeywordToolExternal

This tool allows us to create the word-lists in different languages.

Yes, we can create similar rules and PCREs to analyse the entire web-documents as generally contained within the <BODY> </BODY> tags too.

To cover a large word-list, we must group them on the basis of probability and create multiple rules to comprehensively cover the whole-list.
-----------------
Manish Kochar
Manish
Site Admin
 
Posts: 1318
Joined: Wed Apr 14, 2004 9:09 pm
Location: Mumbai

blocking pornography

Postby Manish » Wed Jun 27, 2007 3:54 pm

Ok,

I did a little more research, and noticed that the web-masters of porn web-sites do a few more tricks, like omitting the "," or using plurals for the keywords.

I recommend these expressions as a better alternative to the earlier ones:
Code: Select all
(?!\<.*\<)(<meta name[\s]*=(\"|[\s])*(keyword(|s)|description(|s))(\"|[\s])*content(|s)(\"|[\s])*=(\"|[\s])*)[^<>]*(,|\b)(all|bad|words|and phrases)(,|\b)[^<>]*>


Notice the part - [font=Lucida Console]all|bad|words|and phrases[/font] ?
you should replace just that much with your own. It will now detect the presence of unacceptable words in keyword and description meta tags.

The following variant of the above expression will detect the bad words even if they are suffixed with plural, so if you specified [font=Lucida Console]rose|bell|horse[/font] it will detect [font=Lucida Console]rose, roses, bell, bells, belles, horse, horses[/font]

Code: Select all
(?!\<.*\<)(<meta name[\s]*=(\"|[\s])*(keyword(|s)|description(|s))(\"|[\s])*content(|s)(\"|[\s])*=(\"|[\s])*)[^<>]*(,|\b)((all|bad|word|and phrase)(|s|es|ies))(,|\b)[^<>]*>


The logic in the above two expressions can be extended to detect pornographic content, by searching in Page Title with this expression:

Code: Select all
<title>[^<>]*\b(all|bad|word|and phrase)(|s|es|ies)\b[^<>]*</title>


I tested these expressions, using the various words generally found on pornographic web-sites.

I will make a post on links and image tags a little later.

feel free to experiment & also let me know.
-----------------
Manish Kochar
Manish
Site Admin
 
Posts: 1318
Joined: Wed Apr 14, 2004 9:09 pm
Location: Mumbai


Return to Word Filtering

Who is online

Users browsing this forum: No registered users and 0 guests

cron