Search engines list and classify web-sites by "looking into" the meta tags for keywords, description & title.
A typically well-crafted web-page would have the following tags in the head section:
- Code: Select all
<meta name="keywords" content="relevant,keywords,for,web-page,promotion">
<meta name="description" content="A short description of the page content. It could be approximately between 200 to 250 words">
<Title>The title of the web-page, approximately 30 to 60 characters. </Title>
SafeSquid's Keyword filter can be used to decisively identify such web-sites.
In this PCRE:
- Code: Select all
(<meta name=.*(keywords|description).*content=|<title>).*\b(all bad phrases|and|words|add as many|pipes as you wish)\b.*>
We can very precisely identify web-sites that use 'all bad phrases' 'and' 'words' 'add as many' 'pipes as you wish' as phrases in any of the above listed meta tags.
The trick is to create as many rules as required in the keyword-filter section, and give each rule a different score. One popular way of working is to set the oerall threshold to 100, and then give each rule a different weight between 0-100 depending on the probability of use of each of such words & phrases, in objectionable web-sites.
We found Google's Keword analysis tool, to be extremely useful to generate lists of phrases and words, for any subject - including pornograpy.
This is an on-line tool available at:
https://adwords.google.com/select/KeywordToolExternal
This tool allows us to create the word-lists in different languages.
Yes, we can create similar rules and PCREs to analyse the entire web-documents as generally contained within the <BODY> </BODY> tags too.
To cover a large word-list, we must group them on the basis of probability and create multiple rules to comprehensively cover the whole-list.
