file:///home/logilab/docbook/xslt/common/common.xsl; Line 1877; Column 20; Error: no ID for constraint linkend: memory. file:///home/logilab/docbook/xslt/html/xref.xsl; Line 44; Column 20; XRef to nonexistent id: memory file:///home/logilab/docbook/xslt/common/common.xsl; Line 1877; Column 20; Error: no ID for constraint linkend: memory. file:///home/logilab/docbook/xslt/html/xref.xsl; Line 44; Column 20; XRef to nonexistent id: memory file:///home/logilab/docbook/xslt/common/common.xsl; Line 1877; Column 20; Error: no ID for constraint linkend: memory. file:///home/logilab/docbook/xslt/html/xref.xsl; Line 44; Column 20; XRef to nonexistent id: memory HOW-TO proxy HTTP

HOW-TO proxy HTTP

Adrien Di Mascio


Table of Contents

The proxy recipe
http_parse_request
get_and_filter
Write_back_to_socket
memory.xml parameters
The spam.xml file
Known Bugs

Abstract

The aim of this module is to enable the user to personalize his Web navigation by using a proxy. The proxy use is done by calling several times the same recipe.

To use the proxy, you must start "Engine" with the --socket-manager option.

The proxy recipe

The first thing to do to use the proxy is to configure your browser for him to go through your proxy when connecting on Internet. For example, if you use Netscape, then click on "Edit", then "Preferences". In the "Advanced" menu, choose "proxy", then toggle "manual proxy configuration". Now, you've got to tell him where is your proxy. So click on "print" and write the appropriate values in the text fields. For example, in Proxy HTTP, you would write 'localhost' and in port, the port number on which the proxy is listening on (cf ???).

Your browser will now go through the proxy. The "Web proxy" recipe will be fired each time a request will be sent by the browser. This recipe is made of four steps and two transitions without conditions :

http_parse_request

This action will receive a request from the browser on the listening port. Its role is to extract and parse all the request informations (asked url, headers, content, ...). These informations, once parsed, are written in Narval's memory in order to be passed on to the next actions.

Once this step is done, two steps are fired simultaneously. We will only describe one of these two actions, as the other is only used to enable logging and is not yet useful for using proxy.

get_and_filter

It is the most important action in this recipe. She can be seen as a sequence of three different steps:

  • Retrieve the request specifications and the filtering mode from Narval's memory.

  • Then we relieve the request sent by the browser and we retrieve the asked url data. If translation-mode is on (cf ???), we try to guess the document's language thanks to the "guesslang" module in the "infopal" package, (If this package is not installed, translation-mode doesn't work), and if necessary, we translate the asked document thanks to Altavista's translation tool.

    For instance, if the original url was :

    http://br.yahoo.com
    (Yahoo Brazil's home page), and if the user only speaks english, then we change the url in order to have the translated page :
    http://babel.altavista.com/urltrurl?lp=pt_en&url=http%2F%3A%3Abr.yahoo.com%3A.

  • The last part of this action is all that concerns images and text's filtering. There are different ways to filter images : we can filter according to server name, domain name, or according to special rules which can be defined in the file spam.xml (cf ???).

    For image filtering, the idea is : once the HTML document is retrieved, we browse it and when we see an image reference which must be filtered, then, replace the remote reference by a local reference to a blank image.

    For text filtering, the idea is nearly the same, but the rules to apply are defined in the file spam.xml. Theses rules define a list of paterrns to replace by other patterns.

  • Once the data has been processed, we store the processed data in Narval's memory in an organized way (headers, content, ...)

Write_back_to_socket

In this action, we retrieve the data stored in Narval's memory which is actually the server's reponse but filtered as wished and we send it back to the browser.

memory.xml parameters

The file memory.xml can contain different informations in relation with Narval's use. For the sepcific use of the proxy, next informations are necessary:

  • The listening port number must be sepcified. For this, you must create a "server-socket" element with two attributes for the port numner and the recipe to instantiate each time a request is sent. For example, we could have :

             
               <server-socket port="7777" recipe="proxy.Web proxy"/>
             
            
    With this, the proxy will be listening on port 7777. You'll then have to specify to your browser that the proxy is listening on that port.

The main interest of this proxy is to enable images filtering in Web pages, it is then necessary to specify the way you want the proxy to filter. Next informations are optional for proxy using but they represent its main interest. There are two types of specifications to do. The first one concerns the filtering mode, the second one concerns the translation mode :

  • For filtering, you must put a junkbuster-method element. For example :

    <junkbuster-method value="18" />
               
    The "value"'s data depends on the filtering type you want. The different recongnized values are:
    • 1 for image filtering according to server name

    • 2 for image filtering according to domain name

    • 4 for image filtering according to the specific rules define in spam.xml

    • 8 for image filtering according to the "filter" rules define in spam.xml

    • 16 to enable on the fly translation when necessary

    • All combinations possible of these values. For example, 18 will mean 2 (image filtering based on domain name) and 16 (tranlsation)

  • To list all spoken languages, that is only useful if translation mode is on, you have to put a "spoken-languages" element. Here's an example of what you could have in your memory.xml:

                
                   <spoken-languages>
                    <language name="italian" code="it" />
                    <language name="english" code="en" />
                    <language name="french" code="fr" />
                   </spoken-languages>
                
               
    This will mean that you speak italian, english anf french. The proxy will then translate all pages which doesn't match any of these langauges. The order in which "language" elements are placed is important because the proxy will first try to translate in the language represented by the first element, then the second one and so on.

    The code attribute's value is important too because it will be used as an identifier by Altvista's tranlsation tool. For now, recognized languages are:

    • English, code : "en". We can translate english into all the the other mentionned languages.

    • German, code : "de". We can translate german into english and french.

    • Spanish, code : "es". We can translate spanish into english.

    • Italian, code : "it". We can translate italian into english.

    • Portuguese, code : "pt". We can translate portuguese into english.

    • French, code : "fr". We can translate french into english and german.

    The name's attribute value has no importance.

If you want to write another recipe in order to replace the proxy, you can, in the "server-socket" element specify your recipe's name instead of "proxy.Web proxy".

The spam.xml file

It contains specific rules for image's filtering. You can use differents elements :

  • The "trash" and "allow" elements which respectively accept or deny images' downloading from specific sites.

    The "rule" elemens which define a host and addresses to deny or accept on its page. For example :

               
                <rule  host='.businessweek\.com.*' url='.*\/sponsors.*'/>
               
              
    will define the "businessweek.com" site and will precise that only url which contains "/sponsors" are specified. Once these urls have been defined by a rule element, still nothing is filtered. You can either allow these urls or deny them by putting this "rule" element in a "trash" or "allow" element.

    In the special case in which we want to deny all requests to a specific url, and not only on certain images, you don't have to specify a "url" element, only the host one is needed.

  • You can add "and" elements into "trash" or "allow". A "and" element will have a list of rules as childs and it will mean that all theses rules must be applied.

  • The filter elements are used for text filtering. They are composed of "text_match" and "replace_by" elements. For example,

                  
                   <filter host='*.com'>
                     <text_match>Internet</text_match>
                     <replace_by>________</replace_by>
                   </filter>
                  
               
    will replace the word "Internet" by "_________" in all web pages whose address ends by ".com".

    Text filtering is done directly in HTML source, so be careful not to replace tag values by using this option.

Each rule "allow" or "trash" must be placed in a "policy" element. Several rules can be placed in a "policy" element. spam.xml must contain a root element "spam-policy" which will holds "policy" elements.

Known Bugs

  • The translation's tool only allow about 150 words' group tranlsation. So, in certain case, it won't translate all the page.

  • The "POST" requests often cause problems, especially when using tranlsation tool.

  • Only iamges recieved by reading an "img" tag in HTML source are filtered. This means that others (produced by javascript, Flash, ...) won't be filtered.

  • "Connection reset by peer" is mismanaged.