Amit’s Web Proxy Project

Starting as part of a research project at Stanford, Amit has written a series of web proxies, from 1997 to 2000. These proxies are old and may not work in current versions of Python without some modification.

Back in those days, some folk found Proxy 3 to be useful for their basic web surfing. Proxy 4 is more scalable and robust. However, newer browsers address the problems that these proxies solved, so these proxies are no longer very useful.

Proxy 2#

[1997] The typical interaction between a web browser and a web server occurs in two stages: first, the browser sends a request to a server, which performs checks to ensure the request is to a valid document and that the browser has permission to access the document; second, the server sends a reply to the browser, which performs checks to ensure the document is valid and, if it is executable (Java applet), that it does not violate any security restrictions as it executes. A web proxy server can sit between a browser and server. Most web proxies are used for firewalls, caching, or filtering. The MURI Proxy (part of the MURI research project) both filters and modifies documents sent in a reply. In particular, our proxy modifies Java applets to restrict their behavior. This proxy is not designed to be a foolproof system to catch all hostile Java classes that are sent over the network; rather, it is a convenient tool with which we can experiment with restrictions for Java applets.

The general architecture of the MURI Proxy involves intercepting and acting on both HTTP requests and replies. A request can be handled in one of four ways:

  1. Block. A blocked request is never answered; the browser’s connection is closed and the browser reports an error or uses a “broken image” icon.
  2. Answer. An answered request is handled directly by the proxy. In a sense, the proxy acts as a web server.
  3. Redirect. A redirected request is sent to a location other than for what location it was originally intended.
  4. Forward. A forwarded request is not modified by the proxy; it is sent to the web server for which it was originally intended.

Requests that are redirected or forwarded to a remote web server will result in a reply document, which is intercepted by the proxy. A series of transformations can be applied to the document, with the result being sent to the browser. The URL and MIME type for the document are examined to determine how it is to be transformed. The web browser will not know that the document was modified by the proxy.

At present, the MURI Proxy is capable of the following actions on HTTP requests:

  1. Block. The proxy blocks access to a number of sites known to serve only advertising.
  2. Answer. The proxy handles documents at http://_proxy/ *. The most important is at /start/, which launches a window with a user interface applet giving the user control of and information about the proxy.
  3. Redirect. The proxy redirects requests for special Java classes (see Java class filtering, below) to a site that contains bytecode for these classes.
  4. Forward. All other requests are forwarded.

The MURI Proxy also contains several modules for transforming Java documents:

  1. HTML documents containing references to an image from a blocked site are modified to omit those references.
  2. Java applets using frame windows (windows that appear outside the browser) are modified to:
    1. restrict the number of frame windows to ten;
    2. and restrict the size of frame windows to 500x400.

Download

To run the proxy demo, it is required that you be using a browser (such as Netscape Navigator or Microsoft Internet Explorer) that supports web proxies and Java applets.

To see the behavior of the proxy, it is recommended that you compare the behavior of your browser when the proxy is not being used to the behavior when the proxy is being used. Note: Be sure that you exit and restart your browser after changing proxy settings; otherwise, documents may be in the cache and may not be loaded through the proxy.

These pages will act differently when the proxy is being used:

  1. CNN.com[2] and many other pages have their advertisements removed.
  2. Java Applets[3]: the size and number of windows is restricted.

You can also bring up the user interface applet[4], which displays information about the proxy. The menus let you control which proxy modules will be active.

Proxy 3#

[1998] Version 3 of the proxy is less focused on Java applet security research and more towards HTML and Javascript filtering. I found that I liked using the proxy for my daily web surfing, but there were several problems. The main problem was that version 2 did not perform well when running multiple browser windows simultaneously. Version 3 uses an event-driven architecture that can handle hundreds of simultaneous connections. (I was writing this in a time in my life when I hated threads.) In addition, the content filtering in version 3 is more modular and can handle streaming (so that portions of the document can be filtered and sent on to the browser before the entire document is loaded). Version 3 of the proxy supports a configuration file and loadable modules. Loadable modules include:

The approach taken by proxy3 is to alter the content. This works differently than proxies like Junkbuster, which leave the content alone but block at the HTTP level. The WWW 2003 conference includes a paper that uses a similar but more extensible and principled approach[5].

Download

The proxy has not been tested on non-Unix systems.

Take a look at this clever use of the proxy for spam filtering[6]

Proxy 4#

[2000] Whereas Proxy 3 allowed me to experiment with content-altering features to speed up browsing, Proxy 4 allowed me to experiment with connection-altering features to speed up browsing. Keep in mind that this was over 10 years ago, when browsers didn’t fully support all of HTTP/1.1. Proxy 4 separates the browser connection and server connection so that the server can use HTTP/1.1 features even if the browser didn’t support them.

A side effect of separating the browser connection object from the server connection object is that the input and output buffers are decoupled, making the proxy much smoother. Proxy 4 is much more reliable than my proxy 2 and proxy 3 projects.

My intent was to reimplement the proxy 3 modules on top of proxy 4, but I never did that.

Download

The proxy has not been tested on non-Unix systems. However, it’s more likely than earlier versions to work across platforms, because it uses the cross-platform asyncore[8] library for networking.

You may want to look at WebCleaner[9], a proxy that is based on my proxy4 but offers a whole lot more!

Proxy 5#

[2004] Each of the above proxies implements some feature that I wanted at the time. However none of them implement all the features. In particular, a combination of proxy 3 and proxy 4 would be nice. Someday I’d like to work on proxy 5, a multithreaded proxy (perhaps actor/agent-based) that both deals with HTTP in the same way as proxy 4 and deals with HTML, Javascript, and CSS in the same way as proxy 3.

There are so many proxies out there[10] that maybe I should take a look at their goals and architecture before I design proxy5.

[2005-04-12] A lot of the HTML-modifying tricks I wanted to implement are easier to implement in GreaseMonkey[11] or Stylish[12], so I haven’t had much motivation to work on a proxy to do these things.

Email me , or comment: