Starting as part of a research project at Stanford, Amit has written a series of web proxies, from 1997 to 2000. These proxies are old and may not work in current versions of Python without some modification.
- Proxy 2 was for a research project involving security and Java applets. But I also experimented with content blocking. I had implemented an ad blocker as early as 1997; Wikipedia says ad blockers were first developed in 1996[1].
- Proxy 3 dealt with content transformation (HTML, Javascript, images).
- Proxy 4 dealt with connection transformation (HTTP/1.1 connections, pipelining, compression, chunking).
Back in those days, some folk found Proxy 3 to be useful for their basic web surfing. Proxy 4 is more scalable and robust. However, newer browsers address the problems that these proxies solved, so these proxies are no longer very useful.
Proxy 2#
[1997] The typical interaction between a web browser and a web server occurs in two stages: first, the browser sends a request to a server, which performs checks to ensure the request is to a valid document and that the browser has permission to access the document; second, the server sends a reply to the browser, which performs checks to ensure the document is valid and, if it is executable (Java applet), that it does not violate any security restrictions as it executes. A web proxy server can sit between a browser and server. Most web proxies are used for firewalls, caching, or filtering. The MURI Proxy (part of the MURI research project) both filters and modifies documents sent in a reply. In particular, our proxy modifies Java applets to restrict their behavior. This proxy is not designed to be a foolproof system to catch all hostile Java classes that are sent over the network; rather, it is a convenient tool with which we can experiment with restrictions for Java applets.
The general architecture of the MURI Proxy involves intercepting and acting on both HTTP requests and replies. A request can be handled in one of four ways:
- Block. A blocked request is never answered; the browser’s connection is closed and the browser reports an error or uses a “broken image” icon.
- Answer. An answered request is handled directly by the proxy. In a sense, the proxy acts as a web server.
- Redirect. A redirected request is sent to a location other than for what location it was originally intended.
- Forward. A forwarded request is not modified by the proxy; it is sent to the web server for which it was originally intended.
Requests that are redirected or forwarded to a remote web server will result in a reply document, which is intercepted by the proxy. A series of transformations can be applied to the document, with the result being sent to the browser. The URL and MIME type for the document are examined to determine how it is to be transformed. The web browser will not know that the document was modified by the proxy.
At present, the MURI Proxy is capable of the following actions on HTTP requests:
- Block. The proxy blocks access to a number of sites known to serve only advertising.
-
Answer. The proxy handles documents at
http://_proxy/ *
. The most important is at/start/
, which launches a window with a user interface applet giving the user control of and information about the proxy. - Redirect. The proxy redirects requests for special Java classes (see Java class filtering, below) to a site that contains bytecode for these classes.
- Forward. All other requests are forwarded.
The MURI Proxy also contains several modules for transforming Java documents:
- HTML documents containing references to an image from a blocked site are modified to omit those references.
- Java applets using frame windows (windows that appear outside the browser) are modified to:
- restrict the number of frame windows to ten;
- and restrict the size of frame windows to 500x400.
Download
To run the proxy demo, it is required that you be using a browser (such as Netscape Navigator or Microsoft Internet Explorer) that supports web proxies and Java applets.
- proxy2[2]
To see the behavior of the proxy, it is recommended that you compare the behavior of your browser when the proxy is not being used to the behavior when the proxy is being used. Note: Be sure that you exit and restart your browser after changing proxy settings; otherwise, documents may be in the cache and may not be loaded through the proxy.
These pages will act differently when the proxy is being used:
- CNN.com[3] and many other pages have their advertisements removed.
- Java Applets[4]: the size and number of windows is restricted.
You can also bring up the user interface applet[5], which displays information about the proxy. The menus let you control which proxy modules will be active.
Proxy 3#
[1998] Version 3 of the proxy is less focused on Java applet security research and more towards HTML and Javascript filtering. I found that I liked using the proxy for my daily web surfing, but there were several problems. The main problem was that version 2 did not perform well when running multiple browser windows simultaneously. Version 3 uses an event-driven architecture that can handle hundreds of simultaneous connections. (I was writing this in a time in my life when I hated threads.) In addition, the content filtering in version 3 is more modular and can handle streaming (so that portions of the document can be filtered and sent on to the browser before the entire document is loaded). Version 3 of the proxy supports a configuration file and loadable modules. Loadable modules include:
- mod_proxy: provide magic URLs that display the proxy’s internal state.
- mod_stdio: listen for proxy events (HTTP connections, errors, timeouts, ad removal, etc.) and display them on stdout.
- mod_curses: listen for proxy events and display them in a curses UI.
- mod_gtk: listen for proxy events and display them in a GTK UI; also allow changing settings.
- mod_ui: listen for proxy events and display them in a Java applet (not included); also allow changing settings.
- mod_stats: listen for proxy events and display statistics when the proxy exits.
- mod_timing: display slow DNS lookups and slow proxy filters.
- mod_cookies: listen for cookie events (sent by server, sent by browser) and display them on stdout.
- mod_headers: display HTTP headers (sent by server, sent by browser) on stdout.
- mod_html: modify HTML -- change Slashdot color scheme from green to blue; rearrange My Excite portal layout; change Microsoft quotes to standard ASCII quotes; remove popup ads; remove banner ads.
- mod_geocities: modify HTML -- remove Geocities popups.
- mod_java: modify Java bytecode (wrap audio, thread, frame, socket objects).
- mod_slashdot: modify images -- change Slashdot color scheme from green to blue by altering the GIF files.
- mod_block: block clear GIFs.
- mod_cache: cache documents forever.
- mod_dnsprefetch: parse HTML documents, find hostnames, prefetch the DNS lookups for them so when you click on a link, the (often slow) DNS lookup is already performed.
- mod_formdata: display form upload data.
- mod_ignorecache: remove headers which tell the browser not to cache certain sites.
- mod_nocookie: block servers from setting cookies.
The approach taken by proxy3 is to alter the content. This works differently than proxies like Junkbuster, which leave the content alone but block at the HTTP level. The WWW 2003 conference includes a paper that uses a similar but more extensible and principled approach[6].
Download
The proxy has not been tested on non-Unix systems.
Take a look at this clever use of the proxy for spam filtering[7]
Proxy 4#
[2000] Whereas Proxy 3 allowed me to experiment with content-altering features to speed up browsing, Proxy 4 allowed me to experiment with connection-altering features to speed up browsing. Keep in mind that this was over 10 years ago, when browsers didn’t fully support all of HTTP/1.1. Proxy 4 separates the browser connection and server connection so that the server can use HTTP/1.1 features even if the browser didn’t support them.
- Parallel DNS lookups are made in the background, and their results are shared among all connections that need the DNS results. At the time, some browsers would perform DNS lookups one at a time; performing in parallel greatly speeds up browsing. Proxy 4’s infrastructure also supports prefetching DNS lookups, such as what proxy 3’s mod_dnsprefetch module needed (but in proxy 3 it was a hack).
- The server connection can use HTTP/1.1 chunked encoding even if the browser does not. This allows web servers to send parts of the page before the entire page is ready.
- The server connection can use HTTP/1.1 gzip encoding even if the browser does not. This allows web servers to compress the page, saving bandwidth.
- The server connection can use HTTP/1.1 keep-alive even if the browser does not. This allows web servers to reuse TCP connections. Creating new TCP connections is slow, especially for certain types of networks, so reusing them helps speed up browsing. (A related feature, pipelining, allows multiple requests over that connection, but isn’t widely used even in 2011[8].)
- Browser connections can exist with no server connection. This includes special debugging URLs, and also for the proxy to cache content. Proxy 3’s mod_proxy and mod_cache could have used this.
- Multiple browser connections can use the same server connection. For example if a web page includes the same image several times on page, some (older) browsers would request that image several times from the server. Proxy 4 can consolidate these into a single server request and then feed that content several times to the browser. (This isn’t needed in newer browsers, but it was useful at the time.)
- Server connections can exist with no browser connection. For example, the proxy could prefetch content based on heuristics. I had wanted this capability for proxy 3’s modules.
A side effect of separating the browser connection object from the server connection object is that the input and output buffers are decoupled, making the proxy much smoother. Proxy 4 is much more reliable than my proxy 2 and proxy 3 projects.
My intent was to reimplement the proxy 3 modules on top of proxy 4, but I never did that.
Download
The proxy has not been tested on non-Unix systems. However, it’s more likely than earlier versions to work across platforms, because it uses the cross-platform asyncore[9] library for networking.
You may want to look at WebCleaner[10], a proxy that is based on my proxy4 but offers a whole lot more!
Proxy 5#
[2004] Each of the above proxies implements some feature that I wanted at the time. However none of them implement all the features. In particular, a combination of proxy 3 and proxy 4 would be nice. Someday I’d like to work on proxy 5, a multithreaded proxy (perhaps actor/agent-based) that both deals with HTTP in the same way as proxy 4 and deals with HTML, Javascript, and CSS in the same way as proxy 3.
There are so many proxies out there[11] that maybe I should take a look at their goals and architecture before I design proxy5.
[2005-04-12] A lot of the HTML-modifying tricks I wanted to implement are easier to implement in GreaseMonkey[12] or Stylish[13], so I haven’t had much motivation to work on a proxy to do these things.