1package HTTP::Proxy::BodyFilter::htmlparser; 2 3use strict; 4use Carp; 5use HTTP::Proxy::BodyFilter; 6use vars qw( @ISA ); 7@ISA = qw( HTTP::Proxy::BodyFilter ); 8 9sub init { 10 croak "First parameter must be a HTML::Parser object" 11 unless $_[1]->isa('HTML::Parser'); 12 13 my $self = shift; 14 $self->{_parser} = shift; 15 16 my %args = (@_); 17 $self->{rw} = delete $args{rw}; 18} 19 20sub filter { 21 my ( $self, $dataref, $message, $protocol, $buffer ) = @_; 22 23 @{ $self->{_parser} }{qw( output message protocol )} = 24 ( "", $message, $protocol ); 25 26 $self->{_parser}->parse($$dataref); 27 $self->{_parser}->eof if not defined $buffer; # last chunk 28 $$dataref = $self->{_parser}{output} if $self->{rw}; 29} 30 31sub will_modify { $_[0]->{rw} } 32 331; 34 35__END__ 36 37=head1 NAME 38 39HTTP::Proxy::BodyFilter::htmlparser - Filter using HTML::Parser 40 41=head1 SYNOPSIS 42 43 use HTTP::Proxy::BodyFilter::htmlparser; 44 45 # $parser is a HTML::Parser object 46 $proxy->push_filter( 47 mime => 'text/html', 48 response => HTTP::Proxy::BodyFilter::htmlparser->new( $parser ); 49 ); 50 51=head1 DESCRIPTION 52 53The HTTP::Proxy::BodyFilter::htmlparser lets you create a 54filter based on the HTML::Parser object of your choice. 55 56This filter takes a HTML::Parser object as an argument to its constructor. 57The filter is either read-only or read-write. A read-only filter will 58not allow you to change the data on the fly. If you request a read-write 59filter, you'll have to rewrite the response-body completely. 60 61With a read-write filter, you B<must> recreate the whole body data. This 62is mainly due to the fact that the HTML::Parser has its own buffering 63system, and that there is no easy way to correlate the data that triggered 64the HTML::Parser event and its original position in the chunk sent by the 65origin server. See below for details. 66 67Note that a simple filter that modify the HTML text (not the tags) can 68be created more easily with HTTP::Proxy::BodyFilter::htmltext. 69 70=head2 Creating a HTML::Parser that rewrites pages 71 72A read-write filter is declared by passing C<rw =E<gt> 1> to the constructor: 73 74 HTTP::Proxy::BodyFilter::htmlparser->new( $parser, rw => 1 ); 75 76To be able to modify the body of a message, a filter created with 77HTTP::Proxy::BodyFilter::htmlparser must rewrite it completely. The 78HTML::Parser object can update a special attribute named C<output>. 79To do so, the HTML::Parser handler will have to request the C<self> 80attribute (that is to say, require access to the parser itself) and 81update its C<output> key. 82 83The following attributes are added to the HTML::Parser object by this filter: 84 85=over 4 86 87=item output 88 89A string that will hold the data sent back by the proxy. 90 91This string will be used as a replacement for the body data only 92if the filter is read-write, that is to say, if it was initialised with 93C<rw =E<gt> 1>. 94 95Data should always be B<appended> to C<$parser-E<gt>{output}>. 96 97=item message 98 99A reference to the HTTP::Message that triggered the filter. 100 101=item protocol 102 103A reference to the HTTP::Protocol object. 104 105=back 106 107=head1 METHODS 108 109This filter defines three methods, called automatically: 110 111=over 4 112 113=item filter() 114 115The C<filter()> method handles all the interactions with the HTML::Parser 116object. 117 118=item init() 119 120Initialise the filter with the HTML::Parser object passed to the constructor. 121 122=item will_modify() 123 124This method returns a boolean value that indicates to the system 125if it will modify the data passing through. The value is actually 126the value of the C<rw> parameter passed to the constructor. 127 128=back 129 130=head1 SEE ALSO 131 132L<HTTP::Proxy>, L<HTTP::Proxy::Bodyfilter>, 133L<HTTP::Proxy::BodyFilter::htmltext>. 134 135=head1 AUTHOR 136 137Philippe "BooK" Bruhat, E<lt>book@cpan.orgE<gt>. 138 139=head1 COPYRIGHT 140 141Copyright 2003-2006, Philippe Bruhat. 142 143=head1 LICENSE 144 145This module is free software; you can redistribute it or modify it under 146the same terms as Perl itself. 147 148=cut 149 150