PDA

View Full Version : Regex gurus: I need your help!



Ezekiel
09-22-2007, 06:10 PM
Currently, I run Apache on my home computer for various purposes (mainly to share files); I'll eventually use it for a lot more. This is for personal/school stuff -- not really intended to be associated with any of my online aliases. I run it on a non-standard port and only forward it in my router's settings when needed.

I recently decided to host a web-proxy from my machine, so I can browse my favourite sites when I'm bored (or have finished all the work) at school. They have really strict filtering and route everything through their own internal proxy, so I think only HTTP traffic is allowed. Thus, a web-proxy is the solution.

The problem is, they block sites based on words in addition to the standard stuff. It's not a matter of certain domains being blocked -- if any 'un-educational' text is found, access is denied.

The first thought I had was to set up SSL on Apache so they can't monitor my requested pages, but I'm too lazy to do that right now (I forgot how I did it before...) and I'm not sure their network would allow SSL connections on my weird port.

After all this contemplating, I concluded that I would modify PHProxy (the proxy web-app) to get around this word-filtering shit.

My intention is to put a space between every character in the text of every page requested by the user (but leaving HTML tags intact). I may eventually find a character to replace the space that is suitably invisible, but it's fine for the moment.

How will this beat the filter? If every word has a space between every character, their filters can't recognise it.

So far I've found the correct parts to edit in PHProxy's scripts, and here's one of my modified output lines (with the regex I came up with):


echo preg_replace("/(>.*)(.)(.*<)/g", "$* $2 $*", $PHProxy->return_response());

It only puts the spaces once in each whole block of text (as defined by the space between HTML end tags and start tags). This is useless -- I want spaces between every character!

I know there is a simple solution to this, but I'm too tired and I've neglected regexes during my time on the net.

JayT
09-27-2007, 10:09 PM
Currently, I run Apache on my home computer for various purposes (mainly to share files); I'll eventually use it for a lot more. This is for personal/school stuff -- not really intended to be associated with any of my online aliases. I run it on a non-standard port and only forward it in my router's settings when needed.

I recently decided to host a web-proxy from my machine, so I can browse my favourite sites when I'm bored (or have finished all the work) at school. They have really strict filtering and route everything through their own internal proxy, so I think only HTTP traffic is allowed. Thus, a web-proxy is the solution.

The problem is, they block sites based on words in addition to the standard stuff. It's not a matter of certain domains being blocked -- if any 'un-educational' text is found, access is denied.

The first thought I had was to set up SSL on Apache so they can't monitor my requested pages, but I'm too lazy to do that right now (I forgot how I did it before...) and I'm not sure their network would allow SSL connections on my weird port.

After all this contemplating, I concluded that I would modify PHProxy (the proxy web-app) to get around this word-filtering shit.

My intention is to put a space between every character in the text of every page requested by the user (but leaving HTML tags intact). I may eventually find a character to replace the space that is suitably invisible, but it's fine for the moment.

How will this beat the filter? If every word has a space between every character, their filters can't recognise it.

So far I've found the correct parts to edit in PHProxy's scripts, and here's one of my modified output lines (with the regex I came up with):


echo preg_replace("/(>.*)(.)(.*<)/g", "$* $2 $*", $PHProxy->return_response());

It only puts the spaces once in each whole block of text (as defined by the space between HTML end tags and start tags). This is useless -- I want spaces between every character!

I know there is a simple solution to this, but I'm too tired and I've neglected regexes during my time on the net.




There is another space character.

It's the same as the HTML &nbsp;

ASCII code *60

It's treated like a solid character, but is invisible like a regular space.

This way you can have multiple spaces between characters or words if you want, which is not possible in HTML unless you use that code.





Maybe you don't need regex.

Perhaps you could convert all normal spaces to ASCII *60 codes.
It looks identical printed out, but it not the same string.


This replaces all normal spaces with ASCII *60 codes instead.



$x = "This is a test string";

$y = Str_Replace(" ", chr(*60), $x);

print $y;


The printed output string ($y) looks the same as input string ($x), but the spaces are not ASCII code *2 (HEX 20) anymore, so the strings are no longer equal.





Or maybe this. It will place a space between every character in a string.



$x = "This is a test string.";

print Spaced_String($x); // = "T h i s i s a t e s t s t r i n g ."

function Spaced_String($StrArg)
{

$w* = Str_Replace(" ", "", $StrArg);
$w2 = "";

for ($i=0; $i < StrLen($w*); $i++)
{$w2 .= SubStr($w*, $i, *) . " ";}

return trim($w2);

}


Bit the problem is, how do you then know which spaces separate individual words as opposed to individual characters so as to change the string back to normal?

Ezekiel
10-02-2007, 07:11 PM
My problem is identifying which areas are text (not HTML tags) and transforming these areas in some way to avoid the filter.

Your idea of using str_replace() would work, but I would have to split the page up using strtok() or something to separate the text and tags, then use str_replace() on the text areas.

As for the spaces, well if you insert a space between every character, you also insert a space after a space, thus the words are still separated.